Complex System Maintenance Handbook (Springer Series in Reliability Engineering)

  • 56 250 6
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Complex System Maintenance Handbook (Springer Series in Reliability Engineering)

Springer Series in Reliability Engineering Series Editor Professor Hoang Pham Department of Industrial Engineering Rut

1,796 423 19MB

Pages 648 Page size 474 x 675 pts Year 2008

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

File loading please wait...
Citation preview

Springer Series in Reliability Engineering

Series Editor Professor Hoang Pham Department of Industrial Engineering Rutgers The State University of New Jersey 96 Frelinghuysen Road Piscataway, NJ 08854-8018 USA

Other titles in this series The Universal Generating Function in Reliability Analysis and Optimization Gregory Levitin Warranty Management and Product Manufacture D.N.P Murthy and Wallace R. Blischke Maintenance Theory of Reliability Toshio Nakagawa System Software Reliability Hoang Pham Reliability and Optimal Maintenance Hongzhou Wang and Hoang Pham Applied Reliability and Quality B.S. Dhillon Shock and Damage Models in Reliability Theory Toshio Nakagawa Risk Management Terje Aven and Jan Erik Vinnem Satisfying Safety Goals by Probabilistic Risk Assessment Hiromitsu Kumamoto Offshore Risk Assessment (2nd Edition) Jan Erik Vinnem The Maintenance Management Framework Adolfo Crespo Márquez Human Reliability and Error in Transportation Systems B.S. Dhillon

Khairy A.H. Kobbacy • D.N. Prabhakar Murthy Editors

Complex System Maintenance Handbook

123

Khairy A.H. Kobbacy, PhD Management and Management Sciences Research Institute University of Salford Salford, Greater Manchester M5 4WT UK

D.N. Prabhakar Murthy, PhD Division of Mechanical Engineering The University of Queensland Brisbane 4072 Australia

ISBN 978-1-84800-010-0

e-ISBN 978-1-84800-011-7

DOI 10.1007/978-1-84800-011-7 Springer Series in Reliability Engineering series ISSN 1614-7839 British Library Cataloguing in Publication Data A Complex system maintenance handbook. - (Springer series in reliability engineering) 1. Maintenance 2. Reliability (Eningeering) 3. Maintenance - Management I. Murthy, D. N. P. II. Kobbacy, Khairy A. H. 620'.0046 ISBN-13: 9781848000100 Library of Congress Control Number: 2008923781 © 2008 Springer-Verlag London Limited Watchdog Agent™ is a trademark of the Intelligent Maintenance Systems (IMS) Center, University of Cincinnati, PO Box 210072, Cincinnati, OH 45221, USA. www.imscenter.net Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licences issued by the Copy-right Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Cover design: deblik, Berlin, Germany Printed on acid-free paper 9 8 7 6 5 4 3 2 1 springer.com

To our wives Iman and Jayashree for their patience, understanding and support

Preface

Modern societies depend on the smooth operation of many complex systems (designed and built by humans) that provide a variety of outputs (products and services). These include transport systems (trains, buses, ferries, ships and aeroplanes), communication systems (television, telephone and computer networks), utilities (water, gas and electricity networks), manufacturing plants (to produce industrial products and consumer durables), processing plants (to extract and process minerals and oil), hospitals (to provide services) and banks (for financial transactions) to name a few. Every system built by humans is unreliable in the sense that it degrades with age and/or usage. A system is said to fail when it is no longer capable of delivering the designed outputs. Some failures can be catastrophic in the sense that they can result in serious economic losses, affect humans and do serious damage to the environment. Typical examples include the crash of an aircraft in flight, failure of a sewerage processing plant and collapse of a bridge. The degradation can be controlled, and the likelihood of catastrophic failures reduced, through maintenance actions, including preventive maintenance, inspection, condition monitoring and design-out maintenance. Corrective maintenance actions are needed to restore a failed system to operational state through repair or replacement of the components that caused the failure. Maintenance has moved from being an engineering activity after a system has been put into operation into an important issue that needs to be addressed during the design and manufacturing or building of the system. Maintenance impacts on reliability (a technical issue) with serious economic and commercial implications. This implies that operators of complex systems need to look at maintenance from an overall business perspective that integrates the technical and commercial issues in an effective manner. The literature on maintenance is vast. Over the last 50 years, there have been dramatic changes due to advances in the understanding of the physics of failure, in technologies to monitor and assess the state of the system, in computers to store

viii

Preface

and process large amounts of relevant data and in the tools and techniques needed to build model to determine the optimal maintenance strategies. The aim of this book is to integrate this vast literature with different chapters focusing on different aspects of maintenance and written by active researchers and/or experienced practitioners with international reputations. Each chapter reviews the literature dealing with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling analysis and optimisation), reports on the developments and trends in a particular industry sector or, deals with a case study. It is hoped that the book will lead to narrowing the gap between theory and practice and to trigger new research in maintenance. The book is written for a wide audience. This includes practitioners from industry (maintenance engineers and managers) and researchers investigating various aspects of maintenance. Also, it is suitable for use as a textbook for postgraduate programs in maintenance, industrial engineering and applied mathematics. We would like to thank the authors of the chapters for their collaboration and prompt responses to our enquiries which enabled completion of this handbook on time. We also wish to acknowledge the support of the University of Salford and the award of CAMPUS Fellowship in 2006 to one of us (PM). We gratefully acknowledge the help and encouragement of the editors of Springer, Anthony Doyle and Simon Rees. Also, our thanks to Sorina Moosdorf and the staff involved with the production of the book.

Contents

Part A An Overview Chapter 1: An Overview K. Kobbacy and D. Murthy ...................................................................................... 3 Part B Evolution of Concepts and Approaches Chapter 2: Maintenance: An Evolutionary Perspective L. Pintelon and A. Parodi-Herz.............................................................................. 21 Chapter 3: New Technologies for Maintenance Jay Lee and Haixia Wang....................................................................................... 49 Chapter 4: Reliability Centred Maintenance Marvin Rausand and Jørn Vatn .............................................................................. 79 Part C Methods and Techniques Chapter 5: Condition-based Maintenance Modelling Wenbin Wang........................................................................................................ 111 Chapter 6: Maintenance Based on Limited Data David F. Percy ..................................................................................................... 133 Chapter 7: Reliability Prediction and Accelerated Testing E. A. Elsayed ........................................................................................................ 155

x

Contents

Chapter 8: Preventive Maintenance Models for Complex Systems David F. Percy ..................................................................................................... 179 Chapter 9: Artificial Intelligence in Maintenance Khairy A. H. Kobbacy ......................................................................................... 209 Part D Problem Specific Models Chapter 10: Maintenance of Repairable Systems Bo Henry Lindqvist............................................................................................... 235 Chapter 11: Optimal Maintenance of Multi-component Systems: A Review Robin P. Nicolai and Rommert Dekker ................................................................ 263 Chapter 12: Replacement of Capital Equipment P.A. Scarf and J.C. Hartman................................................................................ 287 Chapter 13: Maintenance and Production: A Review of Planning Models Gabriella Budai, Rommert Dekker and Robin P. Nicolai ................................... 321 Chapter 14: Delay Time Modelling Wenbin Wang........................................................................................................ 345 Part E Management Chapter 15: Maintenance Outsourcing D.N.P. Murthy and N. Jack ................................................................................. 373 Chapter 16: Maintenance of Leased Equipment D.N.P. Murthy and J. Pongpech .......................................................................... 395 Chapter 17: Computerised Maintenance Management Systems Ashraf Labib ......................................................................................................... 417 Chapter 18: Risk Analysis in Maintenance Terje Aven ............................................................................................................ 437 Chapter 19: Maintenance Performance Measurement (MPM) System Uday Kumar and Aditya Parida .......................................................................... 459 Chapter 20: Forecasting for Inventory Management of Service Parts John E. Boylan and Aris A. Syntetos .................................................................... 479

Contents

xi

Part F Applications (Case Studies) Chapter 21: Maintenance in the Rail Industry Jørn Vatn ............................................................................................................. 509 Chapter 22: Condition Monitoring of Diesel Engines Renyan Jiang, Xinping Yan ................................................................................. 533 Chapter 23: Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) Ulla Espling and Uday Kumar ............................................................................. 559 Chapter 24: Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets Jayantha P. Liyanage ........................................................................................... 585 Chapter 25: Fault Detection and Identification for Longwall Machinery Using SCADA Data Daniel R. Bongers and Hal Gurgenci .................................................................. 611 Contributor Biographies ....................................................................................... 643 Index ..................................................................................................................... 653

Part A

An Overview

1 An Overview K.A.H. Kobbacy and D.N.P. Murthy K. Kobbacy and D. Murthy

1.1 Introduction The efficient functioning of modern society depends on the smooth operation of many complex systems comprised of several pieces of equipment that provide a variety of products and services. These include transport systems (trains, buses, ferries, ships and aeroplanes), communication systems (television, telephone and computer networks), utilities (water, gas and electricity networks), manufacturing plants (to produce industrial products and consumer durables), processing plants (to extract and process minerals and oil), hospitals (to provide services) and banks (for financial transactions) to name a few. All equipment is unreliable in the sense that it degrades with age and/or usage and fails when it is no longer capable of delivering the products and services. When a complex system fails, the consequences can be dramatic. It can result in serious economic losses, affect humans and do serious damage to the environment as, for example, the crash of an aircraft in flight, the failure of a sewage processing plant or the collapse of a bridge. Through proper corrective maintenance, one can restore a failed system to an operational state by actions such as repair or replacement of the components that failed and in turn caused the failure of the system. The occurrence of failures can be controlled through maintenance actions, including preventive maintenance, inspection, condition monitoring and design-out maintenance. With good design and effective preventive maintenance actions, the likelihood of failures and their consequences can be reduced but failures can never be totally eliminated. The approach to maintenance has changed significantly over the last one hundred years. Over a hundred years ago, the focus was primarily on corrective maintenance delegated to the maintenance section of the business to restore failed systems to an operational state. Maintenance was carried out by trained technicians and was viewed as an operational issue and did not play a role in the design and operation of the system. The importance of preventive maintenance was fully appreciated during the Second World War. Preventive maintenance involves additional costs and is worthwhile only if the benefits exceed the costs. Deciding

4

K. Kobbacy and D. Murthy

the optimum level of maintenance requires building appropriate models and use of sophisticated optimisation techniques. Also, around this time, maintenance issues started getting addressed at the design stage and this led to the concept of maintainability. Reliability and maintainability (R&M) became major issues in the design and operation of systems. Degradation and failure depend on the stresses on the various components of the system. These depend on the operating conditions that are dictated by commercial considerations. As a result, maintenance moved from a purely technical issue to a strategic management issue with options such as outsourcing of maintenance, leasing equipment as opposed to buying, etc. Also, advances in technologies (new materials, new sensors for monitoring, data collection and analysis) added new dimensions (science, technology) to maintenance. These advances will continue at an everincreasing pace in the twenty-first century. This handbook tries to address the various issues associated with the maintenance of complex systems. The aim is to give a snapshot of the current status and highlight future trends. Each chapter deals with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling analysis and optimisation) and reports on developments and trends in a particular industry sector or deals with a case study. In this chapter we give an overview of the handbook. The outline of the chapter is as follows. Section 1.2 deals with the framework that is needed to study the maintenance of complex systems and we discuss some of the salient issues. Section 1.3 presents the structure of the book and gives a brief outline of the different chapters in the handbook. We conclude with a discussion of the target audience for the handbook.

1.2 Framework for Study of Maintenance A proper study of maintenance requires a comprehensive framework that incorporates all the key elements. However, not all the elements would be relevant for a particular maintenance problem under consideration. The systems approach is an effective approach to solving maintenance problems. In this approach, the real world relevant to the problem is described through a characterisation where one identifies the relevant variables and the interaction between the variables. This characterisation can be done using language or a schematic network representation where the nodes represent the variables and the connected arcs denote the relationships. This is good for qualitative analysis. For quantitative analysis, one needs to build mathematical models to describe the relationships. Often this requires stochastic and dynamical formulations as system degradation and failures occur in an uncertain manner. In this section, we discuss the various key elements and some related issues. We use the term “asset” to denote a complex system or individual equipment. It can include infrastructures such as buildings, bridges etc. in addition to those listed in Section 1.1.

An Overview

5

1.2.1 Stakeholders For an asset there can be several stakeholders as indicated in Figure 1.1.

Figure 1.1. Stakeholders for maintenance of an asset

The number of parties involved would depend on the asset under consideration. For example, in case of a rail network (used to provide a service to transport people and goods) the customers can include the rail operators (operating the rolling stock) and the public. The owner can be a business entity, a financial institution or a government agency. The operator is the agency that operates the track and is responsible for the flow of traffic. The service provider refers to the agency carrying out the maintenance (preventive and corrective). It can be the operator (in which case maintenance is done in-house) or some external agent (if maintenance is outsourced) or both (when only some of the maintenance activities are outsourced). The regulator is the independent agency which deals with safety and risk issues. They define the minimum standards for safety and can impose fines on the owner, operator and possibly the service provider should the safety levels be compromised. Government plays a critical role in providing the subsidy and assuming certain risks. In this case all the parties involved are affected by the maintenance carried out on the asset. If the line is shut either frequently and/or for long duration, it can affect customer satisfaction and patronage, the returns to the operators and owners and the costs to the government. 1.2.2 Different Perspectives We focus our attention on the case where the asset is owned by the owner and maintenance is outsourced. In this case, we have two parties – (i) owner (of the asset) and (ii) service agent (providing the maintenance). Figure 1.2 is a very simplified system characterisation of the maintenance process where the main-

6

K. Kobbacy and D. Murthy

tenance activities are defined through a maintenance service contract. The problem is to determine the terms of the service contract.

Figure 1.2. System characterisation for maintenance out-sourcing

Each of the elements of Figure 1.2 involves several variables. For example, the maintenance service contract involves the following: (i) duration of contract, (ii) price of contract, (iii) maintenance performance requirements, (iv) incentives and penalties, (v) dispute resolution, etc. The maintenance performance requirements can include measures such as availability, mean time between failures and so on. The characterisation of the owner’s decision-making process can involve costs, asset state at the end of the contract, risks (service agent not providing the level and quality of service) and so on. The interests and goals of the owner are different from that of the service agent. The study of maintenance is complicated by the unknown and uncontrollable factors. It could be rate of degradation (which depends on several factors such as material properties, operating environment etc) and other commercial factors (high demand for power in the case of a power plant due to very hot weather). 1.2.3 Key Issues and the Need for Multi-disciplinary Approach The key issues in the maintenance of an asset are shown in Figure 1.3. The asset acquisition is influenced by business considerations and its inherent reliability is determined by the decisions made during design. The field reliability and degradation is affected by operations (usage intensity, operating environment, operating load etc.). Through use of technologies, one can assess the state of the asset. The analysis of the data and models allow for optimizing the maintenance decisions (either for a given operating condition or jointly optimizing the maintenance and operations). Once the maintenance actions have been formulated it needs to be implemented.

An Overview

7

Figure 1.3. Key Issues in maintenance of an asset

To execute effective maintenance one needs to have a good understanding of a variety of concepts and techniques for each of the issues. Another issue is the computer packages that allow one to collect and analyze data and build models and derive the optimal solutions. The linking of the technical and commercial issues is indicated in Figure 1.4 and this requires an inter-disciplinary approach.

Figure 1.4. Linking technical and commercial issues

8

K. Kobbacy and D. Murthy

The disciplines involved are as follows 1.2.3.1 Engineering The degradation of an asset depends to some extent on the design and building (or production) of the asset. Poor design leads to poor reliability that in turn results in high level of corrective maintenance. On the other hand, a well-designed system is more reliable and hence less prone to failures. Maintainability deals with maintenance issues at the design and development stage of the asset. 1.2.3.2 Science This is very important in the understanding of the physical mechanisms that are at play and have a significant influence on the degradation and failure. Choosing the wrong material can have a serious consequence and impact on the subsequent maintenance actions needed. 1.2.3.3 Economic Maintenance costs can be a significant fraction of the total operating budget for a business depending on the industry sector. There are two types of costs – annual cost and cost over the life cycle of the asset. The costs can be divided into direct (labour, material etc.) and indirect (consequence of failure). 1.2.3.4 Legal This is important in the context of maintenance out-sourcing and maintenance of leased equipment. In both cases, the central issue is the contract between the parties involved. Of particular importance is dispute resolution when there is a disagreement between the parties in terms of the violation of some terms of the contract. 1.2.3.5 Statistics The degradation and failures occur in an uncertain manner. As such, the analysis of such data requires the use of statistical techniques. Statistics provide the concepts and tools to extract information from data and for the planning of efficient collection systems. 1.2.3.6 Operational Research Operation research provides the tools and techniques for model building, analysis and optimization. Often, analytical approaches fail and one needs to use simulation approach to evaluate the outcomes of different decisions and to choose the optimal (or near optimal) strategies. 1.2.3.7 Reliability Theory Reliability theory deals with the interdisciplinary use of probability, statistics and stochastic modelling, combined with engineering insights into the design and the scientific understanding of the failure mechanisms, to study the various aspects of reliability. As such, it encompasses issues such as (i) reliability modelling, (ii) reliability analysis and optimization, (iii) reliability engineering, (iv) reliability science, (v) reliability technology and (vi) reliability management.

An Overview

9

1.2.3.8 Information Technology and Computer Science The operation and maintenance of complex assets generates a lot of data. One needs efficient ways to store and manipulate the data and to extract relevant information from data. Computer science provides a range of artificial intelligence techniques such as data mining, expert systems, neural networks etc., which are very important in the context of maintenance. 1.2.4 Maintenance Management Maintenance management deals with the overall management of the maintenance of an asset. The management needs to be done at three different levels (strategic, tactical and operational) as indicated in Figure 1.5. - BUSINESS PERSPECTIVE - TECHNICAL & COMMERCIAL - IN-HOUSE vs. OUT-SOURCING - REPLACEMENT / DESIGN CHANGES

MAINTENANCE STRATEGY

STRATEGIC LEVEL

- DEGRADATION (RELIABILITY SCIENCE) - MAINTENANCE POLICIES - LOGISTICS (SPARES, FACILITIES ETC)

MAINTENANCE PLANNING AND SCHEDULING

TACTICAL LEVEL

- DATA COLLECTION - DATA ANALYSIS (ROOT CAUSE, OTHER FACTORS)

MAINTENANCE WORK EXECUTION

OPERATIONAL LEVEL

Figure 1.5. Maintenance management

The strategic level deals with maintenance strategy. This needs to be formulated so that it is consistent and coherent with other (production, marketing, finance, etc.) business strategies. The tactical level deals with the planning and scheduling of maintenance. The operational level deals with the execution of the maintenance tasks and collection of relevant data.

1.3 Structure of the Handbook The handbook integrates the vast literature on maintenance with each chapters focussing on a different aspect of maintenance and written by active researchers with international reputation and/or experienced practitioners from industry. Each chapter either reviews the literature dealing with a particular aspect of maintenance (for example, methodology, approaches, technology, management, modelling ana-

10

K. Kobbacy and D. Murthy

lysis and optimisation), reports on developments and trends in a particular industry sector, or deals with a case study. The book is structured into five parts and each of the last four parts contains several chapters. The topic of the different chapters is as indicated below. Part A:

An Overview

Chapter 1:

An Overview (Khairy Kobbacy and Pra Murthy)

Part B:

Evolution of Concepts and Approaches

Chapter 2: Chapter 3: Chapter 4:

Maintenance: An Evolutionary Perspective (Liliane Pintelon and Alejandro Parodi Herz) New Technologies for Maintenance (Jay Lee and Haixia Wang) Reliability Centred Maintenance (Marvin Rausand and Jorn Vatn)

Part C:

Methods and Techniques

Chapter 5: Chapter 6: Chapter 7: Chapter 8: Chapter 9:

Condition-based Maintenance Modelling (Wenbin Wang) Maintenance Based on Limited Data (David F. Percy) Reliability Prediction and Accelerated Testing (Elsayed A. Elsayed) Preventive Maintenance Models for Complex Systems (David F. Percy) Artificial Intelligence in Maintenance (Khairy A.H. Kobbacy)

Part D:

Problem Specific Models

Chapter10: Chapter 11:

Chapter 14:

Maintenance of Repairable Systems (Bo Henry Lindqvist) Optimal Maintenance of Multi-component Systems: A Review (Robin P. Nicolai and Rommert Dekker) Replacement of Capital Equipment (Philip A. Scarf and Joseph C. Hartman) Maintenance and Production: A Review of Planning Models (Gabriella Budai, Rommert Dekker and Robin P. Nicolai) Delay Time Modelling (Wenbin Wang)

Part E:

Management

Chapter 15: Chapter 16:

Maintenance Outsourcing (Pra Murthy and Nat Jack) Maintenance of Leased Equipment (Pra Murthy and Jarumon Pongpech) Computerised Maintenance Management Systems (Ashraf Labib) Risk Analysis in Maintenance (Terje Aven) Maintenance Performance Measurement (MPM) System (Uday Kumar and Aditya Parida) Forecasting for Inventory Management of Service Parts (John E. Boylan and Aris A. Syntetos)

Chapter 12: Chapter 13:

Chapter 17: Chapter 18: Chapter 19: Chapter 20:

An Overview

11

Part F:

Applications (Case Studies)

Chapter 21: Chapter 22:

Maintenance in the Rail Industry (Jorn Vatn) Condition Monitoring of Diesel Engines (Renyan Jiang and Xinping Yan) Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) (Ulla Espling and Uday Kumar) Integrated e-Operations–e-Maintenance: Applications in North Sea Offshore Assets (Jayanta P. Liyanage) Fault Detection and Identification for Longwall Machinery Using SCADA Data (Daniel Bongers and Hal Gurgenci)

Chapter 23:

Chapter 24: Chapter 25:

A brief outline of each chapter is as follows Chapter 2: Maintenance: An Evolutionary Perspective In the past few decades industrial maintenance has evolved from a non-issue into a strategic concern. During this period the role of maintenance has drastically been transformed. This chapter, while considering the fundamental elements of maintenance and its environment, describes the evolution path of maintenance management and the driving forces of such changes. It basically explains how and why maintenance practice has evolved in time. It includes basic notions of maintenance and clearly classifies and distinguishes between different types of maintenance actions, policies and concepts currently available. The chapter concludes by enlightening the reader with some new challenges in maintenance Chapter 3: New Technologies for Maintenance Predictive maintenance is critical to any engineering system, especially complex systems, in order to avoid system breakdown. With the recent advances in pervasive computing, prognostics can be easily embedded in any devices and systems. When smart machines are networked and remotely monitored, and when their data is modelled and continually analyzed with sophisticated embedded systems, it is possible to go beyond mere “predictive maintenance” to intelligent “prognostics”, the process of pinpointing exactly which components of a machine are likely to fail and then autonomously trigger service and order spare parts. This chapter addresses the paradigm shift in modern maintenance systems from the traditional “fail and fix” practices to a “predict and prevent” methodology. Recent advances in prognostic technologies and tools are presented, and future work directions are discussed. Chapter 4: Reliability Centred Maintenance This chapter gives an introduction to reliability centred maintenance (RCM). The RCM analysis process is divided into 12 distinct steps. Each step is thoroughly described and discussed. The main RCM process is similar to the processes outlined in RCM standards and guidelines, but has more focus on the optimization of maintenance intervals. A new approach is proposed based on generic RCM analyses related to specified classes of consequences. The new approach will significantly reduce the workload of the RCM analysis. A computer tool OptiRCM

12

K. Kobbacy and D. Murthy

that has been developed by the authors, is used to illustrate the new approach. Several examples from railway applications are provided. Chapter 5: Condition-based Maintenance Modelling This chapter presents a model for supporting condition based maintenance decision making. The chapter discusses various issues related to the subject, such as the definition of the state of an asset, direct or indirect monitoring, relationship between observed measurements and the state of the asset, and current modelling developments. In particular, the chapter focuses on a modelling technique used recently in predicting the residual life via stochastic filtering. This is a key element in modelling the decision making aspect of condition based maintenance. A few key condition monitoring techniques are also introduced and discussed. Methods of estimating model parameters are outlined and a numerical example based on real data is presented. Chapter 6: Maintenance-based on Limited Data Reliability applications often suffer from paucity of data for making informed maintenance decisions. This is particularly noticeable for high reliability systems and when new production lines or new warranty schemes are planned. Such issues are of great importance when selecting and fitting mathematical models to improve the accuracy and utility of these decisions. This chapter investigates why reliability data are so limited and proposes statistical methods for dealing with these difficulties. It considers graphical and numerical summaries, appropriate methods for model development and validation, and the powerful approach of subjective Bayesian analysis for including expert knowledge about the application area. Chapter 7: Reliability Prediction and Accelerated Testing This chapter presents an overview of accelerated life testing (ALT) methods and their use in reliability prediction at normal operating conditions. It describes the most commonly used models and introduces new ones which are “distribution free”. Design of optimum test plans in order to improve the accuracy of reliability prediction is also presented and discussed. The chapter provides, for the first time, the link between accelerated life testing and maintenance actions. It develops procedures for using the ALT results for estimating the optimum preventive maintenance schedule and the optimum degradation threshold level for degrading systems. The procedures are demonstrated using two numerical examples. Chapter 8: Preventive Maintenance Models for Complex Systems Preventive maintenance (PM) of repairable systems can be very beneficial in reducing repair and replacement costs, and in improving system availability. Strategies for scheduling PM are often based on intuition and experience, though considerable improvements in performance can be achieved by fitting mathematical models to observed data. For simple repairable systems comprising few components or many identical components, compound renewal processes are appropriate. This chapter reviews basic and advanced models for complex repairable systems and demonstrates their use for determining optimal PM intervals. Computational

An Overview

13

difficulties are addressed and practical illustrations are presented, based on subsystems of oil platforms and Chapter 9: Artificial Intelligence in Maintenance AI techniques have been used successfully in the past two decades to model and optimise maintenance problems. This chapter reviews the application of Artificial Intelligence (AI) in maintenance management and introduces the concept of developing intelligent maintenance optimisation system. The chapter starts with an introduction to maintence management, planning and scheduling and a brief definition of AI and some of its techniques that have applications in maintenance management. A review of literatures is then presented covering the applications of AI in maintenance. We have focused on five AI techniques namely Knowledge Based Systems, Case Based Reasoning, Genetic Algorithms, Neural Networks and Fuzzy Logic. This review also covers “hybrid” systems where two or more AI techniques are used in an application. A discussion of the development of the prototype hybrid intelligent maintenance optimisation system (HIMOS) which was developed to evaluate and enhance PM maintenance routines of complex engineering systems then follows. The chapter ends with a discussion of future research and concluding remarks. Chapter 10: Maintenance of Repairable Systems A repairable system is traditionally defined as a system which, after failing to perform one or more of its functions satisfactorily, can be restored to fully satisfactory performance by any method other than replacement of the entire system. An extended definition used in this chapter includes the possibility of additional maintenance actions which aim at servicing the system for better performance, referred to as preventive maintenance (PM). The common models for the failure process of a repairable system are renewal processes (RP) and non-homogeneous Poisson processes (NHPP). The chapter considers several generalizations and extensions of the basic models, for example the trend renewal process (TRP) which includes NHPP and RP as special cases, and having the property of allowing a trend in processes of non-Poisson type. When several systems of the same kind are considered, there may be an unobserved heterogeneity between the systems which, if overlooked, may lead to wrong decisions. This phenomenon is considered in the framework of the TRP process. We then consider the extension of the basic models obtained by introducing the possibility of PM using a competing risks approach. Finally, models for periodically inspected systems are studied, using a combination of time-continuous and time-discrete Markov chains. Chapter 11: Optimal Maintenance of Multi-component Systems: A Review This chapter gives an overview of the literature on multi-component maintenance optimization focusing on work appearing since the 1991 survey by Cho and Parlar. A classification scheme primarily based on the dependence between components (stochastic, structural or economic) is introduced. Next, the papers are also classified on the basis of the planning aspect (short-term vs. long-term), the grouping of maintenance activities (either grouping preventive or corrective maintenance, or opportunistic grouping) and the optimization approach used (heuristic, policy

14

K. Kobbacy and D. Murthy

classes or exact algorithms). Finally, attention is paid to the applications of the models. Chapter 12: Replacement of Capital Equipment This chapter deals with models of replacement of capital equipment. Capital replacement models may be classified as economic life models or dynamic programming models. The former are concerned with determining the optimal lifetime of an item of equipment taking account of costs over some planning horizon. The latter considers replacement decisions dynamically, determining whether plant should be retained or replaced after each period. We begin by looking at simple economic life models. These are applied in a case study on escalator replacement. Economic life models are then extended to consider first an inhomogeneous fleet and then second a network system viewed as an inhomogeneous fleet with interacting items. A number of different dynamic programming models are introduced for singular systems and then expanded to homogeneous and inhomogeneous fleets and networks of assets. Chapter 13: Maintenance and Production: A Review of Planning Models This chapter gives an overview of the relation between planning of maintenance and production. Production planning and scheduling models where failures and maintenance aspects are taken into account are considered first. The planning of maintenance activities are considered next, where both preventive as well as corrective maintenance are discussed. Third, the planning of maintenance activities at such moments in time where the items to be maintained are not or less needed for production, also called opportunity maintenance is considered. Apart from describing the main ideas, approaches, and results a number of applications are provided. Chapter 14: Delay Time Modelling This chapter presented a modelling tool that was created to model the problems of inspection maintenance and planned maintenance interventions, namely Delay Time Modelling (DTM). This concept provides a modelling framework readily applicable to a wide class of actual industrial maintenance problems of assets in general and inspection problems in particular. The delay time defines the failure process of an asset as a two-stage process. The first stage is the normal operating stage from new to the point that a hidden defect has been identified. The second stage is defined as the failure delay time from the point of defect identification to failure. It is the existence of such a failure delay time which provides the opportunity for preventive maintenance to be carried out to remove or rectify the identified defects before failures. With appropriate modelling of the durations of these two stages, optimal inspection intervals can be identified to optimise a criterion function of interest. This chapter first gives an outline of the delay time concept then introduces two delay time inspection models of a single component and a complex system respectively. The parameters estimation techniques used in DTM are discussed. Extensions to the basic delay time model are highlighted and future research in DTM concludes the chapter.

An Overview

15

Chapter 15: Maintenance Outsourcing It is often uneconomical for businesses to carry out their own maintenance on complex equipment. The alternative is to ‘out-source’ the maintenance function and use an external agent, under a service contract, to carry out some or all of the maintenance actions (preventive and corrective). This chapter develops the framework needed to study decision-making for maintenance outsourcing from both the customer (equipment owner) and service agent perspectives. The relevant literature is reviewed and a game theoretic approach to maintenance outsourcing and the use of agency theory is discussed. The link between maintenance outsourcing and extended warranties is highlighted and the scope for future research in both areas is examined. Chapter 16: Maintenance of Leased Equipment For leased equipment, the lessor has to carry out the maintenance of the equipment over the lease period. To ensure satisfactory performance and maintenance, the lease contract has penalty terms which result in the lessor having to compensate the lessee if the number of failures exceeds some specified number and/or the time to rectify each failure exceeds some specified value. This implies that the lessor needs to take into account these penalties in determining the optimal maintenance strategy. The chapter starts with a conceptual framework to discuss the different issues involved and then looks at models to help the lessor in developing the optimal maintenance strategy. Chapter 17: Computerised Maintenance Management Systems Computerised maintenance management systems (CMMSs) are vital for the coordination of all activities related to the availability, productivity and maintainability of complex systems. Modern computational facilities have offered a dramatic scope for improved effectiveness and efficiency in, for example, maintenance. CMMSs have existed, in one form or another, for several decades. In this chapter, the characteristics of CMMSs have been investigated and have highlighted the need for them in industry and identified their current deficiencies. A proposed model is then presented to provide a decision analysis capability that is often missing in existing CMMSs. The effect of such model is to contribute towards the optimisation of the functionality and scope of CMMSs for enhanced decision analysis support. The use of AI techniques in CMMSs is illustrated. The features of next generation maintenance systems are finally highlighted. Chapter 18: Risk Analysis in Maintenance Risk analysis can be used for selection and prioritisation of maintenance activities, and this application of risk analysis has been given increased attention in recent years. This chapter presents and discusses the use of risk analysis for this purpose. The chapter reviews some critical aspects of risk analysis important for the successful implementation of such analyses in maintenance. This relates to risk descriptions and categorisations, uncertainty assessments, risk acceptance and risk informed decision making, as well as selection of appropriate methods and tools. Both qualitative and quantitative approaches are covered. A detailed risk analysis is outlined showing the effect of maintenance on risk.

16

K. Kobbacy and D. Murthy

Chapter 19: Maintenance Performance Measurement (MPM) System It is important that factors influencing the performance of maintenance process should be identified, and measured, so that they can be monitored and controlled for improvement. In this chapter, besides an overview of performance measurement, maintenance performance indicators, associated issues and challenges for developing a maintenance performance measurement framework, and indicators as in use by different industries are discussed. The framework considers stakeholders, business environment, multi-criteria and hierarchical needs amongst other. Chapter 20: Forecasting for Inventory Management of Service Parts This chapter addresses issues pertinent to forecasting for the inventory management of service parts. In some sectors, such as the aerospace and automotive industries, a very wide range of service parts are held in stock, with significant implications for availability and inventory holding. Their management is therefore an important task. First, a number of possible approaches to classifying service parts for forecasting and inventory management related purposes are reviewed. Second, parametric and non-parametric approaches to forecasting service parts requirements are discussed followed by the presentation of appropriate metrics for measuring the performance of the inventory management system. The existing empirical evidence on various forecasting methods is then summarised. Finally, the conclusions of this work are presented along with the identification of some natural avenues for further research. Chapter 21: Maintenance in the Rail Industry The chapter presents two case studies in railway maintenance. The first case study presents an optimisation model preventive maintenance of a train bogie. In the model a dynamic approach to grouping of maintenance activities is used enabling, e.g., opportunity maintenance. Data from the Norwegian State Railways have been used in the calculation example. The second case study present a life cycle cost approach to prioritization of larger maintenance and renewal projects under budget constraints. Chapter 22: Condition Monitoring of Diesel Engines Various techniques have been widely used to monitor the condition of diesel engines. Analysis of engine lubricant is a most widely used condition monitoring technique. In this chapter, a case study applying oil analysis technique to monitor the condition of marine diesel engines is presented. The case study focuses on analysis and modelling of oil monitoring data. The study first introduces the concept of state discriminant capability of condition variables and uses it to identify the significant condition variables, and then develops a state discriminant model to determine the state of the monitored system based on the current observation. The model parameters are obtained by directly minimizing the misjudgment probability. We believe that the proposed model has a great potential to be used due to its plausible mathematical basis and simplicity though it needs further testing with new data.

An Overview

17

Chapter 23: Benchmarking of the Maintenance Process at Banverket (The Swedish National Rail Administration) For sustaining a competitive edge in the business, railway companies all over the world are looking for ways and means to improve their maintenance performance. Benchmarking is a very effective tool that can assist the management in their pursuit of continuous improvement of their operation. Three different benchmarks have been studied based on a project benchmarking of the maintenance process across borders, another project dealing with benchmarking of maintenance outsourcing by different track regions in Sweden, and a third project studying the level on transparency among the European railway administrations. The chapter discuss the pro and cons, the areas for improvement and the need for improvement of benchmarking metrics and framework. Chapter 24: Integrated e-Operations–e-Maintenance: Application in North Sea Offshore Assets Ongoing developments in Norway brings a good example of how an industry-wide re-engineering process has triggered major changes in operations and maintenance practice of complex and high-risk assets leading towards what is termed integrated e-operations e-maintenance. It aims towards a step-change to the conventional operations and maintenance practices of offshore assets. Initiatives have already been taken to exploit new methods, smart techniques, and digital technologies to enable remote monitoring of offshore equipment condition and asset performance in landbased onshore support facilities using large ICT networks. This has already proved to have direct positive implications on the technical and safety integrity of assets, and subsequently on the plant economics. This chapter shares current experience and knowledge with reference to ongoing developments in the Norwegian oil and gas industry. It highlights current offshore asset maintenance practice, changing technical and economic environment that lead towards an e-approach, development and implementation of integrated e-operations and e-maintenance solutions in the North sea, key features of the e-approach in North sea assets, and future challenges to be fullyintegrated and fail-safe. Chapter 25: Fault Detection and Identification for Longwall Machinery Using SCADA Data In an attempt to improve equipment availability and facilitate informed, preventative maintenance, engineers may choose to implement one or more fault detection and identification (FDI) technologies. For complex systems (systems for which component interactions are not understood and model uncertainties are significant), data-driven methods of FDI are often the only practicable solution. The development of a data-driven FDI system for longwall mining equipment using SCADA data is described here. Significant data preprocessing was required to generate a quality example set. Missing value estimation (MVE) techniques were required to complete the highdimensional stream of condition monitoring data from existing sensors. A cost function, in combination with a linear discriminant analysis, was used to ‘align’ the inaccurate, categorical delay records with those delays inferred by the SCADA data. A neural network was developed to determine the state of the system as a

18

K. Kobbacy and D. Murthy

function of the real-time SCADA data input. Validation of this algorithm with unseen condition monitoring data showed misclassification rates of machine faults as low as 14.3%.

1.4 Target Audience The unique features of the book are as follows: 1. A coverage of the different approaches to maintenance. 2. Deals with many different aspects (scientific, technical, commercial, management, quantitative modelling) etc. 3. Blends theory with practice. As such it should appeal to both researchers and practitioners. For researchers (from different disciplines) it should provide a starting point for new research into different aspects of maintenance. For practitioners it should provide the concepts and tools so that these can be used for improvements in the overall business performance. Also we hope that it will serve as a reference book for use in postgraduate programs in maintenance

Part B

Evolution of Concepts and Approaches

2 Maintenance: An Evolutionary Perspective Liliane Pintelon and Alejandro Parodi-Herz L. Pintelon and A. Parodi-Herz

2.1 Introduction Over the last decennia industrial maintenance has evolved from a non-issue into a strategic concern. Perhaps there are few other management disciplines that underwent so many changes over the last half-century. During this period, the role of maintenance within the organization has drastically been transformed. At first maintenance was nothing more than a mere inevitable part of production, now it is an essential strategic element to accomplish business objectives. Without a doubt, the maintenance function is better perceived and valued in organizations. One could considered that maintenance management is no longer viewed as an underdog function; now it is considered as an internal or external partner for success. In view of the unwieldy competition many organizations seek to survive by producing more, with fewer resources, in shorter periods of time.To enable these serious needs, physical assets take a central role. However, installations have become highly automated and technologically very complex and, consequently, maintenance management had to become more complex having to cope with higher technical and business expectations. Now the maintenance manager is confronted with very complicated and diverse technical installations operating in an extremely demanding business context. This chapter, while considering the fundamental elements of maintenance and its environment, describes the evolution path of maintenance management and the driving forces of such changes. In Section 2.2 the maintenance context is described and its dynamic elements are briefly discussed. Section 2.3 explains how maintenance practice have evolved in time and different epochs are distinguished. Further, this sections devotes special attention to describe a common lexicon for maintenance actions and policies to further focuss on the evolution of maintenance concepts. Section 2.4 underlines how the role of the maintenance manager has been reshaped as a consequence of the changes of the maintenance function. Finally, the chapter concludes with Section 2.5 identifying the new challenges for maintenance.

22

L. Pintelon and A. Parodi-Herz

2.2 Maintenance in Context To discuss the context in which maintenance management is embedded, one may raise the question what is maintenance as such? Most authors in maintenance management literature, one way or another, agree on defining maintenance as the “set of activities required to keep physical assets in the desired operating condition or to restore them to this condition”. While this defines what maintenance is about, it may suggest that maintenance is simple, which it is not, as will be confirmed by any maintenance practitioner. Hence “maintenance management” is needed to ingrain maintenance practice in a complex and dynamic context. From a pragmatic view, the key objective of maintenance management is “total asset life cycle optimization”. In other words, maximizing the availability and reliability of the assets and equipment to produce the desired quantity of products, with the required quality specifications, in a timely manner. Obviously, this objective must be attained in a cost-effective way and in accordance with environmental and safety regulations. Figure 2.1 clearly shows that maintenance is embedded in a given business context to which it has to contribute. What is more, it shows that the maintenance function needs to cope with multiple forces and requirements within and outside the walls of the organization. Beyond any doubt, the tasks of maintenance are complex, enclosing a blend of management, technology, operations and logistics support elements. People Legislation

Management

Society

Technology

Total asset life cycle optimization

Technological evolution

Operations

e-business

Logistics Support Outsourcing Market

Information Technology Competition

Figure 2.1. Maintenance in context

To cope with and to coordinate the complex and changing characteristics that constitute maintenance in the first place, a management layer is imperative. Management is about “what to decide” and “how to decide”. In the maintenance arena, a manager juggles with technology, operations and logistics elements that mainly need to harmonize with production. Technology refers to the physical assets which maintenance has to support with adequate equipment and tools. Operations indicate the combination of service maintenance interventions with

Maintenance: An Evolutionary Perspective

23

core production activities. Finally, the logistics element supports the maintenance activities in planning, coordinating and ultimately delivering, resources like spare parts, personnel, tools and so forth. In one way or another, all these elements are always present, but their intensity and interrelationships will vary from one situation to another. For example, the elevator maintenance in a hospital vs. the plant maintenance in chemical process industries stipulates a different maintenance recipe tailored to the specific needs. Clearly, the choice of the structural elements of maintenance is not independent from the environment. Besides, other factors like the business context, society, legislation, technological evolution, outsourcing market, will be important. Furthermore, relative new trends, such as the e-business context, will influence the current and future maintenance management enormously. A whole new era for maintenance is expected as communication barriers are bridged and coordination opportunities of maintenance service become more intense. 2.2.1 Changes in the Playing Field of Maintenance One should expect that neither maintenance management nor its environment are stationary. The constant changes in the field of maintenance are acknowledged to have enabled new and innovative developments in the field of maintenance science. The technological evolution in production equipment, an ongoing evolution that started in the twentieth century, has been tremendous. At the start of the twentieth century, installations were barely or not mechanized, had simple design, worked in stand-alone configurations and often had a considerable overcapacity. Not surprisingly, nowadays installations are highly automated and technologically very complex. Often these installations are integrated with production lines that are right-sized in capacity. Installations not only became more complex, they also became more critical in terms of reliability and availability. Redundancy is only considered for very critical components. For example, a pump in a chemical process installation can be considered very critical in terms of safety hazards. Furthermore, equipment built-in characteristics such as modular design and standardization are considered in order to reduce downtime during corrective or preventive maintenance. However, predominantly only for some newer, very expensive installations, such as flexible manufacturing systems (FMS), these principles are commonly applied. Fortunately, a move towards higher levels of standardization and modularization begins to be witnessed at all level of the installations. As life cycle optimization concepts are commendable, it becomes mandatory that at the early design stages supportability and maintainability requirements are well thought-out. Parallel to the technological evolution, the ever-increasing customer focus causes even higher pressure, especially on critical installations. As customers’ service in terms of time, quality and choice becomes central to production decisions, the more flexibility is required to cope with these varying needs. This calls for well-maintained and reliable installations capable to fulfil shorter and more reliable lead-times estimation. Physical assets are ever more important for business success.

24

L. Pintelon and A. Parodi-Herz

Maintenance does not escape from the (r)evolution in information communication technology (ICT), which has tremendously changed business practices. However, we comment further on this topic in Section 2.3, by illustrating the impact on the role of the maintenance manager as such. Furthermore, new production and management principles such as Just-in-time (JIT) philosophy, Lean principles, total quality management (TQM) and so forth, have emerged. These production trends intend, by all means, to reduce waste and remove non-value added transactions. It is not surprising that work-in-process (WIP) inventories are one of the key issues for improvement. Clearly, WIP inventories incur high costs as a consequence of the capital immobilization, expensive floor space, etc. As processes happen to be streamlined, WIP inventories are no longer a buffer for problems; accordingly, asset availability and reliability are ever more imperative. Albeit, these principles were initially inspired for production and manufacturing environments are currently also applied and translated in service context. Above all, the business environment has also changed. Competition has become fierce and worldwide due to the globalization. The latter not only implies that competitors are located all over the world, but also that decisions to move production or service activities from a non-efficient site (e.g. due to high operations and maintenance costs) to another site are quickly taken, even if the other location belongs to another continent. Obviously, with the advent of globalization and intense competitive pressures, organizations are looking for every possible source of competitive advantage. This implies that the nature of business environment has become more complex and dynamic requiring different competitive strategies. Many companies are critically evaluating their value chain and often decide to drastically reorganize it. This results in focusing on the core business. Consequently outsourcing of some non-core business activities and the creation of new partnerships and alliances are being considered by many organizations. Not surprisingly, maintenance as a support function is no exception for outsourcing. Yet, it may not be so simple. Outsourcing maintenance of technical systems can become a sensitive issue if it is not handled with diligence. Technical systems are unique and situation specific. For example, outsourcing maintenance of utilities or elevators can be relatively straightforward, but when it comes to production floor equipment it can be a strategic issue that has to be handled with extreme care. These circumstances suggest that outsourcing needs to be considered at operational, tactical and strategic level; see Figure 2.2 The simplest, and also the most common, form of outsourcing is “operational outsourcing”. At this level, a specific task is outsourced and the relationship between supplier and customer is strictly limited to a sell-buy situation. The impact on the internal organization of the customer is also limited. As outsourcing moves up in the organizational pyramid the relationship between supplier and customer changes and “tactical outsourcing” maybe required. At this level of outsourcing the customer shares management responsibility with the supplier and a simple kind of partnership is established. The impact on the internal organization is also greater. Finally, moving towards the organization’s top and for more critical maintenance services, a new form of outsourcing is created, the so-called “strategic outsourcing”. This type of outsourcing is also labelled as “transformational out-

Maintenance: An Evolutionary Perspective

25

sourcing” because of its impact on the customer’s internal organization. Here a complete outsourcing is carried out, the maintenance department is cut away from the customer and moved to the supplier. The relationship between customer and supplier is a strong partnership: the customer has fully entrusted the supplier with one of its strategic maintenance activities. This level of outsourcing is yet less common than the former ones. The rationales of whether or not to outsource maintenance activities are complex and require a well-thought and structured outsourcing process. As mentioned maintenance outsourcing can cover a lot of alternatives. Fortunately, besides, traditional outsourcing of maintenance activities to equipment suppliers or the use of some small local firms, there is nowadays a growing market of medium sized and large outsourcing firms. These firms offer a range of consulting support, specialized services and even full service to allow strategic outsourcing to work.

Strategic “Transformational”

Full service To think with… e.g. outsourcing of all maintenance, BOT, ...

Tactic “Partnership”

Service package e.g. MRO, utilities, facilities, ...

To manage…

Projects e.g. renovation, shutdown, ... Operational “Supplier – Customer”

Specialised services

To organise…

e.g. high tech equipment, piping, insulation, ...

Generic services

e.g. temporary extra capacity (painting, welding, ...)

To carry out…

Figure 2.2. Outsourcing decision levels

Societal expectations concerning technology is also creating boundary conditions for maintenance management. The attention paid to sustainability (3P: people, profit, planet) is a clear sign of this. Legislation is getting more and more stringent. This is especially important here because of its impact on occupational safety and environmental standards. Note that most of the above-mentioned trends for industrial installations can be easily translated to the service sector. Think, for example, of automated warehouses in distribution centre, hospital equipment or building utilities.

26

L. Pintelon and A. Parodi-Herz

2.3 Maintenance Practices Over Time Consequent to the transformation the maintenance context, the maintenance function has also drastically evolved from a non-issue into a strategic concern (see Figure 2.3). At first maintenance was nothing more than an inevitable part of production; it simply was a necessary evil. Repairs and replacements were tackled when needed and no optimization questions were raised. Later on, it was conceived that maintenance was a technical matter. This not only included optimizing technical maintenance solutions, but it also involved attention of the organization on the maintenance work. Further on, maintenance became a full-blown function, instead of production sub-function. Clearly, now maintenance management has become a complex function, encompassing technical and management skills, while still requiring flexibility to cope with the dynamic business environment. Top management recognizes that having a well thought out maintenance strategy together with a careful implementation of that strategy could actually have a significant financial impact. Nowadays, this has led to treating maintenance as a mature partner in business strategy development and possibly at the same level as production. In turn, these strategies formally consider establishing external partnerships and outsourcing of the maintenance function. “Necessary evil” 1940

1950

“Technical matter” 1960

1970

“Profit contributor” 1980

1990

“Cooperative partnership” 2000

Decade

Figure 2.3. The maintenance function in a time perspective

The fact that maintenance has become more critical implies that a thorough insight into the impact of maintenance interventions, or the omission of these, is indispensable. Per se, good maintenance stands for the right allocation of resources (personnel, spares and tools) to guarantee, by deciding on the suitable combination of maintenance actions, a higher reliability and availability of the installations. Furthermore, good maintenance foresees and avoids the consequences of the failures, which are far more important than the failures as such. Bad or no maintenance can appear to render some savings in the short run, but sooner or later it will be more costly due to additional unexpected failures, longer repair times, accelerated wear, etc. Moreover, bad or no maintenance may well have a significant impact on customer service as delivery promises may become difficult to fulfil. Hence, a well-conceived maintenance program is mandatory to attain business, environmental and safety requirements. Despite the particular circumstances, if one intends to compile or judge any maintenance programme, some elementary maintenance terms need to be unambiguous and handled with consistency. Yet, both in practice and in the literature a lot of confusion exists. For example, what for some is a maintenance policy others refer to as a maintenance action; what some consider preventive maintenance others will refer to as predetermined or scheduled maintenance. Furthermore, some argue that some concepts can almost be considered strategies or philosophies, and

Maintenance: An Evolutionary Perspective

27

so on. Certainly there is a lot of confusion, which perhaps is one of the breathing characteristics of such a dynamic and young management science. The terminology used to describe precisely some maintenance terms can almost be taken as philosophical arguments. However, the adoption of a rather simplistic, but truly germane classification is essential. Not intending to disregard preceding terminologies, neither to impose nor dictate a norm, we draw attention, in particular, to three of those confusing terms: maintenance action, maintenance policy and maintenance concept. In the remainder of this chapter the following terminology is adopted. Maintenance Action. Basic maintenance intervention, elementary task carried out by a technician (What to do?) Maintenance Policy. Rule or set of rules describing the triggering mechanism for the different maintenance actions (How is it triggered?) Mainenance Concept. Set of maintenance polices and actions of various types and the general decision structure in which these are planned and supported. (The logic and maintenance recipe used?) 2.3.1 Maintenance Actions Basically, as depicted in Figure 2.4, maintenance actions or interventions can be of two types. They are either corrective maintenance (CM) or precautionary maintenance (PM) actions. 2.3.1.1 Corrective Maintenance Actions (CM) CM actions are repair or restore actions following a breakdown or loss of function. These actions are “reactive” in nature; this merely implies “wait until it breaks, then fit it!”. Corrective actions are difficult to predict as equipment failure behavior is stochastic and breakdowns are unforeseen. Maintenance actions such as replacement of a failed light bulb, repair of a ruptured pipeline and the repair of a stalled motor are some examples of corrective actions. 2.3.1.2 Precautionary Maintenance Actions (PM) PM actions can either be “preventive, predictive, proactive or passive” in nature. These types of actions are moderately more complex than the former. To describe fully each one of them, a book can be written on its own. Nonetheless, the fundamental ideas aim at diminishing the failure probability of the physical asset and/or to anticipate, or avoid if possible, the consequences if a failure occurs. Some PM actions (preventive and predictive) are somewhat easier to plan, because they can rely on fixed time schedules or on prediction of stochastic behaviours. However, other types of PM actions become ongoing tasks, originating from the attitude concerning maintenance. Somehow they became part of the tacit knowledge of the organization. Some precise examples of precautionary actions which can be mentioned are lubrication, bi-monthly bearing replacements, inspection rounds, vibration monitoring, oil analysis, design adjustments, etc. All these tasks are considered to be precautionary maintenance actions; however, the underlying principles may be different.

28

L. Pintelon and A. Parodi-Herz

ACTIONS

POLICIES

CONCEPTS

TPM RCM

Optimizing existing concept

CIBOCOF

Q&D

BCM

Ad hoc

reactive

Customized concept

LCC

preventive

predictive

T/UBM

CBM

FBM DOM

OBM

proactive

passive

Corrective

Precautionary

reactive

Predictive, preventive, proactive and passive

Figure 2.4. Actions, policies and concepts in maintenance1

Although it seems a very clear-cut way of defining elementary maintenance interventions, it still may be difficult in practice to assign some interventions to either class. An example here is routine maintenance on medical equipment such as a breathing device. Cleaning and sterilizing this equipment can be called precautionary maintenance since the equipment is not defective at the moment of the intervention. On the other hand, it is very difficult to predict when an intervention will be needed, and this is a typical characteristic of a corrective intervention. Furthermore, even within precautionary maintenance, it is not always simple to classify certain actions into simple types. This is due to the changing perception on maintenance and the fast evolution of its techniques. 2.3.1.3 Acuity of Maintenance Actions As maintenance knowledge is enhanced and more advance enabling technologies are available, the perception on which maintenance action is “right” has changed a lot during the last decennia. In the 1950s almost all maintenance actions were corrective. Per se maintenance was considered as an annoying and unavoidable cost, which could not be managed. Later on, in the 1960s many companies switched to precautionary (preventive) maintenance programs as they could recognize that some failures on mechanical component had a direct relation with the time or number of cycles in use. This belief was mainly based on physical wear of components or age-related fatigue characteristics. At that time, it was accepted 1

See abbreviations list at the end of this chapter

Maintenance: An Evolutionary Perspective

29

that preventive actions could avoid some of the breakdowns and would lead to cost savings in the long run. The main concern was how to determine, based on historical data, the adequate period to perform preventive maintenance. Certainly, not enough was known about failure patterns, which, among other reasons, have led to a whole separate branch of engineering and statistics: reliability engineering. In the late 1970s and early 1980s, equipment became in general more complex. As result, the super-positioning effect of the failure pattern of individual components starts to alter the failure characteristics of simpler equipment. Hence, if there is no dominant age-related failure mode, preventive maintenance actions are of limited use in improving the reliability of complex items. At this point, the effectiveness of applying preventive maintenance actions started to be questioned and was considered more carefully. A common concern about “over-maintaining” grew rapidly. Moreover, as the insidious belief on preventive maintenance benefits was put at risk, new precautionary (predictive) maintenance techniques emerged. This meant a gradual, though not complete, switch to predictive (inspection and condition-based) maintenance actions. Naturally, predictive maintenance was, and still is, limited to those applications where it was both technically feasible and economically interesting. Supportive to this trend was the fact that conditionmonitoring equipment became more accessible and cheaper. Prior to that time, these techniques were only reserved to high-risk applications such as airplanes or nuclear power plants. In the late 1980s and early 1990s a different footprint on maintenance history occurred with the emergence of concurrent engineering or life cycle engineering. Here maintenance requirements were already under consideration at earlier product stages such as design or commission. As a result, instead of having to deal with built in characteristics, maintenance turned out to be active in setting design requirements for installations and became partly involved in equipment selection and development. All this led to a different type of precautionary (proactive) maintenance, the underlying principle of which was to be proactive at earlier product stages in order to avoid later consequences. Furthermore, as the maintenance function was better appreciated within the organization, more attention was paid to additional proactive maintenance actions. For example, as operators are in straight and regular contact with the installations they could intuitively identify and “feel” right or wrong working conditions of the equipment. Conditions such as noise, smell, rattle vibration, etc., that at a given point are not really measured, represent tacit knowledge of the organization to foresee, prevent or avoid failures and its consequences in a proactive manner. Yet these actions are indeed typically not performed by maintenance people themselves, but are certainly part of the structural evolution of maintenance as a formal or informal partner within the organization. The last type of precautionary (passive) maintenance actions are driven by the opportunity of other maintenance actions being planned. These maintenance actions are precautionary since they occur prior to a failure, but are passive as they “wait” to be scheduled depending on others probably more critical actions. Passive actions are in principle low priority for the maintenance staff as, at a given moment in time, they may not really be a menace for functional or safety failures. However, these actions can save significant maintenance resources as they may reduce the

30

L. Pintelon and A. Parodi-Herz

number of maintenance interventions, especially when the set up cost of maintenance is high. For example, when maintenance actions are planned or need to be carried out on offshore oil platforms or on windmills in remote locations, getting to the equipment equipment can be costly. Therefore, optimizing the best combination of maintenance actions, at that point in time, is mandatory. This may invoke replacing components with significant residual life that in different circumstances would not be replaced. 2.3.2 Maintenance Policies As new maintenance techniques happen to be available and the economic implications of maintenance action are comprehended, a direct impact on the maintenance policies is expected. Several types of maintenance policies can be considered to trigger, in one way or another, either precautionary or corrective maintenance interventions. As described in Table 2.1, those policies are mainly failure-based maintenance (FBM), time/used-based maintenance (TBM/UBM), condition-based maintenance (CBM), opportunity-based maintenance (OBM) design-out maintenance (DOM), and e-maintenance. Table 2.1. Generic maintenance policies Policy

Description

FBM

Maintenance (CM) is carried out only after a breakdown. In case of CFR behaviour and/or low breakdown costs this may be a good policy.

TBM / UBM

PM is carried out after a specified amount of time (e.g. 1 month, 1000 working hours, etc.). CM is applied when necessary. UBM assumes that the failure behaviour is predictable and of the IFR type. PM is assumed to be cheaper than CM.

CBM

PM is carried out each time the value of a given system parameter (condition) exceeds a predetermined value. PM is assumed to be cheaper than CM. CBM is gaining popularity due to the fact that the underlying techniques (e.g. vibration analysis, oil spectrometry,...) become more widely available and at better prices. The traditional plant inspection rounds with a checklist are in fact a primitive type of CBM.

OBM

For some components one often waits to maintain them until the “opportunity” arises when repairing some other more critical components. The decision whether or not OBM is suited for a given component depends on the expectation of its residual life, which in turn depends on utilization.

DOM

The focus of DOM is to improve the design in order to make maintenance easier (or even eliminate it). Ergonomic and technical (reliability) aspects are important here.

CFR = Constant failure rate, IFR=Increasing failure rate

For the more common maintenance policies many models have been developed to support tuning and optimization of the policy setting. It is not our intention to explain the fundamental differences between these models, but rather to provide an overview of types of policies available and why these have been developed. Much

Maintenance: An Evolutionary Perspective

31

has to do with the discussion in the previous section regarding the acuity of maintenance actions. Therefore, it is clear that policy setting and the understanding of its efficiency and effectiveness continues to be fine-tuned as any other management science. We advocate the reader, particularily interested in the underlying principles and type of models, to review McCall (1965), Geraerds (1972), Valdez-Flores and Feldman (1989), Cho and Parlar (1991), Pintelon and Gelders (1992), Dekker (1996), Dekker and Scarf (1998) and Wang (2002) for a full overview on the state-of-the-art literature. The whole evolution of maintenance was based not solely on technical but rather on techno-economic considerations. FBM is still applied providing the cost of PM is equal to or higher than the cost of CM. Also, FBM is typically handy in case of random failure behaviour, with constant failure rate, as TBM or UBM are not able to reduce the failure probability. In some cases, if there exists a measurable condition, which can signal the probability of a failure, CBM can be also feasible. Finally, a FBM policy is also applied for installations where frequent PM is impracticable and expensive, such as can be the maintenance of glass ovens. Either TBM or UBM is applied if the CM cost is higher than PM cost, or if it is necessary because of criticality due to the existence of bottleneck installation or safety hazards issues. Also in case of increasing failure behaviour, like for example wear-out phenomena, TBM and UBM policies are appropriate. Typically, CBM was mainly applied in those situations where the investment in condition monitoring equipment was justified because of high risks, like aviation or nuclear power regeneration. Currently, CBM is beginning to be generally accepted to maintain all type installations. Increasingly this is becoming a common practice in process industries. In some cases, however, technical feasibility is still a hurdle to overcome. Another reason that catches the attention of practitioners in CBM is the potential savings in spare parts replacements thanks to the accurate and timely forecasts on demand. In turn, this may enable better spare parts management through coordinated logistics support. Finding and applying a suitable CBM technique is not always easy. For example, the analysis of the output of some measurement equipment, such as advanced vibration monitoring equipment, requires a lot of experience and is often work for experts. But there are also simpler techniques such as infrared measuring and oil analysis suitable in other contexts. At the other extreme, predictive techniques can be rather simple, as is the case of checklists. Although fairly low-level activity, these checklists, together with human senses (visual inspections, detection of “strange” noises in rotating equipment, etc.) can detect a lot of potential problems and initiate PM actions before the situation deteriorates to a breakdown. At present FBM, TBM, UBM and CBM accept and seize the physical assets which they intend to maintain as a given fact. In contrast, there are more proactive maintenance actions and policies which, instead of considering the systems as “a given”, look at the possible changes or safety measures needed to avoid maintenance in the first place. This proactive policy is referred to as DOM. This policy implies that maintenance is proactively involved at earlier stages of the product life cycle to solve potential related problems. Ideally, DOM policies intend to completely avoid maintenance throughout the operating life of installations, though, this may not be realistic. This leads one to consider a diverse set of maintenance requirements at the

32

L. Pintelon and A. Parodi-Herz

early stages of equipment design. As a consequence, equipment modifications are geared either at increasing reliability by raising the mean-time-between-failures (MTBF) or at increasing the maintainability by decreasing the mean-time-to-repair (MTTR). Per se DOM aims to improve the equipment availability and safety. Some equipment modifications may merely request ergonomic considerations to reduce MTTR, others may need totally new designs. Often DOM projects are combined with efforts to increase occupational safety or increase production capacity, such as set up reduction programs. A rather passive, but considerably important maintenance policy that needs to be mentioned is OBM. Typically OBM is applied for non-critical components with a relatively long lifetime. For these components no separate maintenance programs are scheduled; maintenance happens if an opportunity arises due to a maintenance intervention for another component of that machine. More recently in the mid-1990s, with the emergence of the Internet as an enabling technology and the growth of e-business as the standard on business communication, e-maintenance also appeared in the radar of maintenance policies. E-maintenance rather than a policy can also be considered as a means or enabler to some, if not all, the previous policies. However, it is more than just an acronym; it is a step forward to full-integrated maintenance techniques without the boundaries of place. It is in fact a maintenance policy on its own that can support other policies. In particular, academics and practitioners watch with anticipation the great impact it may have on CBM. Conditions measured on site can be remotely monitored, opening entirely new dimensions and opportunities for maintenance services. Therefore, e-maintenance has captured much attention of maintenance researchers given its great impact on business practice. An example of this evolution is telemaintenance, which allows the diagnosis of installation and to perform limited type of repairs from a remote location using ICT and sophisticated control and knowledge tools. 2.3.3 Maintenance Concepts The idea of an “optimized” maintenance program suggests that an adequate mix of maintenance actions and policies needs to be selected and fine-tuned in order to improve uptime, extend the total life cycle of physical asset and assure safe working conditions, while bearing in mind limiting maintenance budgets and environmental legislation. This does not seem to be straightforward, and may require a holistic view. Therefore, a “maintenance concept” for each installation is necessary to plan, control and improve the various maintenance actions and policies applied. A maintenance concept may in the long term even become a philosophy, tenet or attitude to perform maintenance. In some cases advance maintenance concepts are almost considered strategies on their own. What is certain is that maintenance concepts determine the business philosophy concerning maintenance, and that they are needed to manage the complexity of maintenance per se. In practice, it is clear that more and more companies are spending time and effort determining the right maintenance concept. As a matter of fact, maintenance concepts need to be formulated considering the physical characteristics and the context within which installations operate. Not

Maintenance: An Evolutionary Perspective

33

surprisingly, as system complexity is increasing and maintenance requirements are becoming more complex, maintenance concepts will require different levels of complexity. Literature provides us with various concepts that have been developed through a combination of theoretical insights and practical experiences. Choosing and implementing the best concept in a given context is hard. To the question “what concept is best for us?”, no short and straightforward answer exists. The right answer to the question is determined by the context, with its complex interaction of technology, business, organization, and so forth. Designing and implementing a good concept will take time and effort. Many companies establish teams with members from different areas (engineering, production, maintenance, ...) to accomplish this difficult task. On the market, many consultants offer their services to assist in this process. This outside help may be very useful to get started and to obtain a better insight into own situation. However, it is useful to note that many consultants have “their” concept (e.g. RCM) they are used to implementing, which may bias their judgment on what concept is “right”. Nevertheless, some outside guidance can be useful, but in order to have a good concept that fits all the companies needs, this should be built by in-house people, using all the knowledge available. Several times in this chapter, it has been suggested that next to increasing systems complexity, maintenance has also evolved in time. This has led to three generations of maintenance concepts with its respective transition points. In the following paragraphs an overview is offered which is also portrayed in Table 2.2. In the past, equipment was generally much simpler; hence the need for maintenance decision support was moderate. For truly simple systems, even a single maintenance policy may possibly be considered a concept on its own. This is considered the simplest form, the “first generation”, of maintenance concepts. Here, only one maintenance policy or even type of action was applied to certain equipment. For a state-of-the-art review on this type of maintenance concepts see Wang (2002). With the advent of automation, installations became highly mechanized and the equipment turned out to be more complex and the interdependencies of the multi-unit systems could no longer be ignored. To maintain such installations efficiently a specific mixture of maintenance policies and actions was required. The need for decision structures became crucial. These circumstances prompted, at first instance, the concept of simple quick and dirty (Q&D) decision diagrams. Q&D charts could help to select adequate maintenance policies as only ‘yes’ or ‘no’ answers can be given to a series of structured but simple questions. The authors note that even though Q&D charts lack the holistic view required for well-conceived and sophisticated maintenance concepts, they are still widely used in practice on specific situations thanks to their simplicity. Examples are reported in Pintelon et al. (2000) and Waeyenbergh and Pintelon (2002). Eventually, superior maintenance concepts were claimed, as the complexity of maintenance decisions increased. As a result, in the last 40 years a vast range of maintenance concepts has been extensively documented in literature. This group of concepts is considered the “second generation” of maintenance concepts and provides a pool of knowledge for maintenance practitioners and researchers. Typical examples, and perhaps the most important ones, are total productive

34

L. Pintelon and A. Parodi-Herz

maintenance (TPM), reliability-centred maintenance (RCM) and life cycle costing (LCC) approaches. Table 2.2. Description of the maintenance concepts generations Generation

Concept

Description

Main strengths

Main weaknesses

1st

Ad hoc

Implementing FBM and UBM policies; rarely CBM, DOM, OBM

Simple

Ad hoc decisions

1st → 2nd

Q&D

Easy-to-use decision chart. It helps to decide on the “right” maintenance policy

Consistent, Allows for priorities

Rough questions, and answers

2nd

LCC

Detailed cost breakdown over the equipment’s lifetime helping to plan the maintenance logistics

Sound basic philosophy

Resource and data intensive

TPM

Approach with an overall view on maintenance and production. Especially successful in the manufacturing industry

Considers human/technical aspects, fits in kaizen approach. Extensive tool box

Time consuming implementation

RCM

Structured approach focused on reliability. Initially developed for high tech/high risk environment

Powerful approach, Stepby-step procedure

Resource intensive

RCM-based

Approaches focused on remediating some of the perceived RCM shortcomings

Improved performance through e.g. use of sound statistical analysis

Sometimes an oversimplification

Exploiting the company’s strengths and considering the specific business context

Ensuring consistency and quality in the concept developed

2nd → 3rd

Example: streamlined RCM, BCM, RBCM

3rd

Customized

In-house developed; cherrypicking from existing concepts Examples: CIBOCOF, VDM

All these concepts, as many others, enjoy several advantages and are doomed to specific shortcomings. Correspondingly, new maintenance concepts are developed, old ones are updated and methodologies to design customized maintenance concepts are created. These concepts enjoy a lot of interest in their original form and also give raise to many derived concepts. For example, streamlined RCM from RCM. One may consider that customized maintenance concepts constitute the “third generation” of this evolution. They have fundamentally emerged since it is very difficult to claim a “one fits all” concept in the complex and still constantly changing world of maintenance. They are inspired by the former concepts while trying to aviod in the future previously experienced drawbacks. One way or another, customized maintenance concepts mainly consist of a “cherry picking” of useful techniques and ideas applied in other maintenance concepts. This important, but relatively new concept is expected to grow in importance both in practice and with academicians. Concepts that belong to this generation are, for example, value driven maintenance (VDM) and CIBOCOF, which was developed at the Centre of

Maintenance: An Evolutionary Perspective

35

Industrial Management (CIB), K.U. Leuven, Belgium. Additionally, in-house maintenance concepts, mostly developed in organization with fairly high maintenance maturity, also belong to this category of concepts. This, for example, was implemented in a petrochemical company that developed a customised concept, which was basically following the RCM logic. However, by extending RCM analysis steps and introducing risk-based inspections (RBI), a more focused and betterconceived maintenance plan could be developed. Moreover, the company borrowed some elements from TPM and incorporated these in their maintenance concept. For example, multi-skilled training programmes were implemented and special tool kits were designed for a number of maintenance jobs using TPM principles. Before the third generation of maintenance concepts was started, or actually even earlier, they were perceived as necessary. In the literature, a middle step is recognized to bridge the second generation with maintenance concepts such as business-centred maintenance (BCM) and risk based centred maintenance (RBCM) were developed. These concepts are merely RCM-related and still widely applied in many organizations. However, a slow but steady movement towards more customized maintenance concept is expected in the near future, as the maintenance function matures. Next, a straightforward description on the most important concepts is presented and important references are provided for the interested reader. 2.3.3.1 Quick & Dirty Decision Charts (Q&D) A Q&D decision chart is a decision diagram with questions on several aspects including; failure paterns, repair behaivours of the equipment, business context, maintenance capabilities, cost structure etc. Answering the questions for a given installation, the user proceeds through the branches of the diagram. The process stops with the recommendation of the most appropriate policy for the specific installation. The Q&D approach allows for a relatively quick determination of the most advantageous maintenance policy. It ensures a consistent decision making for all installations. Although some Q&D decision charts are available from literature (e.g. Pintelon et al. 2000), most companies adopting this approach prefer to draw up their own charts, which incorporate their experience and knowledge in the decision process. This can be implemented in several ways. For instance by defining specific questions, adding or deleting maintenance policies, establishing preferred sequence in which the different policies should be considered, etc. This approach however has the drawback of being rough (dirty). The questions are usually put in the basic yes/no format, limiting the answering possibilities. Moreover, answering the questions is usually done on a subjective basis; for example the question whether a given action or policy is feasible is answered based on experience rather than on a sound feasibility study. 2.3.3.2 Life Cycle Costing (LCC) Approaches LCC originated in the late 1960s and is now resurrecting. The basic principle of LCC is sometimes summarised by “it is unwise to pay too much, but is foolish to spend too little”. This refers to the two main underlying ideas of LCC. The first concerns the cost iceberg structure presented by Blanchard (1992) by whom LCC

36

L. Pintelon and A. Parodi-Herz

was revived. Mainly he proposes that when considering maintenance or equipment purchasing alternatives, one should not be limited to what momentarily can be seen: “the top of the iceberg”, such as direct maintenance costs (material, labour, etc.) or the purchase price. The indirectly relevant long run cost such as operational expenses, trainning cost, spares inventory costs, etc. are at least of the same order of magnitude. The second refers to the principle that the further one gets in the design or construction cycle of equipment, the more costly it will be to make modifications (e.g. DOM). Maintenance should be taken into account from the very first moment of designing a machine or system. LCC is a methodology for calculating or estimating the total cost of a system during the entire course of its life. This LCC approach implies a synthesis of costing analysis and engineering design principles that must satisfy life cycle requirements at minimum cost. In turn, design decisions are based on total cost of ownership (TCO) principles. In the literature, several LCC approaches can be distinguished. Among the more important ones are Terotechnology, Integrated Logistic Support/Logistics Support Analysis (ILS/LSA) and Capital asset management. During the 1970s, the Terotechnology concept originated in the UK and was the first formal attempt towards LCC (Parkes 1970). It describes a total view of maintenance management that combines management, technology, logistical support and financial control for industrial systems. Terotechnology is concerned with the specification and design for reliability and maintainability of physical assets. The application of Terotechnology also takes into account the processes of installation, commissioning, operation, maintenance, modification and replacement. Decisions are influenced by feedback of information on design, performance and cost, throughout the life cycle of a project. Although generally accepted as very useful, it was not until fairly recently that terotechnology or similar LCC was adopted by large-scale industry. This was largely due to the developments in ICT that made LCC easier. In the 1980s a different LCC-approach, integrated logistic support/logistics support analysis (ILS/LSA), originated in the military logistics support. Maintenance is regarded as an important issue within the integral logistical support. ILS comprises the spectrum of all activities related to the logistical support during its entire life cycle. These logistical support activities refer to maintenance concept development, the spare parts provisioning, the technical information, the maintenance crew, the training programs, etc. The goal of ILS may be summarized as achieving minimum life cycle costs. Furthermore, LSA is an iterative analytical process to identify and evaluate the logistic support for a new system. LSA constitutes the integration and application of various techniques and methods to ensure that supportability requirements are considered in the system design process. Finally, capital asset management, an LCC-approach with real concern of the financial performance of asset, was developed. Capital asset management provides information to make the financial and operational decisions that optimize equipment performance, from deployment through operations, maintenance and retirement. The key focus is not technical, but financial. Asset management aims at maximizing the return on investment (ROI) in capital assets so that they last longer, perform better and cost less to maintain.

Maintenance: An Evolutionary Perspective

37

2.3.3.3 Total Productive Maintenance (TPM) TPM (Takahashi and Takeshi 1990) is much more than just a concept, actually it is even considered a maintenance philosophy, which derives to the greater part of its substance from a variety of non-Japanese management structures and practices, which were adapted by the Japanese to fit their culture. TPM involves total participation, at all levels of the organization. It aims at maximizing equipment effectiveness and establishing a thorough system of preventive maintenance. TPM fits entirely with the TQM philosophy and the JIT approach. The latter makes sure that problems of various nature (material related, breakdown, training related, ...) are tackled and solved one by one, instead of camouflaging them by using large buffer stocks as was the case with MRP approaches. The TPM toolbox consists of various techniques, some of which are universal ones such as 6sigma, Pareto or ABC analysis, Ishikawa or fishbone diagrams, etc. Other concepts and techniques such as SMED, poke yoke, jidoka, OEE, and the 5S are specific of the TPM philosophy. The last two are of extreme importance and worthy to be explained further. The overall equipment efectiveness (OEE) is a powerful tool to measure the effective use of production capacity. The strength of the concept is the integration of production, maintenance and quality issues into what is called the “six big losses” of useful capacity. Figure 2.5 illustrates this concept. On the other hand, the 5S form one of the basic principles of TPM: Seiri (or sorting out), Seiton (or systematic arrangement), Seiso (or Spic and span), Seiketsu (or standardizing) and Shitsuke (or self-discipline).

downtime losses

Loading time

Operating time

quality losses

Valuable operating time

loss of speed

Net operating time

planning delays planned maintenance

failures set-up and adjustment

stoppages reduced speed

6 big losses

planning losses

Total time

process defects reduced yields

Figure 2.5. The “big six losses” of overall equipment efectiveness

2.3.3.4 Reliability Centered Maintenance (RCM) RCM originates from the 1960s in North American aviation industry. Later on it was adopted by military aviation, and afterwards it was only implemented at high risk industrial plant such as nuclear power plants. Now it can be found in industry

38

L. Pintelon and A. Parodi-Herz

at large. Well known are the books by Nowlan and Heap (1978); Anderson and Neri (1990) and Moubray (1997) who contributed to the adoption of RCM by industry. Note that today many versions of RCM are around, streamlined RCM being one of the more popular ones. However, the Society for Automotive Engineers (SAE) holds the RCM definition that is generally accepted. SAE puts forward the following basic questions to be solved by the any RCM implementation; if any of these is omitted, the method is incorrectly being refered to as an RCM. To answer these seven questions a clear step-by-step procedure exists and decision charts and forms are available: • What are the functions and associated performance standards of asset in its present operating context? • How can it fail to fulfil its functions? (functional failures) • What causes each failure? (failure modes) • What happens when each failure occurs? (failure effects ) • In what way does each failure matter? (failure consequences) • What should be done to predict or prevent each failure? (proactive tasks and task intervals) • What should be done if a suitable proactive task cannot be found? (default actions) RCM is undeniably a valuable maintenance concept. It takes into account system functionality, and not just the equipment itself. The focus is on reliability. Safety and environmental integrity are considered to be more important than cost. Applying RCM helps to increase the asset’s lifetime and establish a more efficient and effective maintenance. Its structured approach fits in the knowledge management philosophy: reduced human error, more and better historical data and analysis, exploitation of expert knowledge and so forth. RCM is popular and many RCM implementations have started during the last decade. Although RCM offers many benefits, there are also drawbacks. From the conceptual point of view there are some weak points. For instance, the fact that the original RCM does not offer a task packaging feature and thus does not automatically offer a workable maintenance plan and the fact that the standard decision charts and forms offered are helpful but also far from perfect. A serious remark, mainly from the academic side, is about the scientific basis of RCM: the FMEA analysis, which is the heart of the RCM analysis, is often done on a rather ad hoc basis. Often available statistical data are insufficient or inaccurate, there is a lack of insight in the equipment degradation process (failure mechanisms) and the physical environment (e.g. corrosive or dusty environment) is ignored. The balance between valuable experience and equally valuable, objective statistical evidence is often absent. Many companies call in the (expensive) help of consultants to implement RCM; some of these consultants however are not capable of offering the help wanted and this – in combination with the lack of in-house experience with RCM – discredits this methodology. RCM is in fact an on-going process, which often causes reluctance to engage in a RCM project. RCM is undoubtedly a very resource consuming process, which also makes it difficult to apply RCM to all equipment.

Maintenance: An Evolutionary Perspective

39

2.3.3.5 RCM-Related Concepts RCM as such has proven to be a very valuable concept, focussing on reliability and paying attention to safety and environment. Its structured approach ensures asset sustainability. However, there are some drawbacks that should be kept in mind and, if possible, remedied. In the literature one can find many RCM-related concepts such as Gits, Coetzee, BCM, RBCM, streamlined RCM, and so forth. All of them adopt RCM principles with the intention of solving some of its shortcomings. These group of concepts constitute the bridging step to the third generation of maintenance concepts. Gits (1984) developed an RCM-like maintenance concept. The main difference with the original RCM is the fact that the methodology delivers a workable maintenance plan. The focus of the concept is on technical and organizational aspects, rather than on economic considerations. This three-phase approach establishes the maintenance plan by quantifying and clustering basic maintenance rules. Those rules are harmonised in operational entities that describe what exactly must be done. Later on, Jones (1995) put forward risk based reliability centred maintenance (RBCM), a new variance of basic RCM. Basically, RBCM can be described as RCM, but with a strong statistical background. This tackles and eliminates the drawback of the ad hoc FMEA of the traditional RCM approach. Risk based inspections (RBI) are one of the core concepts here. The RBI methodology enables the assessment of the likelihood and potential consequences of pressure equipment failures. RBI provides companies with the opportunity to prioritize equipment inspections and optimize the inspection methods, frequencies and resources. Furthermore, RBI helps to develop specific equipment inspection plans and enable the implementation of RCM as such. This results in improved safety, lower failure risks, fewer forced shutdowns, and reduced operational costs. The risk-based approach requires a systematic and integrated use of expertise from the different disciplines that affect plant integrity. These include design, materials selection, operating parameters and scenarios, and understanding of the current and future degradation mechanisms and of the risks involved. So far, all preceding RCM inspired concepts aimed at improving technical drawbacks of RCM by coverting them into workable solutions. It was not until Kelly (1997), with his business-centred maintenance BCM, a full-fledged concept for determining a detailed maintenance plan, that the business as such gained the focal point. Kelly emphasised the importance of identifying, mapping and auditing the maintenance function. The BCM concept also pays attention to the necessary administrative support. Kelly calls his approach a BUTD approach, bottom-up/top-down approach. First, it is a top-down step that starting from the business context, the exact objectives for maintenance are outlined considering all corporate level. The second step is a bottom-up step. It aims at establishing a life maintenance plan for all equipments. In a third and last step, all item life plans are fitted in a maintenance strategy. Applying BCM thus results in a detailed maintenance schedule, ready for use. RCM implementation is complex, time consuming and is not straightforward. Hence, it should be implemented in a controlled fashion with total support of all levels of the organizations. Coetzee (2002) mentions that RCM is a core methodology to ensure that the organization can achieve world-class results. However, to

40

L. Pintelon and A. Parodi-Herz

achieve this objective the traditional RCM should be enhanced. Coetzee proposes a “new” RCM blending concept from different RCM authors’ related techniques. He also puts forward some innovations like the funnelling approach to ensure that RCM efforts are concentrated on the most important failure modes in the organization. Finally, there is a vast range of so-called “streamlined RCM” concepts. These concepts claim to be derivations of RCM. It is consultants who mainly promote streamlined RCM as the solution for the resource consuming character of RCM. Although streamlining sounds attractive it should be carefully applied, in order to keep the RCM benefits. Different streamlining approaches exist; however, very few are acceptable as formal RCM methodologies. Based on Pintelon and Van Puyvelde (2006), Table 2.3 provides a picture of popular streamlined RCM approaches. Table 2.3. Classification of streamlined RCM concepts

Characteristics

Pitfalls

Retro-active approach

Example

Starts from the existing maintenance plan. Determines the failure mode for all maintenance tasks and implements the last RCM steps for these.

Quite time-consuming to find the failure modes for all tasks.Functions” are detected on ad hoc basis. It Implies that the existing maintenance plan is good.

Generic approach

Uses generic lists of failure modes, or even generic analyses of technical systems

Ignores the operational context of the technical systems and the current maintenance practices. It assumes a standard level of analysis detail for all systems.

Skipping approach

Omits one or more steps. Typically, the first step (functions) is skipped and the analysis starts with listing the failure modes.

Omits the first and essential step of RCM, i.e. the functional analysis and as such also does not allow for a sound performance standard setting

Criticality approach

Limits the implementation to critical functions and/or failures for these a full RCM analysis is performed.

Often determines criticality on an ad hoc basis or uses criticality tools which are less reliable than the RCM approach

Troublemaker approach

Carries out a full RCM analysis for critical equipment only. Critical equipment is defined here as bottleneck equipment, which had a lot of maintenance problems in the past or is critical in terms of safety hazards.

Idem as above, although here all RCM steps are followed which guarantees a complete “picture”.

2.3.3.6 Customized Maintenance Concepts The value driven maintenance (VDM) methodology proposed by Haarman and Delahay (2004) builds a bridge between traditional maintenance philosophies and the shareholders’ value. Not only does VDM simplify the boardroom discussion, it also shows that far from being a cost center, maintenance is actually a major economic value within the overall business performance. It is built on established

Maintenance: An Evolutionary Perspective

41

best maintenance practices and concepts such as TPM, RCM and RBI. It shows where the added-value of maintenance lies and how an organisation can be best structured to realise this value. One of the main contributions of VDM is that it offers a common language to management and maintenance to discuss maintenance matters. VDM identifies four value drivers in maintenance and provides concepts to manage by those drivers. For all four value drivers, maintenance can help to increase a company’s economic value. VDM makes a link between value drivers and core competences. For each of the core competences, some managerial concepts are provided. Most recently, Waeyenbergh (2005) presents CIBOCOF as a framework to developed customised maintenance concepts. CIBOCOF starts out from the idea that although all maintenance concepts available from the literature contain interesting ideas, none of them is suitable for implementation without further customization. Companies have their own priorities in implementing a maintenance concept and are likely to go for “cherry picking” from existing concepts. CIBOCOF offers a framework to do this in an integrated and structured way. Figure 2.6 illustrates the steps that this concept structurally goes through. A particularly interesting step is step 5, maintenance policy optimization, where a decision chart is offered to determine which mathematical decision model can be used to optimize the chosen policy (step 4). This decision chart guides the user through the vast literature on the topic. M2 Technical analysis M1 Start-up

Maintenance Plan

M5 Continuous improvement

M3 Policy decision making

M4 Implementation & Evaluation

Figure 2.6. CIBOCOF logic

2.4 Maintenance Manager As maintenance management evolved, so did the job of the maintenance manager. Clearly maintenance management is no longer a pure technical function. Business economics (cost-benefit considerations) and business context (how important are the installations in question?, what are the functional requirements?, …) play an important role. A good maintenance manager needs to have a technical background in order to have an eye for the “big picture” and not lose any aspect out of sight.

42

L. Pintelon and A. Parodi-Herz

Nowadays, the decisions expected from the maintenance manager are complex and sometimes can have far reaching consequences. He/she is (partly) responsible for operational, tactical and strategical aspects of the company’s maintenance management. This involves the final responsibility for operational decisions like the planning of the maintenance jobs and tactical decisions concerning the long-term maintenance policy to be adopted. More recently, maintenance managers are also consulted in strategic decisions, e.g. purchases of new installations, design choices, personnel policy, … The career path of today’s maintenance manager starts out from a rather technical content, but evolves over time into more financial and strategic responsibilities. This career path can be horizontal or vertical. It is also important that the maintenance manager is a good communicator and people manager, as maintenance remains a labor-intensive function. The maintenance manager needs to be able to attract and retain highly skilled technicians. On-going training for technicians is needed to keep track of the rapidly evolving technology. Motivation of maintenance technicians often requires special attention. Job autonomy in maintenance is more than in production, instructions may be vague, immediate assessment of the quality of work is mostly not possible, complaints are more often heard than compliments etc. Aspects like safety and ergonomics are an indispensable element in current maintenance management. Besides people, materials are another important resource for maintenance work. Maintenance material logistics mainly concerns the spare parts management and the determination of finding the optimum trade-off between high spare parts availability and the corresponding stock investments. The above described evolution in maintenance management incurs a sharp need for decision support techniques of various nature: statistical analysis tools for predicting the failure behaviour of equipment, decision schemes for determining the right maintenance concept, mathematical models to optimize the maintenance policy parameters (e.g. PM frequency), decision criteria concerning e-maintenance, decision aids for outsourcing decisions, etc. Table 2.4 illustrates the use of some decision support techniques for maintenance management. These techniques are available and have proven their usefulness for maintenance, but they are not yet widely adopted. In the 1960s most maintenance publications were very mathematically oriented and mainly focussed on reliability. The 1970s and early 1980s publications were more focused on maintenance policy optimization such as determination of optimum preventive maintenance interval, planning of group replacements and inspection modelling. This was a step forward, although these models still often were too focussed on mathematical tractability rather than on realistic assumptions and hypotheses. This caused an unfortunate gap between academics and practitioners. The former had the impression that industry and service sector were not “ready” for their work, while the latter felt frustrated because the models were too theoretical. Fortunately, this is changing. Academics pay more attention to the reallife background of their subject and practitioners discover the usefulness of the academic work. Moreover academic work gets broader and offers a more diverse range of models and concepts, such as maintenance strategy design models, e-maintenance concepts, service parts supply policies, and the like besides the more traditional maintenance optimization models. With the introduction of main-

Maintenance: An Evolutionary Perspective

43

tenance software, the necessary data required for these models could be more easily collected. There still is a big gap between practitioners and academics, but it is already slowly closing. Table 2.4. OR/OM techniques and its application in maintenance Techniques

Application examples in maintenance management

Statistics

Describing failure behaviour

Reliability theory

Reliability prediction of complex systems

Markov theory

Availability studies of repairable systems

Renewal theory

Replacement decisions (group or individual)

Math programming

Maintenance policy parameter optimization

Decision theory

Decisions under uncertainty

Queueing theory

Trade-off personnel capacity - service level

Simulation

Comparison of alternative maintenance policies

Inventory control

MRO management: FMI, NMI, SMI and VSMI

Time and motion study

Estimation of maintenance intervention times

Scheduling – rostering

Daily planning of maintenance jobs

Project planning

Planning of turnaround, large renovation projects

MCDM

Selecting the best outsourcing partner

MRO = maintenance, repair and operating supplies, FMI = fast moving items, NMI = normal moving items, SMI = slow moving items, VSMI = very slow moving items, MCDM = multi-criteria decision making, OR/OM=Operations Research / Operations Management

The help from information technology (IT) is of special interest when discussing decision support for maintenance managers. Computerized maintenance management systems (CMMS), also called computer aided maintenance management (CAMM), maintenance management information systems (MMIS) or even enterprise asset management systems (EAM), nowadays offer substantial support for the maintenance manager. These systems too have evolved over time (Table 2.5). IT of course also supports the e-maintenance applications and offers splendid opportunities for knowledge management implementations. At the beginning of the knowledge management hype, knowledge management was mainly aimed at fields like R&D, innovation management, etc. Later on the potential benefits of knowledge management were also recognized for most business functions. For maintenance management, a knowledge management programme helps to capture the implicit knowledge and expertise of maintenance workers and secure this information in information systems, so making it accessible for other technicians. The benefits of this in terms of consistency in problem solving approach and knowledge retention are obvious. Other knowledge management applications can be, for example, expert systems, assisting in the diagnosis of complex equipment

44

L. Pintelon and A. Parodi-Herz

failures, or data mining on maintenance history records to learn about failure causes. A knowledge management programme will also help to keep track of individual skills and expertise and as such support personnel management over time. Table 2.5. Evolution of CMMS

1990s ...

1980s–1990s

1970s

Business IT systems

CMMS

Characteristics

1st generation

Mainly registration and data administration (EDP). Limited or no process support. Low priority mainframe applications. Limited software market, a lot of in-house development.

2nd generation

Cost control and work order management; MRO management most often included, ... Link with company’s financial information module. First MIS for maintenance Many stand-alone microcomputer applications. Dynamic, but not always reliable, software market.

3rd generation

Broader, e.g. also asset utilization, and EHS module External communication possible, e.g. e-MRO. Enhanced analytical capabilities. Multimedia and web enabled features. Matured market for embedded (part of e.g. ERP) or BoB.

Clearly, the evolution in maintenance management offers a challenging job environment for today’s maintenance manager. This maintenance manager needs to be aware of “the big picture”, i.e. the business context and the maintenance organization as a whole. Moreover, he/she needs to have a sound technological background and be prepared to keep informed of technological evolutions. The maintenance manager needs real management skills, to manage the resources – personnel and materials – in an efficient and effective way, while keeping asset utilization and asset life cycles in mind. Growing in the function of maintenance manager, will also mean acquiring new skills, e.g. in financial management. Last but not least, today’s maintenance manager needs to be flexible, flexible to face threats and to grab opportunities in today’s dynamic business environment where increasing globalisation, many mergers and acquisitions, growing outsourcing markets and emerging e-maintenance technologies are part of daily life.

2.5 Conclusions and New Challenges of Maintenance Maintenance management undoubtedly has undergone major changes during the past decade. It has moved from being low profile, necessary but difficult to manage problems, to be regarded a prominent business function, an important element in business strategy. Not only practitioners have changed their mind about maintenance; academics did as well. Maintenance nowadays is a professional business

Maintenance: An Evolutionary Perspective

45

function and an area of intensive academic research. Efforts are aimed at advancing towards world class maintenance and providing methodologies to do so. Pintelon et al. (2006) describes several maintenance maturity levels required to achieve world class maintenance; these are illustrated in Figure 2.7.

Figure 2.7. Maturity levels of maintenance

Maintenance concept optimization has professionalized. Corrective and precautionary actions are combined in different policies, from reactive to preventive and from predictive to proactive policies. A sound insight into the pros and cons of each of these policies is available in practice and research supports the selection and optimization of these policies. These policies are no longer ad hoc and lose elements within maintenance management but policies are also embedded in maintenance concepts, focussing on reliability and productivity. These concepts ensure consistent decision making for all equipment and at the same time allow for individualized installation maintenance concepts. Decision tools are available to support this process. Top management nowadays, at least in most companies, recognizes the importance of maintenance as an element of their business strategy. Expectations for maintenance are no longer formulated as “keep things running”, but are based upon the overall business strategy. This strategy can be based on flexibility, quality and low cost. The maintenance organization, with its structural and infrastructural elements, is built accordingly. The previous paragraph may give the impression that all problems for maintenance management are already solved; this however is not the case. New opportunities in terms of, for example, outsourcing and e-maintenance exist. Moreover, there is a threatening gap between the top management level and the overall maintenance strategy determination and the tactical level on which the maintenance concepts are designed, detailed and implemented (Figure 2.8). The gap, however, is there between the alignment of the tactical and subsequent operational phase on the one hand and the strategic phase on the other. While both aspects are well studied, the link between the two is often not well established. This leads to disappointments with top management as well as frustration with maintenance managers. Research shows a similar gap. There is some — though

46

L. Pintelon and A. Parodi-Herz

still not enough — research on the link between maintenance and business strategy. The main focus of maintenance management research is still on the tactical and operational planning. Links between the former and the latter part of research however are still very rare. Closing this gap by linking maintenance and business throughout all decision levels is one of the major challenges for the future; every step taken brings us closer to real world-class maintenance.

Figure 2.8. Gap between maintenance and business strategy

2.6 List of Abbreviations BCM: Business-centred maintenance BoB: Best-of-breed BUTD: Bottom-up/top-down analysis CAMM: Computer aided maintenance management CBM: Condition-based maintenance CFR: Constant failure rate

CIBOCOF: Center Industrieel Beleid Onderhoudsontwikkelingsframework CM: Corrective maintenance CMMS: Computerized maintenance management systems DOM: Design-out of Maintenance DSS: Decision support systems

Maintenance: An Evolutionary Perspective

EAM: Enterprise asset management (system) EHS: Energy, health and safety EDP: Electronic data processing EUC: End user computing FBM: Failure-based maintenance FMEA: Failure modes and effect analysis FMI: Fast moving items FMS: Flexible manufacturing systems GUI: Graphical user interface ICT: Information communication technology IFR: Increasing failure rate ILS: Integrated logistics support IT: Information technology JIT: Just-in-time LCC: Life-cycle costing LSA: Logistics support analysis MCDM: Multi-criteria decisionmaking MIS: Management information systems MMIS: Maintenance management information system MRO: Maintenance repair and operating supplies

47

MTBF: Mean-time-between-failures MTTR: Mean-time-to-repair NMI: Normal moving items OBM: Opportunity-based maintenance OEE: Overall equipment effectiveness OM: Operations management OR: Operations research PM: Precautionary maintenance Q&D: Quick & dirty decision charts R&D: Research & development RBI: Risk-based inspections RCBM: Risk-based centred maintenance RCM: Reliability-centred maintenance ROI: Return on investment SAE: Society of automotive engineering SMED: Single minute exchange of dies SMI: Slow moving items TBM: Time-based maintenance TCO: Total cost of ownership TPM: Total productive maintenance TQM: Total quality management UBM: Use-based maintenance VDM: Value-driven maintenance VSMI: Very slow moving items WIP: Work in progress

2.7 References Anderson, R.T., Neri, L., (1990), Reliability Centred Maintenance: Management and Engineering Methods, Elsevier Applied Sciences, London Blanchard, B.S., (1992), Logistics Engineering and Management, Prentice Hall, Englewood Cliffs, New Jersey Cho, I.D, Parlar, M., (1991), A survey on maintenance models for multi-unit systems. European Journal of Operational Research, 51:1–23 Coetzee, J.L., (2002), An Optimized Instrument for Designing a Maintenance Plan: A Sequel to RCM. PhD thesis, University of Pretoria, South-Africa Dekker, R., (1996) Applications of maintenance optimization models: A review and analysis. Reliability Engineering and System Safety, 52(3):229–240 Dekker, R., and Scarf, P.A., (1998) On the impact of optimisation models in maintenance decision making: the state of the art. Reliability Engineering and System Safety, 60:111–119 Geraerds, W.M.J., (1972), Towards a Theory of Maintenance. The English University Press. London.

48

L. Pintelon and A. Parodi-Herz

Gits, C.W., (1984), On the Maintenance Concept for a Technical System: A Framework for Design, Ph.D.Thesis, TUEindhoven, The Netherlands Haarman, M. and Delahay, G., (2004), Value Driven Maintenance – New Faith in Maintenance, Mainnovation, Dordrecht, The Nederlands Jones, R.B., (1995), Risk-Based Maintenance, Gult Professional Publishing (Elsevier), Oxford Kelly, A., (1997), Maintenance Organizations & Systems: Business-Centred Maintenance, Butterworth-Heinemann, Oxford McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey. Management Science, 11 (5):493–524 Moubray, J., (1997), Reliability-Centred Maintenance. Second Edition. ButterworthHeinemann, Oxford Nowlan, F.S., Heap, H.F., (1978), Reliability Centered Maintenance, United Airlines Publications, San Fransisco Parkes, D. in Jardine, A.K.S., (1970), Operational Research in Maintenance, University of Manchester Press, Manchester Pintelon, L., Gelders, L., Van Puyvelde, F., (2000), Maintenance Management, Acco Leuven/ Amersfoort Pintelon, L., Gelders, L., (1992) Maintenance management decision making. European Journal of Operational Research, 58:301–317 Pintelon, L., Pinjala, K., Vereecke, A., (2006), Evaluating the Effectiveness of Maintenance Strategies, Journal of Quality in Maintenance Engineering (JQME), 12(1):214–229 Pintelon, L., Van Puyvelde, F., (2006), Maintenance Decision Making, Acco, Leuven, Belgium Takahashi, Y. and Takashi, O., (1990) TPM: Total Productive Maintenance. Asian Productivity Organization, Tokyo Valdez-Flores, C., Feldman, R.M., (1989) A survey of preventive maintenance models for stochastically deteriorating single-unit systems. Naval Research Logistics, 36:419–446 Waeyenbergh, G., (2005), CIBOCOF – A Framework for Industrial Maintenance Concept Development, PhD thesis, Centre for Industrial Management – K.U.Leuven, Leuven, Belgium Waeyenbergh, G., Pintelon, L., (2002) A framework for maintenance concept development. International Journal of Production Economics, 77:299–313 Wang H., (2002), A survey of maintenance policies of deteriorating systems. European Journal of Operational Research, 139:469–489

3 New Technologies for Maintenance Jay Lee and Haixia Wang

3.1 Introduction For years, maintenance has been treated as a dirty, boring and ad hoc job. It’s seen as critical for maintaining productivity but has yet to be recognized as a key component of revenue generation. The question most often asked is “Why do we need to maintain things regularly?” The answer is “To keep things as reliable as possible.” However, the question that should be asked is “How much change or degradation has occurred since the last round of maintenance?” The answer to this question is “I don’t know.” Today, most machine field services depend on sensor-driven management systems that provide alerts, alarms and indicators. The moment the alarm sounds, it’s already too late to prevent the failure. Therefore, most machine maintenance today is either purely reactive (fixing or replacing equipment after it fails) or blindly proactive (assuming a certain level of performance degradation, with no input from the machinery itself, and servicing equipment on a routine schedule whether service is actually needed or not). Both scenarios are extremely wasteful. Rather than reactive maintenance, “fail-and-fix,” world-class companies are moving forwards towards “predict-and-prevent” maintenance. A maintenance scheme, referred to as condition based maintenance (CBM), was developed by considering current degradation and its evolution. CBM methods and practices have been continuously improved for the last decades; however, CBM is conducted at equipment level − one piece of equipment at a time, and the developed prognostics approaches are application or equipment specific. Holistic approach, real-time prognostics devices, and rapid implementation environment are potential future research topics in product and system health assessment and prognostics. With the level of integrated network systems development in today’s global business environment, machines and factories are networked, and information and decisions are synchronized in order to maximize a company’s asset investments. This generates a critical need for a real-time remote machinery prognostics and health management (R2M-PHM) system. The unmet needs in maintenance can be categorized into the following:

50

J. Lee and H. Wang

1. Machine intelligence: intelligent monitoring, predict and prevent, and compensation, reconfiguration for sustainability (self-maintenance). 2. Operations intelligence: prioritize, optimize, and responsive maintenance scheduling for reconfiguration needs. 3. Synchronization intelligence: autonomous information flow from market demand to factory asset utilization. Based on the unmet needs in maintenance, many research and development questions concerning next generation maintenance systems can be raised. Some of them are the following: 1. How to adapt maintenance schedules to cope dynamically with shop-floor reality? 2. How to feed back information and knowledge gathered in maintenance to the designers of the process? 3. How to link maintenance policies to corporate strategy and objectives? 4. How to synchronize production scheduling based on maintenance performance? The rest of this chapter is organized as follows. Section 2 gives a state-of-theart review on maintenance technologies, which includes a maintenance paradigm overview and CBM prognostics approaches. Section 3 presents the newly developed platform of Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system, the Watchdog Agent® toolbox method for multi-sensor performance assessment and prognostics, and real-life industrial case studies. Section 4 summarizes new developments and discusses future work.

3.2 State-of-the-art Reviews on Maintenance Technologies 3.2.1 Maintenance Paradigm Overview Looking back on the development history and forecasting the development tendency of maintenance technologies, the roadmap to excellence in maintenance can be illustrated as in Figure 3.1. 3.2.1.1 No Maintenance There are two kinds of situations in which no maintenance will occur: • •

No way to fix it: the maintenance technique is not available for a special application, or the maintenance technique is at too early stage of development. Isn’t worth it to fix it: some machines were designed to be used only once. When compared to maintenance cost, it may be more cost-effective just to discard it.

Neither of the scenarios above is within the scope of the discussion here.

Machine Performance and uptime

New Technologies for Maintenance

51

Self-Maintenance or Maintenance-free Proactive Machine Maintenance (Failure Root causes analysis)

No Maintenance

Predictive Preventive Maintenance Maintenance (Scheduled Reactive Maintenance) Maintenance (Fire Fighting)

Figure 3.1. The development of maintenance technologies

3.2.1.2 Reactive Maintenance The aim of reactive maintenance is just to “fix it after it’s broken”, since most of the time a machine breaks down without warning and it is urgent for the maintenance crew to put it back to work: this is also referred to as “fire-fighting”. This fire-fighting mode of maintenance is still present in many maintenance operations today because accurate knowledge of the equipment behavior is lacking. Essentially, little to no maintenance is conducted and the machinery operates until a failure occurs. At this time, appropriate personnel are contacted to assess the situation and make the repairs as expeditiously as possible. In a situation where the damage to equipment is not a critical factor, plenty of downtime is available, and the values of the assets are not a concern, the fire-fighting mode may prove to be an acceptable option. Of course, one must consider the additional cost of making repairs on an emergency basis since soliciting bids to obtain reasonable costs may not be applicable in these situations. Due to market competition and environmental/safety issues, the trend is toward appropriating an organized and efficient maintenance program as opposed to firefighting. 3.2.1.3 Preventive Maintenance Preventive maintenance (PM) is an equipment maintenance strategy based on replacing, overhauling or remanufacturing an item at fixed or adaptive intervals, regardless of its condition at the time. These maintenance operations models can be characterized as long term maintenance policies (Wang 2002) that do not take into account instantaneous equipment status. Scheduled restoration tasks and scheduled discard tasks are both examples of preventive maintenance tasks. In preventive maintenance, breakdowns are tracked and recorded in a database, and the information accumulated provides a base for general preventive actions. The age-dependent PM policy can be considered as the most common maintenance policy in which a unit’s PM times are based on the age of the unit. The basic idea is to replace or repair a unit at its age T or failure whichever occurs first (Badia et al., 2002; Mijailovic 2003). Commonly used equipment reliability indices such as mean time between failure (MTBF) and mean time to repair (MTTR) are extracted

52

J. Lee and H. Wang

from the historical databases of equipment behavior over time. These two indices provide a rough estimate of the time between two adjacent breakdowns and the mean time needed to restore a system when such breakdowns happen. Although equipment degradation processes vary from case to case, and the causes of failure can be different as well, the information contained in MTBF and MTTR can still be informative. Other indices can also be extracted and used, including the mean lifetime, mean time to first failure, and mean operational life, as discussed by Pham et al. (1997). With the introduction of minimal repair and imperfect maintenance, various extensions and modifications to the age-dependent PM policy have been proposed (Bruns 2002; Chen et al. 2003). Another preventive maintenance policy that received much attention is the periodic PM policy, in which degraded machines are repaired or replaced at fixed time intervals independent of the equipment failures. Various modifications and enhancements to this maintenance policy have also been proposed recently (Cavory et al. 2001). The preventive maintenance schemes are time-based without considering the current health state of the product, and thus are inefficient and less valuable for a customer whose individual asset is of the most concern. For the case of helicopter gearboxes, it was found that almost half of the units were removed for overhaul even though they were in a satisfactory operating condition. Therefore techniques for more economical and reliable maintenance are needed. 3.2.1.4 Predictive Maintenance Predictive maintenance (PdM) is a right-on-time maintenance strategy. It is based on the failure limit policy in which maintenance is performed only when the failure rate, or other reliability indices, of a unit reaches a predetermined level. This maintenance strategy has been implemented as condition based maintenance (CBM) in most production systems, where certain performance indices are periodically (Barbera et al. 1996; Chen and Trivedi 2002) or continuously monitored (Marseguerra et al. 2002). Whenever an index value crosses some predefined threshold, maintenance actions are performed to restore the machine to its original state, or to a state where the changed value is at a satisfactory level in comparison to the threshold. Predictive maintenance can be best described as a process that requires both technology and human skills, while using a combination of all available diagnostic and performance data, maintenance history, operator logs and design data to make timely decisions about maintenance requirements of major/critical equipment. It is this integration of various data, information and processes that leads to the success of a PdM program. It analyzes the trend of measured physical parameters against known engineering limits for the purpose of detecting, analyzing and correcting a problem before a failure occurs. A maintenance plan is devised based on the prediction results derived from condition based monitoring. This method can cost more up front than PM because of the additional monitoring hardware and software investment, cost of manning, tooling, and education that is required to establish a PdM program. However, it provides a basis for failure diagnostics and maintenance operations, and offers increased equipment reliability and a sufficient advance in information to improve planning, thereby reducing unexpected downtime and operating costs.

New Technologies for Maintenance

53

3.2.1.5 Proactive Maintenance Proactive maintenance (PaM) is a new maintenance concept that is emerging along with the development of business globalization. It encompasses any tasks that seek to realize the seamless integration of diagnosis and prognosis information and maintenance decision making via a wireless internet or satellite communication network. Machine health information should represent a trend, not just a status, so that a company’s productivity can be focused on asset-level utilization, not just production rates. Moreover, through integrated life-cycle management, such degradation information can be used to make improvements in every aspect of a product’s life-cycle. Intelligent maintenance systems (IMS) presented by Lee (1996) is a PaM representative. Specifically, it has three main working directions as follows: • •



Develop intertwined embedded informatics and electronic intelligence in a networked and tether-free environment and enable products and systems to intelligently monitor, predict, and optimize their performance. Change “failure reactive” to “failure proactive” by avoiding the underlying conditions that lead to machine faults and degradation. Focus on analyzing the root cause, not just the symptoms. That is, seek to prevent or to fix failure from its source. Feed the maintenance information back to the product, process and machine design, and ultimately make improvements in every aspect of product lifecycle.

3.2.1.6 Self-maintenance Self-maintenance is a new design and system methodology. Self-maintenance machines are expected to be able to monitor, diagnose, and repair themselves in order to increase their uptime. One system approach to enabling self-maintenance is based on the concept of functional maintenance (Umeda et al. 1995). Functional maintenance aims to recover the required function of a degrading machine by trading off functions, whereas traditional repair (physical maintenance) aims to recover the initial physical state by replacing faulty components, cleaning, etc. The way to fulfil the self-maintenance function is by adding intelligence to the machine, making it clever enough for functional maintenance, so that the machine can monitor and diagnose itself, and it can still maintain its functionality for a while if any kind of failure or degradation occurs. In other words, self-maintainability would be appended to an existing machine as an additional embedded reasoning system. The required capabilities of a self-maintenance machine (SMM) are defined as follows (Labib 2006): • •

Monitoring capability: SMM must have the ability of on-line condition monitoring using sensor fusion. The sensors send the raw data of machine condition to a processing unit. Fault judging capability: from the sensory data, the SMM can judge whether the machine condition is at normal or abnormal state. By judging the condition of the machines, we can know the current condition and time left to failure of the machines.

54

J. Lee and H. Wang

• •

• •

Diagnosing capability: if the machine condition is at abnormal state, the causes of faults must be diagnosed and identified to allow repair planning action to be carried out. Repair planning capability: the machine is able to propose repair actions based on the result of diagnosis and functional maintenance. The repair planning action is performed using knowledge from the experts which is stored in the data base system. There may be more than one repair action proposed; however, the optimized one will be selected to be implemented. Repair executing capability: the maintenance is carried out by the machine itself without any human intervention. This can be achieved through computer control system and actuators in the machines. Self-learning and improvement: when faced with unfamiliar problems, the machine is able to repair itself and it is expected that if such problems occur again, the machine will take a shorter time for repairing itself and the outcome of maintenance will be more effective and efficient.

Efforts towards realizing self-maintenance have been mainly in the form of intelligent adaptive control, where investigation of control was achieved using fuzzy logic control. In order to realize self-maintenance, one needs to develop and implement an adaptive artificial neuron-fuzzy inference system which allows the fuzzy logic controller to learn from the data it is modeling and automatically produce appropriate membership functions and the required rules. Such a controller must be able to cater for sensor degradation and this leads to self-learning and improvement capabilities. Another system approach to enabling self-maintenance is to add the self-service trigger function to a machine. The machine self-monitors, self-prognoses and selftriggers a service request before a failure actually occurs. The maintenance task may still be conducted by a maintenance crew, but the no gap integration of machine, maintenance schedule, dispatch system and inventory management system will minimize maintenance costs and raise customer satisfaction. 3.2.2 Prognostics Approaches for Condition Based Maintenance Condition based maintenance (CBM) was presented as a maintenance scheme to provide sufficient warning of an impending failure on a particular piece of equipment, allowing that equipment is to be maintained only when there is objective evidence of an impending failure. CBM methods and practices have been continuously improved in recent decades. Sensor fusion techniques are now commonly in use due to the inherent superiority in taking advantage of mutual information from multiple sensors (Hansen et al. 1994; Reichard et al. 2000; Roemer et al. 2001). A variety of techniques in vibration, temperature, acoustic emissions, ultrasonic, oil debris, lubricant condition, chip detectors, and time/stress analyses has received considerable attention. For example, vibration signature analysis, oil analysis and acoustic emissions, because of their excellent capability for describing machine performance, have been successfully employed for prognostics for a long time (Kemerait 1987; Wilson et al. 1999; Goodenow et al. 2002). Current prognostic approaches can be classified into three basic groups: model-based

New Technologies for Maintenance

55

approach, data-driven approach, and hybrid approach. The model-based approach requires detailed knowledge of the physical relationships between, and characteristics of, all related components in a system. It is a quantitative model used to identify and evaluate the difference between the actual operating state determined from measurements, and the expected operating state derived from the values of the characteristics obtained from the physical model. Bunday (1991) presented the theory and methodology of obtaining reliability indices from historical data. In direct implementation in maintenance, the reliability of the system is kept at a defined level, and whenever the reliability falls below the defined level, maintenance actions should take place to restore it back to its proper level. However, it is usually prohibitive to use the model-based approach since relationships and characteristics of all related components in a system and its environment are often too complicated to build a model with a reasonable amount of accuracy. In some cases, values of some process parameters/factors are not readily available. A poor model leads to poor judgment. The data-driven approach requires a large amount of history data representing both normal and “faulty” operations. It uses no a priori knowledge of the process but, instead, derives behavioral models only from measurement data from the process itself. Pattern recognition techniques are widely used in this approach. General knowledge of the process can be used to interpret data analysis results, based on which qualitative methods such as fuzzy logic, and artificial intelligence methods can be used for decision making to realize fault prevention. The hybrid approach fuses the model-based information and sensor-based information and takes advantage of both model-driven and datadriven approaches through which more reliable and accurate prognostic results can be generated (Hansen et al. 1994). Garga et al. (2001) introduced a hybrid reasoning method for prognostics, which integrated explicit domain knowledge and machinery data. In this approach, a feed-forward neural network was trained using explicit domain knowledge to get a parsimonious representation of the explicit domain knowledge. However, a major breakthrough has not been made since. Existing prognostic methods are application or equipment specific. For instance, the development of neural networks has added new dimensions to solving existing problems in conducting prognostics of a centrifugal pump case (Liang et al. 1988). A comparison of the results using the signal identification technique shows various merits of employing neural nets including the ability to handle multivariate wear parameters in a much shorter time. A polynomial neural network was conducted in fault detection, isolation, and estimation for a helicopter transmission prognostic application (Parker et al. 1993). Ray and Tangirala (1996) built a stochastic model of fatigue crack dynamics in mechanical structures to predict remaining service time. Fuzzy logic-based neural networks have been used to predict paper web breakage in a paper mill (Bonissone 1995) and the failure of a tensioned steel band with seeded crack growth (Swanson 2001). Yet another prognostic application presented an integrated system in which a dynamically linked ellipsoidal basis function neural network was coupled with an automated rule extractor to develop a tree-structured rule set which closely approximates the classification of the neural network (Brotherton et al. 2000). That method allowed assessment of trending from the nominal class to each of the identified fault classes, which means quantitative

56

J. Lee and H. Wang

prognostics were built into the network functionality. Vachtsevanos and Wang (2001) gave an overview of different CBM algorithms and suggested a method to compare their performance for a specific application. Prognostic information, obtained through intelligence embedded into the manufacturing process or equipment, can also be used to improve manufacturing and maintenance operations in order to increase process reliability and improve product quality. For instance, the ability to increase reliability of manufacturing facilities using the awareness of the deterioration levels of manufacturing equipment has been demonstrated through an example of improving robot reliability (Yamada and Takata 2002). Moreover, a life cycle unit (LCU) (Seliger et al. 2002) was proposed to collect usage information about key product components, enabling one to assess product reusability and facilitating the reuse of products that have significant remaining useful life. In spite of the progresses in CBM, many fundamental issues still remain. For example: 1. Most research is conducted at the single equipment level, and no infrastructure exists for employing a real-time remote machinery diagnosis and prognosis system for maintenance. 2. Most of the developed prognostics approaches are application or equipment specific. A generic and scalable prognostic methodology or toolbox doesn’t exist. 3. Currently, methods are focused on solving the failure prediction problem. The need for tools for system performance assessment and degradation prediction has not been well addressed. 4. The maintenance world of tomorrow is an information world for featurebased monitoring. Features used for prognostics need to be further developed. 5. Many developed prediction algorithms have been demonstrated in a laboratory environment, but are still without industry validation. To address the afore-mentioned unmet needs, Watchdog Agent®-based intelligent maintenance systems (IMS) has been presented by the IMS Center with a vision to develop a systematic approach in advanced prognostics to enable products and systems to achieve near-zero breakdown reliability and performance.

3.3 Watchdog Agent®-based Intelligent Maintenance Systems Today most state-of-the-art manufacturing, mining, farming, and service machines (e.g., elevators) are actually quite “smart” in themselves. Many sophisticated sensors and computerized components are capable of delivering data concerning a machine’s status and performance. The problem is that little or no practical use is made of most of this data. We have the devices, but we do not have a continuous and seamless flow of information throughout entire processes. Sometimes this is because the available data is not rendered in a useable, or instantly understandable,

New Technologies for Maintenance

57

form. More often, no infrastructure exists for delivering the data over a network, or for managing and analyzing the data, even if the devices were networked. Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system has been recently developed by the IMS Center. It focuses on developing innovative prognostics algorithms and tools, as well as remote and embedded predictive maintenance technologies to predict and prevent machine failures, as illustrated in Figure 3.2.

Figure 3.2. Key focus and elements of the Intelligent Maintenance Systems

The rest of the section is organized as follows. Section 3.1 deals with the platform of Watchdog Agent®-based real-time remote machinery prognostics and health management (R2M-PHM) system. Section 3.2 presents a generic and scalable prognostic methodology or toolbox, i.e., the Watchdog Agent® toolbox; and Section 3.3 illustrates the effectiveness and potentials of this new development using several real industry case studies. 3.3.1 Watchdog Agent®-based R2M-PHM Platform A generic and scalable prognostics framework was presented by Su et al. (1999) to integrate with embedded diagnostics to provide “total health management” capability. A reconfigurable and scalable Watchdog Agent®-based R2M-PHM platform is being developed by the IMS Center, which expands the well known open system architecture for condition-based maintenance (OSA-CBM) standard (Thurston and Lebold 2001) by including real-time remote machinery diagnosis and prognosis systems and embedded Watchdog Agent® technology. As illustrated in Figure 3.3, the Watchdog Agent® (hardware and software) is embedded onto machines to convert multi-sensory data to machine health information. The extracted information is managed and transferred through wireless internet or a satellite communication network, and service is automatically triggered.

58

J. Lee and H. Wang

Figure 3.3. Illustration of IMS real-time remote machinery diagnosis and prognosis system

3.3.1.1 System Architecture The system architecture of the Watchdog Agent®-based R2M-PHM platform is shown in Figure 3.4. In most products or systems, different sensors measure different aspects of the same physical phenomena. For example, sensor signals, such as vibrations, temperature, pressure, etc. are collected. A “digital doctor” inspired by biological perceptual systems and machine psychology theory, the Watchdog Agent® consists of embedded computational prognostic algorithms and a software toolbox for predicting degradation of devices and systems. It is being built to be extensible and adaptable to most real-world machine situations. The health related information is saved to the database. The diagnostic and prognostic outputs of the Watchdog Agent®, which is mounted on all the machinery of interest, can then be fed into the decision support tools. Decision support tools help the operation personnel balance and optimize their resources, when one or more machines are likely to fail, by constantly looking ahead. For example, if a production line has three processes A, B and C, such that A has one machine, B has three machines, and C has one machine, what would we do if we could anticipate that one of the machines at station B is not behaving normally. Perhaps we would arrange a staging area for output from A, or perhaps we would ramp up production on the other two machines at station B. Whatever the case, we would be making our decision before experiencing the impending breakdown. These tools are critical to maintenance and process personnel, enabling them to stay ahead of the game, balancing limited resources with constant change in demand. Decision support tools also help minimize losses in productivity caused by downtime, and help production and logistics managers optimize their maintenance schedule to minimize downtime costs. The lean and necessary information for maintenance can then be determined and published to the internet through an embedded web server.

New Technologies for Maintenance

Embedded software

Sensor signals Vibration Temperature Pressure

59

Watchdog Agent® toolbox

Database

Decision support tools

Web server

Client software

Current Voltage On/Off …

Embedded operating system I/O cards

Remote computer

Embedded computer

Figure 3.4. System architecture of a reconfigurable Watchdog Agent®

The rapid development of web-enabled and cyber-infrastructure technologies is important in providing enablers for remote monitoring and prognostics. One of the major barriers is that most manufacturers adopt proprietary communication protocols which lead to difficulties in connecting diverse machines and products. Currently, the IMS Center is developing a web-enabled remote monitoring Deviceto-Business (D2B)™ platform for remote monitoring and prognostics of diversified products and systems. A system methodology and infotronics platform has been developed that enables the transformation of product condition data into more a useful health information format for remote and network-enabled prognostics applications. The MIMOSA (maintenance information management open system architecture) organization has adopted the IMS infotronic platform as one of its standard platforms and will use an IMS testbed to demonstrate MIMOSA standards in its future activities. As shown in Figure 3.5, the IMS infotronics platform includes the Watchdog Agent® toolbox (which contains adaptive algorithms for different situations and applications), decision support tools, data storage, and D2BTM (device-to-business) system level connectivity. The Watchdog Agent® toolbox includes signal processing, feature extraction, performance assessment, autonomous learning, prediction and prognostics functions. The lean and necessary information for maintenance from decision support tools can then be determined and sent out through D2BTM system level connectivity to remote workstations or computers.

60

J. Lee and H. Wang

Figure 3.5.

Integrated infotronics platform

3.3.1.2 Hardware Requirements For a certain industry application, the selection of Watchdog Agent® hardware depends on characteristics of the input/output signals (for example, what type of input/output signal and how many channels needed), which tools or algorithms are selected (for example, different algorithms require different hardware computation and storage capacities), and the hardware’s working environment (for example, which decides the hardware’s storage type, temperature range, etc.). The hardware prototype currently used in the IMS Center is based on PC104 architecture, as shown in Figure 3.6a. PC104 architecture enables the hardware to be easily expanded to a multi-board system, which includes multiple CPUs and a large amount of input channels. It has a powerful VIA Eden 400MHz CPU and 128MB

New Technologies for Maintenance

61

of memory since all of the tools are embedded into the hardware. It has 16 high speed analog input channels to deal with highly dynamic signals. It also has various peripherals that can acquire non-analog sensor signals such as RS232/485/432, parallel and USB. The prototype uses a compact flash card for storage, so it can be placed on top of machine tools and is suitable for withstanding vibrations in a working environment. Once a certain set of tools/algorithms is determined for a certain industry application, commercially available hardware, such as Advantech and National Instruments (NI) as illustrated in Figure 3.6b and c, respectively, will be further evaluated for customized Watchdog Agent® applications.

Figure 3.6a–c. Options of hardware prototypes for Watchdog Agent® application

3.3.1.3 Software Development The software system of the Watchdog Agent®-based IMS platform consists of two parts: the embedded side software and the remote side software, as shown in Figure 3.7. The embedded side software is the software running on the Watchdog Agent® hardware, which includes a communication module, a command analysis module, a task module, an algorithm module, a function module, and a DAQ module. The communication module is responsible for communicating with the remote side via TCP/IP protocol. The command analysis module is used to analyze different commands coming from the remote side. The task module includes multithread scheduling and management. The algorithm module contains specific watchdog agent tools. The function module has several auxiliary functions such as channel configuration, security configuration, and email list and so on. The DAQ module performs A/D conversion using either interrupt or software trigger to get data from different sensors. The remote side software is the software running on the remote computers. It is implemented by ActiveX control technology and can be used as a component of the Internet Explorer Browser. The remote side software is mainly composed of a communication module and a user interface module. The communication module is used for communicating with the embedded site via TCP/IP protocol. The user interface has a health information display, an ATC status display, and a discrete event display. It also possess an algorithm module, as well as error log database and data format interface.

62

J. Lee and H. Wang

Figure 3.7. Software structure of Watchdog Agent®

3.3.1.4 Remote Monitoring Architecture and Human Machine Interface Standards A four-layer infrastructure for remote monitoring and human machine interface standards is illustrated in Figure 3.8. The data acquisition layer consists of multiple sensors which obtain raw data from the components of a machine or machines in different locations. The Network layer will use either traditional Ethernet connections, or wireless connections for communication between the Watchdog Agent®s, or for sending short messages (SM) to an engineer’s mobile phone via GPRS services. The Application layer functions as a control server to save related information and control the behavior of the Watchdog Agent®s in the network. The Enterprise layer offers a user-friendly interface for maintenance-related engineers to access information either via an Internet browser or a mobile phone.

Figure 3.8. Illustration of Watchdog Agent®-based remote monitoring architecture

New Technologies for Maintenance

63

3.3.2 Watchdog Agent® Toolbox for Multi-sensor Performance Assessment and Prognostics The Watchdog Agent® toolbox, with autonomic computing capabilities, is able to convert critical performance degradation data into health features and quantitatively assess their confidence value to predict further trends so that proactive actions can be taken before potential failures occur. Figure 3.9 illustrates one of the developed enabling prognostics tools that can assess and predict the performance degradation of products, machines and complex systems.

Figure 3.9. MS innovation in advanced prognostics

The Watchdog Agent® toolbox enables one to assess and predict quantitatively performance degradation levels of key product components, and to determine the root causes of failure (Casoetto et al. 2003; Djurdjanovic et al. 2000; Lee 1995, 1996), thus making it possible to realize physically closed-loop product life cycle monitoring and management. The Watchdog Agent® consists of embedded computational prognostic algorithms and a software toolbox for predicting degradation of devices and systems. Degradation assessment is conducted after the critical properties of a process or machine are identified and measured by sensors. It is expected that the degradation process will alter the sensor readings that are being fed into the Watchdog Agent®, and thus enable it to assess and quantify the degradation by quantitatively describing the corresponding change in sensor signatures. In addition, a model of the process or piece of equipment that is being considered, or available application specific knowledge can be used to aid the degradation process description, provided that such a model and/or such knowledge exist. The prognostic function is realized through trending and statistical modeling of the observed process performance signatures and/or model parameters. In order to facilitate the use of Watchdog Agent® in a wide variety of applications (with various requirements and limitations regarding the character of signals, available processing power, memory and storage capabilities, limited space, power consumption, the user’s preference etc.) the performance assessment module of the

64

J. Lee and H. Wang

Watchdog Agent® has been realized in the form of a modular, open architecture toolbox. The toolbox consists of different prognostics tools, including neural network-based, time-series based, wavelet-based and hybrid joint time-frequency methods, etc., for predicting the degradation or performance loss on devices, process, and systems. The open architecture of the toolbox allows one easily to add new solutions to the performance assessment modules as well as to easily interchange different tools, depending on the application needs. To enable rapid deployment, a quality function deployment (QFD) based selection method had been developed to provide a general suggestion to aid in tool selection; this is especially critical for those industry users who have little knowledge about these algorithms. The current tools employed in the signal processing and feature extraction, performance assessment, diagnostics and prognostics modules of Watchdog Agent® functionality are summarized in Figure 3.10. Each of these modules is realized in several different ways to facilitate the use of the Watchdog Agent® in a wide variety of products and applications.

Figure 3.10. Watchdog Agent® prognostics toolbox

3.3.2.1 Signal Processing and Feature Extraction Module The signal processing module transforms multiple sensor signals into domains that are the most informative of a product’s performance. Time-series analysis (Pandit and Wu 1993) or frequency domain analysis (Marple 1987) can be used to process stationary signals (signals with time invariant frequency content), while wavelet (Burrus et al. 1998; Yen and Lin 2000), or joint time-frequency analysis (Cohen 1995; Djurdjanovic et al. 2002) could be used to describe non-stationary signals (signals with time-varying frequency content). Most real life signals, such as speech, music, machine tool vibration, acoustic emission etc. are non-stationary

New Technologies for Maintenance

65

signals, which place a strong emphasis on the need for development and utilization of non-stationary signal analysis techniques, such as wavelets, or joint timefrequency analysis. The feature extraction module extracts features most relevant to describing a product’s performance. Those features are extracted from the time domain into which the sensory processing module transforms sensory signals, using expert knowledge about the application, or automatic feature selection methods such as roots of the autoregressive time-series model, or time-frequency moments and singular value decomposition. Currently the following signal processing and feature extraction tools are used in the Watchdog Agent® toolbox: •



• •



The Fourier transformation method has been widely used in de-noising and feature extraction. Noise component in the signal can be distinguished after it is transformed, and feature components can be identified after the removal of noise. However, Fourier transformation is applicable to nonstationary signals only since frequency-band energies for applications are characterized by time-invariant frequency content. The autoregressive modeling method calculates frequency peak locations and intensities using autoregressive oscillation modes of sensor readings and bares significant information about the process (usually, mechanical systems are well described by the modes of oscillations). The wavelet/wavelet packet decomposition method enables the rapid calculation of non-stationary signal energy distribution at the expense of loosing some of the desirable mathematical properties. The time-frequency analysis method provides both temporal and spectral information with good resolution, and is applicable to highly non-stationary signals (e.g. impacts or transient behaviors). However, it is not applicable if a large amount of data has to be considered and calculation speed is a concern. The application specific features extraction method is applicable in cases when one can directly extract performance-relevant features out of the time-series of sensor readings.

3.3.2.2 Performance Assessment Module The performance assessment module evaluates the overlap between the most recently observed signatures and those observed during normal product operation. This overlap is expressed through the so-called confidence value (CV), ranging between zero and one, with higher CVs signifying a high overlap, and hence performance closer to normal (Lee 1995, 1996). In case data associated with some failure mode exist, most recent performance signatures obtained through the signal processing and feature extraction module can be matched against signatures extracted from faulty behavior data as well. The areas of overlap between the most recent behavior and the nominal behavior, as well as the faulty behavior, are continuously transformed into CV over time for evaluating the deviation of the recent behavior from nominal to faulty. Realization of the performance evaluation module depends on the character of the application and extracted performance signatures. If significant application

66

J. Lee and H. Wang

expert knowledge exists, simple but rapid performance assessment based on the feature-level fused multi-sensor information can be made using the relative number of activated cells in the neural network, or by using the logistic regression approach. For products with open-control architecture, the match between the current and nominal control inputs and the performance criteria can also be utilized to assess the product’s performance. For more sophisticated applications with intricate and complicated signals and performance signatures, statistical pattern recognition methods, or the feature map based approach can be employed. The following performance assessment tools are currently being used in the Watchdog Agent® toolbox: •

• •





The logistic regression method allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. It can quantitatively represent the proximity of current operating conditions to the region of desirable or undesirable behavior. However, it is applicable when a good feature domain description of unacceptable behavior is available. The feature map method assesses the overlap between the normal and most recent process behavior, and is applicable in cases when the Gaussianness of extracted features cannot be guaranteed. The statistical pattern recognition method calculates overlap of feature distributions based on the assumption of Gaussian distribution of the features, and is applicable to a repeatable and stable process. However, it is not applicable to the highly dynamic systems in which feature distribution cannot be approximated as Gaussian The hidden Markov model method is applicable to highly dynamic phenomena when a sequence of process observations rather than a single observation is needed to describe adequately the behavior of process signatures. The particle filters performance assessment is able to describe quantitatively process performance, and is applicable in cases of complex systems that display multiple regimes of operation (both normal and faulty). In this case a hybrid description of the system is needed, incorporating both discrete and continuous states.

3.3.2.3 Diagnostics Module The diagnostics module tells not only the level of behavior degradation (the extent to which the newly arrived signatures belong to the set of signatures describing normal system behavior), but also how close the system behavior is to any of the previously observed faults (overlap between signatures describing the most recent system behavior with those characterizing each of the previously observed faults). This matching allows the Watchdog Agent® to recognize and forecast a specific fault behavior, once a high match with the failure associated signatures is assessed for the current process signatures, or forecasted based on the current and past product’s performance. Figure 3.11 illustrates this signature matching process for performance evaluation.

New Technologies for Maintenance

67

Figure 3.11. Performance evaluation using Confidence Value (CV)



• •



The support vector machine method establishes a non-linear maximum margin classifier that infers the machine condition from a new set of measurements. It works by using a non-linear kernel to transform the input vector space (which is a set of measurements believed to be correlated with machine condition) to a much higher dimension feature space, and drawing a linear hyper-plane classifier there. It is especially applicable to the situation when Gaussianity of the performance related features cannot be guaranteed and when a process may display multiple normal and faulty modes of behavior (multiple regimes of operation and/or multiple possible faults in the process). The main drawback to using this method is that the choice of a kernel in real applications is usually based on experience or trial-and-error test. The hidden Markov model method is especially applicable to a situation in which multiple signals exist and the system may have multiple failure modes. It is applicable to both stationary and non-stationary signals. The Bayesian belief network is a compact representation of cause-and-effect for a complex system, and is especially applicable to situations where there are multiple faults with multiple symptoms. The main drawback of this method is that no standard procedure exists to determine network structure and expert knowledge is needed to identify the node state. Condition diagnosis based on analytically calculated overlaps of Gaussians that describe the signatures corresponding to the current process behavior and the signatures corresponding to various modes of normal or faulty equipment behavior, is applicable to the cases in which performance related features approximately behave as Gaussians.

3.3.2.4 Prediction and Prognostics Module The prediction and prognostics module is aimed at extrapolating the behavior of process signatures over time and predicting their behavior in the future. autoregressive moving average (ARMA) (Pandit and Wu 1993) modeling and match matrix (Liu et al. 2004) methods are used to forecast the performance behavior. Currently, autoregressive moving-average (ARMA) modeling and match matrix methods are used to forecast the performance behavior. Over time, as new

68

J. Lee and H. Wang

failure modes occur, performance signatures related to each specific failure mode can be collected and used to teach the Watchdog Agent® to recognize and diagnose those failure modes in the future. Thus, the Watchdog Agent® is envisioned as an intelligent device that utilizes its experience and human supervisory inputs over time to build its own expandable and adjustable world model. Performance assessment, prediction and prognostics can be enhanced through feature-level or decision-level sensor fusion, as defined by Hall and Llinas (2000) (Chapter 2). Feature-level sensor fusion is accomplished through concatenation of features extracted from different sensors, and the joint consideration of the concatenated feature vector in the performance assessment and prediction modules. Decision-level sensor fusion is based on separately assessing and predicting process performance from individual sensor readings and then merging these individual sensor inferences into a multi-sensor assessment and prediction through some averaging technique. In summary, the following performance forecasting tools are currently used in the Watchdog Agent®: •







The autoregressive moving average (ARMA) method is applicable to linear time-invariant systems whose performance features display stationary behavior. ARMA utilizes a small amount of historic data and can provide good short term predictions. The compound match matrix/ARMA prediction method is applicable to cases when abundant records of multiple maintenance cycles exist for nonlinear processes. It excels at dealing with high dimension data and can provide good long term prediction by converting vector-based feature prediction to scalar-based prediction. The fuzzy logic prediction method is applicable to complex systems whose behavior is unknown and no model, function or numerical technique to describe the system is readily available. It utilizes linguistic vagueness or form and allows imprecision, to some extent, in formulating approximations. Fuzzy logic can give fast approximate solutions. The Elman recurrent neural network (ERNN) prediction method is applicable to non-linear systems and can give long term predictions when given a large amount of training data. However, no standard methodology exists to determine ERNN structure, and trial-and-error is usually used in the modeling process.

New tools will be continuously developed and added to the modular, open architecture Watchdog Agent® toolbox based on the development procedure as shown in Figure 3.12.

New Technologies for Maintenance

69

Problem definition & constraints

Tool selection

Parameter & tool selection

Prototyping & testing No

Accepted

Program development No

Evaluation

Yes Yes Deployment

Figure 3.12. Flowchart for developing Watchdog Agent® tools

3.3.3 Case Studies Several Watchdog Agent® tools for on-line performance assessment and prediction have already been implemented as stand alone applications in a number of industrial and service facilities. Listed below are several examples to illustrate the developed tools. 3.3.3.1 Example 1: Prognostics of an AS/RS Materials Handling Systems A time-frequency based method (Cohen 1995) has been implemented for performance assessment of a gearbox in an AS/RS material handling system shown in Figure 3.13. Four vibration sensor readings have been fused to evaluate autonomously its performance while it is on-line. The vibration signals were processed into joint time-frequency energy distributions (Cohen 1995) and a set of time-shift invariant time-frequency moments (Zalubas et al. 1996; Djurdjanovic et al. 2000; Tacer and Loughlin 1996) were extracted. Since those moments asymptotically follow a Gaussian distribution (Zalubas et al. 1996), statistical reasoning was utilized to evaluate the overlap between signatures describing normal process behavior (used for training) and those describing the most recent process behavior. Figure 3.14 shows a screenshot of the software application housing this time-frequency based Watchdog Agent® used for performance assessment of a material handling system. The CV was generated by fusing multiple signal features for performance assessment.

70

J. Lee and H. Wang

Figure 3.13. Material handling system for mail staging

Figure 3.14. Screenshot of the time-frequency based Watchdog Agent ®

3.3.3.2 Example 2: Roller Bearing Prognostics Testbed Most bearing diagnostics research involves studying the defective bearings recovered from the field or from laboratory experiements where the bearings exhibit mature faults. Experiments using defective bearings have a lower capability for discovering natural defect propagation in its early stages. In order truly to reflect real defect propagation processes, bearing run-to-failure tests were performed under normal load conditions on a specially designed test rig sponsored by Rexnord Technical Service. The bearing test rig hosts four test bearings on one shaft. Shaft rotation speed was kept constant at 2000rpm. A radial load of 6000lbs was added to the shaft and bearing by a spring mechanism. A magnetic plug installed in the oil feedback pipe collected debris from the oil as evidence of bearing degradation. The test stopped when the accumulated debris that adhered to the magnetic plug exceeds a certain level. Four double row bearings were installed on one shaft as shown in Figure 3.15. A high sensitivity accelerometer was installed on each bearing house. Four thermocouples were attached to the outer race of each bearing to record bearing temperature (that is relevant to bearing lubrication condition). Several sets of tests ending with various failure modes were carried out. The time domain feature shows that most of the bearing fatigue time is consumed during the period of material accumulative damage, while the period of crack propagation and development is relatively short. This means that if the traditional threshold-based condition monitoring approach is used, the response time available for the maintenance crew to respond prior to catastrophic failure after a defect is detected in such bearings is very short. A prognostic approach that can detect the defect at an early stage is demanded so that enough buffer time is available for maintenance and logistical scheduling.

New Technologies for Maintenance

71

Figure 3.15. The bearing test rig sponsored by Rexnord Technical Service

Figure 3.16 presents the vibration waveform collected from bearing 4 at the last stage of the bearing test. The signal exhibits strong impulses periodicity because of the impacts generated by a mature outer race defect. However, when examining the historical data and observing the vibration signal three days before the bearing failed, there is no sign of periodic impulses as shown in Figure 3.17a. The periodic impulse feature is completely masked by the noise.

Figure 3.16. The vibration signal waveform of a faulty bearing

An adaptive wavelet filter is designed to de-noise the raw signal and enhance degradation detection. The adaptive wavelet filter is yielded in two steps. First the optimal wavelet shape factor is found by the minimal entropy method. Then an optimal scale is identified by maximizing the signal periodicity. By applying the designed wavelet filter to the noisy raw signal, the de-noised signal can be obtained as shown in Figure 3.17b. The periodic impulse feature can then be clearly discovered, which serves as strong evidence of bearing outer race degradation. The wavelet filter-based de-noising method successfully enhanced the signal feature and provided potent evidence for prognostic decision-making.

72

J. Lee and H. Wang

a Raw Signal

b De-noised signal using the wavelet filter

Figure 3.17a, b. The vibration waveform with early stage defect

3.3.3.3 Example 3: Bearing Risk of Failure and Remaining Useful Life Prediction An important issue in prognostic technology is the estimation of the risk of failure, and of the remaining useful life of a component, given the component’s age and its past and current operating condition. In numerous cases, failures were attributed to many correlated degradation processes, which could be reflected by multiple degradation features extracted from sensor signals. These features are the major information regarding the health of the component under monitoring; however, the failure boundary is hard to define using these features. In reality, the same feature vector could be attributed to totally different combinations of the underlying degradation processes and their severity levels. There is only a probabilistic relationship between the component failure and the certain level of degradation features. A typical example can be found during bearing operation. Two bearings of the same type could fail at different levels of RMS and Kurtosis of vibration signal. To capture the probabilistic relationship between the multiple degradation features and the component failure as well as to predict the risk of failure and the remaining useful life, IMS has developed a Proportional Hazards (PH) approach (Liao et al. 2005) based on the PH model proposed by Cox (1972). The PH model involving multiple degradation features is given as

λ (t ; Z ) = λ0 (t ) exp( β ' Z )

(3.1)

where λ (t ; Z ) is the hazard rate of the component given the current age t and the degradation feature vector Z ; λ0 (t ) is called the baseline hazard rate function; β is the model parameter vector. This formulation relates the working age and multiple degradation feature to the hazard rate of the component. To estimate the parameters, the maximum likelihood approach could be utilized using offline data, including the degradation features over time of many components and their failure times. Afterwards, the established model can be used for predicting the risk of failure for the component by plugging in the working age and the degradation features extracted from the on-line sensor signals. In addition, the remaining useful life L(tcurrent ) given the current working age and the history of degradation features can be estimated as

New Technologies for Maintenance

L(tcurrent ) ≈ ∫

∞ t current

⎛ τ exp ⎜ − ∫ ⎝ t

current

⎞ λ (v; zˆ (v)) dv ⎟ dτ ⎠

73

(3.2)

where zˆ (v) is the predicted feature vector. Consider the vibration data obtained from the test rig in Example 2. To facilitate on-line implementation, root-mean-square (RMS) and Kurtosis are calculated and used as degradation features. Figure 3.18 shows the predicted hazard rate over time based on these degradation features. This quantity can be utilized to trigger maintenance when the risk level crosses a predetermined threshold level. Table 3.1 provides the remaining useful life predictions given the current bearing age and the feature observations. The predictions are in accordance with the actual life of the studied bearing ( ≈ 32 days) with minor prediction errors as the degradation progresses.

Figure 3.18. Hazard rate prediction of bearing 3 in Test 1

Table 3.1. Estimates of expected remaining useful life – Test 1, Bearing 3 (unit: day) Time

26

29

31

Estimated expected remaining useful life

3.5549

3.3965

1.5295

True remaining useful life

6.5278

3.5278

1.5278

Error

2.9729

0.1313

0.0017

3.4 Conclusions and Future Research This chapter addresses the paradigm shift in modern maintenance systems from the traditional “fail and fix” practices to a “predict and prevent” methodology. A reconfigurable and scalable Watchdog Agent®-based intelligent maintenance system

74

J. Lee and H. Wang

has been developed, which serves as a baseline system for researchers and companies to develop next-generation e-maintenance systems. It enables machine makers and users to predict machine health degradation conditions, diagnose fault sources, and suggest maintenance decisions before a fault actually occurs. The Watchdog Agent®-based R2M-PHM platform expands the OSA-CBM architecture topology by including real-time remote machinery diagnosis and prognosis systems and embedded Watchdog Agent® technology. The Watchdog Agent® is an embedded algorithm toolbox which converts multi-sensory data to machine health information. Innovative sensory processing and autonomous feature extraction methods are developed to facilitate the plug-and-play approach in which the Watchdog Agent® can be setup and run without any need for expert knowledge or intervention. Future work will be the further development of the Watchdog Agent®-based IMS platform. Smart software and NetWare will be further developed for proactive maintenance capabilities such as performance degradation measurement, fault recovery, self-maintenance and remote diagnostics. For the embedded Watchdog Agent® application, we need to harvest the developed technologies and tools and to accelerate their deployment in real-world applications through close collaboration between industrial and academic researchers. Specifically, future work will include the following aspects: (i) evaluate the existing Watchdog Agent® tools and identify the application needs from the smart machine testbed; (ii) develop a configurable prognostics tools platform for rotary machinery elements such as bearings, motors, and gears, etc., so that several of most frequently used prognostics tools can be pretested and deposited into a ready-to-use tool library; (iii) develop a user interface system for tool selection, which allows users to use the right tools effectively for the right applications and achieve “the first tool correct” accuracy; (iv) validate the reconfiguration of these tools to a variety of similar applications (to be defined by the company participants); and (v) explore research in a ‘‘peer-to-peer’’ (P2P) paradigm in which Watchdog Agent®s embedded on identical products operating under similar conditions could exchange information and thus assist each other in machine health diagnosis and prognosis. To predict, prioritize, and plan precision maintenance actions to achieve an “every action correct” objective, the IMS Center is creating advanced maintenance simulation software for maintenance schedule planning and service logistics cost optimization for transparent decision making. At the same time, the Center is exploring the integration of decision support tool and optimization techniques for proactive maintenance; this integration will facilitate the functionalities of the Watchdog Agent®-based R2M-PHM in which an intelligent maintenance systems can operate as a near-zero down-time, self-sustainable and self-aware artificially intelligent system that learns from its own operation and experience. Embedding is crucial for creating an enabling technology that can facilitate proactive maintenance and life cycle assessment for mobile systems, transportation devices and other products for which cost-effective realization of predictive performance assessment capabilities cannot be implemented on general purpose personal computers. The main research challenge will be to accomplish sophisticated performance evaluation and prediction capabilities under the severe power consumption, processing power and data storage limitations imposed by embedding. The Center

New Technologies for Maintenance

75

will develop a wireless sensor network made of self-powered wireless motes for machine health monitoring and embedded prognostics. These networked smart motes can be easily installed in products and machines with ad hoc communications. In addition, the Center is investigating the feasibility of harvesting energy by using vibration in an environment equipped with wireless motes for remote monitoring of equipment and machinery. In conjunction with that investigation, the Center is looking at ways of developing communication protocols that require less energy for communication. Power converter circuitry has been designed by using vibration signals in order to convert vibration energy into useful electric energy. These technologies are very critical for monitoring equipment or systems in a complex environment where the availability of power is the major constraint. In the area of collaborative product life cycle design and management, the Watchdog Agent® can serve as an infotronics agent to store product usage and endof-life (EOL) service data and to send feedback to designers and life cycle management systems. Currently, an international intelligent manufacturing systems consortium on product embedded information systems for service and EOL has been proposed. The goal is to integrate Watchdog Agent® capabilities into products and systems for closed-loop design and life cycle management, as illustrated in Figure 3.19.

Figure 3.19. Embedded and tether-free product life cycle monitoring

The Center will continue advancing its research to develop technologies and tools for closed-loop life cycle design for product reliability and serviceability, as well as explore research in new frontier areas such as embedded and networked agents for self-maintenance and self-healing, and self-recovery of products and systems. These new frontier efforts will lead to a fundamental understanding of reconfigurability and allow the closed-loop design of autonomously reconfigurable engineered systems that integrate physical, information, and knowledge domains. These autonomously reconfigurable engineered systems will be able to sense, perform self-prognosis, self-

76

J. Lee and H. Wang

diagnose, and reconfigure the system to function uninterruptedly when subject to unplanned failure events, as illustrated in Figure 3.20.

Near Near“0” “0” Downtime

Closed-Loop Life LifeCycle Cycle Design Design Design for Reliability and Serviceability

Product Center

Health Monitoring Product or System Sensors & Embedded In Use Intelligence

Product Redesign

Smart Design

Enhanced Six-Sigma Design

Degradation Watchdog Agent®

Self-Maintenance

Communications

•Redundancy •Active •Passive

•Tether-Free (Bluetooth) • Internet •TCP/IP

Service

• Web-enabled Monitoring & Prognostics • Decision Support Tools for Optimized Maintenance Condition-based

Maintenance • Business and Service Synchronization (CBM) • Asset Optimization

Web-enabled D2B™ Platform (XML-based)

Watchdog Agent and Device-to-Business (D2B) are Trademarks of IMS Center

Figure 3.20. Intelligent maintenance systems and its key elements

3.5 References Badia, F.G., Berrade, M.D. and Campos, C.A., (2002) Optimal Inspection and Preventive Maintenance of Units with Revealed and Unrevealed Failures. Reliability Engineering and System Safety 78: 157–163. Barbera, F., Schneider, H. and Kelle, P., (1996) A Condition Based Maintenance Model with Exponential Failures and Fixed Inspection Interval. Journal of the Operational Research Society 47(8): 1037–1045. Bonissone, G., (1995) Soft computing applications in equipment maintenance and service in: ISIE ’95, Proceedings of the IEEE International Symposium, 2: 10–14. Brotherton, T., Jahns, G., Jacobs, J. and Wroblewski, D., (2000) Prognosis of faults in gas turbine engines, in: Aerospace Conference Proceedings, (2000) IEEE, 6: 18–25. Bruns, P., (2002) Optimal Maintenance Strategies for Systems with Partial Repair Options and without Assuming Bounded Costs. European Journal of Operational Research 139: 146–165. Bunday, B.D., (1991) Statistical Methods in Reliability Theory and Practice, Ellis Horwood. Burrus, C., Gopinath, R. and Haitao, G., (1998) Introduction to wavelets and wavelet transforms – a primer. NJ: Prentice Hall. Casoetto, N., Djurdjanovic, D., Mayor, R., Lee, J. and Ni, J., (2003) Multisensor process performance assessment through the use of autoregressive modeling and feature maps. Trans. of SME/NAMRI, 31:483–490.

New Technologies for Maintenance

77

Cavory, G., Dupas, R. and Goncalves, R., (2001) A Genetic Approach to the Scheduling of Preventive Maintenance Tasks on a Single Product Manufacturing Production Line, International Journal of Production Economics, 74: 135–146. Chen, C.T., Chen, Y.W. and Yuan, J., (2003) On a Dynamic Preventive Maintenance Policy for a System under Inspection. Reliability Engineering and System Safety 80: 41–47. Chen, D. and Trivedi, K., (2002) Closed-Form Analytical Results for Condition-Based Maintenance. Reliability Engineering and System Safety 76: 43–51. Cohen, L., (1995) Time-frequency analysis. NJ: Prentice Hall. Cox, D., (1972) Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34:187–220. Djurdjanovic, D., Widmalm, S.E., William, W.J., et al., (2000) Computerized classification of temporomandibular joint sounds. IEEE Transactions on Biomedical Engineering 47:977–984. Djurdjanovic, D., Ni, J. and Lee, J., (2002) Time-frequency based sensor fusion in the assessment and monitoring of machine performance degradation. Proceedings of 2002 ASME Int. Mechanical Eng. Congress and Exposition, paper number IMECE2002-32032. Garga, A., McClintic, K.T., Campbell, R.L., et al., (2001) Hybrid reasoning for prognostic learning in CBM systems, in: Aerospace Conference, 10–17 March, 2001, IEEE Proceedings, 6: 2957–2969. Goodenow, T., Hardman, W., Karchnak, M., (2000) Acoustic emissions in broadband vibration as an indicator of bearing stress. Proceedings of IEEE Aerospace Conference, 2000; 6: 95–122.L.D. Hall, L.D. and Llinas, J., (Eds.), (2000) Handbook of Sensor Fusion, CRC Press. Hall, L.D., (1992) Mathematical techniques in Multi-Sensor Data Fusion, Artech House Inc. Hansen, R., Hall, D., Kurtz, S., (1994) New approach to the challenge of machinery prognostics. Proceedings of the International Gas Turbine and Aeroengine Congress and Exposition, American Society of Mechanical Engineers, June 13–16 1994: 1–8. IMS, NSF I/UCRC Center for Intelligent Maintenance Systems, www.imscenter.net; 2004. Kemerait, R., (1987) New cepstral approach for prognostic maintenance of cyclic machinery. IEEE SOUTHEASTCON, 1987: 256–262. Kleinbaum, D., (1994) Logistic regression. New York: Springer-Verlag. Labib, A.W., (2006) Next generation maintenance systems: Towards the design of a selfmaintenance machine. 2006 IEEE International Conference on Industrial Informatics, Integrating Manufacturing and Services Systems, 16–18 August, Singapore Lee, J., (1995) Machine performance monitoring and proactive maintenance in computerintegrated manufacturing: review and perspective. International Journal of Computer Integrated Manufacturing 8:370–380. Lee, J., (1996) Measurement of machine performance degradation using a neural network model. Computers in Industry 30:193–209. Lee, J., Ni, J., (2002) Infotronics agent for tether-free prognostics. Proceeding of AAAI Spring Symposium on Information Refinement and Revision for Decision Making: Modeling for Diagnostics, Prognostics, and Prediction. Stanford Univ., Palo Alto, CA, March 25–27. Liang, E., Rodriguez, R., Husseiny, A., (1988) Prognostics/diagnostics of mechanical equipment by neural network, Neural Networks 1 (1) 33. Liao, H., Lin, D., Qiu, H., Banjevic, D., Jardine, A., Lee, J., (2005) A predictive tool for remaining useful life estimation of rotating machinery components. ASME International 20th Biennial Conference on Mechanical Vibration and Noise, Long Beach, CA, 24–28 September, 2005. Liu, J., Djurdjanovic, D., Ni, J., Lee, J., (2004) Performance similarity based method for enhanced prediction of manufacturing process performance. Proceedings of the 2004 ASME International Mechanical Engineering Congress and Exposition (IMECE), 2004.

78

J. Lee and H. Wang

Marple, S., (1987) Digital spectral analysis. NJ: Prentice Hall. Marseguerra, M., Zio, E. and Podofillini, L. (2002) Condition-Based Maintenance Optimization by Means of Genetic Algorithm and Monte Carlo Simulation. Reliability Engineering and System Safety 77: 151–166. Mijailovic, V. (2003) Probabilistic Method for Planning of Maintenance Activities of Substation Component. Electric Power System Research 64: 53–58. Pandit, S., Wu, S-M., (1993) Time series and system analysis with application. FL: Krieger Publishing Co. Parker, B.E., Jr., Nigro, T.M., Carley, M.P., et al., (1993) Helicopter gearbox diagnostics and prognostics using vibration signature analysis, in: Proceedings of the SPIE — The International Society for Optical Engineering: 531–542. Pham, H., Suprasad, A. and Misra, R.B. (1997) Availability and Mean Life Time Prediction of Multistage Degraded System with Partial Repairs. Reliability Engineering and System Safety 56: 169–173 Radjou, N., (2002) The collaborative product life-cycle. Forrester Research, May 2002. Ray, A. and Tangirala, S., (1996) Stochastic Modeling of Fatigue Crack Dynamic for OnLine Failure Prognostics, IEEE Transactions on Control Systems Technology, 4(4): 443– 449. Reichard, K., Van Dyke, M. and Maynard, K. (2000) Application of sensor fusion and signal classification techniques in a distributed machinery condition monitoring system. Proceedings of SPIE – The International Society for Optical Engineering 4051:329–336. Roemer, M., Kacprzynski, G. and Orsagh, R., (2001) Assessment of data and knowledge fusion strategies for prognostics and health management. Proceedings of IEEE Aerospace Conference, 2001; 6:62979–62988. Seliger, G., Basdere, B., Keil, T., et al. (2002) Innovative processes and tools for disassembly. Annals of CIRP 51:37–41. Su, L., Nolan M, DeMare G, Carey D. (1999) Prognostics framework ‘for weapon systems health monitoring’. Proceedings of IEEE Systems Readiness Technology Conference, IEEE AUTOTESTCON '99, 30 August–2 September 1999: 661–672. Swanson, D.C., (2001) A General Prognostics tracking algorithm for predictive maintenance, Proc. of the IEEE Aerospace Conference, 2001, 6: 2971–2977. Tacer, B., Loughlin, P., (1996) Time-frequency based classification. SPIE Proceedings 42:2697–2705. Thurston, M. and Lebold, M., (2001) Open Standards for Condition Based Maintenance and Prognostic Systems, Pennsylvania State University, Applied Research Laboratory. Umeda, Y., Tomiyama, T. and Yoshikawa, H., (1995) A design methodology for selfmaintenance machines, ASME journal of mechanical design, 117, September Vachtsevanos, G. and Wang, P., (2001) Fault prognosis using dynamic wavelet neural networks, Proceedings of the IEEE International Symposium on Intelligent Control 2001 (ISIC '01): 79–84. Wang, H.Z., (2002) A Survey of Maintenance Policies of Deteriorating Systems, European Journal of Operationa Research, 139: 469–489. Wilson, B.W., Hansen, N.H., Shepard, C.L., et al. (1999) Development of a modular in-situ oil analysis prognostic system. International Society of Logistics (SOLE) 1999 Symposium, Nevada, Las Vegas, 30 August –2 September. Yamada, A., Takata, S., (2002) Reliability improvement of industrial robots by optimizing operation plans based on deterioration evaluation. Annals of the CIRP 51:319–322. Yen, G., Lin, K., (2000) Wavelet packet feature extraction for vibration monitoring. IEEE Trans. on Industrial Electronics 2000; 47:650–667. Zalubas, E.J., O’Neill, J.C., Williams, W.J., et al., (1996) Shift and scale invariant detection. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1996; 5:3637–3640.

4 Reliability Centred Maintenance Marvin Rausand and Jørn Vatn

4.1 Introduction Reliability centred maintenance (RCM) is a method for maintenance planning that was developed within the aircraft industry and later adapted to several other industries and military branches. A high number of standards and guidelines have been issued where the RCM methodology is tailored to different application areas, e.g., IEC 60300-3-11, MIL-STD-217, NAVAIR 00-25-403 (NAVAIR 2005), SAE JA 1012 (SAE 2002), USACERL TR 99/41 (USACERL 1999), ABS (2003, 2004), NASA (2000) and DEF-STD 02-45 (DEF 2000). On a generic level, IEC 60300-3-11 (IEC 1999) defines RCM as a “systematic approach for identifying effective and efficient preventive maintenance tasks for items in accordance with a specific set of procedures and for establishing intervals between maintenance tasks.” A major advantage of the RCM analysis process is a structured, and traceable approach to determine the optimal type of preventive maintenance (PM). This is achieved through a detailed analysis of failure modes and failure causes. Although the main objective of RCM is to determine the preventive maintenance, the results from the analysis may also be used in relation to corrective maintenance strategies, spare part optimization, and logistic consideration. In addition, RCM also has an important role in overall system safety management. An RCM analysis process, when properly conducted, should answer the following seven questions: 1. 2. 3. 4. 5. 6. 7.

What are the system functions and the associated performance standards? How can the system fail to fulfil these functions? What can cause a functional failure? What happens when a failure occurs? What might the consequence be when the failure occurs? What can be done to detect and prevent the failure? What should be done when a suitable preventive task cannot be found?

80

M. Rausand and J. Vatn

The main objectives of an RCM analysis process are to: • • •

Identify effective maintenance tasks Evaluate these tasks by some cost–benefit analysis Prepare a plan for carrying out the identified maintenance tasks at optimal intervals

The RCM analysis process is carried out as a sequence of activities. Some of these activities, or steps, overlap in time. The structuring of the RCM process is slightly different in the various standards, guidelines, and textbooks. In this chapter we split the RCM analysis process into the following 12 steps: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.

Study preparation System selection and definition Functional failure analysis (FFA) Critical item selection Data collection and analysis Failure modes, effects, and criticality analysis (FMECA) Selection of maintenance actions Determination of maintenance intervals Preventive maintenance comparison analysis Treatment of non-critical items Implementation In-service data collection and updating

The rest of the chapter is structured as follows: In Section 4.2 we describe and discuss the 12 steps of the RCM process. The concepts of generic and local RCM analysis are introduced in Section 4.3. These concepts have been used in a novel RCM approach to improve and speed up the analyses in a railway application. Models and methods for optimization of maintenance intervals are discussed in Section 4.4. Some main features of a new computer tool, OptiRCM, are briefly introduced. Concluding remarks are given in Section 4.5. The RCM analysis approach that is described in this chapter is mainly in accordance with accepted standards, but also contains some novel issues, especially related to steps 6 and 8 and the approach chosen in OptiRCM. The RCM approach is illustrated with examples from railway applications. Simple examples from the offshore oil and gas industry are also mentioned.

4.2 Main Steps of the RCM Analysis Process 4.2.1 Step 1: Study Preparation Before the actual RCM analysis process is initiated, an RCM project group must be established. The group should include at least one person from the maintenance function and one from the operations function, in addition to an RCM specialist. In Step 1 the RCM project group should define and clarify the objectives and the scope of the analysis. Requirements, policies, and acceptance criteria with

Reliability Centred Maintenance

81

respect to safety and environmental protection should be made visible as boundary conditions for the RCM analysis. The part of the plant to be analyzed is selected in Step 2. The type of consequences to be considered should, however, be discussed and settled on a general basis in Step 1. Possible consequences to be evaluated may comprise: • • • • • •

Human injuries and/or fatalities Negative health effects Environmental damage Loss of system effectiveness (e.g. delays, production loss) Material loss or equipment damage Loss of market shares

All consequence classes cannot usually be measured in a common unit. It is therefore necessary to prioritize between means affecting the various consequence classes. Such a prioritization is not an easy task and will not be discussed in this chapter. The trade-off problems can to some extent be solved within a decision theoretical framework (Vatn et al. 1996). RCM analyses have traditionally concentrated on PM strategies. It is, however, possible to extend the scope of the analysis to cover topics like corrective maintenance strategies, spare part inventories, logistic support problems, and input to safety management. The RCM project group must decide what should be part of the scope and what should be outside. The resources that are available for the analysis are usually limited. The RCM project group should therefore be realistic with respect to what to look into, realizing that analysis cost should not dominate potential benefits. In many RCM applications the plant already has effective maintenance programs. The RCM project will therefore be an upgrade project to identify and select the most effective PM tasks, to recommend new tasks or revisions, and to eliminate ineffective tasks. Further to apply those changes within the existing programs in a way that will allow the most efficient allocation of resources. When applying RCM to an existing PM program, it is best to utilize, to the greatest extent possible, established plant administrative and control procedures in order to maintain the structure and format of the current program. This approach provides at least three additional benefits: • • •

It preserves the effectiveness and successfulness of the current program It facilitates acceptance and implementation of the project’s recommendations when they are processed It allows incorporation of improvements as soon as they are discovered, without the necessity of waiting for major changes to the PM program or analysis of every system

4.2.2 Step 2: System Selection and Definition Before a decision to perform an RCM analysis is taken, two questions should be considered:

82

M. Rausand and J. Vatn

• •

To which systems is an RCM analysis beneficial compared with more traditional maintenance planning? At what level of assembly (plant, system, subsystem) should the analysis be conducted?

All systems may in principle benefit from an RCM analysis. With limited resources we must, however, set priorities, at least when introducing RCM in a new plant. We should start with the systems we assume will benefit most from the analysis. The following criteria may be used to prioritize systems for an RCM analysis: • • •

The failure effects of potential system failures must be significant in terms of safety, environmental consequences, production loss, or maintenance costs The system complexity must be above average Reliability data or operating experience from the actual system, or similar systems, should be available

Most operating plants have developed an assembly hierarchy, i.e. an organization of the system hardware elements into a structure that looks like the root system of a tree. In the offshore oil and gas industry this hierarchy is usually referred to as the tag number system. Several other names are also used. Moubray (1997) refers to the assembly hierarchy as the plant register. In railway infrastructure maintenance it is common to use the disciplinary areas as the next highest level in the plant register. These are typically: • • • • • •

Superstructure Substructure Signalling Telecommunications Power supply (overhead line with supporting systems) Low voltage systems

In this chapter, the following terms are used for the levels of the assembly hierarchy: Plant: A logical grouping of systems that function together to provide an output or product by processing and manipulating various input raw materials and feed stock. An offshore gas production platform may, e.g., be considered as a plant. For railway application a plant might be a maintenance area, where the main function of that “plant” is to ensure satisfactory infrastructure functionality in that area. Moubray (1997) refers to the plant as a cost centre. In railway application a plant corresponds to a train set (rolling stock), or a line (infrastructure). System: A logical grouping of subsystems that will perform a series of key functions, which often can be summarized as one main function, that is required of a plant (e.g., feed water, steam supply, and water injection). The compression system on an offshore gas production platform may, e.g., be considered as a system. Note that the compression system may consist of several compressors with a high degree of redundancy. Redundant units performing the same main function should be included in the same system. It is usually easy to identify the systems in a plant, since they are used as logical building blocks in the design process.

Reliability Centred Maintenance

83

The system level is usually recommended as the starting point for the RCM process. This is further discussed and justified, e.g., by Smith (1993) and in MILSTD 2173 (MIL-STD 1986). This means that on an offshore oil/gas platform the starting point of the analysis should be the compression system, the water injection system or the fire water system, and not the whole platform. In railway application the systems were defined above as the next highest level in the plant hierarchy. The systems may be further broken down into subsystems, and sub-subsystems, and so on. For the purpose of the RCM analysis process the lowest level of the hierarchy should be what we will call an RCM analysis item. RCM analysis item: A grouping or collection of components, which together form some identifiable package that will perform at least one significant function as a stand-alone item (e.g., pumps, valves, and electric motors). For brevity, an RCM analysis item will in the following be called an analysis item. By this definition, a shutdown valve, e.g., is classified as an analysis item, while the valve actuator is not. The actuator is supporting equipment to the shutdown valve, and only has a function as a part of the valve. The importance of distinguishing the analysis items from their supporting equipment is clearly seen in the FMECA in Step 6. If an analysis item is found to have no significant failure modes, then none of the failure modes or causes of the supporting equipment are important, and therefore do not need to be addressed. Similarly, if an analysis item has only one significant failure mode, then the supporting equipment only needs to be analyzed to determine if there are failure causes that can affect that particular failure mode (Paglia et al. 1991). Therefore, only the failure modes and effects of the analysis items need to be analyzed in the FMECA in Step 6. An analysis item is usually repairable, meaning that it can be repaired without replacing the whole item. In the offshore reliability database OREDA (2002) the analysis item is called an equipment unit. The various analysis items of a system may be at different levels of assembly. On an offshore platform, for example, a huge pump may be defined as an analysis item in the same way as a small gas detector. If we have redundant items, e.g., two parallel pumps; each of them should be classified as analysis items. When in Step 6 we identify causes of analysis item failures, we often find it suitable to attribute this failure causes to failures of items on an even lower level of indenture. The lowest level is usually referred to as components. Component: The lowest level at which equipment can be disassembled without damage or destruction to the items involved. Smith (2005) refers to this lowest level as least replaceable assembly, while OREDA (2002) uses the term maintainable item. It is very important that the analysis items are selected and defined in a clear and unambiguous way in this initial phase of the RCM analysis process, since the following analysis will be based on these analysis items. If the OREDA database is to be used in later phases of the RCM process, it is recommended as far as possible to define the analysis items in compliance with the “equipment units” in OREDA.

84

M. Rausand and J. Vatn

4.2.3 Step 3: Functional Failure Analysis (FFA) The objectives of this step are to: 1. 2. 3.

Identify and describe the systems’ required functions Describe input interfaces required for the system to operate Identify the ways in which the system might fail to function

4.2.3.1 Step 3(i): Identification of System Functions The objective of this step is to identify and describe all the required functions of the system. According to ABS (2004) “each function should be documented as a function statement that contains a verb describing the function, an object on which the function acts, and performance standard(s)”. A function of a shutdown valve may therefore be “close flow of oil within 5 s”. A complex system will usually have a high number of different functions. It is often difficult to identify all these functions without a checklist. The checklist or classification scheme of the various functions presented below may help the analyst in identifying the functions. The same scheme may be used in Step 6 to identify functions of analysis items. The term item is therefore used in the classification scheme to denote either a system or an analysis item: 1.

2.

3.

4. 5.

6.

Essential functions: These are the functions required to fulfil the intended purpose of the item. The essential functions are simply the reasons for installing the item. Often an essential function is reflected in the name of the item. An essential function of a pump is, e.g., to pump a fluid. Auxiliary functions: These are the functions that are required to support the essential functions. The auxiliary functions are usually less obvious than the essential functions, but may in many cases be as important as the essential functions. Failure of an auxiliary function may in many cases be more critical than a failure of an essential function. An auxiliary function of a pump is, e.g., to “contain fluid.” Protective functions: The functions intended to protect people, equipment, and the environment from damage and injury. The protective functions may be classified according to what they protect, as: (i) safety functions, (ii) environment functions, and (iii) hygiene functions. An example of a protective function is the protection provided by a rupture disk on a pressure vessel. Information functions: These functions comprize condition monitoring, various gauges and alarms, and so on. Interface functions: These functions apply to the interfaces between the item in question and other items. The interfaces may be active or passive. A passive interface is, e.g., present when an item is a support or a base for another item. Superfluous functions: According to Moubray (1997) “Items or components are sometimes encountered which are completely superfluous. This usually happens when equipment has been modified frequently over a period of years, or when new equipment has been over-specified”. Superfluous functions are

Reliability Centred Maintenance

85

sometimes present when the item has been designed for an operational context that is different from the actual operational context. In some cases failures of a superfluous function may cause failure of other functions. For analysis purposes the various functions of an item may also be classified as: •



On-line functions: These are functions operated either continuously or so often that the user has current knowledge about their state. The termination of an on-line function is called an evident (or detectable) failure. In relation to safety instrumented systems, on-line functions correspond to high demand systems; see IEC 61508 (IEC 1997). Off-line functions: These are functions that are used intermittently or so infrequently that their availability is not known by the user without some special check or test. The protective functions are very often off-line functions. An example of an off-line function is the essential function of an emergency shutdown (ESD) system on an oil platform. The termination of an off-line function is called a hidden (or undetectable) failure. In the IEC 61508 setting, off-line functions correspond to low demand systems.

Note that this classification of functions should only be used as a checklist to ensure that all relevant functions are revealed. Discussions about whether to classify a function as, e.g., “essential” or “auxiliary” should be avoided. The item may in general have several operational modes (e.g., running, and standby), and several functions related to each operating state. 4.2.3.2 Step 3(ii): Functional Block Diagrams Various types of functional diagrams may represent the system functions identified in Step 3(i). The most common diagram is the so-called functional block diagram. A simple functional block diagram of a diesel engine is shown in Figure 4.1. It is generally not required to establish functional block diagrams for all the system functions. The diagrams are, however, efficient tools to illustrate the input interfaces to a function. In some cases we may want to split system functions into sub-functions on an increasing level of detail, down to functions of analysis items. The functional block diagrams may be used to establish this functional hierarchy in a pictorial manner, illustrating series-parallel relationships, possible feedbacks, and functional interfaces (e.g., see Blanchard and Fabrycky 1998; Rausand and Høyland 2004). Alternatives to the functional block diagram are reliability block diagrams and fault trees. Functional block diagrams are also useful as a basis for the FMECA in Step 6 in the RCM analysis process. 4.2.3.3 Step 3(iii): Functional Failures The next step of the FFA is to identify and describe how the various system functions may fail. A system function may be subject to a set of performance standards (or functional requirements) that may be grouped as physical properties, operational performance properties including output tolerances, and time requirements such as continuous operation or required availability. An unacceptable deviation from one or more of these performance standards is called a functional failure.

86

M. Rausand and J. Vatn

Figure 4.1. Functional block diagram for a diesel engine

The term functional failure is mainly used in the RCM literature, and has the same meaning as the more common term failure mode. In RCM we talk about functional failures on equipment level, and use the term failure mode related to the parts of the equipment. The failure modes will therefore be causes of a functional failure. It is important to realize that a functional failure (and a failure mode) is a manifestation of the failure as seen from the outside, i.e., a deviation from performance standards. Functional failures and failure modes may be classified in three main groups related to the function of the item: • • •

Total loss of function: In this case the function is not achieved at all, or the quality of the function is far beyond what is considered as acceptable. Partial loss of function: This group may be very wide, and may range from the nuisance category almost to the total loss of function. Erroneous function: This means that the item performs an action that was not intended, often the opposite of the intended function.

A variety of classifications schemes for functional failures (failure modes) have been published. Some of these schemes, e.g., Blache and Shrivastava (1994), may be used in combination with the function classification scheme in Step 3(ii) to ensure that all relevant functional failures are identified. The system functional failures may be recorded on a specially designed FFAworksheet that is rather similar to a standard FMECA worksheet. An example of an FFA-worksheet is presented in Figure 4.2 In the first column of Figure 4.2 the various operational modes of the system are recorded. For each operational mode, all the relevant functions of the system are recorded in column 2.

Reliability Centred Maintenance System: Ref. drawing no.: Operational mode

Function

Date: Function requirements

Performed by: Functional failure

87

Page: of: Frequency

Criticality S

E

A

C

Figure 4.2. Example of an FFA-worksheet

The performance requirements to the functions, like target values and acceptable deviations, are listed in column 3. For each function (in column 2) all the relevant functional failures are listed in column 4. In column 5 the frequency/probability of the functional failure is listed. A criticality ranking of each functional failure in that particular operational mode is given is given in column 6. The reason for including the criticality ranking is to be able to limit the extent of the further analysis by disregarding insignificant functional failures. For complex systems such a screening is often very important in order not to waste time and money. The criticality ranking depends on both the frequency/probability of the occurrence of the functional failure, and the severity of the failure. The severity must be judged at plant level. The severity ranking should be given in the four consequence classes: (S) safety of personnel, (E) environmental impact, (A) production availability, and (C) economic losses. For each of these consequence classes the severity should be ranked as for example (H) high, (M) medium, or (L) low. How we should define the borderlines between these classes will depend on the specific application. If at least one of the four entries are (M) medium or (H) high, the severity of the functional should be classified as significant, and the functional failure should be subject to further analysis. The frequency of the functional failure may also be classified in the same three classes. (H) high may, e.g., be defined as more than once per 5 years, and (L) low less than once per 50 years. As above, the specific borderlines will depend on the application. The frequency classes may be used to prioritize between the significant system failure modes. If all the four severity entries of a system failure mode are (L) low, and the frequency is also (L) low, the criticality is classified as insignificant, and the functional failure is disregarded in the further analysis. If, however, the frequency is (M) medium or (H) high the functional failure should be included in the further analysis even if all the severity ranks are (L) low, but with a lower priority than the significant functional failures. The FFA may be rather time-consuming because, for all functional failures, we have to list all the maintenance significant items (MSIs) (see Step 4). The MSI lists will hence have to be repeated several times. To reduce the workload we often conduct a simpler FFA where for each main function we list all functional failures in one column, and all the related MSIs in another column. This is illustrated in Figure 4.3 for a railway application.

88

M. Rausand and J. Vatn

The function name reflects the functions to be carried out on a relatively high level in the system. In principle, we should explicitly formulate the function(s) to be carried out. Instead we often specify the equipment class performing the function. For example, “departure light signal” is specified rather than the more correct formulation “ensure correct departure light signal”. We observe that the last functional failure in Figure 4.3 is not a failure mode for the “correct” functional description (Ensure correct departure light signal), but is related to another function of the “departure light signal”. Thus, if we use an equipment class description rather than an explicit functional statement, the list of failure modes should cover all (implicit) functions of the equipment class. At the functional failure level, it is also convenient to specify whether the failure mode is evident or hidden; see Figure 4.3 where we have introduced an “EF/HF” column. For each function we also list the relevant items that are required to perform the function. These items will form “rows” in the FMECA worksheets; see Step 5. 4.2.4 Step 4: Critical Item Selection The objective of this step is to identify the analysis items that are potentially critical with respect to the functional failures identified in Step 3(iii). These analysis items are denoted functional significant items (FSI). For simple systems the FSIs may be identified without any formal analysis. In many cases it is obvious which analysis items that have influence on the functional failures. For complex systems with an ample degree of redundancy or with buffers, we may need a formal approach to identify the FSIs. If failure rates and other necessary input data are available for the various analysis items, it is usually a straightforward task to calculate the relative importance of the various analysis items based on a fault tree model or a reliability block diagram. A number of importance measures are discussed by Rausand and Høyland (2004). In addition to the FSIs, we should also identify items with high failure rate, high repair costs, low maintainability, long lead-time for spare parts, or items requiring external maintenance personnel. These analysis items are denoted maintenance cost significant items (MCSI). The sum of the functional significant items and the maintenance cost significant items are denoted maintenance significant items (MSI). In an RCM project for the Norwegian Railway Administration the use of generic RCM analyses (see Section 4.3) made it possible to analyze all identified MSIs. In this case this step could be omitted. 4.2.5 Step 5: Data Collection and Analysis The purpose of this step is to establish a basis for both the qualitative analysis (relevant failure modes and failure causes), and the quantitative analysis (reliability parameters such as MTTF, PF-intervals, and so on). The data necessary for the RCM analysis may be categorized into the following three groups:

Reliability Centred Maintenance

89

Function: _______ Function: “Home signal” Function: “Departure light signal” Description: “Five lamp signals, with three main signals and two pre-signals” Functional failure

- Wrong signal picture - Missing signal picture - Unclear signal picture - Does not prevent contact hazard in case of earth fault - etc.

EF / HF HF HF HF HF

MSI

- Signal mast - Brands - Background shade - Earth conductor - Lamp - Lens - Transformer - etc.

Figure 4.3. Structure of functional failure analysis

Design data: (i) System definition: a description of the system boundaries including all subsystems and equipment to fulfil the main functions of the system, (ii) system breakdown: the assembly hierarchy as described in Step 2, (iii) a technical description of each subsystem, such as the structure of the subsystem, capacity and functions (e.g., input and output), (iv) system performance requirements, e.g., desired system availability, environmental requirements, (v) requirements related to maintenance/testing, e.g., according to rules and regulations. 2. Operational and failure data: (i) Performance requirements, (ii) operating profile (continuous or intermittent operation), (iii) control philosophy (remote/local and automatic/manual), (iv) environmental conditions, (v) maintainability, (vi) calendar- and accumulated operating time for overhauls, (vii) maintenance and downtime costs, (viii) recommended maintenance for each analysis item based on manufacturer specification, general guidelines or standards, or in-house recommended practice, and (ix) failure information, what happens when a failure occurs. 3. Reliability data: Reliability data may be derived from the operational data by statistical analysis. The reliability data is used to decide the criticality, to describe the failure process mathematically and to optimize the time between PM tasks.

1.

During the initial phase of the RCM analysis process it often becomes evident that the format and quality of the operational data are not sufficient to estimate the relevant reliability parameters. Some of the main problems encountered are: • •

The failure data is on a too high level in the assembly hierachy, i.e., data is not reported on the RCM analysis item level (MSI). Failure mode and failure causes are not reported, or the recorded information does not correspond to definitions and code lists used in the FMECA of Step 6.

90

M. Rausand and J. Vatn

• •

For systems being monitored by measurements or visual inspection, the state information is often not reported, making it impossible to establish models for the failure progression. For multiple copies of a component the failure reporting do not link each failure report to a physical unit, but only states that “one of the components has failed and has been replaced.”

When such problems are encountered, it is important to start a process to improve the reporting of operational and failure data. However, there will always be a cost associated with improved reporting due to: (i) the maintenance personnel need to spend more time on reporting, (ii) the maintenance personnel need to be trained in failure reporting, and get insight into the structured FMECA thinking, and (iii) the reporting systems (maintenance management systems) have to be restructured to allow reporting in a format in accordance with the logical structure of the FMECA worksheets. Our experience is that improved reporting quality is unattainable unless maintenance personnel executing the maintenance also participate in the RCM process. This would give ownership to the process, but it is no guarantee that reporting will improve. 4.2.6 Step 6: FMECA The objective of this step is to identify the dominant failure modes of the MSIs identified in Step 4. The information entered into the FMECA worksheet should be sufficient both with respect to maintenance task selection in Step 7, and interval optimization in Step 8. Our FMECA worksheet has more fields than the FMECAs found in most RCM standards. The reason for this is that we use the FMECA as the main database for the RCM analysis. Other RCM approaches often use a rather simple FMECA worksheet, but then have to add an additional FMECA-like worksheet with the data required for optimization of maintenance intervals. TOP

Events Experience has shown that we can significantly reduce the workload of the FMECA by introducing so-called TOP events as a basis for the analysis. The idea is that for each failure mode in the FMECA, a so-called TOP event is specified as consequence of the failure mode. A number of failure modes will typically lead to the same TOP event. A consequence analysis is then carried out for each TOP event to identify the end consequences of that particular TOP event, covering all consequence classes (e.g., safety, availability/punctuality, environmental aspects). For many plants, risk analyses (or safety cases) have been carried out as part of the design process. These may sometimes be used as a basis for the consequence analysis. Figure 4.4 shows a conceptual model of this approach for a railway application where the left part relatively to the TOP event is treated in the FMECA, and the right part is treated as generic, i.e., only once for each TOP event.

Reliability Centred Maintenance

91 C1 C2

Initiating event

C3

TOP event

“Red bulb failure”

“Train collision”

C4 C5 C6

Failure cause: - Burn-out bulb

Maintenance barrier: Other barriers: - Preventive replacement

- Directional setting “block” - Automatic train protection - Train control centre

Consequence reducing barriers: - Rescue team - Train construction - Fire protection

Figure 4.4. Barrier model for safety

In the rectangle (dashed line) in the left-hand side of Figure 4.4 an “initiating event” and a “barrier” are illustrated. To analyze this “rectangle” we need reliability parameters, such as MTTF, aging parameter, and PF interval, that are included in the FMECA worksheet (e.g., see Rausand and Høyland 2004). Three situations are considered: 1.

There is a failure or a fault situation that is not related to the component we are analyzing with respect to maintenance. If, for example, we are analyzing the automatic train protection (ATP) on the train, the initiating event may be “locomotive driver does not comply with signaling”, and thus the ATP is a barrier against this initiating event. In this situation the function of the ATP is typically a hidden function. 2. There is a potential failure in the component that is being analyzed, and maintenance is a barrier against this failure. An example is a crack that has been initiated in the rail, or in an axle (initiating event); and ultrasonic inspection is a maintenance activity to reveal the crack, and prevent a serious incident. 3. The initiating event is a component aging failure, and preventive maintenance is carried out to reduce the likelihood of this failure. In this situation the initiating event and the first barrier in Figure 4.4 merges to one element. An example is aging failure of a light bulb. The likelihood of such a failure will, however, be reduced if the light bulb is periodically replaced with a new one before the aging effect becomes dominant.

“Other barriers” in Figure 4.4 can prevent the component failure from developing into a critical TOP event. “Track circuit detection” may be a barrier against rail breakage, because the track circuit can detect a broken rail. Typical examples of TOP events in railway application are: • • •

Train derailment Collision train-train Collision train-object

92

M. Rausand and J. Vatn

• • • •

Fire Persons injured or killed in or at the track Persons injured or killed at level crossings Passengers injured or killed at platforms

Several consequence-reducing barriers may also be available. Guide rails may, e.g., be installed to mitigate the consequences in case of derailment. In Figure 4.4 we have indicated that the outcome of the TOP event may be one out of six (end) consequence classes: C1: C2: C3: C4: C5: C6:

Minor injury Medical treatment Permanent injury 1 fatality 2–10 fatalities >10 fatalities

Note that the consequence reducing barriers and the end consequences are not analyzed explicitly during the FMECA, but treated as generic for each TOP event. In the railway situation this means only six analyses of the safety consequences related to human injuries/fatalities. In the following, a list of fields (columns) for the FMECA worksheets is proposed. The structure of the FMECA is hierarchical, but the information is usually presented in a tabular worksheet. The starting point in the FMECA is the functional failures from the FFA in Step 3. Each maintainable item is analyzed with respect to any impact on the various functional failures. In the following we describe the various columns: • • • • • • • •

Failure mode (equipment class level). The first column in the FMECA worksheet is the failure mode at the equipment class level identified in the FFA in Step 3. Maintenance significant item (MSI). The relevant MSI were identified in the FFA. MSI function. For each MSI, the functions of the MSI related to the current equipment class failure mode are identified. Failure mode (MSI level). For the MSI functions we also identify the failure modes at the MSI level. Detection method. The detection method column describes how the MSI failure mode may be detected, e.g., by visual inspection, condition monitoring, or by the central train control system (for railway applications). Hidden or evident. Specify whether the MSI function is hidden or evident. Demand rate for hidden function, fD. For MSI functions that are hidden, the rate of demand of this function should be specified. Failure cause. For each failure mode there is/are one or more failure causes. A failure mode will typically be caused by one or more component failures at a lower level. Note that supporting equipment to the component is considered for the first time at this step. In this context a failure cause may therefore be a failure mode of supporting equipment.

Reliability Centred Maintenance





• •

• • •





• •

93

Failure mechanism. For each failure cause, there is one or several failure mechanisms. Examples of failure mechanisms are fatigue, corrosion, and wear. To simplify the analysis, the columns for failure cause and failure mechanism are often merged into one column. Mean time to failure (MTTF). The MTTF when no maintenance is performed should be specified. The MTTF is specified for one component if it is a “point” object, and for a standardized distance if it is a “line” object such as rails, sleepers, and so on. TOP event safety. The TOP event in this context is the accidental event that might be the result of the failure mode. The TOP event is chosen from a predefined list established in the generic analysis Barrier against TOP event safety. This field is used to list barriers that are designed to prevent a failure mode from resulting in the safety TOP event. For example, brands on the signalling pole would help the locomotive driver to recognize the signal in case of a dark lamp. PTE-S. This field is used to assess the probability that the other barriers against the TOP event all fail; see Figure 4.4. PTE-S should count for all the barriers listed under “Barrier against TOP event safety”. TOP event availability/punctuality. Also for this dimension a predefined list of TOP events may be established in the generic analysis. Barrier against TOP event availability/punctuality. This field is used to list barriers that are designed to prevent a failure mode from resulting in an availability/punctuality TOP event. Since the fail safe principle is fundamental in railway operation, there are usually no barriers against the punctuality TOP event when a component fails. An example of a barrier is a two out of three voting system on some critical components within the system. PTE-P. This field is used to assess the probability that the other barriers against an availability/punctuality TOP event all fails. PTE-P should count for all the barriers listed under “Barrier against TOP event availability/ punctuality”. Due to the fail safe principle, PTE-P will often be equal to one. Other consequences. Other consequences may also be listed. Some of these are non-quantitative like noise effects, passenger comfort, and aesthetics. Material damage to rolling stock or components in the infrastructure may also be listed. Material damage may be categorized in terms of monetary value, but this is not pursued here. Mean downtime (MDT). The MDT is the time from a failure occurs until the failure has been corrected and any traffic restrictions have been removed. Criticality indexes. Based on already entered information, different criticality indexes can be calculated. These indexes are used to screen out nonsignificant MSIs.

If a failure mode is considered significant with respect to safety or availability/ punctuality (or other dimensions) a preventive maintenance task should be assigned. In order to do such an assignment, further information has to be specified. This additional information will be completed during Steps 7 and 8. The following fields are recommended:

94

M. Rausand and J. Vatn









• •

Failure progression. For each failure cause the failure progression should be described in terms of one of the following categories: (i) gradual observable failure progression, (ii) non-observable and fast observable failure progression (PF model), (iii) non-observable failure progression but with aging effects, and (iv) shock type failures. Gradual failure information. If there is a gradual failure progression information about a what values of the measurable quantity represents a fault state. Further information about the expected time and standard deviation to reach this state should be recorded. PF-interval information. In case of observable failure progression the PF model is often applied (e.g., see Rausand and Høyland 2004, p. 394). The PF concept assumes that a potential failure (P) can be observed some time before the failure (F) occurs. This time interval is denoted the PF interval (e.g., see Rausand and Høyland 2004). We need information both on the expected value and the standard deviation of the PF interval. Aging parameter. For non-observable failure progression aging effects should be described. Relevant categories are strong, moderate or low aging effects. The aging parameter can alternatively be described by a numeric value, i.e., the shape parameter α in the Weibull distribution. Maintenance task. The maintenance task is determined by the RCM logic discussed in Step 7. Maintenance interval. Often we start by describing existing maintenance interval, but after the formalized process of interval optimalization in Step 8 we enter the optimized interval.

An example of an FMECA worksheet is shown in Table 4.1 for a departure light signal. 4.2.7 Step 7: Selection of Maintenance Actions This step is the most novel compared to other maintenance planning techniques. A decision logic is used to guide the analyst through a question–and–answer process. The input to the RCM decision logic is the dominant failure modes from the FMECA in Step 6. The main idea is for each dominant failure mode to decide whether a preventive maintenance task is suitable, or it will be best to let the item deliberately run to failure and afterwards carry out a corrective maintenance task. There are generally three reasons for doing a preventive maintenance task: • • •

Prevent a failure Detect the onset of a failure Reveal a hidden failure

Only the dominant failure modes are subjected to preventive maintenance. To obtain appropriate maintenance tasks, the failure causes or failure mechanisms should be considered.

Reliability Centred Maintenance

95

Table 4.1. Example of part of an FMECA worksheet System function: Functional failure:

Ensure correct departure light signal No signal Failure mode

event

Safety barriers

PTE-S

event

MSI

Function

Lamp

Give light

No light

Burnt-out filament

Train – Train

Directional block, ATP, TCC, “Black=red”

3 x 10

–4

Manual train operation

Lens

Protect lamp

Broken lens

Rock fall

Train – Train

Directional block, ATP, TCC, “Black=red”

2 x 10

–-5

None

Slip through light

No light slipping through

Fouling

Train – Train

Directional block, ATP, TCC, “Black=red”

2 x 10–4

Failure cause

TOP

TOP

None

The failure mechanisms behind each of the dominant failure modes should be entered into the RCM decision logic to decide which of the following basic maintenance tasks is most applicable: 1. 2. 3. 4. 5. 6.

Continuous on-condition task (CCT) Scheduled on-condition task (SCT) Scheduled overhaul (SOH) Scheduled replacement (SRP) Scheduled function test (SFT) Run to failure (RTF)

Continuous on-condition task (CCT) is a continuous monitoring of an item to find any potential failures. An on-condition task is applicable only if it is possible to detect reduced failure resistance for a specific failure mode from the measurement of some quantity. Scheduled on-condition task (SCT) is a scheduled inspection of an item at regular intervals to find any potential failures. There are three criteria that must be met for an on-condition task to be applicable: 1.

It must be possible to detect reduced failure resistance for a specific failure mode. 2. It must be possible to define a potential failure condition that can be detected by an explicit task. 3. There must be a reasonable consistent age interval between the time of potential failure and the time of failure.

96

M. Rausand and J. Vatn

There are two disadvantage of a scheduled vs. a continuous on-condition task: • •

The man-hour cost of inspection is often larger than the cost of installing a sensor. Since the scheduled inspection is carried out at fixed points of time, one might “miss” situations where the degradation is faster than anticipated.

An advantage of a scheduled on-condition task is that the human operator is then able to “sense” information that a sensor will not be able to detect. This means that traditional “walk around checks” should not be totally skipped even if sensors are installed. Scheduled overhaul (SOH) is a scheduled overhaul of an item at or before some specified age limit, and is often called “hard time maintenance”. An overhaul task can be considered applicable to an item only if the following criteria are met: 1.

There must be an identifiable age at which the item shows a rapid increase in the item’s failure rate function. 2. A large proportion of the units must survive to that age. 3. It must be possible to restore the original failure resistance of the item by reworking it.

Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts) at or before some specified age limit. A scheduled replacement task is applicable only under the following circumstances: 1. 2.

The item must be subject to a critical failure. Test data must show that no failures are expected to occur below the specified life limit. 3. The item must be subject to a failure that has major economic (but not safety) consequences. 4. There must be an identifiable age at which the item shows a rapid increase in the failure rate function. 5. A large proportion of the units must survive to that age.

Scheduled function test (SFT) is a scheduled inspection of a hidden function to identify any failure. A scheduled function test task is applicable to an item under the following conditions: 1.

The item must be subject to a functional failure that is not evident to the operating crew during the performance of normal duties. 2. The item must be one for which no other type of task is applicable and effective. Run to failure (RTF) is a deliberate decision to run to failure because the other tasks are not possible or the economics are less favourable.

Reliability Centred Maintenance

Continuous oncondition task (CCT)

Yes Does a failure alerting measurable indicator exist?

Yes

Is continuous monitoring feasible?

Scheduled oncondition task (SCT)

No

No

Yes Is aging parameter α>1?

Yes

Scheduled overhaul (SOH)

Is overhaul feasible? No

No

Is the function hidden?

97

Yes

Scheduled replacement (SRP)

Scheduled function test (SFT)

No No PM activity found (RTF)

Figure 4.5. Maintenance task assignment/decision logic

In many situations a maintenance task may prevent several failure mechanisms. Hence in some situations it is better to enter failure modes rather than failure mechanisms into the RCM decision logic. Note also that if a failure cause for a dominant failure mode corresponds to supporting equipment, the supporting equipment should be defined as the “item” to be entered into the RCM decision logic. The RCM decision logic is shown in Figure 4.5 Note that this logic is much simpler than that found in most RCM standards and guidelines. It should be emphasized that such logic can never cover all situations. For example, in the situation of a hidden function with aging failures, a combination of scheduled replacements and function tests is required.

4.2.8 Step 8: Determination of Maintenance Intervals Usually, formalized methods for optimization of maintenance interval are not a part of the RCM analysis. In order to optimize maintenance intervals we need to structure the analysis in such a way that it fits into the maintenance optimization models that exist. See Section 4.4 for a discussion of determination of maintenance intervals using optimization models. 4.2.9 Step 9: Preventive Maintenance-Comparison Analysis Two overriding criteria for selecting maintenance tasks are used in RCM. Each task selected must meet two requirements:

98

M. Rausand and J. Vatn

• •

It must be applicable It must be effective

Applicability: Meaning that the task is applicable in relation to our reliability knowledge and in relation to the consequences of failure. If a task is found based on the preceding analysis, it should satisfy the applicability criterion. A PM task is applicable if it can eliminate a failure, or at least reduces the probability of occurrence to an acceptable level (Hoch 1990) — or reduces the impact of failures! Cost-effectiveness: Meaning that the task does not cost more than the failure(s) it is going to prevent. The PM task’s effectiveness is a measure of how well it accomplishes that purpose and if it is worth doing. Clearly, when evaluating the effectiveness of a task, we are balancing the cost of “performing the maintenance with the cost of not performing it. In this context, we may refer to the cost as follows (Hoch 1990): The cost of a PM task may include: • • • • • • •

The risk of maintenance personnel error, e.g., “maintenance introduced failures” The risk of increasing the effect of a failure of another component while one is out of service The use and cost of physical resources The unavailability of physical resources elsewhere while in use on this task Production unavailability during maintenance Unavailability of protective functions during maintenance of these “The more maintenance you do the more risk you expose your maintenance personnel to”

On the other hand, the cost of a failure may include: • • •

The consequences of the failure should it occur (i.e., loss of production, possible violation of laws or regulations, reduction in plant or personnel safety, or damage to other equipment) The consequences of not performing the PM task even if a failure does not occur (i.e., loss of warranty) Increased costs for emergency

4.2.10 Step 10: Treatment of Non-MSIs In Step 4 critical items (MSIs) were selected for further analysis. A remaining question is what to do with the items that are not analyzed. For plants already having a maintenance program it is reasonable to continue this program for the nonMSIs. If a maintenance program is not in effect, maintenance should be carried out according to vendor specifications if they exist, else no maintenance should be performed. See Paglia et al. (1991) for further discussion.

Reliability Centred Maintenance

99

4.2.11 Step 11: Implementation A necessary basis for implementing the result of the RCM analysis is that the organizational and technical maintenance support functions are available. A major issue is therefore to ensure the availability of the maintenance support functions. The maintenance actions are typically grouped into maintenance packages, each package describing what to do, and when to do it. Many accidents are related to maintenance work. When implementing a maintenance program it is therefore of vital importance to consider the risk associated with the execution of the maintenance work. Checklists may be used to identify potential risk involved with maintenance work: • • • •

Can maintenance people be injured during the maintenance work? Is work permit required for execution of the maintenance work? Are means taken to avoid problems related to re-routing, by-passing, etc.? Can failures be introduced during maintenance work?

Task analysis, e.g., see Kirwan and Ainsworth (1992), may be used to reveal the risk involved with each maintenance job. See Hoch (1990) for further discussion on implementing the RCM analysis results. 4.2.12 Step 12: In-service Data Collection and Updating The reliability data we have access to at the outset of the analysis may be scarce, or even almost none. In our opinion, one of the most significant advantages of RCM is that we systematically analyze and document the basis for our initial decisions and, hence, can better utilize operating experience to adjust that decision as operating experience data is collected. The full benefit of RCM is therefore only achieved when operation and maintenance experience is fed back into the analysis process. The updating process should be concentrated on three major time perspectives: 1. 2. 3.

Short term interval adjustments Medium term task evaluation Long term revision of the initial strategy

For each significant failure that occurs in the system, the failure characteristics should be compared with the FMECA. If the failure was not covered adequately in the FMECA, the relevant part of the RCM analysis should, if necessary, be revised. The short-term update can be considered as a revision of previous analysis results. The input to such an analysis is updated reliability figures either due to more data, or updated data because of reliability trends. This analysis should not require excessive resources, since the framework for the analysis is already established. Only Steps 5 and 8 in the RCM process will be affected by short-term updates. The medium term update will also review the basis for the selection of maintenance actions in Step 7. Analysis of maintenance experience may identify significant failure causes not considered in the initial analysis, requiring an updated FMECA in Step 6.

100

M. Rausand and J. Vatn

The long-term revision will consider all steps in the analysis. It is not sufficient to consider only the system being analyzed; it is required to consider the entire plant with its relations to the outside world, e.g., contractual considerations, new laws regulating environmental protection, and so on.

4.3 Generic and Local RCM Analyses An RCM analysis should be conducted for physical units in a stated operational context. Assume that we are planning to carry out an RCM analysis of a specific railway point1 (turnout) at location X on line Y. For this railway point we identify all functions, failure modes, and so on. We then propose a set of maintenance tasks, and finally we choose the maintenance intervals based on reliability parameters for the railway point, punctuality parameters, and personnel risk. Now, there might be several hundreds of similar railway points, with slightly varying reliability performance and risk profiles that would require different maintenance intervals. To avoid repeating the entire RCM analysis for all these railway points, we propose to conduct a generic RCM analysis, and then make local adjustments with regard to reliability and risk parameters. The following steps are then required: 1.

2.

3.

4.

5.

1

Conduct a generic RCM analysis for selected components. In this analysis we use generic (average) values of reliability and risk parameters (regarding punctuality and personnel risk). Establish a generic RCM database. The results from generic RCM analyses of selected equipment types are stored in a generic RCM database. In a first phase we may restrict ourselves to consider broad classes of typical railway points. In a later phase, we may want to refine our analysis to cover specific types and brands of railway points (with different failure modes). Select local analysis objects. In the local analysis we work with a subset of the railway system. This can, for example, be a specific railway point, railway points in the main track of a specific line, and so on. Find an appropriate generic RCM template. For a local analysis object, we now recall the corresponding generic RCM analysis from the RCM database. We first verify that the generic RCM analysis object (template) is appropriate in terms of qualitative properties, with respect to functions, failure modes, and so. At this point it might be necessary to add more functions, failure modes, etc. In this case, we add the “new” RCM object to the generic RCM database in order to make the generic RCM database more comprehensive. Adjust parameters. At the local level we identify differences from the parameters used in the generic RCM database. A specific line may, for example, have very old railway points that may cause the MTTF to be smaller than the average MTTF. In this step of the procedure we have to

A railway point is a railway “switch” that allows a train to go from one track to another. A railway point is called a “turnout” in American English.

Reliability Centred Maintenance

101

consider all parameters that are involved in the optimization model (see Section 4.4. 6. Re-run the optimization procedure. Based on the new “local” parameters we next re-run the optimization procedure to adjust maintenance intervals taking local differences into account. To carry out this process we need a computerized tool to streamline the work. 7. Document the results. The results from the local analysis are stored in a local RCM database. This is a database where only the adjustment factors are documented, for example, for railway points A, B, C, and D on line Y the MTTF is 30 % higher than the average. Hence the maintenance interval is also reduced accordingly.

4.4 Modelling and Optimizing Maintenance Intervals A wide range of general models and methods for maintenance optimization have been proposed, e.g., see Rausand and Høyland (2004), Pierskalla and Voelker (1979), Valdez-Florez and Feldman (1989), Cho and Parlar (1991), Gertsbakh (2000) and Wang (2002). A high number of models and methods for specific applications have also been developed, e.g., see Vatn and Svee (2002), Chang (2005), Castanier and Rausand (2006), and Welte et al. (2006). In this section we present basic elements required to optimize maintenance interval (τ ) , and a standard procedure for setting up the cost function, C(τ ) , is proposed. A computerized tool called OptiRCM has been developed by the authors to support the RCM procedure presented in this chapter. OptiRCM is currently being used by the Norwegian National Railway (NSB). The Norwegian National Rail Administration (JBV) has also adopted the same procedure. OptiRCM imports the FMECA results generated by Steps 6 and 7 of the RCM analysis process. Cost information is usually not available in the FMECA; hence information about preventive and corrective maintenance costs must be provided separately. A screen presenting the information on the MSI level is shown in Figure 4.6. OptiRCM uses a procedure with three steps to optimize maintenance intervals: (i) the component performance is established (left-hand part of Figure 4.6), (ii) the system model is established (centre part of Figure 4.6), and (iii) the total cost if calculated (right-hand part of Figure 4.6). 4.4.1 Component Model The aim of the component model is to establish the effective failure rate with respect to a specific failure mode, λE (τ ) , as a function of the maintenance interval τ . The effective failure rate is the unconditional expected number of failures per time unit for a given maintenance level. Typically, the effective failure rate is an increasing function of τ . A large number of models for determining the effective failure rate as a function of the maintenance strategies, the degradation models, and so on, have been proposed in the literature.

102

M. Rausand and J. Vatn

Figure 4.6. OptiRCM input and analysis screen

The interpretation of the effective failure rate is not straightforward for hidden functions. For such functions we also need to specify the rate at which the hidden function is demanded. In this situation we may approximate the effective failure rate by the product of the demand rate and the probability of failure on demand (PFD) for the hidden function. In the following we indicate models that may be used for modelling the effective failure rate, and we refer to the literature for details. The aim of OptiRCM has been: • • •

To cover the standard situations, both with respect to evident/hidden failures, but also with respect to the type of failure progression. Provide formulae that do not require too many reliability parameters to be specified. Limit the number of probabilistic models as a basis for the optimization.

Only the Weibull distribution is used to model aging failures in OptiRCM. There may, of course, be situations where another distribution would be more realistic, but our experience is that the user of such a tool rarely has data or insight that helps him to do better than applying the Weibull model. 4.4.1.1 Effective Failure Rate in the Situation of Aging A standard block replacement policy is considered where an aging component is periodically replaced after intervals of length τ . Upon a failure in one interval, the component is replaced without affecting the next planned replacement. The effective failure rate, i.e., the average number of failures per time unit is then given by λE (τ ) = W (τ ) / τ , where W (τ ) is the renewal function (e.g., see Rausand and Høyland 2004). Approximation formulas for the effective failure rate exist if we assume Weibull distributed failure times (e.g., see Chang et al. 2006). OptiRCM

Reliability Centred Maintenance

103

uses the renewal equation to establish an iterative scheme for the effective failure rate based on an initial approximation. 4.4.1.2 Effective Failure Rate in the Situation of Gradual Observable Failure Progression The assumptions behind this situation is that the failure progression, say Y (t) , can be observed as a function of time. In the simplest situation Y (t) is onedimensional, whereas in more complex situations Y (t) may be multidimensional. We may also have situations where Y (t) denotes some kind of a signal where, for example, the fast Fourier transform of the signal is available. In OptiRCM a very simple situation is considered, where Y (t) is monotonically increasing. As Y (t) increases, the probability of failure also increases, and at a predefined level (maintenance limit), say l , the component is replaced, or overhauled. The effective failure rate, λE (τ, l) , is now a function of both the inspection interval, and the maintenance limit. In OptiRCM a Markov chain model is used to model the failure progression (e.g., see Welte et al. 2006 for details of the Markov chain modelling, and also an extension where it is possible to reduce the inspection intervals as we approach the maintenance limit). In the Markov chain model it is easy to treat the situation where Y (t) is a nonlinear function of time. If we restrict ourselves to linear failure progression, continuous models as the Wiener and gamma processes may also be used. 4.4.1.3 Effective Failure Rate in the PF Model The assumption behind the PF model is that failure progression is not observable for a rather long time, and then at some point of time we have a rather fast failure progression. This is the typical situation for cracks (potential failures) that can be initiated after a large number of load cycles. The cracks may develop rather fast, and it is important to detect the cracks before they develop into breakages. The time from a crack is observable until a failure (breakage) occurs is denoted the PF-interval. The important reliability parameters are the rate of potential failures, the mean and standard deviation of the PF-interval, and the coverage of the inspection method. The model implemented in OptiRCM for the PF situation is described in Vatn and Svee (2002). See also Castanier and Rausand (2006) for a similar approach, and the more general application of delay time models (Christer and Waller 1984). 4.4.2 System Model Figure 4.4 shows a simplified model of the risk picture related to the component failure being analyzed. In order to quantify the risk related to safety, we need the following input data: • • •

The effective failure rate, λE (τ ) The probability that the other barriers against the TOP event with respect to safety all fail, PTE−S The probability that the TOP event results in consequence C j is PC j for j running through the number of consequence classes.

104

M. Rausand and J. Vatn Table 4.2. PLL and cost contribution and for each consequence class Consequence

PLLj = PLL-contribution

SCj = Cost (Euro)

C1: Minor injury

0.01

2,000

C2: Medical treatment

0.05

30,000

C3: Permanent injury

0.1

300,000

C4: 1 fatality

0.7

1,600,000

C5: 2–10 fatalities

4.5

13,000,000

C6: >10 fatalities

30

160,000,000

The frequency of the consequence class C j is given by Fj = λE (τ ) ⋅ PTE−S ⋅ PC j

(4.1)

where PCj is the probability that the TOP event results in consequence class C j . We will later indicate how we can model Equation 4.1 as a function of the maintenance interval, τ . In some situations we also assign a cost, and/or a PLL (potential loss of life) contribution to the various cost elements. PLL denotes the annual, statistically expected number of fatalities in a specified population. Proposed values adopted by the Norwegian National Rail Administration are given in Table 4.2. Please see discussion by Vatn (1998) regarding what it means to assign monetary values to safety. The total PLL contribution related to the component failure being analyzed is then PLL = PTE−S ⋅ ∑ j=1 (PC j ⋅ PLL j ) ⋅ λE (τ ) 6

(4.2)

and the total cost contribution related to the component is CS = PTE−S ⋅ ∑ j=1 (PC j ⋅ SC j ) ⋅ λE (τ ) 6

(4.3)

where SCj is safety cost of consequence class C j . Note that in the FMECA analysis we can have an automatic procedure that calculates both the PLL contribution and the safety cost contribution based on the reliability parameters, and the type of TOP event. In the same way as we have done for safety consequences, we proceed with punctuality or unavailability costs. Here we simplify, and assume that there exists a fixed (expected) cost for each TOP event for punctuality, say PC(TOP). The punctuality cost per time unit is then

Reliability Centred Maintenance

CP = PTE-P ⋅ PC (TOP) ⋅ λE (τ )

105

(4.4)

This procedure may, if required, be repeated for other dimensions like environment, material damage, and so on. 4.4.3 Total Cost and Interval Optimization The approach to interval optimization is based on minimizing the total cost related to safety, punctuality, availability, material damage, etc. Within an ALARP regime (e.g., see Vatn 1998) this requires that the risk is not unacceptable. Assuming that risk is acceptable, we proceed by calculating the total cost per time unit: C(τ ) = CS (τ ) + CP (τ ) + CPM (τ ) + CCM (τ )

(4.5)

where CS (τ ) and CP (τ ) are given by Equation 4.3 and 4.4, respectively. Further, CPM (τ ) = PM Cost / τ

(4.6)

where PM Cost is the cost per preventive maintenance activity. Note that for condition-based tasks we distinguish between the cost of monitoring the item, and the cost of physically improving the item by some restoration or renewal activity. This complicates Equation 4.6 slightly because we have to calculate the average number of renewals. Further, if CM Cost is the cost of a corrective maintenance activity, we have CCM (τ ) = CM Cost ⋅ λE (τ )

(4.7)

Table 4.3. Generic probabilities, PCj, of consequence class Ci for the different TOP events event

PC1

PC2

PC3

PC4

PC5

PC6

Derailment

0.1

0.1

0.1

0.1

0.05

0.01

Collision train-train

0.02

0.03

0.05

0.5

0.3

0.1

Collision train-object

0.1

0.2

0.3

0.15

0.01

0.001

Fire

0.1

0.2

0.2

0.1

0.02

0.005

Passengers injured or killed at platforms

0.3

0.3

0.2

0.05

0.01

0.001

Persons injured or killed at level crossings

0.1

0.2

0.3

0.3

0.09

0.01

Persons injured or killed in or at the track

0.2

0.2

0.2

0.3

0.1

0.0001

TOP

106

M. Rausand and J. Vatn

To find the optimum maintenance interval we can now calculate C(τ ) in Equation 4.5 for various values of the maintenance interval, τ , and then choose the τ value that minimizes C(τ ) . As a numerical example we consider a pump used for oil cooling of the main high voltage transformer in a locomotive. The relevant figures in the example are assessed by experts in the Norwegian National Railway. Upon failure of the oil pump, the TOP event for punctuality will most likely be a FULL STOP with a probability, PTE−P = 0.75 for this punctuality consequence. It is considered that a full stop gives an average delay of 15 min, and the cost of 1 min delay is set to 150 Euros. The potential TOP event for safety is a FIRE, but the likelihood is very small, i.e. PTE−S = 0.0005 . The reliability parameters of the pump are for the aging parameter α = 3.5 , and for the mean time to failure without any preventive maintenance we set to MTTF = 10 million km. To calculate the safety cost we find ∑ j (PC j ⋅ SC j ) = 1.286 million Euros by combining Table 4.2 and 4.3. Equation 4.3 thus reads CS (τ ) = 643 ⋅ λE (τ ) . Punctuality cost in Equation 4.4 is similarly given by CS (τ ) = 2250 ⋅ λE (τ ) . For PM and CM cost we have PM Cost = 3100 Euros, and CM Cost = 4400 Euros, respectively. Chang et al. (2006) argue that a good approximation for the effective failure rate is α

2 (0.09α − 0.2)τ ⎤ ⎛ Γ(1 + 1/ α ) ⎞ α −1 ⎡ 0.1ατ λE (τ ) = ⎜ + ⎥ ⎟ τ ⎢1 − 2 MTTF ⎝ MTTF ⎠ ⎣ MTTF ⎦

(4.8)

The total cost C(τ ) in Equation 4.5 can now be found as a function of τ ; see Figure 4.7 for a graphical illustration. The optimum interval is found to be 7.5 million km. The maintenance action is scheduled replacement of the pump; see Figure 4.5.

Figure 4.7. Cost elements as a function of the maintenance interval

Reliability Centred Maintenance

107

4.5 Conclusions The main parts of the RCM approach that we have described in this chapter are compatible with common practice and with most of the RCM standards. We are, however, using a more complex FMECA where we also record data that are necessary during maintenance interval optimization. The novel parts of our approach are related to the use of so-called generic RCM analysis and to maintenance interval optimization. The use of generic RCM analysis will significantly reduce the workload of a complete RCM analysis. Maintenance optimization is, generally, a very complex task, and only a brief introduction is presented in this chapter. For maintenance personnel to be able to use the proposed methods, they need to have access to simple computerized tools where the mathematically complex methods are hidden. This was our objective in developing the OptiRCM tool. Maintenance optimization modules are, more or less, non-existent in the standard RCM tool. OptiRCM is not a replacement for these tools, but rather a supplement. OptiRCM is still in the development stage, and we are currently trying to implement several new features into OptiRCM. Among these are additional methods related to maintenance strategies, and grouping of maintenance tasks.

4.6 References ABS, (2003) Guide for Survey Based on Reliability-Centered Maintenance. American Bureau of Shipping, Houston. ABS, (2004) Guidance Notes on Reliaility-Centered Maintenance. American Bureau of Shipping, Houston. Blanchard BS, Fabrychy WJ, (1998) Systems Engineering and Analysis, 3rd ed. Prentice Hall, Englewood Cliffs, NJ. Blanche KM, Shrivastava AB, (1994) Defining failure of manufacturing machinery and equipment. Proceedings from the Annual Reliability and Maintainability Symposium, pp. 69–75. Castanier B, Rausand M, (2006) Maintenance optimization for subsea oil pipelines. Pressure Vessels and Piping 83:236–243. Chang KP, (2005) Reliability-centered maintenance for LNG ships. ROSS report 200506, NTNU, Trondheim, Norway. Chang KP, Rausand M, Vatn J, (2006) Reliability Assessment of Reliquefaction Systems on LNG Carriers. Submitted for publication in Reliability Engineering and System Safety. Cho DI, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23. Christer AH, Waller WM, (1984) Delay time models of industrial inspection maintenance problems, Journal of the Operational Research Society 35:401–406. DEF-STD 02-45 (NES 45), (2000) Requirements for the application of reliability-centred maintenance technique to HM ships, submarines, Royal fleet auxiliaries and other naval aixiliary vessels. Defense Standard, U.K. Ministry of Defence, Bath, England. Gertsbakh I, (2000) Reliability Theory with Applications to Preventive Maintenance. Springer, New York. Hoch R, (1990) A practical application of reliability centered maintenance. the American Society of Mechanical Engineers, 90-JPGC/Pwr-51, Joint ASME/IEEE Power Generation Conference, Boston, MA, 21–25 October.

108

M. Rausand and J. Vatn

IEC60300-3-11, (1999) Dependability management – Application guide – Reliability centered maintenance. International Electrotechnical Commission, Geneva. IEC61508, (1997) Functional safey of electrical/electronic/programmable electronic safetyrelated systems, Part 1–7. International Electrotechnical Commission, Geneva. Kirwan B, Ainsworth LK, (1992) A Guide to Task Analysis. Taylor and Francis, London. MIL-STD-2173 (AS), (1986) Reliability-Centered Maintenance. Requirements for Naval Aircraft, Weapon Systems and Support Equipment. U.S. Department of Defense, Washington, DC. Moubray J, (1997) Reliability-centered Maintenance II, 2nd ed. Industrial Press, New York. NASA, (2000) Reliability Centered Maintenance Guide for Facilities and Collateral Equipment. NASA Office of Safety and Mission Assurance, Washington DC. NAVAIR 00-25-403, (2005) Guidelines for the naval aviation reliability-centered maintenance process. Naval Air Systems Command, U.S.A. OREDA, (2002) Offshore Reliability Data, 4th ed. OREDA Participants. Available from Det Norske Veritas, NO-1322 Høvik, Norway. Paglia A, Barnard D, Sonnett D, (1991) A case study of the RCM project at V.C. Summer Nuclear Generating Station. 4th International Power Generation Exhibition and Conference, Tampa, Florida, USA. 5:1003–1013. Pierskalla WP, Voelker JA, (1979) A survey of maintenance models: The control and surveillance of deteriorating systems. Naval Research Logistics Quarterly 23:353–388. Rausand M, Høyland A, (2004) System Reliability Theory; Models, Statistical Methods, and Applications. Wiley, New York. SAE JA1012, (2002) A Guide to the Reliability-Centered Maintenance (RCM) Standard. The Egineering Society for Advancing Mobility Land, Sea, Air, and Space, Warrendale, PA. Smith AM, (1993) Reliability-Centered Maintenance. McGraw-Hill, New York. Smith DJ, (2005) Reliability, maintainability and Risk; Practical Methods for Engineers icluding Reliability Centred Maintenance and Safety-Related Systems. Elsevier, Butterworth Heinemann, Amsterdam. USACERL TR 99/41 (1999) Reliability centered maintenance (RCM) guide. Operating a more effective maintenance program. U.S. Army Corps of Engineers. Valdez-Flores C, Feldman RM, (1989) A survey of preventive maintenance models for stochastically deterioratingsingl-unit systems. Naval Research Logistics 36:419–446. Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization. Reliability Engineering and System Safety 51:241–257. Vatn J, (1998) A discussion of the acceptable risk problem. Reliability Engineering and System Safety 61:11–19. Vatn J, Svee H, (2002) A risk based approach to determine ultrasonic inspection frequencies in railway applications. ESReDA Conference, Madrid, 27–28 May. Wang H, (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139:469–489. Welte T, Vatn J, Heggset J, (2006) Markov state model for optimization of maintenance and renewal of hydro power components. 9th International Conference on Probabilistic Methods Applied to Power Systems, KTH, Stockholm, 11–15 June 2006.

Part C

Methods and Techniques

5 Condition-based Maintenance Modelling Wenbin Wang

5.1 Introduction The use of condition monitoring techniques in industry to direct maintenance actions has increased rapidly over recent years to the extent that it has marked the beginning of what is likely to prove a new generation in production and maintenance management practice. There are both economic and technological reasons for this development driven by tight profit margins, high outage costs and an increase in plant complexity and automation. Technical advances in condition monitoring techniques have provided a means to achieve high availability and to reduce scheduled and unscheduled production shutdowns. In all cases, the measured condition information does, in addition to potentially improving decision making, have a value added role for a manager in that there is now a more objective means of explaining actions if challenged. In November 1979, the consultants, Michael Neal & Associate Ltd published ‘A Guide to Condition Monitoring of Machinery’ for the UK Department of Trade and Industry; Neal et al. (1979). This groundbreaking report illustrated the difference in maintenance strategies (e.g., breakdown, planned, etc.) and suggested that condition based maintenance, using a range of techniques, would offer significant benefits to industry. By the late 1990s condition based maintenance had become widely accepted as one of the drivers to reduce maintenance costs and increase plant availability. With the advent of e-procurement, business to business (B2B), customer to business (C2B), business to customer (B2C) etc., industry is fast moving towards enterprise wide information systems associated with the internet. Today, plant asset management is the integration of computerised maintenance management systems and condition monitoring in order to fulfil the business objectives. This enables significant production benefits through objective maintenance prediction and scheduling. This positions the manufacturer to remain competitive in a dynamic market. Today there exists a large and growing variety of condition monitoring techniques for machine condition monitoring and fault diagnosis. A particularly popular

112

W. Wang

one for rotating and reciprocal machinery is vibration analysis. However, irrespective of the particular condition monitoring technique used, the working principle of condition monitoring is the same, namely condition data become available which need to be interpreted and appropriate actions taken accordingly. There are generally two stages in condition based maintenance. The first stage is related to condition monitoring data acquisition and their technical interpretations. There have been numerous papers contributing to this stage, as evidenced by the proceedings of COMADEM over recent years. This stage is characterised by engineering skill, knowledge and experience. Much effort of the study at this stage has gone into determining the appropriate variables to monitor, Chen et al. (1994), the design of systems for condition monitoring data acquisition, Drake et al. (1995), signal processing, Wong et al. (2006), Samanta et al. (2006), Harrison (1995), Li and Li (1995), and how to implement computerised condition monitoring, Meher-Homji et al. (1994). These are just a few examples and no modelling is explicitly entered into the maintenance decision process based upon the results of condition monitoring. For detailed technical aspects of condition monitoring and fault diagnosis, see Collacott (1997). The second stage is maintenance decision making, namely what to do now given that condition information data and their interpretations are available. The decision at this stage can be complicated and entails consideration of cost, downtime, production demand, preventive maintenance shutdown windows, and most importantly, the likely survival time of the item monitored. Compared with the extensive literature on condition monitoring techniques and their applications, relatively little attention has been paid to the important problem of modelling appropriate decision making in condition based maintenance. This chapter focuses on the second stage of condition monitoring, namely condition based maintenance modeling as an aid to effective decision making. In particular, we will highlight a modelling technique used recently in condition based maintenance, e.g. residual life modelling via stochastic filtering (Wang and Christer 2000). This is a key element in modeling the decision making aspect of condition based maintenance. The chapter is organised as follows. Section 5.2 gives a brief introduction to condition monitoring techniques. Section 5.3 focuses on condition based maintenance modeling and discuss various modeling techniques used. Section 5.4 presents the modelling of the residual life conditional on observed monitoring information using stochastic filtering. Section 5.5 concludes the chapter with a discussion of topics for future research.

5.2 Condition Monitoring Techniques For many years condition monitoring has been defined as “The assessment on a continuous or periodic basis of the mechanical and electrical condition of machinery, equipment and systems from the observation and/or recordings of selected measurement parameters” (Collacott 1997). One of the obvious analogies is the temperature measurement of a human body where the observation is the temperature and the system is the human body. Just as doctors strongly recommend periodic checks of key health parameters such as blood pressure, pulse, weight and/or temperature for an early indication of potential health problems, for

Condition-based Maintenance Modelling

113

industrial equipment some measurements can be taken and the likely condition of the plant assessed. Today there exists a large and growing variety of forms of condition monitoring techniques for machine condition monitoring and fault diagnosis. Understanding the nature of each monitoring technique and the type information measured will certainly help us when establishing a decision model. Here we briefly introduce five main techniques and among them, vibration and oil analysis techniques are the two most popular. 5.2.1 Vibration Based Monitoring Vibration based monitoring is the main stream of current applications of condition monitoring in industry. Vibration based monitoring is an on (off) line technique used to detect system malfunction based on measured vibration signals. Generally speaking, vibration is the variation with time of the magnitude of a quantity that is descriptive of the motion or position of a mechanical system, when the magnitude is alternatively greater than and smaller than some average value or reference. Vibration monitoring consists essentially in identifying two quantities: • •

The magnitude (overall level) of the vibration The frequency content (and/or time waveform)

The magnitude is basically used for establishing the severity of the vibration and the frequency content for the cause or origin. Vibration velocity has been seen as the most meaningful magnitude criterion for assessing machine condition, though displacement or acceleration is also used. The magnitude of vibration is usually measured in root mean square (rms). If T denotes the period of vibration and V (t ) is the vibration (say, velocity) measured at time t, then Vrms =

1 T



T 0

(V (t )) 2 dt ,

which is proportional to the energy of vibration (Reeves 1998). However, since vibration signals from machines are, in general, periodic in nature, a great deal of information is contained in its frequency spectrum form. The frequency spectrum is usually obtained digitally using a digital analyser or computer via a mathematical algorithm known as “fast fourier transform” (FFT). The spectrum analysis of vibration signals is commonly used in the fault diagnosis of rotating machines. Potentially, all machines can benefit from vibration monitoring except, perhaps, those running at very low speed (below about 20 rev/min), and those where isolation (or damping) occurs between the source and the sensor. From observed vibration signals we often see a typical two-stage process where the signals may stay flat over the normal operation period and then display some increasing trend when a defect has initiated (Wang 2002). Another factor coming into play when establishing a vibration based maintenance model is the casual relationship between the measured signals and the state of the plant. It is the defect

114

W. Wang

which causes the abnormal signals, but not vice versa (Wang 2002). This factor plays an important role when selecting an appropriate model for describing such a relationship. 5.2.2 Oil Based Monitoring A detailed analysis of a sample of engine, transmission and hydraulic oils is a valuable preventive maintenance tool for machines. In many cases it enables the identification of potential problems before a major repair is necessary, has the potential to reduce the frequency of oil changes, and increase the resale value of used equipment. Oil based monitoring involves sampling and analyzing oil for various properties and materials to monitor wear and contamination in an engine, transmission or hydraulic system etc. Sampling and analyzing on a regular basis establishes a baseline of normal wear and can help indicate when abnormal wear or contamination is occurring. Oil analysis works as follows. Oil that has been inside any moving mechanical apparatus for a period of time reflects the possible condition of that assembly. Oil is in contact with engine or mechanical components as wear metallic trace particles enter the oil. These particles are so small they remain in suspension. Many products of the combustion process will also become trapped in the circulating oil. The oil becomes a working history of the machine. Particles caused by normal wear and operation will mix with the oil. Any externally caused contamination also enters the oil. By identifying and measuring these impurities, one can get an indication of the rate of wear and of any excessive contamination. An oil analysis also will suggest methods to reduce accelerated wear and contamination. The typical oil analysis tests for the presence of a number of different materials to determine sources of wear, find dirt and other contamination, and even check for the use of appropriate lubricants. Today there exists a variety of forms of oil based condition monitoring methods and techniques to check the volume and nature of foreign particles in oil for equipment health monitoring. There are spectrometric oil analysis, scan electron microscopy/energy dispersive X-ray analysis, energy dispersive X-ray fluorescent, low powered optical microscopy, and ferrous debris quantification. One purpose of the oil analysis is to provide a means of predicting possible impending failure without dismantling the equipment. One can “look inside” an engine, transmission or hydraulic systems without taking it apart. For oil based monitoring there is no such clear cut distinction between normal and abnormal operating based on observed particle information in the oil samples. The foreign particles that accumulate in the lubricant oil increase monotonically so that we may not able to see a two-stage failure process as seen in the vibration based monitoring. The casual relationship between the measured amount of particles in the oil and the state of the plant may also be bilateral in that, for example, the wear may cause the increase of observed metals in the oil, but the metals and other contaminants in the oil may also accelerate the wear. This marks a difference when modeling the state of the plant in oil based monitoring compared to vibration based.

Condition-based Maintenance Modelling

115

5.2.3 Other Monitoring Techniques The other popular condition monitoring techniques are infrared thermography, acoustics and motor current analysis. The basis of infrared thermography is quite simple. All objects emit heat or infrared electro-magnetic energy, but only a very small proportion of this energy is visible to the naked eye. At low temperatures in order to ‘see’ the heat being emitted an infrared camera must be used. The camera detects the invisible thermal energy and converts it to a visible image on a screen. The image can then be analyzed to identify any abnormality. The acoustic emission (AE) based method is widely used for monitoring the condition of rotating machinery. Compared to traditional vibration based methods, the high frequency approach of AE has the advantage of a significant improvement in signal to noise ratio. It can also be used for non-rotating machinery where defect activities do not generate distinct repetition frequencies and hence FTT analysis cannot be used. An item to note is that AE transducers need to have a relatively narrow band to be able to detect high frequency faults. The motor current noise signature analysis methods and apparatus for monitoring the operating characteristics of an electric motor-operated device, such as a motor-operated valve, have been frequently used for early detection of rotor related faults in AC induction motors. Frequency domain signal analysis techniques are applied to a conditioned motor current signal to identify distinctly various operating parameters of the motor driven device from the motor current signature. The signature may be recorded and compared with subsequent signatures to detect operating abnormalities and degradation of the device. This diagnostic method does not require special equipment to be installed on the motor-operated device, and the current sensing may be performed at remote control locations, e.g., where the motor-operated devices are used in inaccessible or hostile environments. All the techniques briefly introduced above can offer some help for indicating the current state or condition of the plant monitored. Based on the technical analysis of the observed condition monitoring data, a maintenance decision has to be made to maintain the plant in a cost effective way. We discuss in the next section, how modeling can be used to support such a decision making utilizing available monitoring information.

5.3 Condition Based Maintenance Modelling There is a basic but not always clearly answered question in condition monitoring — what is the purpose of condition monitoring? Have we lost sight of the ultimate need? Condition monitoring is not an end itself, it involves an expenditure entered into by the managers in the belief that it will save them money. How is this saving achieved? It can be obtained by using monitored condition information to optimise maintenance to achieve minimum breakdown of the plant with maximum availability for production, and to ensure that maintenance is only carried out when necessary. This is what one calls condition based maintenance which contrasts with the traditional breakdown or time based maintenance policies where maintenance

116

W. Wang

is only carried out when it becomes necessary utilizing available condition information. But in reality, all too often we see effort and money spent on monitoring equipment for faults which rarely occur, and we also see planned maintenance being carried out when the equipment is perfect healthy though the monitored information indicates something is “wrong”. A study of oil based condition monitoring of gear boxes of locomotives used by Canadian Pacific Railway (Aghjagan 1989) indicated, that since condition monitoring was commissioned (entailed 3–4 samples per locomotive per week, 52 weeks per year), the incidence failure of gear boxes while in use fell by 90 %. This is a significant achievement. However, when subsequently stripped down for reconditioning/overhaul, there was nothing evidently wrong in 50 % of cases. Clearly, condition monitoring can be highly effective, but may also be very inefficient at the same time. Modelling is necessary to improve the cost effectiveness and efficiency of condition monitoring. 5.3.1 The Decision Model This is an extension to the agebased replacement model in that the replacement decision will be made not only dependent upon the age, but also upon the monitored information, plus other cost or downtime parameters. If we take the cost model as an example, then the decision model amounts to minimising the long run expected cost per unit time. We use the following notation: c f : The mean cost per failure c p : The mean cost per preventive replacement

cm : The mean cost per condition monitoring ti : The ith and the current monitoring point Yi : Monitored information at ti with yi of its observed value ℑi : History of observed condition variables to ti , ℑi = { y1 ,..., yi } X i : The residual life at time ti pi ( xi | ℑi ) : Pdf of X i conditional on ℑi

The long term expected cost per unit time, C (t ) , given that a preventive replacement is scheduled at time t> ti is given by (Wang 2003) C (t ) =

(c f − c p ) P (t − ti | ℑi ) + c p + icm ti + (t − ti )(1 − P (t − ti | ℑi )) +



t − ti 0

where P (t − ti | ℑi ) = P ( X i < t − ti | ℑi ) = ∫

(5.1)

xi pi ( xi | ℑi ) dxi t − ti 0

pi ( xi | ℑi )dxi , which is the probability

of a failure before t conditiional on ℑi . The right hand side of Equation 5.1 is the expected cost per unit time formulated as a renewal reward function, though the lifetimes are independent but not identical.

Condition-based Maintenance Modelling

117

The time point t is usually bounded within the time period from the current to the next monitoring since a new decision shall be made once a new monitoring reading becomes available at time ti +1 . In general, if a minimum of C (t ) is found within the interval to the next monitoring in terms of t , then this t should be the optimal replacement time. If no minimum is found, then the recommendation would be to continue to use the plant and evaluate Equation 5.1 at the next monitoring point when new information becomes available. For a graphical illustration of the above principle see Figure 5.1. C(t)

No replacement is recommended

Optimal replacement time

ti Current time

t*

Next monitoring time ti +1

t

Figure 5.1. A graph to show the optimal replacement time

Obviously the key element in Equation 5.1 is the determination of pi ( xi | ℑi ) , which is the topic of the next two sections. 5.3.2 Modelling pi ( xi | ℑi ) Before we proceed to the discussion of the modelling of pi ( xi | ℑi ) , there are few issues that need clarification. The first relates to the concept of direct and indirect monitoring (Christer and Wang 1995). In direct monitoring, the actual condition of the item, say the depth of a brake pad, can be observed, and a critical level, say C , can be set up. While in the indirect monitoring case we can only collect measurements related to the actual condition of the item monitored in a stochastic manner. For example, in the vibration monitoring case, if a high vibration signal is observed we may suspect the item’s condition might be bad, but we may neither know the exact condition of it, nor its quantification. For direct monitored systems, Markov models are popular; see Black et al. (2005), Chen and Trivedi (2005), and Love (2000). Counting processes have also been used for modeling the deterioration of directly monitored plant; see Aven (1996) and Jensen (1992). Christer and Wang (1992) used a random coefficient model for a direct monitored case. It is noted however that the majority of condition monitoring applications are indirect monitoring such as the

118

W. Wang

five popular monitoring techniques discussed earlier. It is therefore in this chapter that our attention is paid to indirect monitoring cases. The second issue is the appropriate definition of the plant state. This also relates to the first issue whether the monitoring is direct or indirect. In direct monitoring, the actual observed condition of the item is clearly the plant state. While in the indirect monitoring case we can only observe measures indirectly related to the actual condition of the item monitored as discussed earlier. The most simple and intuitive definition is a set of categorical states ranging, say from 0 (new) to N (failed) as seen from Markov based models (Baruah and Chinnam 2005). Wang (2006a) also used a generic term of wear to represent the state of the monitored plant, which is particularly useful in modelling wear related problems in condition monitoring. Wang and Christer (2000) first used the residual life at the time of checking as a measure of the state of the monitored unit of interest. This definition provides an immediate modeling means to establish directly a link between the measured information and the residual life of interest. It is noted however, that this residual life is usually not observable which increases modeling complexity. A model of pi ( xi | ℑi ) introduced later will be based on this definition. Various different methods or models have been proposed in the literature to formulate and calculate pi ( xi | ℑi ) . Proportional Hazard Modeling (PHM, one particular and natural form for modelling the hazard) is a popular one; Kumar and Westberg (1997), Love and Guo 1991, Makis and Jardine (1991), Jardine et al. (1998), Banjevic et al. (2001). Accelerated life models (Kalbfleisch and Prentice 1980; Wang and Zhang 2005) could also be used here, and may be more appropriate since the analogy between accelerated life testing, where these models originate, and condition monitoring is a close one. It should be noted that accelerated life models and proportional hazard models are identical when the time to failure distribution is Weibull, that is when the hazard function is given by h (t ) = α β t β −1 .

There are two problems with proportional hazards modeling or accelerated life models in condition based maintenance. The first is that the current hazard is determined partially by the current monitoring measurements and the full monitoring history is not used. The second is the assumption that the hazard or the life is a function of the observed monitoring data which acts directly on the hazard via a covariate function. Both problems relate to the modeling assumption rather than the technique. The first can be overcome if some sort of transformation of the observed data is used. The second problem remains unless the nature of monitoring indicates so. It is noted however that, for most condition monitoring techniques, the observed monitoring measurements are concomitant types of information which are a function of the underlying plant state. A typical example is in vibration monitoring where a high level of vibration is usually caused by a hidden defect but not vice versa as we have discussed earlier. In this case the observed vibration signals may be regarded as concomitant variables which are caused by the plant state. Note that in oil based monitoring things are different as the metal particles and other contaminants observed in the oil can be regarded both as concomitant

Condition-based Maintenance Modelling

119

variables and covariates as we discussed earlier. In this case a model considers both variables might be appropriate. The last decade has seen an increased use of stochastic filtering and Hidden Markov Models (HMM) for modelling pi ( xi | ℑi ) in condition based maintenance; see Hontelez et al. (1996), Christer et al. (1997), Wang and Christer (2000), Bunks et al. (2000), Dong and He (2004), Lin and Markis (2003, 2004), Baruah and Chinnam (2005), and Wang (2006a). These techniques overcome both problems of PHM and provide a flexible way to model the relationship between the observed signals and unobserved plant state. HMM can be seen as a specific type of stochastic filtering models that are usually used for discrete state and observation variables. If the noise factors in the model are not Gaussian, then a closed form for pi ( xi | ℑi ) is generally not available and one has to resort to numerical approximations. A comparison study using both filtering (Wang 2002) and PHM (Makis and Jardine 1991) based on vibration data revealed that the filtering based model produced a better result in terms of prediction accuracy (Matthew and Wang 2006). It should also be noted that if the monitored variables also influence the state to some extent, then both HMM and PHM should be used to tackle the problem. Alternatively an interactive HMM can also be formulated where a bilateral relationship is assumed between the observed and unobserved. In the next section, we shall discuss in details a specific filtering model used for the derivation of pi ( xi | ℑi ) . This model is simple to use and is analytically tractable.

5.4 Conditional Residual Life Prediction First we define the true state of plant as the residual life conditional upon measured condition related information to date, such as, vibration, temperature, etc. Next we assume these conditional pieces of information are functions of the residual life, that is, it is the residual life which controls the behavior of the measured conditional information, but not vice versa (this assumption can be relaxed). Generally we expect that a short residual life (depending on the severity of the defect) will generate a high signal level in some of the measures of condition variables, though in a typical stochastic fashion. In theory, we may have the following relationship: Defect

Short residual life

Higher than normal signal may be observed.

If the severity of the defect is represented by the length of the residual life, the relationship between the residual life and observed condition related variables follows.

120

W. Wang

5.4.1 Conditional Residual Life Prediction The model is built based on the following assumptions: 1. Plant items are monitored regularly at discrete time points. 2. There are two periods in the plant life where the first period is the time length from new to the point when the item was first identified to be faulty, and the second period is the time interval from this point to failure if no maintenance intervention is carried out. The second period is often called the failure delay time. It is also assumed that these two periods are statistically independent from each other. 3. A threshold level is established to classify the item monitored to be in a potentially faulty state if the condition information signal is above the level. Such a threshold level is usually determined by engineering experience or by a statistical analysis of measured condition related variables. 4. The conditional information obtained at time ti , yi , during the failure delay time is a random variable which depends on xi . Assumptions 1 and 2 can often be observed in condition monitoring practice. Assumption 3 can be relaxed and a model which can both identify the starting point of the second stage and residual life prediction can be established (Wang 2006b). For now, to keep the model simple we still use assumption 3. Assumption 4 was first proposed in Wang and Christer (2000), which states that the rapid increase in the observed condition information is partly due to the shortened residual life because of the hidden defect. However this relationship is contaminated with random noise. Assumption 4 is the fundamental principle underpinning our model. For a detailed discussion on assumption 4 see Wang and Christer (2000). Because the interest in residual life prediction is over the failure delay time (assuming it exists) and the information collected over the normal working period may not be beneficial for residual life prediction, we revise our notation on ti as the ith and the current monitoring time since the item was suspected to be faulty but still operating (noted that the order starts from the moment when the item was first identified to be possibly faulty). This implies that t1 is the first monitoring point which may indicate that the second stage has started. However, some monitoring may not be able to display a two-stage process such as oil based monitoring. If this is the case, we can simply set the threshold level to be zero. Figure 5.2 shows a typical condition monitoring practice. It is noted from Figure 5.2 that the conditional information obtained before t1 is not used since it is irrelevant to the decision making process. It is noted however, that the time to t1 is one of important information sources to be used in determining the condition monitoring interval (Wang 2003). Since the residual life at ti is the residual life at ti −1 minus the interval between ti and ti −1 provided the item has survived to ti and no maintenance action has been taken, it follows that

⎧ X − (ti − ti −1 ) if X i −1 > ti − ti −1 X i = ⎨ i −1 . not defined else ⎩

(5.2)

Condition-based Maintenance Modelling

121

yi

x1 y3

y2 y1

x3

x2

Threshold level

t1

0

t2

t3

failure

Figure 5.2. Condition monitoring practice

The relationship between Yi and X i is yet to be identified. From assumption 4 we know that it can be described by a distribution, say, p( yi | xi ) . We will discuss this later when fitting the model to data. We wish to establish the expression of pi ( xi | ℑi ) , and therefore a consequential decision model can be constructed on the basis of such a conditional probability; see Equation 5.1. Since ℑi = { y1 , y2 ,..., yi } = { yi , ℑi −1} , then pi ( xi | ℑi ) can be expressed as pi ( xi | ℑi ) = p( xi | yi , ℑi −1 ) . It follows that

pi ( xi | ℑi ) = p ( xi | yi , ℑi −1 ) =

p ( xi , yi | ℑi −1 ) p ( yi | ℑi −1 )

(5.3)

By using the multiplicative rule, the joint distribution, p ( xi , yi | ℑi −1 ) is given as p ( xi , yi | ℑi −1 ) = p( yi | xi , ℑi −1 ) p ( xi | ℑi −1 )

(5.4)

Since given both xi and ℑi−1 , yi depends on xi only from assumption 4 so Equation 5.4 reduces to p ( xi , yi | ℑi −1 ) = p ( yi | xi , ℑi −1 ) p( xi | ℑi −1 ) = p( yi | xi ) p( xi | ℑi −1 )

(5.5)

Integrating out the xi term in Equation 5.5 we have

p( yi | ℑi −1 ) =



∞ 0

p( xi , yi | ℑi −1 )dxi =



∞ 0

p( yi | xi ) p( xi | ℑi −1 )dxi

(5.6)

122

W. Wang

We focus our attention to p ( xi | ℑi −1 ) which appears both in Equation 5.4 and Equation 5.6. From Equation 5.2 we have xi −1 = g ( xi ) = xi + (ti − ti −1 ) conditional on X i −1 > ti − ti −1 . Then the distribution of X i | ℑi −1 can be expressed by a transformation of variables from X i to X i −1 (Freund 2004) as

p( xi | ℑi −1 ) = pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 )

Since

dg ( xi ) dxi

(5.7)

dg ( xi ) = 1 and dxi

pi −1 ( g ( xi ) | ℑi −1 , X i −1 > ti − ti −1 ) =





pi −1 ( g ( xi ) | ℑi −1 )

ti − ti −1

pi −1 ( xi −1 | ℑi −1 )dxi −1

(5.8)

we finally have

p( xi | ℑi −1 ) =

pi −1 ( xi + ti − ti −1 | ℑi −1 )





ti − ti −1

pi −1 ( xi −1 | ℑi −1 )dxi −1

(5.9)

Using Equations 5.5, 5.6 and Equation 5.9, 5.3 becomes

pi ( xi | ℑi ) =

p ( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi −1 )



∞ 0

p( yi | xi ) pi −1 ( xi + ti − ti −1 | ℑi )dxi −1

(5.10)

which is a recursive equation which starts at time t1 . At time t1 , using Equation 5.10 we have

p1 ( x1 | ℑ1 ) =

p ( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )



∞ 0

p( y1 | x1 ) p0 ( x1 + t1 − t0 | ℑ0 )dx1

(5.11)

Since ℑ0 is usually 0 or not available, so p0 ( x1 + t1 − t0 | ℑ0 ) = p0 ( x1 + t1 − t0 ) , then if p0 ( x0 ) and p( y1 | x1 ) can be specified, Equation 5.11 can be determined. Similarly we can proceed to determining pi ( xi | ℑi ) if pi −1 ( xi −1 | ℑi −1 ) and p ( yi | xi ) are available from the previous step calculation at time ti −1 . Now the task is how to specify p0 ( x0 ) and p ( yi | xi ) .

Condition-based Maintenance Modelling

123

5.4.2 Specification of p0 ( x0 ) and p ( yi | xi ) p0 ( x0 ) is just the delay time distribution over the second stage of the plant life. Here we use the Weibull distribution as an example in this context. In practice or theory, the distribution density function p0 ( x0 ) should be chosen from the one which best fits to the data or from some known theory. The set-up of the p ( yi | xi ) term requires more attention. Here we follow the one used in Wang (2002), where yi | xi is assumed to follow a Weibull distribution with the scale parameter being equal to the inverse of A + Be − cx . In this way we establish a negative correlation between yi and xi as expected, that is E (Yi | X i = xi ) ∝ A + Be − cx . The pdf is given below: i

i

p( yi | xi ) =

yi

−( yi η ( )η −1 e A+ Be − cx − cx A + Be A + Be i

i

− cxi



.

(5.12)

This is a concept called floating scale parameter, which is particularly useful in this case (Wang 2002). There are other choices to model the relationship between yi and xi , but these will not be discussed here, and can be found in Wang (2006a). 5.4.3 Estimating the Model Parameters Within pi ( xi | ℑi ) To calculate the actual pi ( xi | ℑi ) we need to know the values for the model parameters. They are the parameters of p0 ( x0 ) and p ( yi | xi ) . The most popular way to estimate them is using the method of maximum likelihood. At each monitoring point, ti , two pieces information are available, namely, yi and X i −1 > ti − ti −1 , both conditional on ℑi−1 . The pdf. for yi | ℑi −1 is given by Equation 5.7 and the probability function of X i −1 > ti − ti −1 | ℑi −1 is given by P ( X i −1 > ti − ti −1 | ℑi −1 ) = ∫

∞ ti − ti −1

pi −1 ( xi −1 | ℑi −1 ) dxi −1

(5.13)

If the item monitored failed at time t f after the last monitoring at time t n , the complete likelihood function is then given by

L (Θ) =

(∏

n i =1

p( yi | ℑi −1 )



∞ ti − ti −1

)

pi −1 ( xi −1 | ℑi −1 )dxi −1 ) pn (t f − tn | ℑn )

(5.14)

where Θ is the set of parameters to be estimated. Taking logs on both sides of Equation 5.14 and maximising it in terms of unknown parameters should give the estimated values of those parameters. However, computationally it has to be solved numerically since Equation 5.14 involves many integrals which may not have analytical solutions.

124

W. Wang

5.4.4 A Case Study Figure 5.3 shows the data of overall vibration level in rms of six bearings, which is from a fatigue experiment (Wang 2002). It can be seen from Figure 5.3 that the bearing lives vary from around 100 h to over 1000 h, which shows a typical stochastic nature of the life distribution. The monitored vibration signals also indicate an increasing trend with bearing ages in all cases, but with different paths. An important observation is the pattern of vibration signals which stays relatively flat in the early stage of the bearing life and then increases rapidly (a defect may have been initiated). This indicates the existence of the two stage failure process as defined earlier.

Figure 5.3. Vibration data of six bearings

The initial point of the second stage in these bearings is identified using a control chart called the Shewhart average level chart and the threshold levels of the bearings are shown in Table 5.1 (Zhang 2004). Table 5.1. Threshold level for each bearing

Bearing 1 2 3 4 5 6

Threshold level 5.06 5.62 4.15 5.14 3.92 4.9

Condition-based Maintenance Modelling

125

Assuming both distributions for p0 ( x0 ) and p ( yi | xi ) are Weibull where p ( x0 ) = αβ (α x0 ) β −1 e − (α x

0)

β

and yi

−( yi η ( )η −1 e A+ Be − cx − cx A + Be A + Be

p( yi | xi ) =

i

− cxi



i

then starting from t1 and after recursive filtering we have ( xi + ti ) β −1 e− (α ( x + t )) i

pi ( xi | ℑi ) =



∞ 0

i

β

( z + ti ) β −1 e − (α ( z + t )) i

β

∏ ∏

i

ψ k ( xi , ti )

k =1 i

(5.15)

ψ ( z , ti )dz k =1 k

where

ψ k ( z , ti ) =

− C ( z +ti −tk ) −1 η

) ) e− ( y ( A+ Be . − C ( z + t −t ) A + Be k

i

k

To estimate the parameters in p0 ( x0 ) and p ( yi | xi ) we need write down the likelihood function as Equation 5.14. The actual process to estimate these unknown parameters is complicated and involves heavy numerical manipulation which we omit and interested readers can get the details in Zhang (2004). The estimated result is listed in Table 5.2. Table 5.2. Estimated parameter values in p0 ( x0 ) and p ( yi | xi )

αˆ 0.011

βˆ 1.873

Aˆ 7.069

Bˆ 27.089

Cˆ 0.053

ηˆ

4.559

Based on the estimated parameter values in Table 5.2 and Equation 5.15 the predicted residual life at some monitoring points given the history information of bearing 6 in Figure 5.3 is plotted in Figure 5.4. In Figure 5.4 the actual residual lives at those checking points are also plotted with symbol *. It can be seen that actual residual lives are well within the predicted residual life distribution as expected. Given the estimated values for parameters and associated costs such as c f = 6000 , c p = 2000 and cm = 30 (Wang and Jia 2001) we have the expected cost per unit time for one of the bearings at various checking time t, shown in Figure 5.5.

126

W. Wang

Expectd cost per unit time

Figure 5.4. Predicted condition residual life of bearing 6

27 t=80.5 hrs t=92.5 hrs t=104 hrs t=116.5 hrs t=129 hrs

23

19

15 0

10

20

30

Planned replacement time

Figure 5.5. Expected cost per unit time vs. planned replacement time in hours from the current time t

In can be seen from Figure 5.5. that at t = 116.5 and 129 h both planned replacements are recommended within the next 30 h. To illustrate an alternative decision chart in terms of the actual condition monitoring reading, we transformed the cost related decision into actual reading in Figure 5.6 where the dark grey area indicates that if the reading falls within this area a preventive replacement is required within the planning period of consideration. The advantage of Figure 5.6 is that it can not only tell us whether a preventive replacement is needed but also show us how far the reading is from the area of preventive replacement so that appropriate preparation can be done before the actual replacement.

Condition-based Maintenance Modelling

127

14

Observed CM reading

12

Preventive replacement area

10 8 6 4

No preventive replacement area

2 0 80.5

92.5

104

116.5

129

Tim e (age in hour) of CM reading taken

Figure 5.6. Decision chart using observed CM reading

The transformation is carried out in this way – at each monitoring point of ti , by gradually changing the value of yi in pi ( xi | ℑi ) used in Equation 5.1 until a preventive replacement is recommended by the model within the planning period, and then marking this value of yi as the threshold value at time ti . Connecting these threshold values at those monitoring points forms the boundary between the light and dark grey areas. Finally mark the actual reading of yi on the graph to see which area it falls in.

5.5 Future Research Directions 5.5.1 Multi-component Systems Previous condition based prognosis models developed in the literature mainly focused on a single failure mode system subject to routine monitoring and replacement such as bearings, pumps and motors, and various probability distributions are used to describe the lifetime of the component. In the case of a high value and high risk system with many components such as aircraft engines and gas turbines, how to assess the health condition and make prognosis based on condition information obtained from all components is still an open question. It is typical with a multicomponent system that many observed signal parameters are available and the times between failures are neither independent nor identical. 5.5.2 Identification of the Initial Point of a Random Defect With the delay time concept (see Chapter 14), system life is assumed to be classified into two stages. The first is the normal working stage where no abnormal condition parameters are to be expected. The second starts when a hidden defect is first initiated with possible abnormal signals. The identification of the initial point in the evolution of such a defect is important and has a direct impact on the

128

W. Wang

subsequent prediction model. Most research on fault diagnosis focuses on the location of the fault, the possible cause of the fault and, of course, the type of fault. This serves the engineering purpose of deciding what to repair, but does not aid the decision of when to do the task. This initial point defect identification has received very little attention in prognosis literature. Wang (2006b) addressed this problem to some extent using a combination of the delay time concept and the HMM. Much work still remains. It is possible that a multi-stage (>2) failure process could be used, which might be more appropriate in some cases. 5.5.3 The Definition of Plant State The definition of the underlying state and the relationship between the observed monitoring parameters and the state of the system are issues which still need attention. In the model presented in this chapter, the state of the system is defined as the residual life, which is assumed to influence the observed signal parameters. Whilst the modelling output appears to make sense, there are a few potential problems with the approach. The first is the issue that the life of the plant is fixed at birth (installation) but unknown. This is termed as playing God. Second, the residual life is not the direct cause of the observed abnormal signals. These are more likely caused by some hidden defects which are linked to the residual life in this chapter. To correct the first problem we can introduce another equation describing the relationship between X i and X i −1 deterministically or randomly. This will allow X i to change during use, which is more appropriate. If the relationship is deterministic, then a closed form of Equation 5.3 is still available, but if it is random, HMM must be used and no closed form of Equation 5.3 exists unless the noises associated are normally distributed. The second problem can be overcome if we adopt a discrete or continuous state hidden Markov chain to describe the system deterioration process where the state space of the chain represents the system state in question. 5.5.4 Information Fusion There is now a considerable amount of condition monitoring and process control information available in industry, thanks to recent developments in condition monitoring technology. It is noted that not all information is useful, or because of correlation one may obtain similar information. There are two ways to deal with this. One is to use some statistical methods to reduce the dimension of the original data such as principal component analysis, and the other is to use multi-variate distributions. The principal component analysis method has been used in Wang and Zhang (2005), but unless the first principle component accounts for most of the variation in the original data we still need to deal with a data set with more than two dimensions. The use of multi-variate distributions in prognosis has not been reported apart from the normal distribution which has the drawback of producing negative values. A final point worth mentioning is that, in practice, observed condition monitoring variables could be concomitant variables or covariates with respect to the

Condition-based Maintenance Modelling

129

system state. A model which can handle both type of information is ideal, but very few attempts have been made (Hussin and Wang 2006).

5.6 Summary and Conclusions This chapter introduces the concept of condition monitoring, key condition monitoring techniques, condition based maintenance and associated modelling support in aid of condition based maintenance. Particular attention is paid to the residual time prediction based on available condition information to date. An important development made here is the establishment of the relationship between the observed information and underlying condition which is the residual life in this case. This is achieved by letting the mean of the observed information at ti be a function of the residual life at that point conditional on X i = xi . The mathematical development is based on a recursive algorithm called filtering where all past information is included. The example illustrated is based on real data which came from a fatigue experiment. However, data from industry has shown the robustness of the approach and the residual life predictions conducted so far are satisfactory.

5.7 References Aghjagan, H.N., (1989) Lubeoil analysis expert system, Canadian Maintenance Engineering Conference, Toronto. Aven, T., (1996) Condition based replacement policies – a counting process approach, Rel. Eng. & Sys. Safety, 51(3), 275–281. Banjevic, D., Jardine, A.K.S., Makis, V. and Ennis, M., (2001) A control-limit policy and software for condition based maintenance optimization, INFOR 39(1), 32–50. Baruah, P. and Chinnam R.B., (2005) HMM for diagnostics and prognostics in maching processes, I. J. Prod. Res., 43(6), 1275–1293. Black, M., Brint, A.T. and Brailsford J.R., (2005) A semi-Markov approach for modelling asset deterioration, J. Opl. Res. Soc. 56(11), 1241–1249. Bunks C., McCarthy, D. and Al-Ani T., (2000) Condition based maintenance of machine using hidden Markov models, Mech. Sys. & Sig. Pro., 14(4), 597–612. Chen, D. and Trivedi, K.S., (2005) Optimization for condition based maintenance with semi-Markov decision process, Rel. Eng. & Sys. Safety, 90(1), 25–29. Chen, W., Meher-Homji, C.B. and Mistree, F., (1994) COMPROMISE: an effective approach for condition-based maintenance management of gas turbines. Engineering Optimization, 22, 185–201. Christer, A.H., Wang, W. and Sharp, JmM., (1997) A state space condition monitoring model for furnace erosion prediction and replacement, Euro. J. Opl. Res., 101, 1–14. Christer, A.H. and Wang, W., (1992) A model of condition monitoring inspection of production plant, I. J. Prod. Res., 30, 2199–2211. Christer A.H and Wang, W., (1995) A simple condition monitoring model for a direct monitoring process, E. J. Opl. Res., 82, 258–269. Collacott, R.A., (1977) Mechanical fault diagnosis and condition monitoring, Chapman and Hall Ltd., London. Dong M. and He, D., (2004) Hidden semi-Markov models for machinery health diagnosis and prognosis, Trans. North Amer. Manu. Res. Ins. of SME, 32, 199–206.

130

W. Wang

Drake, P.R., Jennings, A.D., Grosvenor, R.I. and Whittleton, D., (1995) acquisition system for machine tool condition monitoring. Quality and Reliability Engineering International 11, 15–26. Freund, J.E., (2004) Mathematical statistics with applications, Pearson Prentice and Hall, London. Harrison, N., (1995) Oil condition monitoring for the railway business. Insight 37, 278–283. Hontelez, J.A.M., Burger, H.H. and Wijnmalen, D.J.D., (1996) Optimum condition based maintenance policies for deteriorating systems with partial information, Rel. Eng. & Sys. Safety, 51(3), 267–274. Hussin, B., and Wang, W., (2006) Conditional residual time modelling using oil analysis: a mixed condition information using accumulated metal concentration and lubricant measurements, to appear in Proc. 1st Main. Eng. Conf, Chendu, China. Jardine, A.K.S., Makis, V., Banjevic, D., Braticevic, D. and Ennis, M., (1998) A decision optimization model for condition based maintenance, J. Qua. Main. Eng., 4(2), 115– 121. Jensen, U., (1992) Optimal replacement rules based on different information level, Naval Res. Log. 39, 937–955. Kalbfleisch, J.D. and Prentice, R.L., (1980) The Statistical Analysis of Failure Time Data. Wiley, New York. Kumar, D. and Westberg, U., (1997) Maintenance scheduling under age replacement policy using proportional hazard modelling and total-time-on-test plotting, Euro. J. Opl. Res., 99, 507–515. Li, C.J. and Li, S.Y., (1995) Acoustic emission analysis for bearing condition monitoring. Wear 185, 67–74. Lin, D. and Makis, V., (2003) Recursive filters for a partially observable system subject to random failures, Adv. Appl. Prob., 35(1), 207–227. Lin D. and Makis, V., (2004) Filters and parameter estimation for a partially observable system subject to random failures with continuous-range observations, Adv. Appl. Prob., 36(4), 1212–1230. Love C.E., Zhang Z.G., Zitron M.A., and Guo R., (2000) A discrete semi-Markov decision model to determine the optimal repair/replacement policy under general repairs, Euro. J. Opl Res, 125, 2, 398–409 Love, C.E. and Guo, R., (1991) Using proportional hazard modelling in plant maintenance. Quality and Reliability Engineering International, 7, 7–17. Makis, V. and Jardine, A.K.S., (1991) Computation of optimal policies in replacement models, IMA J. Maths. Appl. Business & Industry, 3, 169–176. Matthew, C. and Wang, W., (2006) A comparison study of proportional hazard and stochastic filtering when applied to vibration based condition monitoring, submitted to Int. Tran OR. Meher-Homji, C.B., Mistree, F. and Karandikar, S., (1994) An approach for the integration of condition monitoring and multi-objective optimization for gas turbine maintenance management. International Journal of Turbo and Jet Engines, 11, 43–51. Neal, M., and Associates, (1979) Guide to the condition monitoring of machinery, DTI, London. Reeves, C.W. (1998) The vibration monitoring handbook, Coxmoor Publishing Company, Oxford. Samanta, B., Al-Balushi, K.R., Al-Araimi, S.A. (2006) Artificial neural networks and genetic algorithm for bearing fault detection Soft Computing, 10 (3), 264–271. Wang, W., (2002) A model to predict the residual life of rolling element bearings given monitored condition monitoring information to date, IMA. J. Management Mathematics, 13, 3–16.

Condition-based Maintenance Modelling

131

Wang, W., (2003) Modelling condition monitoring intervals: A hybrid of simulation and analytical approaches, J. Opl. Res Soc, 54, 273–282. Wang, W., (2006a) A prognosis model for wear prediction based on oil based monitoring, to appear in J. Opl. Res Soc, Wang, W., (2006b) Modelling the probability assessment of the system state using available condition information, to appear in IMA. J. Management Mathematics. Wang, W. and Christer, A.H., (2000) Towards a general condition based maintenance model for a stochastic dynamic system, J. Opl. Res. Soc. 51, 145–155. Wang, W. and Jia, Y., (2001) A multiple condition information sources based maintenance model and associated prototype software development, proceedings of COMADEM 2001, Eds. A. Starr and Raj B.K.N. Rao, Elsevier, 889–898. Wang, W. and Zhang, W., (2005) A model to predict the residual life of aircraft engines based on oil analysis data, Naval Logistics Research, 52, 276–284. Wong, M.L.D., Jack, L.B., Nandi, A.K., (2006) Modified self-organising map for automated novelty detection applied to vibration signal monitoring Mech. Sys. & Sig. Proc., 20(3), 593–610. Zhang, W., (2004) Stochastic modeling and applications in condition based maintenance, PhD, thesis, University of Salford, UK.

6 Maintenance Based on Limited Data David F. Percy

6.1 Introduction Reliability applications often suffer from a lack of data with which to make informed maintenance decisions. Indeed, the very nature of maintenance is to avoid observed failure data from arising! This effect is particularly noticeable for high reliability systems such as aircraft engines and emergency vehicles, and when new production lines are established or warranty schemes are planned. The evaluation of such systems is a learning process and knowledge is continually updated as more information becomes available. Such issues are of great importance when selecting and fitting mathematical models to improve the accuracy and utility of these decisions. This chapter investigates why reliability data are so limited, identifies the problems that this causes and proposes statistical methods for dealing with these difficulties. In particular, it considers graphical and numerical summaries, appropriate methods for model development and validation, and the powerful approach of subjective Bayesian analysis for including expert knowledge about the application area, such as information pertaining to a particular manufacturing process and experience of similar operational systems. Many reliability problems involve making strategic decisions under risk or uncertainty. Stochastic models involving unknown parameters are often adopted for this purpose and our concern is how to make inference about, and arising from, these unknown parameters. The easiest approach involves skilfully guessing the parameter values by subjective means, which is fine so long as there is sufficient expert knowledge to perform this task well. More commonly, the parameters are estimated from observed data and decisions are then made by assuming the parameters equal to their estimates. This frequentist approach to inference is very good if there are sufficient data to estimate the parameters well. However, few data are available in many areas of maintenance and replacement; see Percy et al. (1997) and Kobbacy et al. (1997) for example. There are several reasons why data are scarce in these situations. New systems and processes

134

D. Percy

naturally offer scant historical data about their performance and reliability. Poor and incomplete maintenance records are often kept, as the engineers and managers do not always appreciate the potential benefits that can be achieved through quantitative modelling and analysis. Of equal importance, many observations of failure times tend to be censored due to maintenance interventions. Typical applications take the form of reliability analysis, such as modelling a critical system’s time to failure, and scheduling problems, such as determining efficient policies for scheduling capital replacement and preventive maintenance, all of which are considered elsewhere in this book. Other applications include determining appropriate thresholds for condition monitoring and specifying warranty schemes for new products. Under these circumstances, it is important to allow for the uncertainty about the unknown model parameters. This is readily achieved by adopting the Bayesian approach to inference, as described by Bernardo and Smith (2000) and O’Hagan (2004). The structure for the remainder of this chapter is as follows. Section 6.2 explains the need for Bayesian analysis and Section 6.3 introduces the concepts beginning with Bayes’ theorem, which is of great importance in its own right. Section 6.4 discusses the construction of prior and posterior distributions, whilst Section 6.5 considers the role of predictive distributions and Section 6.6 considers techniques for setting the hyperparameters of prior distributions. One of the great strengths of the Bayesian approach, particularly in relation to practical problems in reliability and maintenance, is its ability to improve the quality of decision analysis, as described in Section 6.7. Section 6.8 presents a review of the Bayesian approach to maintenance and Section 6.9 includes specific case studies that demonstrate these methods. Finally, Section 6.10 suggests topics for future research and possible new applications. For convenience, there follows a list of symbols and acronyms that are used throughout this chapter. P(⋅) : Probability E (⋅) : Expected value p(⋅) : Probability mass function f (⋅) : Probability density function R(⋅) : Reliability function L(⋅) g (⋅)

Be(θ )

Po(µ )

Ge(θ ) Ex(λ )

: : : : : :

No(µ,ψ ) :

Ga (α , λ ) :

We(α , λ ) :

Likelihood function Prior or posterior probability density function Bernoulli distribution Poisson distribution Geometric distribution Exponential distribution Normal distribution Gamma distribution Weibull distribution

Maintenance Based on Limited Data

135

6.2 Need for Bayesian Approach Figure 6.1 shows the links between equipment, maintenance, models, parameters and data. Starting with the equipment, imperfect reliability necessitates some forms of maintenance. These affect the performance of the equipment as implied by the arrow, which represents a directional influence. In order to determine suitable maintenance policies and strategies, we formulate appropriate mathematical models. These involve unknown parameters that are modelled using expert knowledge and observed data. Variations to the models arise due to modified reliability characteristics when maintenance strategies are in place for particular equipment, forming the cycle at the top of the chart.

Figure 6.1. The link between fundamental aspects of maintenance modelling and analysis

The conventional approach to model fitting is based upon frequentist methods of estimation, as described in statistics books. One of the best such methods is that of maximum likelihood. Essentially, all unknown model parameters are replaced by estimates calculated from samples of data. For example, a parameter that represents the mean lifetime of a rechargeable battery in a portable computer might be replaced by the average lifetime calculated from a sample of such batteries that were run from charged to flat. However, this approximation can and does lead to substantial errors, inaccuracies and poor decisions, particularly when the estimates are based on small samples of data. When data are limited in this way, one starts with subjective estimates and updates them as new data are observed. This is a very common scenario in reliability and maintenance, where the samples typically contain few and censored failure data. Example 6.1 Suppose we use a Weibull We (α , λ ) distribution to model the random variable X , which represents the breaking strain of a steel cable. As destructive testing can be very expensive and safety precautions can be crucial, it is feasible that we might only collect right-censored observations of the form D = { xi : X > xi ; i = 1, 2,… , n} . In order to make useful inference involving the model parameters α > 0 and λ > 0 , we need to construct the likelihood function. For this scenario, the likelihood involves the reliability function of X ,

(

R ( x α , λ ) = exp −λ xα

)

(6.1)

136

D. Percy

for x > 0 , and takes the form L (α , λ ; D ) ∝

n

∏ i =1

n ⎛ ⎞ R ( xi α , λ ) = exp ⎜ −λ xiα ⎟ . i =1 ⎝ ⎠



(6.2)

We typically maximize this function in order to evaluate the maximum likelihood estimates of α and λ . To do so, the likelihood equations are λ

n

∑x i =1

λ

n

α i

∑x i =1

α i

=0;

(6.3)

log xi = 0 .

(6.4)

These have no finite solutions for αˆ and λˆ , so our analysis has been thwarted by the lack of uncensored data. □

6.3 Bayesian Inference In 1763, some research on probability theory by the Reverend Thomas Bayes was posthumously published in Philosophical Transactions of the Royal Society. This contained an incredibly important statement of what we now refer to as Bayes’ theorem. In its simplest form, Bayes’ theorem states that for two events A and B , the conditional probability of B given that A has occurred can be expressed as P ( B A) =

P ( A B) P ( B) P ( A)

(6.5)

where it is sometimes useful to evaluate the probability of A using the law of total probability P ( A) = P ( A B ) P ( B ) + P ( A B′ ) P ( B′ )

(6.6)

where the event B ′ is the complement of the event B ; that is, the event that B does not occur. Bayes’ theorem can be interpreted as a way of transposing the conditionality from P ( A B ) to P ( B A ) , or as a way of updating the prior probability P ( B ) to give the posterior probability P ( B A ) . Example 6.2 An aircraft warning light comes on if the landing gear on either side is faulty. Suppose we know that faults only occur 0.4% of the time, that they are detected with 99.9% reliability and that false alarms only occur 0.5% of the time when the landing gear is operational. Defining events W = “warning light comes on” and L = “landing gear faulty”, this information can be summarized

Maintenance Based on Limited Data

137

concisely as P ( L ) = 0.004 , P (W L ) = 0.999 , P (W L′ ) = 0.005 . Our aim is to calculate the probability that the landing gear is faulty if the warning light comes on. Intuitively, one might suppose that this probability is very close to one, as the alarm system appears to be very accurate. However, the law of total probability gives P (W ) = P (W L ) P ( L ) + P (W L′ ) P ( L ′ ) = 0.999 × 0.004 + 0.005 × 0.996 = 0.008976 ,

(6.7)

from which Bayes’ theorem gives P(L W ) =

P (W L ) P ( L ) P (W )

=

0.999 × 0.004 = 0.45 0.008976

(6.8)

to two decimal places. This result implies that most (55%) of these warning lights are false alarms, despite the apparent accuracy of the alarm system! The reason for this paradoxical outcome is that the landing gear is operational for the vast majority of the time. If we were to specify P ( L ) = 0.04 instead, we would obtain P ( L W ) = 0.89 , which is far more acceptable. Similar patterns of behaviour apply to medical screening procedures – in order to reduce the incidence of misdiagnoses, only patients deemed to be at risk of an illness are routinely screened for it. □ Only in the mid-twentieth century were the real benefits of Bayes’ theorem appreciated though. Not only does it apply to probabilities, but also to random variables. For example, suppose X is a discrete random variable and Y is a continuous random variable. Then the conditional probability density function of Y given X can be determined using Bayes’ theorem, if we know the marginal distributions of X and Y , and the conditional distribution of X given Y : f ( y x) =

p ( x y) f ( y) p ( x)

.

(6.9)

This rule for “transposing the conditionals” has proven to be crucial in a variety of important applications, including quality control, fault diagnosis, image processing, medical screening and criminal trials. Even more importantly, we can apply Bayes’ theorem to unknown model parameters. This is the foundation of the Bayesian approach to statistical inference and has had an enormous and profound impact on the subject over the last few decades. Suppose that a continuous random variable X has a probability distribution that depends on an unknown parameter θ . For example, X might represent the firebreach time of a door in minutes and it might have an exponential distribution with unknown mean µ = 1 θ . A naïve approach to statistical inference would simply replace θ by a good guess based on expert opinions. However, this is inherently inaccurate and can lead to poor decisions. A better method is the frequentist approach to inference, where-

138

D. Percy

by we evaluate an estimate θˆ for the unknown parameter θ based on a set of observed data D = { x1 , x2 ,… , xn } , which might consist of a random sample of actual fire-breach times for the above example. Subsequent analyses generally invoke the approximation θ ≈ θˆ , which can again lead to poor decisions. In contrast, the Bayesian approach does not involve any guesses or estimates of unknown parameters in the model. Rather, it uses Bayes’ theorem to update our prior beliefs about θ in response to the observed data D thus: g (θ D ) =

f ( D θ ) g (θ ) f ( D)

.

(6.10)

This enables us to make any inference we wish about θ . We can also use our posterior beliefs about θ for any subsequent inference involving X . The price that we pay for obtaining exact answers and avoiding approximations in this way comes in two parts, the need to assume a prior distribution for θ and the increase in algebraic complexity. This chapter shows how to resolve these issues. Example 6.3 Suppose the unknown parameter θ represents the proportion of car batteries that fail within two years and our prior beliefs about θ can be expressed in terms of the probability density function g (θ ) = 2 (1 − θ ) ; 0 < θ < 1 .

(6.11)

Suppose also that we observe three car batteries, one of which fails within two years and two of which do not. Then we can express the likelihood of these data using the binomial probability mass function

p ( D θ ) = 3θ (1 − θ ) , 2

(6.12)

which is the discrete equivalent to the probability density function f ( D θ ) referred to above. As p ( D ) , the discrete equivalent of f ( D ) , does not depend on θ , an application of Bayes’ theorem as stated above gives g (θ D ) ∝ p ( D θ ) g (θ ) ∝ θ (1 − θ )

3

(6.13)

for 0 < θ < 1 , so our posterior beliefs about the unknown parameter θ can be expressed as a beta distribution θ D ~ Be ( 2, 4 ) . We elaborate on this process further in Section 6.4. □

Maintenance Based on Limited Data

139

6.4 Prior and Posterior Distributions Section 6.3 concluded by deriving an equation for updating a prior probability density function g (θ ) for an unknown parameter θ , based on some observed data D to give a posterior probability density function g (θ D ) . The term f ( D θ ) is proportional to the likelihood function. If the data set D consists of a random sample of observations x1 , x2 ,… , xn of a continuous random variable X with probability density function f ( x θ ) , then the likelihood function becomes L (θ ; D ) ∝

n

∏ f (x θ ) . i =1

(6.14)

i

As the term f ( D ) does not depend on θ , we can therefore write g (θ D ) ∝ L (θ ; D ) g (θ )

(6.15)

or, in words, “posterior is proportional to likelihood times prior”. This is the fundamental rule for Bayesian inference. Example 6.4 Previously, in Example 6.3, we considered the proportion of car batteries that fail within two years. This involved the use of Bayes’ theorem for this unknown model parameter θ and was an illustration of how the fundamental rule “posterior is proportional to likelihood times prior” can be applied. To clarify this demonstration, the likelihood function takes the form

L (θ ; D ) ∝

3

∏ p ( x θ ) = θ (1 − θ ) i =1

2

(6.16)

i

where the probability mass function p ( xi θ ) corresponds to a Bernoulli distribution. Consequently, the posterior probability density function of θ given the data D has the form

g (θ D ) ∝ L (θ ; D ) g (θ ) ∝ θ (1 − θ )

3

(6.17)

for 0 < θ < 1 , which agrees with the result we obtained previously. The corresponding prior and posterior probability density functions are graphed for comparison in Figure 6.2. □

140

D. Percy 3

prior ( θ )

2

posterior ( θ ) 1

0

0

0.2

0.6

0.4

0.8

1

θ

Figure 6.2. Prior and posterior probability density functions for Example 6.4

Having evaluated a posterior distribution using this rule, we can evaluate the posterior mode θˆ such that

( )

g θˆ D ≥ g (θ D ) ∀θ ,

(6.18)

by solving the equation d L (θ ; D ) g (θ ) = 0 . dθ

(6.19)

However, to find the median or mean, and to use this posterior density to make any further inference, we need to determine the constant of proportionality in the fundamental rule above. In standard situations, we can recognise the functional form of L (θ ; D ) g (θ ) and hence quote published work on probability distributions to determine this constant of proportionality and so derive g (θ D ) explicitly. In non-standard situations, we determine this constant of proportionality using numerical quadrature or simulation, both of which we discuss later. 6.4.1 Reference Priors There are two main types of prior distribution, which loosely correspond with objective priors and subjective priors. As objective priors strictly do not exist, this category is generally known as reference priors and are used if little prior information is available and as a benchmark against which to compare the output from using subjective priors. This offers a default Bayesian analysis that is not dependent upon any personal prior knowledge. The simplest reference prior is proposed by the Bayes-Laplace postulate and simply recommends the use of a uniform or locally-uniform prior g (θ ) ∝ 1 for all θ in the region of support Rθ .

Maintenance Based on Limited Data

141

However, different parameterisations can lead to different inferences with this approach. To avoid this inconsistency, the standard univariate reference prior that analysts now adopt is the invariant prior of Jeffreys (1998), defined by g (θ ) ∝ I (θ ) ; θ ∈ Rθ

(6.20)

⎧⎪ d 2 log f ( x θ ) ⎫⎪ I (θ ) = − E X θ ⎨ ⎬ dθ 2 ⎩⎪ ⎭⎪

(6.21)

where

is Fisher’s expected information. An extension exists for the case of a parameter vector θ , though we usually assume the components of θ are independent, so g (θ ) is just the product of the univariate invariant priors. This invariant prior distribution is occasionally improper, as its integral sometimes diverges. However, this problem is generally unimportant because the corresponding posterior distributions are usually proper. Books on Bayesian methods, such as Bernardo and Smith (2000) and Lee (2004), present tables of invariant prior and posterior distributions for common models. 6.4.2 Subjective Priors Subjective prior distributions should be used if prior information is available, which is almost always. They represent the best available knowledge about unknown parameters and can be specified using smoothed histograms, relative likelihoods or parametric families. The first two of these are arbitrary and computationally awkward, so we now investigate the last of these. A family of priors C is closed under sampling if g (θ ) ∈ C ⇒ g (θ D ) ∈ C ,

(6.22)

so that the posterior density has the same functional form as the prior density. This property is particularly appealing, as our prior knowledge can be regarded as posterior to some previous information. Again, we tend to suppose that components in multi-parameter problems are independent, so that their joint prior density is the product of corresponding univariate marginal priors. Such closed priors exist, and are called natural conjugate priors, for sampling distributions f ( x θ ) that belong to the exponential family. This family includes Bernoulli, binomial, geometric, negative binomial, Poisson, exponential, gamma, normal and lognormal models. For a model in the exponential family with scalar parameter θ , we can express the probability density or mass function in the form

142

D. Percy

f ( x θ ) = exp {a ( x ) b (θ ) + c ( x ) + d (θ )}

(6.23)

and the natural conjugate prior for θ is defined by

g (θ ) ∝ exp {k1b (θ ) + k2 d (θ )}

(6.24)

for suitable constants k1 and k 2 . However, any conjugate prior of the form

g (θ ) ∝ h (θ ) exp {k1b (θ ) + k2 d (θ )}

(6.25)

is also closed under sampling for models in the exponential family. Books on Bayesian methods, such as Bernardo and Smith (2000), present tables of the conjugate prior and posterior distributions for common models. However, many applications in reliability and maintenance are not amenable to such simple analyses. For example, the Weibull distribution is not a member of the exponential family. As a result of this, the constant of proportionality in the expression g (θ D ) ∝ L (θ ; D ) g (θ )

(6.26)

can sometimes not be evaluated algebraically and analytical approximations or numerical computation are usually required. It is desirable to avoid the inconsistency of using natural conjugate priors when they exist and other forms of subjective prior, such as location-scale forms, when they do not. The following recommendations by Percy (2004) provide a simple, consistent and comprehensive strategy that achieves this for general use: • • •

Infinite range −∞ < θ < ∞ , use a normal prior distribution for θ Semi-infinite range 0 < θ < ∞ , use a gamma prior distribution for θ Finite range 0 < θ < 1 , use a beta prior distribution for θ

If necessary, linear transformations of the parameters ensure that these priors are sufficient for modelling all situations. They match with the natural conjugate priors for simple models and extend to deal with more complicated models. Mixtures of these priors can be used if multimodality is present and prior independence can be assumed for multiparameter situations.

6.5 Predictive Distributions The frequentist approach to inference involves estimating unknown parameters, evaluating confidence intervals and performing significance tests. Such intervals and tests are statements about the data rather than the parameters and so are of little use. For example, null hypotheses are often strictly impossible, in which case a test will be significant if, and only if, sufficient data are observed. In contrast, the

Maintenance Based on Limited Data

143

Bayesian approach to inference makes statements about the parameters given the data, which are precisely what is required. O’Hagan (1994) commented that the “Bayesian approach … is fundamentally sound, very flexible, produces clear and direct inferences and makes use of all the available information.” In contrast, he noted that the “Classical approach suffers from some philosophical flaws, has a restrictive range of inferences with rather indirect meanings and ignores prior information.” One of the most important and useful features of the Bayesian approach arises when we wish to make predictions about future values of the random variable X where f ( x θ ) is specified. If θ is unknown, the prior predictive probability density function of X is f ( x) =



∫ f ( x θ ) g (θ ) dθ .

(6.27)

−∞

If data D are observed, the posterior predictive probability density function of X is f ( x D) =



∫ f ( x θ ) g (θ D ) dθ

−∞





(6.28)

∫ f ( x θ ) L (θ ; D ) g (θ ) dθ .

−∞

( )

In contrast, a frequentist approach either uses the approximation f ( x D ) ≈ f x θˆ

( )

or gives a point prediction xˆ = E X θˆ with a prediction interval if available. Example 6.5 Suppose that the time X to breakdown of a large pulper in a paper mill has an exponential sampling distribution given some unknown hazard parameter λ . With Jeffreys’ invariant prior, the prior predictive density is given by ∞

f ( x ) ∝ ∫ λ exp ( −λ x ) 0

1 1 dλ ∝ λ x

(6.29)

for x > 0 , which is improper. However, this does provide information about the relative likelihoods for different values of X . For example, the ratio of probabilities that X lies in the intervals (5,10) and (10,20) is given by

144

D. Percy 10

P ( 5 < X < 10 )

P (10 < X < 20 )

=

1

∫ x dx

5 20

1 dx x 10



=

log10 − log 5 =1 log 20 − log10

(6.30)

so the time to breakdown of this pulper is equally likely to lie in these two intervals without taking account of any subjective or empirical information that might be available. Even if we subsequently observe a random sample of lifetimes D = { x1 , x2 ,… , xn } the posterior predictive density ∞ ⎪⎧ ⎛ n ⎞ ⎪⎫ f ( x D ) ∝ λ exp ( −λ x ) × λ n −1 exp ⎨− ⎜ xi ⎟ λ ⎬ d λ ⎩⎪ ⎝ i =1 ⎠ ⎪⎭ 0 n! ; x>0 = n +1 n ⎛ ⎞ ⎜ x + xi ⎟ i =1 ⎝ ⎠





(6.31)



is still improper, though we can evaluate relative likelihoods as we did for the prior predictive density. In contrast, a frequentist approach would merely generate the approximation X D ~ Ex (1 x ) and could do no better than guess a value for X before observing any data. □ Example 6.6 Reconsidering the time to breakdown of the pulper in Example 6.5, suppose we instead use a gamma prior to reflect the knowledge of experts on site. The prior predictive density is now given by ∞

f ( x ) = λ exp ( −λ x )

∫ 0

=

ab

a

( x + b)

a +1

ba λ a −1 exp ( −bλ ) d λ Γ (a)

(6.32)

; x>0

which corresponds with a special form of gamma-gamma distribution. If we subsequently observe a random sample of lifetimes D = { x1 , x2 ,… , xn } the posterior predictive density is given by ∞

⎧⎪ ⎛ f ( x D ) ∝ λ exp ( −λ x ) λ a + n −1 exp ⎨ − ⎜ b + ⎪⎩ ⎝ 0 Γ ( a + n + 1) = ; x>0 a + n +1 n ⎛ ⎞ ⎜ x + b + xi ⎟ i =1 ⎝ ⎠





n

⎞ ⎫⎪

∑ x ⎟⎠ λ ⎬⎪ d λ i =1

i



(6.33)

Maintenance Based on Limited Data

145

which again corresponds to a gamma-gamma distribution. As before, a frequentist □ approach would merely yield the approximation X D ~ Ex(1 x ) .

6.6 Prior Specification In Section 6.4 we discussed what objective and subjective prior distributions are appropriate for practical applications. As some prior knowledge is always available, a conjugate prior should be used whenever possible. However, reference priors are useful in these circumstances: • • •

For an objective analysis with no specific personal inputs For comparison with similar analyses by other investigators As baselines to assess the sensitivity of results to choice of prior

We now consider the difficult problem of assigning values to the hyperparameters of subjective prior distributions. Suppose we have a model for a continuous random variable X with probability density function f ( x θ ) , which depends on a parameter θ with subjective prior probability density function g (θ ) . Typically, this prior distribution consists of two unknown hyperparameters, which we now label a and b . We set fixed values for these hyperparameters, to reflect our prior knowledge about θ . For two hyperparameters, we need two distinct pieces of information such as the upper and lower tertiles ( 33 1 3 and 66 2 3 percentiles) of θ or the cumulative probabilities corresponding to any two suitable values of θ . Alternative information about θ could be provided, though quantiles and cumulative probabilities are the easiest and best formulations. One obvious alternative is to specify the prior mode, but this is occasionally at an endpoint of the parameter’s range and so provides no useful information. Furthermore, there is no suitable candidate for the second piece of information when the prior mode is used. Another obvious alternative is to specify the prior mean and standard deviation. However, we cannot make meaningful judgments about these purely mathematical abstracts. Unfortunately, parameters are not observable and we cannot make accurate statements about them directly. The sole exception is when our parameter represents the probability of an event associated with infinitely repeatable Bernoulli trials. In this case, it is feasible to elicit information about an identical quantity, the asymptotic proportion. In general, however, we can elicit hyperparameter values by considering the prior predictive distribution introduced in Section 6.5, which is also a function of a and b ; refer to Percy (2002) for further details. Research in this area is still ongoing, particularly for models for which the prior predictive cumulative distribution function cannot be determined analytically and for multiparameter models for which there are implicit and indeterminable constraints on the prior predictive quantiles. Example 6.7 We saw earlier that the prior predictive probability density function for the exponential sampling distribution (perhaps representing the time X to

146

D. Percy

failure of a pulper, as before, or the downtime X incurred as a result of a computer system failure) with a gamma prior is given by f ( x) =

ab a

( x + b)

a +1

; x>0.

(6.34)

Hence the prior predictive cumulative distribution function is F ( x) =

x

a

ab a

⎛ b ⎞ dx = 1 − ⎜ ⎟ ; x>0. a +1 ⎝ x+b⎠ 0 ( x + b)



(6.35)

If an expert specifies tertiles L and U , such that FX ( L ) = 1 3 and FX (U ) = 2 3 , then we can solve these two nonlinear simultaneous equations numerically for a and b . These can then be substituted into our prior density f ( x ) . □ Example 6.8 In Example 6.7, the exponentially distributed random variable X might instead represent the lifetime of an energy efficient light bulb, in operating hours. Suppose that, based on subjective knowledge of similar light bulbs, we believe that one third of the new type will fail within 2500 operating hours and one third will last for at least 7500 operating hours. This implies that we believe that the remaining third will fail between these two values. Then L = 2500 and U = 7500 , so we need to solve these two simulateneous nonlinear equations numerically for a and b : a

1 b ⎛ ⎞ = 1− ⎜ ⎟ ; 3 ⎝ 2,500 + b ⎠

(6.36)

a

2 b ⎛ ⎞ = 1− ⎜ ⎟ . 3 ⎝ 7,500 + b ⎠

(6.37)

There are many algorithms for solving simultaneous nonlinear equations and several computer packages that contain these algorithms. Mathcad gives the values a = 3.5240 and b = 20,502 , so the prior distribution for the exponential parameter λ is specified completely as λ ~ Ga ( 3.5240, 20,502 ) . □

6.7 Bayesian Decision Theory Much research into maintenance modelling, as presented throughout this book, involves making informed decisions in the presence of stochastic variability. Sensitivity analyses are always advisable in such circumstances, to consider how the conclusions are affected by misspecification of the model and its parameters; see Kobbacy et al. (1995). Rather than replacing model parameters by guesses or estimates, however, more accurate decisions can be made by adopting a Bayesian

Maintenance Based on Limited Data

147

analysis to allow for the uncertainty attached to these parameters. This effect is particularly important when dealing with limited amounts of data, a common problem in the area of reliability and maintenance and the subject of this chapter. For example, the author recently acquired a set of data relating to the performance of an industrial valve subject to corrective and preventive maintenance. Only 12 uncensored lifetime observations were available, despite the fact that this represents six years of data collection. From a frequentist point of view, it would be unwise to fit any model involving more than three parameters to these data. However, the Bayesian is not constrained in this manner, as prior knowledge gleaned from experience of similar systems can be incorporated in the analysis. Of course, parsimony still dictates that models with fewer parameters are more robust for predictive purposes, even if they provide better fits to the observed data. We can resolve such issues using model comparison methods using prior odds, Bayes factors and posterior odds, which we do not discuss here. Consider a set of possible decisions d ∈ ∆ with associated utility function u (d ,θ ) , which depends on an unknown parameter θ . The best decision is that which maximizes the prior expected utility E {u ( d , θ )} =



∫ u ( d ,θ ) g (θ ) dθ

(6.38)

−∞

with respect to the prior probability density function g (θ ) . Alternatively, we can minimize the prior expected loss E {l ( d , θ )} =



∫ l ( d ,θ ) g (θ ) dθ

(6.39)

−∞

for some loss function l ( d , θ ) . If we observe exchangeable data D = { x1 , x2 ,… , xn } from the sampling density f ( x θ ) , the criterion to maximize (minimize) is the posterior expected utility (loss) defined by E {u ( d , θ ) D} =



∫ u ( d ,θ ) g (θ D ) dθ

(6.40)

−∞

where g (θ D ) ∝ L (θ ; D ) g (θ )

(6.41)

is the posterior probability density function. Example 6.9 Which of two alarm systems should we buy if they cost ci units and fail at times X i where X i λi ~ Ex ( λi )

(6.42)

148

D. Percy

for i = 1,2 respectively? Assuming replacements on failure for an infinite horizon, the elementary renewal theorem gives the expected cost per unit time for action i as the loss function l ( i, λi ) = ci λi .

(6.43)

Eλ {l ( i, λi )} = ci E ( λi )

(6.44)

Then i

and we choose system i which minimizes this expected loss, where E ( λi ) is the prior mean. □

6.8 Review of Bayesian Approach to Maintenance Whether we are interested in modelling the reliability of components or systems, assessing the quality of manufactured products, determining optimal replacement policies, deciding when to intervene with preventive maintenance, interpreting the results from condition monitoring, resolving stock control problems or establishing warranty schemes, mathematical models and statistical analysis offer many advantages over subjective expert knowledge alone. This book describes many techniques related to the modelling aspects and generally advocates the frequentist approach of estimating unknown model parameters based upon random samples of observed data. However, Chapter 6 has emphasised that this approach only provides approximate inference, decisions, predictions and solutions. When many data are available, such as might arise when analysing the returns data from common household appliances, these approximations are very accurate. However, these approximations can be very inaccurate when few data are available. We often encounter this situation in maintenance modelling, as the whole purpose of maintenance is to prevent failures from occurring and so lifetime observations are typically censored. Moreover, some applications in this general area relate to products or systems that are completely new or modified versions and for which reliability data are simply not available. By combining the observed data with expert knowledge, the Bayesian approach to statistical analysis avoids the need for approximate inference and yields exact answers under the assumed model and prior formulations. These enable us to make the best maintenance decisions given all available information. This chapter began by justifying this approach and then investigated suitable forms for the prior distributions corresponding to common reliability models. After describing how to calculate the related posterior and predictive distributions, it discusses how to use this knowledge for decision making in practice.

Maintenance Based on Limited Data

149

Table 6.1. Common distributions for maintenance modelling Model and parameter

Prior

Posterior

BERNOULLI

beta

beta

Be (θ )

Be ( a, b )

Be ( a + nx , b + n (1 − x ) )

POISSON

gamma

gamma

Po ( µ )

Ga ( a, b )

Ga ( a + nx , b + n )

GEOMETRIC

beta

beta

Ge (θ )

Be ( a, b )

Be ( a + n, b + n ( x − 1) )

EXPONENTIAL

gamma

gamma

Ex ( λ )

Ga ( a, b )

Ga ( a + n, b + ∑ x )

NORMAL

normal

normal

No ( µ ,ψ )

No ( a, b )

⎛ ab + nxψ ⎞ No ⎜ , b + nψ ⎟ b + n ψ ⎝ ⎠

NORMAL

gamma

gamma

No ( µ ,ψ )

Ga ( c, d )

2 ⎛ n ( x − µ ) ( n − 1) s 2 n Ga ⎜ c + , d + + ⎜ 2 2 2 ⎝

probability θ

mean µ

probability θ

hazard λ

mean µ known precision ψ

known mean µ

⎞ ⎟ ⎟ ⎠

precision ψ

Table 6.1 presents a summary of the probability distributions commonly encountered in maintenance analysis, together with details of their natural conjugate prior and posterior distributions. For models with two parameters, including the unconstrained normal, gamma and Weibull sampling distributions, the analysis is less straightforward and readers are referred to Section 6.4.2 for guidance. Among the published research that applies this methodology to maintenance modelling is the extensive book on Bayesian reliability analysis by Martz and Waller (1982). Journal papers that address specific issues include those by Soland

150

D. Percy

(1969), Bury (1972) and Canavos and Tsokos (1973), who are concerned particularly with analysis of the Weibull distribution. Singpurwalla (1988) and Percy (2004) are concerned with prior elicitation for reliability analysis and O’Hagan (1998) presents an accessible, general discussion of Bayesian methods. There are many other academic publications dealing with Bayesian approaches in maintenance and a representative sample of recent articles include those by van Noortwijk et al. (1992), Mazzuchi and Soyer (1996), Chen and Popova (2000), Apeland and Aven (2000), Kallen and van Noortwijk (2005) and Celeux et al. (2006). The general aim is to determine optimal policies for maintenance scheduling and operation, by combining subjective prior knowledge with observed data using Bayes’ theorem and employing belief networks for larger systems.

6.9 Case Studies We now consider two case studies in which the techniques of this chapter can be applied successfully. 6.9.1 Digital Set Top Boxes The proportion of defective test versions of digital set top boxes θ in a large shipment is unknown, but a beta prior probability density function of the form g (θ ) =

1 b −1 θ a −1 (1 − θ ) ; 0 < θ < 1 B ( a, b )

(6.45)

is appropriate. An expert believes that θ is equally likely to lie in each of the intervals (0, 1 50 ) , ( 1 50 , 1 20 ) and ( 1 20 ,1) , which corresponds to hyperparameter values of a = 1.112 and b = 24.03 as displayed in Figure 6.3. Given that 100 boxes are selected at random from the shipment and 3 of these are found to be defective, we can determine the posterior probability density function of θ from the first row of Table 6.1 as a Be ( 4.112,121.0 ) distribution, which is also displayed in Figure 6.3. This enables us to evaluate numerically the posterior probability that the proportion of defective boxes in the shipment exceeds 1 1 10 as P (θ > 10 D ) = 0.0013 , or 1 in 763. As a final exercise, suppose we select a further box at random from the shipment and consider the random variable X which takes the value 0 if the box is functional, or 1 if it is defective. Then Equation 6.28 can be used to determine the posterior predictive probability mass function for X given the data above as ⎧0.967 ; x = 0 p ( x D) = ⎨ ⎩0.033 ; x = 1

(6.46)

so the posterior probability that a randomly chosen box from the shipment is defective is P ( X = 1 D ) = 0.033 , or 1 in 30.

Maintenance Based on Limited Data

151

30

20 prior( θ ) posterior( θ ) 10

0

0

0.1

0.2 θ

Figure 6.3. Prior and posterior probability density functions for digital set top boxes

6.9.2 Rechargeable Tool Batteries A manufacturer is interested in assessing the unknown hazard λ of rechargeable tool batteries for inter-charge operational times measured in hours. Her prior beliefs are represented by a Ga (10, 40 ) distribution with probability density function g (λ ) =

4010 λ 9 exp ( −40λ ) ; λ > 0 . 9!

(6.47)

She runs an experiment for one day, replacing each flat battery by an identical fully charged battery after failure, so that the total number of failures X has a Poisson distribution with probability mass function p(x λ) =

( 24λ ) x!

x

exp ( −24λ ) ; λ = 0,1, 2,… .

(6.48)

In fact, she runs n = 10 such experiments in parallel, giving a sample mean of x = 6.7 . Referring to the second row of Table 6.1 and transforming to failures per hour, we see that her posterior beliefs about λ correspond to a Ga ( 77, 280 ) distribution, as displayed in Figure 6.4. The posterior mode is 0.27 , which corresponds to the most likely value of λ .

152

D. Percy

20

prior( λ ) 10

posterior ( λ )

0

0.5

1

λ Figure 6.4. Prior and posterior probability density functions for rechargeable tool batteries

6.10 Conclusions Bayesian inference represents a methodology for mathematical modelling and statistical analysis of random variables and unknown parameters. It provides an excellent alternative to the frequentist approach which gained immense popularity throughout the twentieth century. Whereas the frequentist approach is based upon the restrictive inference of point estimates, confidence intervals, significance tests, p-values and asymptotic approximations, the Bayesian approach is based upon probability theory and provides complete solutions to practical problems. Advocates of the Bayesian approach regard it as superior to the frequentist approach in most circumstances and infinitely superior in some. However, it does depend upon the existence and specification of subjective probability to represent individual beliefs, whereas the frequentist approach is almost completely objective. Partial resolution of these difficulties was addressed in Section 6.6 and continues to be improved upon, particularly in regards to eliciting subjective prior knowledge for multiparameter models. The approach advocated here also involves more analytical and computational complexity, though this is not much of a hindrance with modern computing power. In particular, this approach often involves intractable integrals of the forms g (θ D ) ∝

f ( x) =



∫ L (θ ; D ) g (θ ) dθ

posterior densities;

(6.49)

−∞



∫ f ( x θ ) g (θ ) dθ

−∞

predictive densities;

(6.50)

Maintenance Based on Limited Data

E {u ( d , θ )} =

153



∫ u ( d ,θ ) g (θ ) dθ

expected utilities.

(6.51)

−∞

Monte Carlo simulation can be used to approximate any integral of this form by generating many pseudo-random numbers θ1 , θ 2 ,… ,θ n from the prior or posterior density in the integrand and evaluating the unbiased estimator ∞

1

n

∫ s (θ ) g (θ ) dθ ≈ n ∑ s (θ ) ,

−∞

i =1

i

(6.52)

though more efficient procedures exist. Rejection methods are used to generate the pseudo-random numbers and the most powerful such algorithms are referred to as Markov chain Monte Carlo (MCMC) methods, the most common of which is Gibbs sampling. At the time of writing, WinBUGS software is freely available for performing MCMC calculations and may be downloaded from the internet. Further information about MCMC techniques, and other analytical and numerical methods for Bayesian computation, are discussed in the textbooks mentioned in the introduction. We have explained why the solution of many problems arising in maintenance applications is often hampered by a lack of data and so are prime candidates for applying the ideas presented in this chapter. In particular, we suggested and demonstrated how this methodology might benefit decision making related to modelling times to failure and scheduling problems, such as determining efficient policies for scheduling capital replacement and preventive maintenance, determining appropriate thresholds for condition monitoring and specifying warranty schemes for new products. There is considerable scope for developing these techniques for new application areas within maintenance and extending them into related areas. Potential future projects might consider original products, such as recent inventions or modified lines, and items that are tailored to consumers’ specifications, such as construction works, for which historical data are not available. Similarly, some rare, expensive and safety critical systems will have limited failure data with which to estimate model parameters. Enhancements to warranty analysis are also possible, particularly in cases where returns data are not readily available, including natural extensions of the basic concepts to the analysis of extended warranties. Finally, broader definitions of reliability and maintenance would enable us to apply some of the preceding ideas to non-industrial systems, such as information networks, social communities and public services.

6.11 References Apeland S, Aven T, (2000) Risk based maintenance optimization: foundational issues. Reliability Engineering & System Safety 67:285–292 Bernardo JM, Smith AFM, (2000) Bayesian Theory. Chichester: Wiley Bury KV, (1972) Bayesian decision analysis of the hazard rate for a two-parameter Weibull process. IEEE Transactions on Reliability 21:159–169

154

D. Percy

Canavos GC, Tsokos CP, (1973) Bayesian estimation of life parameters in the Weibull distribution. Operations Research 21:755–763 Celeux G, Corset F, Lannoy A, Ricard B, (2006) Designing a Bayesian network for preventive maintenance from expert opinions in a rapid and reliable way. Reliability Engineering & System Safety 91:849–856 Chen TM, Popova, E, (2000) Bayesian maintenance policies during a warranty period. Communications in Statistics 16:121–142 Jeffreys H, (1998) Theory of Probability. Oxford: University Press Kallen MJ, van Noortwijk JM, (2005) Optimal maintenance decisions under imperfect maintenance. Reliability Engineering & System Safety 90:177–185 Kobbacy KAH, Percy DF, Fawzi BB, (1995) Sensitivity analyses for preventive-maintenance models. IMA Journal of Mathematics Applied in Business and Industry 6:53–66 Kobbacy KAH, Percy DF, Fawzi, BB, (1997) Small data sets and preventive maintenance modelling. Journal of Quality in Maintenance Engineering 3:136–142 Lee PM, (2004) Bayesian Statistics: an Introduction. London: Arnold Martz HF, Waller RA, (1982) Bayesian Reliability Analysis. New York: Wiley Mazzuchi TA, Soyer R, (1996) A Bayesian perspective on some replacement strategies. Reliability Engineering & System Safety 51:295–303 O’Hagan A, (1998) Eliciting expert beliefs in substantial practical applications. The Statistician 47:21–35 O’Hagan A, (1994) Kendall's Advanced Theory of Statistics Volume 2B: Bayesian Inference. London: Arnold Percy DF, (2002) Bayesian enhanced strategic decision making for reliability. European Journal of Operational Research 139:133–145 Percy DF, (2004) Subjective priors for maintenance models. Journal of Quality in Maintenance Engineering 10:221–227 Percy DF, Kobbacy KAH, Fawzi BB, (1997) Setting preventive maintenance schedules when data are sparse. International Journal of Production Economics 51:223–234 Singpurwalla ND, (1988) An interactive PC-based procedure for reliability assessment incorporating expert opinion and survival data. Journal of the American Statistical Association 83:43–51 Soland RM, (1969) Bayesian analysis of the Weibull process with unknown scale and shape parameters. IEEE Transactions on Reliability 18:181–184 van Noortwijk JM, Dekker A, Cooke RM, Mazzuchi TA, (1992) Expert judgement in maintenance optimization. IEEE Transactions on Reliability 41:427–432

7 Reliability Prediction and Accelerated Testing E. A. Elsayed

7.1 Introduction Reliability is one of the key quality characteristics of components, products and systems. It cannot be directly measured and assessed like other quality characteristics but can only be predicted for given times and conditions. Its value depends on the use conditions of the product as well as the time at which it is to be predicted. Reliability prediction has a major impact on critical decisions such as the optimum release time of the product, the type and length of warranty policy and associated duration and cost, and the determination of the optimum maintenance and replacement schedules. Therefore, it is important to provide accurate reliability predictions over time in order to determine accurately the repair, inspection and replacements strategies of products and systems. Reliability predictions are based on testing a small number of samples or prototypes of the product. The difficulty in predicting reliability is further complicated by many limitations such as the available time to conduct the test and budget constraints, among others. Testing products at design conditions requires extensive time, large number of units and cost. Clearly some kind of reliability testing, other than testing at normal design conditions, is needed. One of the most commonly used approaches for testing products within the above stated constraints is accelerated life testing (ALT) where units or products are subjected to more severe stress conditions than normal operating conditions to accelerate its failure time and then use the test results to predict (extrapolate) the reliability at design conditions. This Chapter will address the determination of optimum maintenance schedule at normal operating conditions while utilizing the results from accelerated testing. We classify the ALT into two types: accelerated failure time testing (AFTT) and accelerated degradation testing (ADT). The AFTT is conducted when accelerated conditions result in the failure of test units without experiencing failure mechanisms different from those occurring at normal operating conditions and when there is “enough” units to be tested at different conditions. Moreover, the economics of conducting AFTT need to be justified as the test is destructive and its duration is

156

E. Elsayed

directly related to the reliability of test units and the applied stresses. Finally, testing at stresses far from normal makes it difficult to predict reliability accurately at normal conditions as in some cases few or no failures are observed even under accelerated conditions making reliability inference via failure time analysis highly inaccurate, if not impossible. On the other hand ADT is a viable alternative to AFTT when the product’s physical characteristics or performance indices leading to failure (e.g. drift in resistance value of a resistor, change in light intensity of light emitting diodes (LED) and loss of strength of a bridge structure) experience degradation over time. Moreover, significant degradation data can be obtained by observing degradation of a small number of units over time. Degradation testing may also be conducted either at normal or accelerated conditions, and no actual failure is required for reliability inference (Liao 2004). In this chapter, we address the issues associated with conducting accelerated life testing and describe how the reliability models obtained from ALT are used in the determination of the optimum maintenance schedules at normal operating conditions. This chapter is organized as follows. Section 7.1 provides an overview of the role of reliability prediction and the importance of accelerated life testing. In Section 7.2 we present the two most commonly used accelerated life testing types in reliability engineering. The approaches and models for predicting reliability using accelerated life testing are described in Section 7.3 while Section 7.4 focuses on mathematical formulation and solution of the design of accelerated life testing plans. Section 7.5 shows how accelerated life testing is related to maintenance decisions at normal operating conditions. Models to determine the optimum preventive maintenance schedules for both failure time models and degradation models are presented in Section 7.6. A summary of the chapter is presented in Section 7.7. We begin by describing the ALT types.

7.2 ALT Types 7.2.1 Accelerated Failure Time Testing It is known that the more reliable the device, the more difficult it is to measure its reliability. In fact, many devices last so long that life testing at normal operating conditions is impractical. Furthermore, testing devices or components at normal operating conditions requires an extensive amount of time and a large number of devices in order to obtain accurate measures of their reliabilities. ALT is commonly used to obtain reliability and failure rate estimates of devices and components in a much shorter time. A simple way to accelerate the life of many components or products that are used on a continuous time basis such as tires and light bulbs is to accelerate time (i.e. run the product at a higher usage rate). It is typically assumed that the number of cycles, hours, etc., to failure during testing is the same as would be observed at the normal usage rate. For example, in evaluating the failure time distribution of light bulbs which are used on the average about 6 h per day, one year of operating experience can be compressed into three months by using the light bulb for 24 h every day. The advantage of this type of testing is that no assumptions need to be

Reliability Prediction and Accelerated Testing

157

made about the relationship of the failure time distributions at both the accelerated and the normal conditions. However, it is not always true that the number of cycles to failure at high usage rate is the same as that of the normal usage rate. Moreover, the effect of aging is ignored. Therefore, this type of testing must be run with special care to assure that product operation and stress remain normal in all regards except usage rate and the effect of aging is taken into account, if possible. An alternative to the above accelerated failure time testing is to accelerate stress (apply stresses more severe than that of the normal conditions) to shorten product or component life. Typical accelerating stresses are temperature, voltage, humidity, pressure, vibration, and fatigue cycling. It is important to recognize the type of stress which indeed accelerates product or component life. Suitable accelerating stresses need to be determined. One may also wish to know how product life depends on several stresses operating simultaneously. In accelerated life testing, the test stress levels should also be controlled. They cannot be so high as to produce other failure modes that rarely or are unlikely to occur at normal conditions. Yet levels should be high enough to yield enough failures similar to those that exist at the design (operating) stress. The limited range of the stress levels needs to be specified in the test plans to avoid invalid or biased estimates of reliability. The stress application loading can be constant, increase (or decrease) continuously or in steps, vary cyclically, or vary randomly or combinations of these loadings. The choice of such stress loading depends on how the product is loaded in service and on practical and theoretical limitations (Shyur 1996). 7.2.2 Accelerated Degradation Testing In some cases, applying high stresses might not induce failures or result in sufficient data and reliability inference via failure time analysis becomes highly inaccurate, if not impossible. However, if a product’s physical characteristics or performance indices leading to failure experience degradation over time then degradation analysis could be a viable alternative to traditional failure time analysis. The advantages of degradation modeling over time-to-failure modeling are significant. Indeed, degradation data may provide more reliability information than would otherwise be available from time-to-failure data with censoring. Moreover, degradation testing may be conducted either at normal or accelerated conditions, and no actual failure is required for reliability inference. Degradation data needed for reliability inference may be obtained from two categories: the first is field application and the second is degradation testing experiments. The first category requires an extensive data collection system over a long time. Since the collected data are often subject to highly random stress environment and human errors, the data may exhibit significant volatility and sometimes its accuracy is questionable, limiting its use for reliability inference and prediction. The second category, prognostics, is a process of predicting the future state of a product (or component). Degradation data analysis might be used in this process to minimize field failure and reduce the life-cycle expenses by recommending conditionbased maintenance on observed components or systems. Moreover, degradation testing is usually conducted to demonstrate products’ reliability and helps in revealing the main failure mechanisms and the major failure-causing stress factors.

158

E. Elsayed

It may be conducted at or close to the normal operating conditions to provide more accurate and precise information for reliability estimates. Yet, to save time and cost, accelerated degradation testing (ADT) is commonly used to obtain immediate data for extrapolating reliability under normal conditions. ADT is conducted by testing units (products or components) at accelerated conditions and measuring its degradation indicators with time. The test can be terminated once “enough” observations are obtained without causing destruction of the test unit (nondestructive testing) if possible. For general purposes, a degradation model along with inference procedure that can utilize both field degradation data and degradation testing data is preferred, and its potential ability to be embedded into the development of systems for prognostics purposes is of additional value to the manufacturers. Reliability assessment using ADT experiments requires an appropriate degradation model, a carefully designed test plan and insightful investigation of the field operating environment in order to achieve high accuracy of the reliability estimates. An appropriate degradation model is the one that accurately interprets the effects of the stresses on the degradation process of a product based on its physical properties and the related probability distributions. On the other hand, a carefully designed test plan may improve the accuracy of the developed degradation model and the efficiency of the experiments. The design of the test plan consists of objective functions, several constraints and decision variables such as stress levels, sample allocation ratios at stress levels, frequency of observing and measuring degradation and test termination time. Inappropriate assignments of these decision variables in practice result in inaccurate reliability estimates. Moreover, it is a challenging and critical issue to consider the stochastic nature of the normal (field) operating conditions in reliability inference from ADT to the normal operating conditions. When field stresses are not deterministic, which is usually the case; their uncertainty will potentially influence the degradation process of the product. If such variations and extremes are ignored in a reliability model, an inaccurate estimate will result, sometimes misleading the judgment for reliability requirements, warranty decisions and the maintenance plans. Therefore, it is important to design robust test plans subject to constraints. The plans should be robust to: the accuracy estimation of the model parameters, the underlying distributions (in case of misspecifications) and robust to the underlying stress-life relationship. Currently, the literature relating ADT to field applications is rare. Without scientific guide from the literature, it is hard to make an appropriate robust design to tolerate the extremes while avoiding “over-design” of the product. In both accelerated failure time (usage or stress) and degradation testing (normal or stress) robust models that relate the results of the test to the normal operating conditions (or other conditions) are needed. In the following section, we describe such models and discuss their assumptions and limitations.

7.3 ALT Reliability Estimation Models The accuracy of reliability estimation depends on the models that relate the failure data under severe conditions, or high stress, to that at normal operating conditions, or design stress. Elsayed (1996) classifies these models into three groups: statistics-

Reliability Prediction and Accelerated Testing

159

based models, physics-statistics models, and physics-experimental models. Furthermore, he classifies the statistics models into two sub-categories: parametric and nonparametric models. We limit the models in this chapter to the statistics models as they are more general while the physics-statistics and physics-experimental models are usually developed for particular applications such as fatigue testing, creep testing and electromigration models. 7.3.1 Statistics-based Models: Parametric The failure times at each stress level are used to determine the most appropriate failure time distribution along with its parameters. We refer to these models as AFT (accelerated failure time). Parametric models assume that the failure times at different stress levels are related to each other by a common failure time distribution with different parameters. Usually, the shape parameter of the failure time distribution remains unchanged for all stress levels, but the scale parameter may present a multiplicative relationship with the stress levels. For practical purposes, we assume that the time scale transformation (also referred to as acceleration factor, AF > 1 ) is constant, which implies that we have a true linear acceleration. Thus the relationships between the accelerated and normal conditions are summarized as follows (Tobias and Trindade 1986; Elsayed 1996). Let the subscripts o and s refer to the operating conditions and stress conditions, respectively. The relationship between the time to failure at operating conditions and stress conditions is to = AF × tS .

(7.1)

The cumulative distribution functions are related as ⎛ t ⎞ Fo ( t ) = Fs ⎜ ⎟. ⎝ AF ⎠

(7.2)

The probability density functions are related as ⎛ 1 ⎞ ⎛ t ⎞ fo ( t ) = ⎜ ⎟ fs ⎜ ⎟. ⎝ AF ⎠ ⎝ AF ⎠

(7.3)

The failure rates are given by ⎛ 1 ⎞ ⎛ t ⎞ ⎟ hs ⎜ ⎟. ⎝ AF ⎠ ⎝ AF ⎠

ho ( t ) = ⎜

(7.4)

160

E. Elsayed

The acceleration factor is obtained by determining the median lives of units tested at two different accelerated stresses and extrapolating to the median life at normal operating stress. It can also be estimated by replacing the medians with some quartiles. The accuracy of the reliability estimates suffers when small samples are tested at the stress conditions since the determination of proper failure time distribution that describes these failures becomes difficult. More importantly, the assumption of having the same failure time distributions at different stress levels is difficult to justify especially when small numbers of failures are observed. In these cases, it is more appropriate to use nonparametric models as described next. 7.3.2 Statistics-based Models: Nonparametric Nonparametric models relax the requirement of the common failure time distribution, i.e., no common failure time distribution is required among all stress levels. Several nonparametric models have been developed and validated in recent years. We describe these models below. 7.3.2.1 Proportional Hazards Model Cox’s Proportional Hazards (PH) model (Cox 1972, 1975) is the most popular nonparametric model. It has become the standard nonparametric regression model for accelerated life testing in the past few years. The PH model is distribution-free requiring only the ratio of hazard rates between two stress levels to be constant with time. The proportional hazards model has the following form: λ (t ; z ) = λ 0 (t ) exp( β z )

(7.5)

The base line hazard function λ 0 (t ) is an arbitrary function; it is modified multiplicatively by the covariates (i.e. applied stresses). Elsayed and Zhang (2006) assume λ 0 (t ) to be linear with time: λ 0 (t ) = γ 0 + γ 1t . Substituting λ 0 (t ) into the PH model, we obtain: λ (t ; z ) = (γ 0 + γ 1t ) exp( β z ) , where z = ( z1 , z2 ,… z p )T is a column vector of the covariates (or applied stresses). For ALT, the column vector represents the stresses used in the test and/or their interactions. β = ( β1 , β 2 ,… β p ) is a row vector of the unknown coefficients corresponding to the covariates z . These coefficients can be estimated using a partial likelihood estimation procedure. This model usually produces “good” reliability estimation with failure data for which the proportional hazards assumption holds and even when it does not exactly hold. 7.3.2.2 Extended Linear Hazards Regression Model The PH and AFT models have different assumptions. The only model that satisfies both assumptions is the Weibull regression model (Kalbfleisch and Prentice 2002). For generalization, the Extended Hazard Regression (EHR) model (Ciampi and

Reliability Prediction and Accelerated Testing

161

Etezadi-Amoli 1985; Etezadi-Amoli and Ciampi 1987; Shyur et al. 1999) is proposed to combine the PH and AFT models into one form: λ (t ; z ) = λ0 (ez'β t ) exp( z'α)

(7.6)

The unknowns of this model are the regression coefficients α , β and the unspecified baseline hazard function λ 0 (t ) . The model reflects that the covariate z has both the time scale changing effect and hazard multiplicative effect. It becomes the PH model when β = 0 and the AFT model when α = β . Elsayed et al. (2006) propose a new model called Extended Linear Hazard Regression (ELHR) model. The ELHR model (e.g., with one covariate) assumes those coefficients to be changing linearly with time: λ (t ; z ) = λ0 (te( β

0

+ β1t ) z

) exp ( (α 0 + α1t ) z )

(7.7)

The model considers the proportional hazards effect, time scale changing effect as well as time-varying coefficients effect. It encompasses all previously developed models as special cases. It may provide a refined model fit to failure time data and a better representation regarding complex failure processes. Since the covariate coefficients and the unspecified baseline hazard cannot be expressed separately, the partial likelihood method is not suitable for estimating the unknown parameters. Elsayed et al. (2006) propose the maximum likelihood method which requires the baseline hazard function to be specified in a parametric form. In the EHR model, the baseline hazards function has two specific forms; one is a quadratic function and the other is a quadratic spline. In the proposed ELHR model, we assume the baseline hazard function λ 0 (t ) to be a quadratic function:

λ0 (t ) = γ 0 + γ 1t + γ 2 t 2

(7.8)

Substituting λ 0 (t ) into the ELHR model yields

λ (t ; z ) = γ 0 eα 0 z +α1zt + γ 1teθ0 z +θ1zt + γ 2t 2 eω0 z +ω1zt where θ 0 = α 0 + β 0 , θ1 = α1 + β1 , ω0 = α 0 + 2β 0 , ω1 = α1 + 2β1

The cumulative hazard rate function is obtained as

(7.9)

162

E. Elsayed

Λ (t ; z ) = =



t 0

λ (u; z )du =



t 0

γ 0 eα 0 z +α1zu du +



t 0

γ 1ueθ0 z +θ1zu du +



t 0

γ 2u 2eω0 z +ω1zu du

γ 0 α 0 z +α1zt γ 0 α 0 z γ 1t θ0 z +θ1zt γ γ e − e + e − 1 2 eθ0 z +θ1zt + 1 2 eθ0 z α1 z α1 z θ1 z (θ1 z ) (θ1 z )

+

2γ 2t ω0 z +ω1zt 2γ 2 ω0 z +ω1zt 2γ 2 ω0 z γ 2t 2 ω0 z +ω1zt − + − e e e e 2 3 3 ω1 z (ω1 z ) (ω1 z ) (ω1 z )

The reliability function, R(t ; z ) and the probability density functions f (t ; z ) are obtained as R(t ; z ) = exp(−Λ(t ; z )) f (t ; z ) = λ (t ; z ) exp(−Λ(t; z ))

Although the ELHR model is developed based on the distribution-free concept, a close investigation of the model reveals its capability of capturing the features of commonly used failure time distributions. The main limitation of this model is that “good” estimates of the many parameters of the model require a large number of test units. 7.3.2.3 Proportional Mean Residual Life Model Oakes and Dasu (1990) originally propose the concept of the Proportional Mean Residual Life (PMRL) by analogy with PH model. Two survivor distributions F (t ) and F0 (t ) are said to have PMRL if e( x) = θ e0 ( x)

(7.10)

where e( x) is the mean residual life at time x . We extend the model to a more general framework with a covariate vector Z (applied stress)

e(t | z ) = exp( β T z )e0 (t )

(7.11)

We refer to this model as the proportional mean residual life regression model which is used to model accelerated life testing. Clearly e0 ( x) serves as the MRL corresponding to a baseline reliability function R0 (t ) and is called the baseline mean residual function; e(t z ) is the conditional mean residual life function of T − t given T > t and Z = z . Where z T = ( z1 , z2 ; , z p ) is the vector of covariates, β T = ( β1 , β 2 ; , β p ) is the vector of coefficients associated with the covariates, and p is the number of covariates. Typically, we can experimentally obtain {(ti , zi ), i = 1, 2, , n} the set of failure time and the vectors of covariates for each unit (Zhao and Elsayed, 2005). The main assumption of this model is the proportionality of mean residual lives with applied stresses. In other words, the mean

Reliability Prediction and Accelerated Testing

163

residual life of a unit subjected to high stress is proportional to the mean residual life of a unit subjected to low stress. 7.3.2.4 Proportional Odds Model In many applications, however, it is often unreasonable to assume the effects of covariates on the hazard rates remain fixed over time. Brass (1971) observes that the ratio of the death rates, or hazard rates, of two populations under different stress levels (for example, one population for smokers and the other for nonsmokers) is not constant with age, or time, but follows a more complicated course, in particular converging closer to unity for older people. So the PH model is not suitable for this case. Brass (1974) proposes a more realistic model: the proportional odds (PO) model. The proportional odds model has been successfully used in categorical data analysis (McCullagh 1980; Agresti and Lang 1993) and survival analysis (Hannerz 2001) in the medical fields. The PO model has a distinct different assumption on proportionality, and is complementary to the PH model. It has not been used in reliability analysis of accelerated life testing so far. Zhang and Elsayed (2005) extend this model for reliability estimates using ALT data. We describe the PO model as follows. Let T > 0 be a failure time associated F (t ; z ) , with stress level z with cumulative distribution F (t ; z ) , and that ratio 1 − F (t ; z ) or

1 − R(t ; z ) , be the odds on failure by time t . The PO model is then expressed as R(t ; z )

F (t ) F (t ; z ) = exp( β z ) 0 1 − F (t ; z ) 1 − F0 (t )

(7.12)

where F0 (t ) ≡ F (t ; z = 0) is the baseline cumulative distribution function and β is unknown regression parameter. Let θ (t ; z ) denote the odds function, then the above PO model is transformed to θ (t ; z ) = exp( β z )θ 0 (t )

(7.13)

where θ 0 (t ) ≡ θ (t ; z = 0) is the baseline odds function. For two failure time samples with stress levels z1 and z2 , the difference between the respective log odds functions is log[θ (t ; z1 )] − log[θ (t ; z2 )] = β ( z1 − z2 ) ,

which is independent of the baseline odds function θ 0 (t ) and the time t . Hence, the odds functions are constantly proportional to each other. The baseline odds function could be any monotone increasing function of time t with the property of θ 0 (0) = 0 . When θ 0 (t ) = t ϕ , PO model presented by Equation 7.13 becomes the

164

E. Elsayed

log-logistic accelerated failure time model (Bennett 1983), which is a special case of the general PO models. In order to utilize the PO model in predicting reliability at normal operating conditions, it is important that both the baseline function and the covariate parameter, β , be estimated accurately. Since the baseline odds function of the general PO models could be any monotone increasing function, it is important to define a viable baseline odds function structure to approximate most, if not all, of the possible odds function. In order to find such a “universal” baseline odds function, we investigate the properties of odds function and its relation to the hazard rate function. The odds function θ (t ) is denoted by θ (t ) =

F (t ) 1 − R(t ) 1 = = −1 1 − F (t ) R(t ) R(t )

(7.14)

From the properties of reliability function and its relation to odds function shown in Equation 7.14, we could easily derive the following properties of odds function θ (t ) : 1.

θ (0) = 0 , θ (∞) = ∞

2.

θ (t ) is monotonically increasing function in time

3.

θ (t ) =

1 − exp[−Λ (t )] = exp[−Λ (t )] − 1 , and Λ (t ) = ln[θ (t ) + 1] exp[−Λ (t )]

4.

λ (t ) =

θ ′(t ) θ (t ) + 1

Further investigation of such a “universal” odds function shows that it can be approximated by a polynomial function. An appropriate ALT model is important since it explains the influences of the stresses on the expected life of a product based on its physical properties and the related statistical properties. On the other hand, a carefully designed test plan improves the accuracy and efficiency of the reliability estimation. The design of an accelerated life testing plan consists of the formulation of objective function, the determination of constraints and the definition of the decision variables such as stress levels, sample size, allocation of test units to each stress level, stress level changing time and test termination time, and others. Inappropriate values of the decision variable result in inaccurate reliability estimates and/or unnecessary test resources. Thus it is important to design test plans to minimize the objective function under specific time and cost constraints.

Reliability Prediction and Accelerated Testing

165

7.4 Design of Accelerated Life Testing Plans Conducting an accelerated life testing (ALT) requires the determination or development of a reliability inference model that relates the failure data at stress conditions with design or operating conditions. Moreover, an accelerated test plan needs to be developed to obtain appropriate and sufficient information in order to estimate reliability performance accurately at operating conditions. A test plan requires the identification of the type of stresses to be applied, stress levels, methods of stress application (constant, ramp, cyclic), number of units at every stress level, minimum number of failures at every stress level, optimum test duration, frequency of test data collection and other test parameters. Indeed, without an optimum test plan, it is likely that a large sequence of expensive and time consuming tests be conducted that might cause delays in product release or in some cases the termination of the entire product. In this section, we describe the procedure for designing an optimum test plan based on the proportional hazards model followed by a numerical example. Optimum test plans based on other ALT models can be developed in a similar fashion. 7.4.1 Design of ALT Plans An ALT plan requires the determination of the type of stress, method of applying stress, stress levels, the number of units to be tested at each stress level and an applicable accelerated life testing model that relates the failure times at accelerated conditions to those at normal conditions. When designing an ALT, we need to address the following issues: (a) select the stress types to use in the experiment, (b) determine the stress levels for each stress type selected, (c) determine the proportion of devices to be allocated to each stress level (Elsayed and Jiao 2002). We refer the reader to Meeker and Escobar (1998) and Nelson (2004) for other approaches for the design of ALT plans. We consider the selection of the stress level zi and the proportion of devices pi to allocate for each zi such that the most accurate reliability estimate at use conditions zD can be obtained. We consider two types of censoring: type I censoring involves running each test unit until a prespecified time. The censoring times are fixed and the number of failures is random. Type II censoring involves simultaneously testing units until a prespecified number of them fails. The censoring time is random while the number of failures is fixed. We use the following notations: ln ML n zH, zM, zL zD p1 , p2 , p3 T R(t; z) f(t; z) F(t; z)

Natural logarithm Maximum likelihood Total number of test units High, medium, low stress levels respectively Specified design stress Proportion of test units allocated to zL, zM and zL, respectively Pre-specified period of time over which the reliability estimate is of interest Reliability at time t, for given z Pdf at time t, for given z Cdf at time t, for given z

166

E. Elsayed

Λ (t ; z ) λ0 (t )

Cumulative hazard function at time t, for given z Unspecified baseline hazard function at time t

We assume the baseline hazard function λ0 (t ) to be linear with time: λ0 (t ) = γ 0 + γ 1t

Substituting λ0 (t ) into the PH model given by Equation 7.5, we obtain, λ (t ; z ) = (γ 0 + γ 1t ) exp( β z )

We obtain the corresponding cumulative hazard function Λ (t ; z ) , and the variance of the hazard function as Λ (t ; z ) = (γ 0 t +

γ 1t 2 β z )e 2 ˆ

ˆ

Var[(γˆ0 + γˆ1t )e β Z ] = (Var[γˆ0 ] + Var[γˆ1 ]t 2 )e2( β z +Var [ β ] z D

ˆ

2

ˆ

2

)

2

+ e 2 β z +Var [ β ] z (eVar [ β ] z − 1)(γ 0 + γ 1t ) 2

7.4.1.1 Formulation of the Test Plan Under the constraints of available test units, test time and specification of minimum number of failures at each stress level, the problem is to allocate stress levels and test units optimally so that the asymptotic variance of the hazard rate estimate at normal conditions is minimized over a prespecified period of time T. If we consider three stress levels, then the optimal decision variables ( z *L , zM* , p1* , p2* , p3* ) are obtained by solving the following optimization problem with a nonlinear objective function and both linear and nonlinear constraints: T

Min

∫ Var[(γˆ

0

ˆ

+ γˆ1t )e β z ]dt D

0

subject to Σ = F −1  0 < pi < 1, i = 1, 2,3 3

∑p i =1

i

=1

z D < zL < zM < zH npi Pr[t ≤ τ | zi ] ≥ MNF , i = 1, 2,3

Reliability Prediction and Accelerated Testing

167

where, MNF is the minimum number of failures and Σ is the inverse of the  Fisher's information matrix. Other objective functions can be formulated which result in different design of the test plans. These functions include the D-Optimal design that provides efficient estimates of the parameters of the distribution. It allows relatively efficient determination of all quantiles of the population, but the estimates are distribution dependent. 7.4.1.2 Numerical Example An accelerated life test is to be conducted at three temperature levels for MOS capacitors in order to estimate its life distribution at design temperature of 50°C. The test needs to be completed in 300 h. The total number of items to be placed under test is 200 units. To avoid the introduction of failure mechanisms other than those expected at the design temperature, it has been decided, through engineering judgment, that the testing temperature should not exceed 250°C. The minimum number of failures for each of the three temperatures is specified as 25. Furthermore, the experiment should provide the most accurate reliability estimate over a 10-year period of time. Consider three stress levels; then the formulation of the objective function and the test constraints follow the same formulation given in the above section. The optimum plan derived (Elsayed and Jiao 2002) that optimizes the objective function and meets the constraints is shown as follows: z L = 160o C , zM = 190o C , z H = 250o C The corresponding allocations of units to each temperature level are: p1 = 0.5, p2 = 0.4, p3 = 0.1 7.4.1.3 Concluding Remarks Design of ALT plans plays a major role in providing accurate estimates of reliability, mean time to failure and the variance of failure time at normal operating conditions. These estimates have a major impact on many decisions during the product life cycle such as maintenance schedules, warranty and repair policies and replacement times. Therefore, the test plans should be robust (Pascual 2006), i.e., it should be: 1.

2.

3.

Robust to planning values of the model parameters. This implies that ALT conducted at three or more stresses are more robust than those conducted at two stresses. Allocating more units at the low stress level will also improve the robustness of the plan. Robust to the type of the underlying distribution. In other words, misspecification of the underlying distribution should not result in significant errors in calculating reliability characteristics. Robust to the underlying stress-life relationship. The commonly used concept that higher stresses result in more failures might result in the “wrong” stress-life relationship. For example, testing circuit packs at higher temperature reduces humidity which in turn results in fewer failures than those at field conditions. In essence, this is a deceleration test (higher stresses show fewer failures).

168

E. Elsayed

7.5 Relating ALT Results to Maintenance Decisions at Normal Operating Conditions It is important to note that it is not necessary to conduct destructive ALT when the product’s characteristics can be monitored through degradation with time. For example, light emitting diodes (LED) are likely to experience degradation in the light intensity before they are deemed completely unsuitable for use. In such cases it is important to conduct accelerated degradation testing (ADT). The threshold level where a unit is considered unacceptable might be considered the same threshold level for replacement or maintenance (if possible). In a typical experiment the threshold level is set as the level at which the light intensity drops to 50% of its original value. This threshold level is set based on engineering and users’ experience. Of course, an optimum level can be determined based on other factors such as economic, maintenance strategy, availability of maintenance crew and others. It should be noted that this level is set for accelerated conditions. Clearly the determination of the optimum maintenance schedule at normal operating conditions depends on many factors as follows. (1) The variance of the time to failure at normal conditions is much larger than that of the ADT as shown in Figure 7.1. (2) The failure time distribution or the degradation paths at accelerated conditions are directly related to the failure rate or degradation rate. Higher accelerated stresses result in higher rates as shown in Figure 7.2. Thus the failure rate at normal conditions requires careful evaluation as it directly affects the maintenance schedule. (3) Since there are no universal normal operating conditions but a distribution is likely to describe these conditions, the maintenance threshold level will then be greatly affected by such a distribution. (4) The repair rate in field conditions is likely to be different from that of the ALT. (5) The effect of aging at stress conditions is not captured. (6) When a unit is repaired it is not considered as good as new; consequently the time to next failure is shorter. Therefore, the maintenance threshold level needs to be optimally determined so that the total maintenance cost is minimized or the system availability is maximized as discussed in Section 7.7.

Figure 7.1. Distributions of the time to failure at stress and normal conditions

Reliability Prediction and Accelerated Testing 40° C

60° C

Degradation Path

80° C

169

Threshold

Time

Figure 7.2. Distributions of degradation paths with time at different stress levels

In order to determine the optimum maintenance schedule at normal operating conditions using accelerated testing results one needs to perform the following two steps: 1. Relate the reliability function at stress conditions to that at normal conditions by developing an appropriate model using the approaches discussed earlier in this chapter or using an ADT model as described in Eghbali and Elsayed (2001), Liao (2004) and Meeker and Escobar (1998). 2. Relate the maintenance threshold level to the operating conditions. For example, when the stress at operating conditions is higher than the mean of the normal conditions then a lower threshold level is used. Similarly, when the stress at operating conditions is lower than the mean of the normal operating conditions then a higher threshold level is used. The first step has been discussed in Section 7.3 and the second step will be discussed in Section 7.6.

7.6 Determination of the Optimum Preventive Maintenance Schedule and Optimum Threshold Degradation Level at Normal Conditions The optimum preventive maintenance schedule at operating conditions can be determined by relating the reliability functions at accelerated conditions with that at normal conditions then utilize an optimization function that relates reliability to preventive maintenance schedule. In Section 7.6.1 we demonstrate these steps through an example. Another approach for determining the optimum preventive maintenance for degrading systems is to determine the optimum threshold degradation level at which maintenance actions are taken by minimizing the over-

170

E. Elsayed

all cost of maintenance or by ensuring a minimum acceptable system availability level (Liao et al. 2005). This will be illustrated in Section 7.6.2. 7.6.1 Optimum Preventive Maintenance Schedule at Operating Conditions The first step is to relate the accelerated testing results to stress conditions and obtain a reliability expression which is a function of the applied stresses. We then substitute the normal operating conditions in the expression to obtain a reliability function at normal conditions. We illustrate this by designing an optimum test plan then use its results to obtain the reliability expression. Suppose we develop an accelerated life test plan for a certain type of electronic devices using two stresses: temperature and electric voltage. The reliability estimate at the design condition over a 10-year period of time is of interest. The design condition is characterized by 50 ºC and 5V. From engineering judgment, the highest levels (upper bounds) of temperature and voltage are pre-specified as 250 ºC and 10 V, respectively. The allowed test duration is 200 h, and the total number of devices placed under test is 200. The minimum number of failures at any test combination is specified as 10. The test plan is determined through the following steps: 1. According to the Arrehenius model, we use 1/(absolute temperature) as the first covariate z1 and 1/(Voltage) as the second covariate z2 in the ALT model. 2. The PH model is used in conducting reliability data analysis and designing the optimal ALT plan using the approach described in Section 7.4.1.1. The model is given by λ (t ; z ) = λ 0 ( t ) exp ( β1 z1 + β 2 z2 )

where λ 0 (t ) = γ 0 + γ 1t + γ 2 t 2 3. A baseline experiment is conducted to obtain initial estimates for the model parameters. These values are: γˆ0 = 0.0001 , γˆ1 = 0.5 , γˆ2 = 0 , βˆ1 = −3800 , and βˆ2 = −10 . Approximating γˆ0 to zero we write the hazard rate function as

λ (t ; T , V ) = 0.5t e

−(

3800 10 + ) T V

(7.15)

The reliability and the probability density function (pdf) expressions are respec2

tively given as f (t ;30o C,5V ) = 0.5t exp[−(e −3.6336 t ) )] 2

R(t ; T , V ) = exp[−(e−0.25((3800 / T ) +10 / V )t ) )]

(7.16)

Reliability Prediction and Accelerated Testing

2

f (t ; T , V ) = 0.5t exp[−(e−0.25((3800 / T ) +10 / V ) t ) )]

171

(7.17)

Assume that the normal operating temperature is 30 oC and the normal operating voltage is 5 V. Substituting in Equations 7.16 and 7.17 yields 2

Rn (t ) = R(t ;30o C,5V ) = exp[−(e −3.6336 t )]

(7.18) 2

f n (t ) = f (t;30o C,5V ) = 0.5t exp[ −(e−3.6336 t )]

(7.19)

In the second step, we chose an appropriate preventive maintenance (PM) model and determine the optimum PM schedule. Consider a simple preventive maintenance and replacement policy. Under this policy, two types of actions are performed. The first type is the preventive replacement that occurs at fixed intervals of time. Components or parts are replaced at predetermined times regardless of the age of the component or the part being replaced. The second type of action is the failure replacement where components or parts are replaced upon failure. This policy is illustrated in Figure 7.3. The most widely used criterion of maintenance models is to minimize the total expected maintenance and replacement cost per unit time. This can be accomplished by developing a total expected cost function per unit time as follows. NEW ITEM

PREVENTIVE REPLACEMENT

FAILURE REPLACEMENTS

0 ONE CYCLE

tp

Figure 7.3. Constant interval replacement policy

Let c(t p ) be the total replacement cost per unit time as a function of t p .Then

c (t p ) =

Total expected cost interval (0, t p ] Expected length of the interval

.

(7.20)

The total expected cost in the interval (0, t p ] is the sum of the expected cost of failure replacements and the cost of the preventive replacement. During the interval (0, t p ], one preventive replacement is performed at a cost of c p and M (t p ) failure

172

E. Elsayed

replacements at a cost of c f each, where M (t p ) is the expected number of replacements (or renewals) during the interval (0, t p ]. The expected length of the interval is t p . Equation 7.20 can be rewritten as c(t p ) =

c p + c f M (t p ) tp

.

(7.21)

We apply the above model to determine the optimum preventive maintenance schedule for the example for the electronic devices whose reliability and pdf functions obtained from accelerated conditions and are expressed as given in Equations 7.18 and 7.19 respectively. Assuming c p =100 and c f =1200, we rewrite Equation 7.21 as: tp



10 + 1200 tf n (t ) dt 0

c(t p ) =

(7.22)

tp

Calculated values of the cost per unit time are shown in Table 7.1 and plotted in Figure 7.4. The optimum preventive maintenance schedule at normal operating conditions is 0.18 unit times. Table 7.1. Time vs. cost per unit time values (bold numbers indicate optimum values)

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0.2

Cost/unit time 918

885

862

847

839

836

840

848

C ost per unit tim e

Time

3500 3000 2500 2000 1500 1000 500 0 0,03

0,13

0,23

0,33

0,43

Time Figure 7.4. Optimum preventive maintenance schedule

0,53

Reliability Prediction and Accelerated Testing

173

7.6.2 Optimum Preventive Maintenance Schedule Based on Accelerated or Normal Degradation Determining the optimum maintenance schedule for systems subject to degradation follows the same procedures described in Section 7.6.1. It begins by developing a degradation model (at normal operation conditions or at accelerated conditions then extrapolate to normal conditions as shown above). Liao et al. (2005) assume that the degradation is described by a gamma process and obtain the optimum degradation level accordingly. Ettouney and Elsayed (1999) obtain the reliability function for different threshold degradation levels. We demonstrate the determination of the degradation threshold level at normal stress levels using Ettouney and Elsayed (1999) results; then we utilize the optimum degradation level to determine the corresponding optimum preventive maintenance schedule as follows. Consider the case of corrosion in reinforced concrete bridges which is a major concern to professional engineers because of both public safety and cost which associated with needed repairs and replacement. Prediction of bridge functional degradation due to corrosion conditions is investigated below. The two main corrosion parameters which affect the reinforcing bars in reinforced concrete bridges are the corrosion rate, rcorr , and the time it takes to initialize corrosion, T1. Enright and Frangopol (1998) present several mean and variance test measurements for both rcorr and T1. In a typical case, they show that the mean and variance of rcorr are between 0.005 in/year and 3 × 10−6 in/year, respectively. The mean and standard deviation of T1 are 10 years and 0.4 years, respectively. In order to estimate the time-variant strength of a reinforced concrete corroded beam, the corrosion effects on the diameter of the reinforcing bars is evaluated first. After corrosion initiation time, T1, the diameter of a reinforcing bar, D(t), can be evaluated as D(t ) = Di − rcorr (t − T 1)

(7.23)

where Di = 1.41 in. is the initial reinforcing bar diameter and t is the elapsed time. Note that t ≥ T1 and D(t) ≥ 0. For more details of Equation 7.23 the reader is referred to Enright and Frangopol (1998). The time-variant reinforced concrete strength, Mp(t), can now be evaluated using the conventional design equations in Enright and Frangopol (1998): a⎞ ⎛ M p = nAs f y ⎜ d − ⎟ 2⎠ ⎝

(

a = ( nAs f y ) 0.85 f c` b

(7.24)

)

(7.25)

Note that As = π D(t ) 2 4 . The reinforcing steel and the concrete strengths are f y and f c` , respectively. The number of reinforcing bars is n. The effective depth and the width of the beam are d and b, respectively. For the current example, the

174

E. Elsayed

values of different parameters are chosen as f y = 40 ksi, f c` = 3 ksi, d = 27 in. and b = 16. Using Equation 7.23 through Equation 7.25 the random time-variant strength, Mp(t), can be estimated. Using the previously mentioned values of rcorr and T1 and a Monte Carlo simulation technique, different strength values for different reinforced concrete beams can be simulated. Thus, a discrete time-variant reinforced concrete strength, xij can be evaluated from the Monte Carlo simulation of the continuous strength Mp(t). Eghbali and Elsayed (2001) show that the reliability function for a specified failure threshold degradation x is expressed as Rx (t ) = P( X > x; t ) = exp[

− xγ ] b exp(−at )

(7.26)

where X is a random variable represents the degradation measure, a, b and γ are constants. The Maximum Likelihood method was utilized to estimate the parameters of Equation 7.26: m

L (γ , a , b , t ) = ∏ ( i =1

γ ) ni b e x p ( − a ti )

− x ijγ

ni

m

∏ ∏ x ijγ −1 e x p ( b e x p ( − a ti ) ) i =1 j =1

(7.27)

where m is the number of years, ni is the total number of degradation data in a year i and xij is the strength of unit j in year i. Taking the logarithm of Equation 7.27 we obtain ni

m

m

m

m

i =1

i =1

i =1

i =1 j =1

m

ni

ln L = ∑ ni ln γ − ∑ ni ln b + ∑ ni ati + ∑∑ (γ − 1) ln xij − ∑∑ i =1 j =1

xijγ b exp(−ati )

(7.28)

Equating the partial derivatives of Equation 7.28 with respect to γ , a and b to zeros and solving the resulting equations using a modified Powell hybrid algorithm and a finite difference approximation to the Jacobian yields: a = 0.12, b = 1.1346×107 and γ = 1.49. The resulting reliability function is Rx (t ) = P( X > x; t ) = exp[

− xγ ] b exp(−at )

or − x1.49 Rx (t ) = exp[ ]. 1.1346598 ×107 × exp (-0.12t )

The reliability for different threshold values of the strength is shown in Figure 7.5. The time to failure for threshold values of 4800, 4000, 3500, 3000, and 2500 are 25.04, 27.25, 28.88, 30.76, and 33.0 years respectively.

Reliability Prediction and Accelerated Testing

s=2500

1 Reliability

175

s=3000 s=3500

0.8 0.6

s=4000

0.4

s=4800

0.2 0 0

10

20

30

40

50

60

Time (Years) Figure 7.5. Reliability for different threshold levels

The next step is to determine the optimum preventive maintenance schedule for every threshold level and select the schedule corresponding to the smallest cost among all optimum cost values. This will represent both the optimum threshold level and the corresponding optimum preventive maintenance schedule. We demonstrate this for two threshold levels (S = 4800 and S = 2500) assuming c p =10 and c f =1200; we utilize Equation 7.21 as follows: tp



10 + 1200 tf ( x; t ) dt c(t p ) =

0

tp

(7.29)

where f ( x; t ) =

γ γ −1 − xγ ) , t > 0, θ (t ) = be− at x exp( θ (t ) θ (t )

(7.30)

As shown in Figure 7.6, the optimum t p values for S=400 and S=2500 are 17 and 16 years respectively. The minimum of the two is the one corresponding to S = 2500. Therefore, the optimum threshold is 2500 and the corresponding optimum maintenance schedule is 16 years.

176

E. Elsayed

3 2,5

Cost / Unit Time

2

S=4800

S=2500

1,5 1 0,5 0 2

12

22

32

Time

Figure 7.6. Total cost per unit time vs. time

7.7 Summary In this chapter we present the common approaches for predicting reliability using accelerated life testing. The models are classified as accelerated life testing models (ALT) and accelerated degradation models (ADT). The ALT models are also classified as accelerated failure time models with assumed failure time distributions and “distribution free” models. Also we modify the proportional odds model to be used for reliability prediction with multiple stresses. Most of the research in the literature does not extend the use of accelerated life testing beyond reliability predictions at normal conditions. This is the first work that links the ALT to maintenance theory and maintenance scheduling. We develop optimum preventive maintenance schedules for both ALT models and degradation models. We demonstrate how the reliability prediction models obtained from ALT can be used in obtaining the optimum maintenance schedules. We also demonstrate the link between the optimum degradation threshold level and the optimum maintenance schedule. This work can be further extended to include other maintenance cost or insurance of minimum availability level of a system. Further work is needed to investigate the relationship between threshold levels at accelerated conditions and those at normal conditions. Moreover, the models need to include the repair rate as well as spares availabilities.

Reliability Prediction and Accelerated Testing

177

7.8 References Agresti, A. and Lang, J.B., (1993) Proportional odds model with subject-specific effects for repeated ordered categorical responses, Biometrika, 80, pp. 527–534 Bennett, S. (1983) Log-logistic regression models for survival data, Applied Statistics, 32, 165–171 Brass, W., (1971) On the scale of mortality, In: Brass, W., editor. Biological aspects of Mortality, Symposia of the society for the study of human biology. Volume X. London: Taylor & Francis Ltd.: 69–110 Brass, W., (1974) Mortality models and their uses in demography, Transactions of the Faculty of Actuaries, Vol. 33, 122–133. Ciampi, A. and Etezadi-Amoli, J., (1985) A general model for testing the proportional hazards and the accelerated failure time hypotheses in the analysis of censored survival data with covariates, Commun. Statist. - Theor. Meth., Vol. 14, pp. 651–667. Cox, D.R., (1972) Regression models and life tables (with discussion), Journal of the Royal Statistical Society B, Vol. 34, pp. 187–208 Cox, D.R., (1975) Partial likelihood, Biometrika, Vol. 62, pp. 269–276 Eghbali, G. and Elsayed, E.A., (2001) Reliability estimate using degradation data, in Advances in Systems Science: Measurement, Circuits and Control, Mastorakis, N. E. and Pecorelli-Peres, L. A. (Editors), Electrical and Computer Engineering Series, WSES Press, pp. 425–430 Elsayed, E.A., (1996) Reliability engineering, Addison-Wesley Longman, Inc., New York, 1996. Elsayed, E.A. and Jiao, L., (2002) Optimal design of proportional hazards based accelerated life testing plans, International Journal of Materials & Product Technology, Vol. 17, Nos. 5/6, 411–424 Elsayed, E.A. and Zhang, H., (2006) Design of PH-based accelerated life testing plans under multiple-stress-type, to appear in the Reliability Engineering and Systems Safety Elsayed, E.A., Liao, H., and Wang, X., (2006) An extended linear hazard regression model with application to time-dependent-dielectric-breakdown of thermal oxides, IIE Transactions on Quality and Reliability Engineering, Vol. 38, No. 4, 329–340 Elsayed, E.A. and Zhang, H., (2005) Design of optimum simple step-stress accelerated life testing plans, Proceedings of 2005 International Workshop on Recent Advances in Stochastic Operations Research. Canmore, Canada. Enright, M.P. and Frangopol, D.M., (1998) Probabilistic analysis of resistance degradation of reinforced concrete bridge beams under corrosion, Engineering Structures, Vol. 20 No. 11, pp. 960–971 Etezadi-Amoli, J. and Ciampi, A., (1987) Extended hazard regression for censored survival data with covariates: a spline approximation for the baseline hazard function, Biometrics, Vol. 43, pp. 181–192 Ettouney, M. and Elsayed, E.A., (1999) Reliability estimation of degraded structural components subject to corrosion, Fifth ISSAT International Conference, Las Vegas, Nevada, August 11–13 Hannerz, H., (2001) An extension of relational methods in mortality estimation, Demographic Research, Vol. 4, p. 337–368 Kalbfleisch, J.D. and Prentice, R.L., (2002) The statistical analysis of failure time data, John Wiley & Sons, New York, New York Liao, H., Elsayed, E.A., and Ling-Yau Chan, (2005) Maintenance of continuously monitored degrading systems, European Journal of Operational Research, Vol. 75, No. 2, 821–835 Liao, H., (2004) Degradation models and design of accelerated degradation testing plans, Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers University

178

E. Elsayed

McCullagh, P., (1980) Regression models for ordinal data, Journal of the Royal Statistical Society. Series B, Vol. 42, No. 2, 109–142 Meeker, W.Q. and Escobar, L.A., (1998) Statistical methods for reliability data, John Wiley & Sons, New York, New York Nelson, W., (2004) Accelerated testing: statistical models, test plans, and data analyses, John Wiley & Sons, New York, New York Oakes, D. and Dasu, T. (1990) A note on residual life, Biometrika, 77, pp. 409–410. Pascual, F.G., (2006) Accelerated life test plans robust to misspecification of the stress-life relation, Technometrics, Vol. 48, No. 1, 11–25 Shyur, H-J., (1996) A General nonparametric model for accelerated life testing with timedependent covariates, Ph.D. Dissertation, Department of Industrial and Systems Engineering, Rutgers University Shyur, H-J., Elsayed, E.A. and Luxhoj, J.T., (1999) A General model for accelerated life testing with time-dependent covariates, Naval Research Logistics, Vol. 46, 303–321 Tobias, P. and Trindade, D., (1986) Applied reliability, Von Nostrand Reinhold Company, New York, New York Zhang, H. and Elsayed, E.A., (2005) Nonparametric accelerated life testing based on proportional odds model, Proceedings of the 11th ISSAT International Conference on Reliability and Quality in Design, St. Louis, Missouri, USA, August 4–6 Zhao, W. and Elsayed, E.A., (2005) Optimum accelerated life testing plans based on proportional mean residual life, Quality and Reliability Engineering International

8 Preventive Maintenance Models for Complex Systems David F. Percy

8.1 Introduction Preventive maintenance (PM) of repairable systems can be very beneficial in reducing repair and replacement costs, and in improving system availability, by reducing the need for corrective maintenance (CM). Strategies for scheduling PM are often based on intuition and experience, though considerable improvements in performance can be achieved by fitting mathematical models to observed data; see Handlarski (1980), Dagpunar and Jack (1993) and Percy and Kobbacy (2000) for example. For systems comprising few components, and systems comprising many identical components, modelling and analysis using compound renewal processes might be possible. Such situations are considered by Dekker et al. (1996) and Van der Duyn Schouten (1996). However, many systems comprise a large variety of different components and are too complicated for applying this methodology. We refer to these as complex repairable systems. This chapter reviews basic models for complex repairable systems, explaining their use for determining optimal PM intervals. Then it describes advanced methods, concentrating on generalized proportional intensities models, which have proven to be particularly useful for scheduling PM. Computational difficulties are addressed and practical illustrations are presented, based on sub-systems of oil platforms and refineries. The motivation is that for complex systems, one needs to build models for failures based on the history of maintenance (PM and CM) available. Once a model is built, one can evaluate different PM strategies to determine the best one. The focus is to look at different models and how to determine the best model based on historical data. Section 8.2 presents some real examples of complex systems with historical data sets. In each case, it discusses current maintenance policies and any problems with collection or accuracy of the data. Section 8.3 considers the effects of PM and CM actions upon system reliability and availability, so justifying the need for

180

D. Percy

modelling the operating situations in order to determine suitable scheduling strategies. In Section 8.4, we review the models that can be used for this purpose. We also assess the relevance, strengths and weaknesses of each model and provide references where readers can find more details. The remainder of the chapter presents general recommendations for modelling of complex systems in order to schedule PM in practice. Section 8.5 describes the generalized proportional intensities model, Section 8.6 reviews the method of maximum likelihood for estimating unknown model parameters, Section 8.7 addresses the problem of model selection, and considers statistical tests for this purpose, and Section 8.8 looks at the scheduling problem. Finally, Section 8.9 applies these methods to some of the data of Section 8.2 and Section 8.10 presents some concluding remarks. For convenience, we now present a list of symbols and acronyms that are used throughout this chapter. PM Preventive maintenance CM Corrective maintenance ROCOF Rate of occurrence of failures NHPP Nonhomogeneous Poisson process T1 , T2 , … Failure times of a system X 1 , X 2 , … Inter-failure times of a system N (t ) Number of failures up to time t History of process up to time t H (t ) Intensity function ι (t ) ι 0 (t ) Baseline intensity function Po(µ ) Poisson distribution F ( x) Cumulative distribution function f (x ) Probability density function R(x ) Reliability or survivor function h( x ) Hazard function h0 (x ) Baseline hazard function DRP Delayed renewal process DARP Delayed alternating renewal process VAM Virtual age model PHM Proportional hazards model IRM Intensity reduction model PIM Proportional intensities model GPIM Generalized proportional intensities model MLE Maximum likelihood estimate AIC Akaike information criterion BIC Bayes information criterion

Preventive Maintenance Models for Complex Systems

181

8.2 Examples with Historical Data Sets Example 8.1 Ascher and Feingold (1984) presented three hypothetical sets of reliability data to illustrate the forms of historical failure information that are typically observed for complex systems. The numbers represent inter-failure times corresponding to happy, sad and noncommittal systems respectively and are displayed in Table 8.1. The inter-failure times are increasing for the happy system, as the system settles down and fewer failures occur later on. This phenomenon can arise with prototype systems, such as a new aircraft, items subject to a burn-in phase of operation, such as a piston engine, and debugging of computer programs. Conversely, the inter-failure times are decreasing for the sad system, as the system ages and wears over time. This situation is very common and applies to most systems, such as television sets, music centres and motor vehicles. The noncommittal system displays no clear trend in inter-failure times. Table 8.1. Hypothetical reliability data from Ascher and Feingold (1984) Happy system

Sad system

Noncommittal system

15

177

51

27

65

43

32

51

27

43

43

177

51

32

15

65

27

65

177

15

32

Example 8.2 Percy et al. (1998) published a set of data relating to the reliability and maintenance history of a valve in a petroleum refinery, as displayed in Table 8.2. The two columns successively represent the times in days between maintenance actions and the types of actions, where 0 indicates no failure (PM) and 1 indicates failure (CM). At first glance, this would appear to be a noncommittal system. However, on further inspection, there appear to be fewer failures later on and more preventive actions. Whether the PM is proving to be effective or the system is generally happy is not easy to determine. Modelling can provide these answers though. Based on these data, our ultimate goal is to decide how often to perform PM in future or on similar systems. When collecting such data, it is very important to record all PM and CM events accurately, as errors of omission or commission can result in wrong decisions. For example, if the first failure were not recorded, the average time until system failure over the first 94 days would appear to be twice its actual value, perhaps suggesting that PM is not required.

182

D. Percy

Table 8.2. Reliability and maintenance history of a petroleum refinery valve Time since last action

Type of action

Time since last action

Type of action

71

1

186

0

23

1

14

1

64

1

8

1

207

0

112

1

136

1

57

0

66

1

28

1

37

0

4

1

119

0

139

0

2

1

250

0

5

1

206

0

250

0

144

0

Example 8.3 Kobbacy et al. (1997) published a set of historical reliability and maintenance data collected from a main pump at an oil refinery over a period of nearly seven years. These data are reproduced in Table 8.3, with consecutive observations reading down the columns successively from left to right. Table 8.3. Reliability and maintenance history of a main oil refinery pump Times since last actions 34*

1

37

22

3

14

4

28

51

21

81*

13

38

51

6

86*

27

20

15

26

156*

8

28*

18

15*

20*

148*

44

1

35

96*

92

3

26*

44*

47*

13

56

37

61

45*

13

64

36

84*

97

67*

8*

2

12

88*

29

62*

12*

65*

30

12

8

27

43*

4

1

46

102

4

Preventive Maintenance Models for Complex Systems

183

Right-censored observations corresponding to preventive maintenance are marked by asterisks, whereas unmarked observations correspond to failures and corrective maintenance. To clarify this point, consider the pump’s performance from the time when data collection commenced. After 34 days without failure, PM was performed. The pump then continued to operate for 14 more days and then failed. Following CM, the pump worked for 81 more days without failure and then PM was performed. Following 6 further PM actions, the next failure occurred 676 (=34+14+…+97) days after data collection began. By scanning the inter-event times in Table 8.3, it is clear that preventive maintenance was not performed at regular intervals or according to any other simple pattern. Such irregularity can arise because of opportunistic PM, such as when a maintenance team is on site or has idle time, or because of condition monitoring warnings, such as vibration and noise indicators. In many applications including this, however, PM is simply not modelled and monitored effectively. This can result in excessive repair costs and unacceptable levels of downtime.

8.3 Effects of Preventive and Corrective Maintenance Before considering suitable models for the reliability and maintenance of complex repairable systems, we must consider what is meant by these terms. A complex system consists of any structure of more than one component, which performs a particular function. Typical systems include industrial and domestic machinery, such as production lines, utility supplies, railway operations, motor vehicles, central heating systems and washing machines. We concentrate on industrial systems, which benefit greatly from reliability and maintenance modelling. Such complex systems are often subject to failures, upon which we either discard the systems or repair them. Failures can be total or catastrophic, in which case the system stops working, such as when an exhaust pipe drops off a car or a microchip short circuits in a refrigerator. Alternatively, they can be partial or debilitating, such as when a car headlight bulb blows or a refrigerator clogs up with ice. Total failures incur immediate repair costs. Repairs usually consist of replacing broken components and we incur the costs of replacement parts, labour associated with repair and system downtime. For expensive systems, the cost of replacement parts might contribute most. For dangerous situations, the cost of labour might be most influential. For continuous process industries, the cost of downtime will dominate. As these costs can be very large, management will seek to avoid catastrophic failures by intervening with preventive maintenance at a much smaller cost. Debilitating failures are of less importance, as they do not incur direct costs. However, when observable, they can serve as indicators of when to perform preventive maintenance or capital replacement. Consequently, the failures that this chapter generally refers to are catastrophic in nature. Preventive maintenance can be specific, as identified by condition monitoring indicators, or opportunistic, when such actions are convenient because of other environmental factors. These possibilities are very much application dependent and require in-depth analyses, though the models we consider here do extend to include

184

D. Percy

such information. Much preventive maintenance is less specific in terms of particular systems but not in terms of the work involved, and applies more generally. For example, motor vehicles might be serviced annually according to a strict checklist procedure. The actual work conducted during PM can involve many tasks, such as cleaning surfaces, lubricating joints, sharpening blades, replacing fluids, removing waste, cooling down and redecorating. As for CM, we incur costs of PM due to parts, labour and downtime, though these tend to be substantially less than for repairs. The challenge is to balance the costs of preventive maintenance with the supposed improvements in system reliability. Too few PM actions means we incur big CM costs and small PM costs, whereas too many PM actions means we incur small CM costs and big PM costs. Unfortunately, there is no simple explanation of how CM and PM affect system reliability. By modelling the failure patterns of these systems mathematically, we can gain valuable insights about cost-effective strategies for maintenance and replacement.

8.4 Review of Suitable Models Many mathematical models have been proposed for statistical analysis of complex repairable systems. Table 8.4 presents a summary of the main types. In order to discuss the strengths and weaknesses of each model in more depth, we first introduce some standard notation. Suppose that each time a system fails, we repair it and thereby return it to operational condition. For a preliminary analysis, we also assume that repair times are negligible. Let T1 , T2 , T3 ,… be the times to successive failures of the system and let X i = Ti − Ti −1 be the time between failure i − 1 and failure i where T0 = 0 . The Ti and X i are random variables and we define ti and xi to be their corresponding realized values. Figure 8.1 illustrates this situation. We also define N (t ) as the number of failures in the interval (0, t ] .

Figure 8.1. Notation for a repairable system

We generally model the time to first failure using a familiar lifetime probability distribution or hazard function. However, this approach is inadequate for modelling other times to failure, as the inter-failure times are neither independent nor identically distributed in general (Ascher and Feingold 1984). Stochastic processes form the appropriate basis for models to use under these circumstances. We are interested in the probability that a system fails in the interval (t, t + ε ] given the history of the process up to time t . We describe the behaviour of the failure process by the intensity function (identified here by the Greek letter iota):

Preventive Maintenance Models for Complex Systems

ι ( t ) = lim

{

}.

P N (t + ε ) − N (t ) ≥ 1 H (t ) ε

ε →0

185

(8.1)

For an orderly process, where simultaneous failures are impossible, the intensity function is equal to the derivative of the conditional expected number of failures:

ι (t ) =

{

}

d E N (t ) H (t ) , dt

(8.2)

which is referred to as the rate of occurrence of failures (ROCOF). Table 8.4. Summary of models for complex repairable systems Models Renewal process

Nonhomogeneous Poisson process

CM

PM





Comments

References



Repair back to (or replace by) new item

Taylor and Karlin (1994)



Only CM actions, zero repair times

Ascher and Feingold (1984); Crowder et al. (1991); Lindqvist et al. (2003)

Watson (1970)

Percy et al. (1998a)

Delayed renewal process





Distributions for failures after PM and CM actions, zero downtimes

Delayed alternating renewal process





Fixed or random downtimes

Virtual age model





CM minimal repair, Jack (1998); PM reduction in Doyen and Gaudoin age (2004)

Proportional hazards model





Different hazard functions for failures after PM and CM actions

Cox (1972a); Jardine et al. (1987); Newby (1994); Lutigheid et al. (2004)

Intensity reduction model





CM minimal repair, Doyen and Gaudoin PM reduction in (2004) intensity function

Proportional intensities model





Takes account of covariates, CM as minimal repairs

Cox (1972b); Percy et al. (1998b)

Generalized proportional intensities model





Both CM and PM affect the intensity function

Percy and Alkali (2006)

186

D. Percy

For more details on statistical inference in this context, we refer readers to Crowder et al. (1991). Our fundamental model is the nonhomogeneous Poisson process (NHPP), which effectively implies that a repair restores a system to the state it was in immediately before failure. Such corrective maintenance effects are commonly referred to as minimal repairs. The NHPP satisfies these conditions for 0 0 [gamma]

β

};

x > 0 [Weibull]

The form of the hazard function is precisely the same as the form of the intensity function if we were to use a stochastic process to model the complex system. For a nonhomogeneous Poisson process, this intensity function applies beyond the first failure. However, successive hazard functions for inter-failure times have different forms, which correspond to shifted and truncated versions of the distribution for time to first failure. Imperfect maintenance models must allow for the dynamic evolution of a system and take account of hypothesized and observed knowledge about the effectiveness of repairs. As mentioned above, this section reviews a variety of existing models for repairable systems and describes suitable adaptations for systems that are subject to preventive maintenance. In passing, we remark that time is used as the only scale of measurement here. Some applications use running time instead, or both, such as the flight time of an aircraft or the mileage and age of a car. Further details of such variations are described by Baik et al. (2004) and Jiang and Jardine (2006).

Preventive Maintenance Models for Complex Systems

187

8.4.1 Renewal Process (Maximal Repair) This model assumes that repairs renew a system to its condition as new. A renewal process is a counting process that registers the successive occurrence of events during a given time interval ( 0,t ] where the time durations between consecutive events X 1 , X 2 , X 3 ,… form a sequence of independent and identically distributed non-negative random variables. The special case where their distribution is exponential corresponds to the homogeneous Poisson process. We can characterize the intensity function of a renewal process by

(

ι ( t ) = ι0 t − t N ( t )

)

(8.3)

where ι0 ( t ) is the baseline intensity function, which would prevail if there were no system failures. As this is a renewal process, the baseline intensity function is equal to the hazard function for the inter-failure times: ι0 ( x ) = h ( x ) . The baseline intensity function can take many forms, including: (i)

ι0 (t ) = α

(ii)

ι0 (t ) = αβ t

[loglinear]

(iii)

ι 0 (t ) = α t β

[power-law]

[constant]

The renewal process is a plausible first order model for components or parts when the repair time is negligible, since complete replacement of a component after failure implies renewal instead of repair. Conversely, the renewal process is a poor model for complex systems, where repairs involve replacing or restoring just a fraction of the system’s components. If a large portion of a system needs to be restored, it is often more economical to replace the entire system. Even if a repair restores the system’s performance to its original specification, the presence of predominantly aged components implies that system reliability is not renewed. 8.4.2 Nonhomogeneous Poisson Process (Minimal Repair) The assumptions underlying this model imply that, when a repair is carried out, a system assumes the same condition that it was in immediately before failure. The nonhomogeneous Poisson process (NHPP) differs from the homogeneous Poisson process only in that the rate of occurrence of failures varies with time rather than being constant. As mentioned early in this section, it is the fundamental model for repairable systems. The NHPP is also the most appropriate model for the reliability of a complex system comprising infinity components. However, for a finite number of components, this model can only serve as an approximation, often poor, as the intensity function changes following each repair. In this model, the interarrival times X 1 , X 2 , X 3 ,… are neither independent nor identically distributed.

188

D. Percy

An important characteristic of the NHPP is that the intensity function depends on the system’s global operating time, measured from the instant the system is put into operation. A simple NHPP model can be expressed as

ι ( t ) = ι0 ( t )

(8.4)

where ι0 ( t ) is the baseline intensity function introduced earlier. In modelling the reliability of repairable systems under the nonhomogeneous Poisson process assumptions, the numbers of events in non-overlapping intervals are independent random variables and the intensity becomes the rate of occurrence of failures or peril rate of a repairable system. This model corresponds to minimal repair, whereby system reliability returns to the condition immediately before failure. If repair times are small relative to times between failures, so that they can be ignored, then we have ι ( t ) = h ( t ) . 8.4.3 Delayed Renewal Process We refer to a repairable system as stationary if there is no long-term improvement or deterioration of its performance. For many applications, the assumptions of renewal and minimal repair are too restrictive. We have encountered the need for an alternative scenario that allows for minor repairs, as follows: • •

Corrective maintenance is performed upon failure, to restore the system to a reasonable operating state Preventive maintenance takes place at regular intervals, to reset the system to a good operating state

Corrective maintenance (CM) corresponds to major or minor repair work and may involve replacing the damaged components, whereas preventive maintenance (PM) usually corresponds to minor interventions such as lubrication, cleaning and inspection. Given this structure, we assume that failure times after corrective operations are independent and identically distributed, as are failure times after preventive operations. However, we allow for different probability distributions in the two cases and this defines the delayed renewal process (DRP). This is not a simple renewal process, because of the different lifetime distributions following the two types of action. However, the simple renewal process could be regarded as a limiting case of the DRP, if corrective operations were to repair the system to the same state as preventive operations. Maximal repairs involve restoring the system upon failure to its condition at new. Similarly, if corrective operations were to restore the system to the state immediately before failure, minimal repairs would result. This is not strictly a special case of the delayed renewal process, but a computer program could easily allow for this assumption if required. However, we believe that minimal repairs are convenient for mathematical modelling but are not always valid in practice.

Preventive Maintenance Models for Complex Systems

189

Figure 8.2. Delayed renewal process

As shown in Figure 8.2, define the random variables U and V to be the lifetimes after PM and CM respectively. Their probability density functions, conditional upon known parameters, are fU ( u ) and fV ( v ) respectively. These distributions might take the exponential, gamma or Weibull forms defined earlier, to achieve the required flexibility. Note that the exponential distribution is a limiting case of the gamma as α → 1 and Weibull as β → 1 . The DRP assumes that downtimes are negligible compared with the costs of parts and labour. We now consider the effects of non-ignorable downtimes. 8.4.4 Delayed Alternating Renewal Process The delayed renewal process described above assumes that the downtimes for preventive and corrective maintenance are negligible when compared with the lifetimes. It also assumes that the costs associated with these downtimes are dominated by the costs of parts and labour. The model and analysis are further complicated when we allow for periods of downtime, when maintenance actions take place. In many applications involving continuous-process industries, the principal costs are not due to parts and labour, but are due to lost production whilst the system is down. Consequently, we must consider downtime costs and durations when determining cost-effective strategies for scheduling PM. This extension results in the delayed alternating renewal process (DARP), for which analytical solution is not even feasible in practice. The downtimes following preventive and corrective maintenance can be fixed or random. Since analytical solution of the optimisation problems is not possible and we are adopting a simulation approach here, either of these can be included in the calculations with ease. In the following work, we consider them fixed to avoid confusion. Another benefit of simulation over numerical solution of the renewal equations is that anomalies are readily catered for, such as switching from CM to PM if the system is in the failed state when PM is due. The DARP is illustrated in Figure 8.3.

Figure 8.3. Delayed alternating renewal process

The delayed alternating renewal process is appropriate when the time to replace (or repair back to new) a failed item is non-zero. In this case, we have working and

190

D. Percy

failed states and these alternate. So far, we have only allowed for systems that display no long-term trends, corresponding to improvement or deterioration. We now discuss age-based models that allow for such trends. These models can also be used for stationary and non-stationary systems when concomitant information is available. We discuss these benefits later, as the need for including such extra sources of information is described. 8.4.5 Virtual Age Model (Rejuvenation) The virtual age model (VAM) modifies the hazard function for a system’s interfailure times at each corrective maintenance action. For these repairs, the system’s virtual age at any given time is determined by a variety of additive or multiplicative age-reduction factors. This resets the system to a younger state, which is only an approximation for reasons mentioned earlier. The intensity function of a point process under the age reduction model may be additive

⎛ N (t ) ⎞ ι ( t ) = ι0 ⎜ t − si ⎟ ⎜ ⎟ i =1 ⎝ ⎠



(8.5)

or multiplicative

⎛ N (t ) ⎞ ι ( t ) = ι0 ⎜ t si ⎟ ⎜ i =1 ⎟ ⎝ ⎠



(8.6)

where both si are constants, representing the age reduction factors, and ι0 ( t ) is the baseline intensity function again. In order to evaluate the intensity function for a sequence of failures under age reduction, the renewal function governs the system failure pattern. The additive model can generate negative intensities but the multiplicative model is suitable if replacement components are infallible. The age-reduction model has been applied to systems under a block replacement policy. A critical defect of the age-reduction model and its many variants is that they do not provide a realistic description of the failure processes. For example, replacing a corroded exhaust pipe does not reduce a car’s age, as very many other components are no less likely to fail. 8.4.6 Proportional Hazards Model The proportional hazards model (PHM) is more flexible than the renewal process, DRP and DARP, as it allows for non-stationarity. It is also more flexible than the virtual age model because it allows for concomitant information. In principle, this model appears to be inappropriate for representing a complex system, because hazards naturally relate to lifetimes of components rather than inter-failure times of processes. We cannot physically justify this model as readily as the proportional intensities model described later. However, this does not invalidate its use in this context as a statistical model rather than a mathematical model and considerable

Preventive Maintenance Models for Complex Systems

191

success in applying the proportional hazards model to real reliability and PM scheduling problems has been achieved. In formulating the PHM for a repairable system, we adopt different hazard functions after PM κ ( u ) = κ 0 ( u ) exp ( y ′t γ )

(8.7)

and after CM λ ( v ) = λ0 ( v ) exp ( z t′δ )

(8.8)

where u and v represent the lifetimes following PM and CM respectively. The baseline hazard functions can take any suitable forms, including exponential, Gumbel and Weibull. The covariates that might be contained in the vectors y t and z t include cumulative observations of: • • • • •

Time since last PM Time since last CM Total number or total downtime of PMs Total number or total downtime of CMs Average PM interval duration

We might consider other factors and covariates for inclusion here, representing the concomitant information mentioned earlier. These could include: • • •

Severity measures of failures Quality measures of maintenance Condition-monitoring measurements

Temporal, or continuously time varying covariates (time since last PM and time since last CM) cause substantial computational difficulties. These may be avoided by choosing baseline hazard functions that are sufficiently flexible. The vectors γ and δ contain the regression coefficients, which generally take the form of unknown parameters. The results from extensive analyses demonstrate that this proportional hazards model is flexible, easy to use and of considerable practical value, despite its doubtful mathematical suitability for modelling repairable systems. 8.4.7 Intensity Reduction Model (Correction) Improvement factors feature in additive and multiplicative intensity reduction models (IRM) for imperfect maintenance. Perhaps the most suitable of these is an intensity reduction model that involves a multiplicative scaling of the intensity function upon each failure and repair. This is the natural model for systems that are improving or deteriorating with time and provides a perfect description of the physical situation. This model can be expressed as an NHPP with intensity function

192

D. Percy

ι ( t ) = ι0 ( t )

N (t )

∏s i =1

(8.9)

i

where the si are constants representing the intensity reduction factors and ι0 ( t ) is the baseline intensity function again. We later generalize this model by supposing si are simple functions of i , or are random variables that are independent of the failure and repair process. Having concluded that this model is ideally suited to modelling complex repairable systems, this chapter later considers how to extend it to allow for preventive maintenance and concomitant information. 8.4.8 Proportional Intensities Model Whilst the proportional hazards model offered a valuable generalization of the delayed renewal process and delayed alternating renewal process to allow for nonstationarity and concomitant information, it is not the natural model for repairable systems. The natural model takes the form of a nonhomogeneous Poisson process and is the essence of the proportional intensities model (PIM), which is the subject of this subsection and is a generalization of the intensity reduction model described above. Define the random variable N ( t ) as the number of system failures by time t . Then the NHPP is characterised by conditionally independent increments, corresponding with conditionally independent times between failures that occur with intensity ι ( t ) = lim

{

}

P N (t + ε ) − N (t ) ≥ 1 H (t ) ε

ε →0

(8.10)

at system age t units, where H ( t ) is the history of the process. However, the NHPP corresponds with minimal repair as in Section 8.4.2 and makes no allowances for system improvement, or even deterioration, arising from maintenance actions. Hence, we modify the intensity function by introducing a multiplicative factor, so that we can express the intensity function as

(

)

ι ( t ) = ι0 ( t ) exp xTt β ,

(8.11)

where the baseline intensity ι0 ( t ) has a standard form such as constant, loglinear and power-law. Furthermore, the parameter vector β represents the regression coefficients and the observation vector x t contains factors and covariates relating to the system, such as the cumulative observations and concomitant information mentioned in Section 8.4.6. An alternative option arises when using the PIM to model a complex repairable system subject to PM. Rather than adopting a global time scale for the baseline intensity function as implied above, we could reset the time scale of the baseline intensity function to zero upon each PM action. This introduces an element of

Preventive Maintenance Models for Complex Systems

193

renewal that might be applicable if PM involves major reworking. System age could then be included amongst the covariates if necessary. However, this intervention results in a hybrid model, which suffers from the same difficulty of interpretation as does the PHM. As for predictor variables, the process simulation calculations for scheduling PM simplify greatly if we hold factors and covariates at fixed values throughout each PM interval. However, this essentially treats all CM as minimal repair work, an assumption that we earlier claimed is often unreasonable. To avoid this constraint, we need to consider variables that change during a PM interval, such as the cumulative number of failures. The computational effort required to incorporate such temporal covariates in our simulation is immense, but this relates to computer power rather than manpower and so is quite acceptable.

8.5 Generalized Proportional Intensities Model (GPIM) The GPIM is this chapter’s main model of interest, as it allows for covariates and offers much potential for decision making related to scheduling preventive maintenance. Special cases of the GPIM are the intensity reduction model and the proportional intensities model investigated in Section 8.4. An algebraic representation of the GPIM in terms of the intensity function is given by

⎧⎪ M ( t ) ⎫⎪ ⎧⎪ N ( t ) ⎫⎪ ι ( t ) = ι0 ( t ) ⎨ ri ⎬ ⎨ s j ⎬ exp xTt β . ⎩⎪ i =1 ⎭⎪ ⎩⎪ j =1 ⎭⎪





( )

(8.12)

Here, ι0 ( t ) is the baseline intensity function, whilst ri > 0 and s j > 0 are the intensity scaling factors for preventive maintenance (PM) and corrective maintenance (CM) actions respectively. Furthermore, M ( t ) and N ( t ) are the total numbers of PM and CM actions, whilst xt is a vector of predictor variables and β is an unknown parameter vector of regression coefficients. One might expect the rj and s j to be less than one for a deteriorating system and greater than one for an improving system, though replacing failed components with used parts and accidentally introducing faults during maintenance can produce the opposite effects. System copies can have different forms of baseline intensity function. For reduction of intensity, the scaling factors can take the forms of positive constants, random variables, deterministic functions of time ( t ) and events ( i and j ) or stochastic functions of time and events. As for the intensity reduction model described in Section 8.4.7, a reasonable assumption for initial analysis is that ri = ρ for i = 1, 2,… , M ( t ) and s j = σ for j = 1, 2,… , N ( t ) , in which case the GPIM corresponds with the PIM of Section 8.4.8. The vector of predictor variables might include:

194

D. Percy

• • •

Quality of last maintenance action Time since last maintenance action Condition indicators

The quality of maintenance affects the functionality of a system and its future performance. Our justification for including the time since last maintenance here is to allow for the possibility that maintenance interventions can introduce problems similar to the burn-in of new components. The first of these is a discrete function of time, whereas the second is a continuous function of time. Condition indicators, when available, give direct and very strong guidance on the likely occurrence of failures. They are typically discrete functions of time that vary at, and between, maintenance actions.

8.6 Parameter Estimation All of the preceding models contain unknown parameters. In order to make any decisions based on these models, such as determining when to schedule the next PM activity, we need to quantify our knowledge about these parameters subjectively and empirically. Three forms of inference are applicable here. In increasing order of accuracy and precision, but also of algebraic complexity, they are naïve (fully subjective), frequentist (fully objective) and Bayesian (both subjective and objective). The first of these is trivial, whereas the others both require us to specify the likelihood function. Firstly, consider the delayed renewal process of Section 8.4.1. In practical applications, the model parameters for each of the PM and CM lifetime distributions are unknown and we need some subjective or objective information about these parameters. Subjective information typically represents the expert views of maintenance engineers about a system’s repair and failure process, and can take many forms such as simply specifying values for the unknown parameters. Objective information typically takes the form of historically observed failure and repair data for the system under consideration, of the form

{

D = ( ui , vij ) ; i = 1,… , n; j = 1,… , ni

}

(8.13)

which covers n complete PM intervals, where interval i contains ni failures. Note that the ui are right censored if ni = 0 and the vij are right censored when j = ni . Otherwise, the observations represent actual failure times. We introduce the indicator variables ⎧0 ; ui right censored ci = ⎨ ⎩1 ; ui observed lifetime

and

(8.14)

Preventive Maintenance Models for Complex Systems

⎧⎪0 ; vij right censored d ij = ⎨ ⎪⎩1 ; vij observed lifetime

195

(8.15)

to identify when observations are right censored. The likelihood function for this delayed renewal process then becomes L (θ , φ ; D ) ∝ ni

∏ { f ( u θ )} {R ( u θ )} ∏ { f ( v φ )} {R ( v φ )} n

i =1

1− ci

ci

i

i

1− dij

dij

ij

j =1

(8.16)

ij

where R ( ⋅) represents the corresponding reliability function. Due to the nature of the DRP model, this likelihood function can be written as the product of a function of θ and a function of φ , so that

L (θ , φ ; D ) = L (θ ; D ) L (φ ; D )

(8.17)

where n

{

} {R ( u θ )}

L (θ ; D ) ∝ ∏ f ( ui θ ) i =1

1− ci

ci

(8.18)

i

and L (φ ; D ) ∝

ni

∏∏ { f ( v φ )} {R ( v φ )} n

i =1

j =1

1− dij

dij

ij

ij

.

(8.19)

For a frequentist analysis, we evaluate the maximum likelihood estimates of θ and φ by maximising the natural logarithm of this function with respect to these parameters. Subsequent inference generally assumes that the parameters are equal to these values. To avoid the errors that arise through adopting a naïve or frequentist approach, we can instead adopt a Bayesian approach. This leads naturally to a decision-theoretic solution to the problem of PM scheduling and we refer interested readers to the article by Percy et.al. (1998a) for details. We now turn our attention to parameter estimation for the nonhomogeneous Poisson process. For failure times T1 , T2 ,… , TN (T ) with observed values t1 , t2 ,… , t N (T ) in the interval ( 0,T ] , the likelihood function corresponding to a NHPP with intensity function ι ( t ) is given by T ⎧⎪ N (T ) ⎫⎪ ⎪⎧ ⎪⎫ L {ι; H ( t )} ∝ ⎨ ∏ ι ti− ⎬ exp ⎨− ∫ ι ( t ) dt ⎬ ⎪⎭ ⎩⎪ 0 ⎭⎪ ⎩⎪ i =1

( )

(8.20)

196

D. Percy

and so the log-likelihood function becomes

l {ι; H ( t )} = const. +

N (T )

T

∑ logι ( t ) − ∫ι ( t ) dt . i =1

− i

(8.21)

0

Therefore, once we specify the formulation of ι ( t ) , we can obtain estimates for its unknown parameters via likelihood-based methods. Example 8.4 Assuming T = t N (T ) so that observation ceases at a failure, the maximum likelihood estimates (MLEs) can be determined analytically for the power-law process (NHPP with power-law intensity). With ι (t ) = α t β and n = N (T ) , the MLEs are n

βˆ =

n

T log ∑ ti i =1

−1

(8.22)

and

(

).

n βˆ + 1

αˆ =

T

βˆ +1

(8.23)

For a particular system, successive arrival times (not inter-arrival times) were observed to be 15, 42, 74, 117, 168, 233 and 410 days. With n = 7 , T = 410 and t1 = 15,… , t7 = 410 , we have βˆ ≈ −0.3007 and then αˆ ≈ 0.07288 . As βˆ < 0 , the intensity is a strictly decreasing function of time; this is a happy system that seems to improve with age. Analysis of the intensity based models follows by extending this likelihood function corresponding to the NHPP. Consider the generalized proportional intensities model of Section 8.5. The choice of which predictor variables to include depends upon the sample size (history of failures) and the results of standard selection procedures based on analyses of deviance for nested models. Only important predictors should be included in order to produce a robust model. We can estimate the parameters in the model by maximum likelihood, on extending the NHPP likelihood presented above, whereby the log-likelihood is given by

l {ι; H (T )} = const. + n

∑ c {logι ( t ) + M ( t ) log ρ + N ( t ) log σ + x γ} k =1



0

k

− k

⎧⎪ M t N t ( ) σ ( ⎨ρ k =0 ⎪ ⎩

− k

n



k

k

)

tk +1

− k

T tk−

(8.24)



⎪ ∫ ι ( t ) exp ( x γ ) dt ⎬⎪ . 0

tk

T t



This corresponds to the simple case where the scaling factors are constant: minor changes are needed for the more general cases.

Preventive Maintenance Models for Complex Systems

197

8.7 Model Selection In this section, we consider how to choose among the many approaches described above, namely the renewal process (RP), delayed renewal process (DRP), delayed alternating renewal process (DARP), virtual age model (VAM), proportional hazards model (PHM), nonhomogeneous Poisson process (NHPP), intensity reduction model (IRM), proportional intensities model (PIM) and generalized proportional intensities model (GPIM). The main distinguishing features are process stationarity, goodness of fit, mathematical robustness, consistency and ease of implementation. All of these models are concerned with describing the failure and repair process of complex repairable systems subject to preventive maintenance. We subsequently use the fitted models to forecast the system behaviour under different PM strategies by simulation. This enables us to determine the optimal strategy by minimising the expected cost per unit time over a suitable horizon, finite or infinite, with respect to suitable loss or utility functions. The RP only applies to individual components, for which corrective and preventive maintenance effectively amount to replacement. The DRP and DARP apply to stationary systems, whereas all of the later models allow for nonstationarity. The DRP is easier to fit than the DARP, but ignores the influence of downtimes, so we use the latter if these are significant. Ascher and Feingold (1984) discuss several methods for assessing stationarity. Perhaps the simplest of these is a graph that plots the observed cumulative number of failures against the observed cumulative operating hours. Consistent departures from linearity might suggest that some trends are present. Naturally, we must exercise care to avoid distorting the results when allowing for PM interventions. Sometimes though, we seek a formal hypothesis test to assess whether the assumption of stationarity is reasonable. One of these is Laplace's trend test, which is simple and sufficient for most needs. Suppose we observe the system history from time 0 until time t and suppose that we observe n failures at times t1 , t2 ,… , tn . Then Laplace's trend test compares the test statistic n

U=

nt

∑t − 2 i =1

i

n t 12

(8.25)

with standard normal critical values, rejecting the null hypothesis of no trend if U ∉ ( − z p 2 , z p 2 ) for a hypothesis test at the 100 p % level of significance, where the proportion p represents the size of the test. For a 5% significance test, the critical values are given by z p 2 = 1.960 . If we decide that a system is nonstationary, we could use the VAM or PHM, which are easier to fit to data than the stochastic processes considered next, but are less robust because of their statistical rather than mathematical derivation. However, all of these models require numerical computation to some extent. The VAM and PHM might provide a better fit to the observed data on occasions,

198

D. Percy

though the mathematical justification and consequent robustness of the stochastic processes are most appealing properties. The NHPP corresponds to minimal repairs and only applies to systems containing very many similar components. However, it is the fundamental model for complex repairable systems and its simplicity appeals to many practitioners. The IRM and PIM improve upon the NHPP by allowing for partial repairs and preventive maintenance. The GPIM combines the best features of both models and perhaps offers the most potential for PM scheduling problems, despite the extra computational burden it attracts. Whichever model we choose to fit to our data, some degree of model comparison is necessary. For the RP, DRP and DARP, we need to decide which lifetime distributions to fit following PM and CM. For the VAM and PHM, we must select suitable baseline hazard functions, scaling factors and explanatory variables for our linear predictors. Similar choices are necessary for the NHPP, IRM, PIM and GPIM. We can assess the goodness of fit of a model using its likelihood function and can compare different models using likelihood ratios or Bayes factors. Consider two nested models M 1 ⊃ M 2 with p1 > p2 parameters and likelihood functions L1 > L2 respectively. Then under general conditions, asymptotic sampling distribution theory states that 2 log

L1 ~ χ 2 ( p1 − p2 ) L2

(8.26)

and so we can test whether the extra parameters are significant. This is particularly beneficial when choosing which elements to include in a linear predictor. If the models M 1 and M 2 are not nested, we cannot use this formal test and simply compare the log-likelihood functions log L1 and log L2 , choosing the model with the larger log-likelihood. This is appropriate for choosing between gamma and Weibull baseline hazard functions, for example. However, it is only valid if p1 = p2 , as a model with more parameters often fits better than a model with fewer parameters, by definition. To compare non-nested models with different numbers of parameters, we usually apply a correction factor to the log-likelihood functions. Two common modified forms are the Akaike information criterion (AIC), which suggests that we compare log L1 − p1 with log L2 − p2 , and the Schwarz criterion, or Bayes information criterion (BIC), which suggests that we compare log L1 − ( p1 log n ) 2 with log L2 − ( p2 log n ) 2 where n is the number of observations in the data set. The latter arises as the limiting case of the posterior odds resulting from a Bayesian analysis with reference priors. In each case, the best model to choice is the one that maximizes the information criterion. Example 8.5 Suppose we fit two non-nested models to a set of lifetime data, based on n = 31 observed failures. The first model contains three parameters and has a likelihood of L1 = 8.742 × 10−18 . The second model contains five parameters and has a likelihood of L2 = 3.110 × 10−17 . The Bayes information criterion for the first model is log L1 − ( p1 log n ) 2 ≈ −44.43 and for the second model it is log L2 − ( p2 log n ) 2 ≈ −46.59 so we prefer the first, simpler model here.

Preventive Maintenance Models for Complex Systems

199

8.8 Preventive Maintenance Scheduling The objective of model fitting is to determine the optimal PM period for minimising the expected cost per unit time. We will see that analytical solution of this problem is not possible and that simulation of the failure and repair process over a given horizon provides the best approach for resolving this difficulty. Sometimes, no particular horizon is specified and we can do no more than assume an infinite horizon. However, this problem simplifies for stationary systems involving the models based on the renewal process, as we only need to simulate the process over a single PM interval. On other occasions, a finite horizon is clearly defined. Perhaps a factory or machine is owned on a 20-year lease. Alternatively, the equipment might be retained until cost efficiencies on a larger scale recommend replacement or scrapping. For a pre-determined finite horizon such as these, we base decisions on simulating the process for the whole horizon. Further analysis could be performed for the situation where a finite horizon is not pre-specified and must be regarded as random. The extra complexity introduced is a current research issue. We begin by considering the delayed renewal process again. Suppose that the costs associated with PM and CM are k PM and kCM units respectively. Assuming an infinite horizon, we now simulate a PM interval of length t . This involves generating a pseudo-random observation u from fU ( u ) , to represent a typical lifetime following PM. If u ≥ t , the interval is complete and the total cost incurred is k PM . However, if u < t we generate a pseudo-random observation v1 from fV ( v ) , to represent a typical lifetime following CM, and add a cost kCM for the repair. We continue this process, generating CM lifetimes v1 , v2 , v3 ,… and adding a further cost kCM each time until this interval is complete, and then calculate the total cost for this interval. Call this total cost K1 . This procedure has completely simulated a PM interval of length t . We next repeat the procedure, until we have m repetitions in total, and determine the total costs for these simulated intervals, K1 , K 2 ,… , K m . Then their sample mean K=

1 m

m

∑K i =1

i

(8.27)

represents an unbiased estimator for the total cost per PM interval. This enables us to estimate the expected cost per unit time as K t . Now we must repeat the whole simulation for different values of t , using an efficient search algorithm, to determine the value of t that minimises this expected cost per unit time. This is the recommended PM interval duration. We advocate direct search algorithms for practical implementation, such as golden-section search. For practical purposes, t is unlikely to vary continuously and discrete values will dominate. Convenient multiples of days, weeks or months provide suitable units of measurement for practical implementation.

200

D. Percy

To deal with scenarios involving finite horizons, we modify this simulation procedure. Instead of generating one simulated PM interval on many occasions, we simulate the process over the whole horizon h and accumulate the costs of PM and CM over this period. If we redefine K1 as the total cost over this horizon, then successive replications of this simulated process generate total costs K1 , K 2 ,… , K m as before. This time however, the expected cost per unit time is given by K h where K is the sample mean defined earlier. We now shift our attention to the delayed alternating renewal process. For most applications, it is reasonable to suppose that all PM activities have downtimes of similar durations, at an average cost of k PM units, and that all CM activities have downtimes of similar durations, at an average cost of kCM units. The analysis of this DARP model then proceeds exactly as for the DRP model, except that the simulation of successive PM intervals must also take account of these downtimes. Our model assumptions could be extended to consider different levels of maintenance activity, if these are evident in practice. For example, PM and CM might each be performed as minor or major activities, with corresponding downtimes. Such possibilities are application specific and can readily be incorporated as required, by adapting the basic simulation program. Indeed, simulation is the only feasible method of analysis and optimisation in this case. To investigate PM scheduling for the nonhomogeneous Poisson process, we condition only upon the history at time t to avoid the problems associated with doubly stochastic processes and obtain

{

}

P N (t + ε ) − N (t ) = n H (t )

{µt (ε )} =

n

n!

exp {− µt ( ε )}

(8.28)

for n = 0,1, 2,… where

µt ( ε ) =

t +ε

∫ ι ( t ) dt

(8.29)

t

is the mean number of failures in the interval ( t , t + ε ) . Consequently, the reliability function for the next failure from time t is

{

}

Rt ( ε ) = P N ( t + ε ) − N ( t ) = 0 H ( t ) = exp {− µt ( ε )} ,

(8.30)

from which we can determine the lifetime distribution following a particular maintenance action at time t as ft ( ε ) = − Rt′ ( ε ) = ι ( t + ε ) exp {− µt ( ε )} .

(8.31)

Preventive Maintenance Models for Complex Systems

201

This allows us to simulate the process as before, evaluate expected costs over a finite horizon, and so deduce the most economical time for the next preventive maintenance. This decision can be made at any specific event, such as during PM or CM, or even between events, so long as the intensity function is known. Next we consider the proportional hazards model. To avoid referring separately to the hazard functions κ ( u ) and λ ( v ) , consider a general hazard function h ( x ) . For the purposes of simulation in order to schedule PM in the future, the reliability function can be determined as x ⎪⎧ ⎪⎫ R ( x ) = exp ⎨ − h ( x ) dx ⎬ , ⎪⎩ 0 ⎪⎭



(8.32)

from which the probability density function is ⎧⎪ x ⎫⎪ f ( x ) = − R ′ ( x ) = h ( x ) exp ⎨− h ( x ) dx ⎬ , ⎩⎪ 0 ⎭⎪



(8.33)

allowing us to simulate the system’s failure process for PM optimisation as before.

8.9 Applications We now apply some of these models to the data sets in Section 8.2. Example 8.6 For each system, we fitted the intensity reduction model using constant, loglinear and power-law baseline intensities with constant reduction factors. Its goodness of fit is measured by the log-likelihoods in Table 8.5, obtained using Mathcad software. For comparison, we also display the log-likelihoods for the extremes of renewal process (maximal repairs) and nonhomogeneous Poisson process (minimal repairs)

202

D. Percy

Table 8.5. Log-likelihoods for analyses of hypothetical reliability data Model

intensity reduction

maximal repair

minimal repair

Baseline intensity

Happy system

Sad system

Noncommittal system

constant

−33⋅ 7

−33⋅ 7

−35 ⋅ 5

loglinear

−32 ⋅ 4

−28 ⋅ 5

−33⋅ 4

power-law

−29 ⋅ 4

−32 ⋅ 0

−34 ⋅ 7

constant

−35 ⋅ 5

−35 ⋅ 5

−35 ⋅ 5

loglinear

−34 ⋅ 8

−34 ⋅ 8

−34 ⋅ 8

power-law

−35 ⋅ 1

−35 ⋅ 1

−35 ⋅ 1

constant

−35 ⋅ 5

−35 ⋅ 5

−35 ⋅ 5

loglinear

−34 ⋅ 8

−32 ⋅ 0

−35 ⋅ 2

power-law

−35 ⋅ 0

−31⋅ 8

−35 ⋅ 3

As expected, the intensity reduction model provides a good fit to all three systems, preferring the power-law baseline intensity for the happy system and the loglinear baseline intensity for the sad and noncommittal systems. Figure 8.4 shows that these baseline intensities are all increasing functions and any apparent happiness is due to the high quality of repairs rather than a self-improving system.

Preventive Maintenance Models for Complex Systems

203

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Intensity Function

0.18

λ ( t , a , b , s)

0 0

t

410

Figure 8.4. Best fitting models for happy, sad and noncommital systems, respectively

204

D. Percy

Example 8.7 Regarding all PM actions as CM actions for demonstration purposes, we apply the Laplace trend test to determine whether there is any evidence of non-stationarity at the 5% level of significance. Our test statistic is n

U=

nt

∑t − 2 i =1

i

n t 12

=

22 × 2,128 2 ≈ −0.5230 . 22 2,128 × 12

21,901 −

(8.34)

As −1.960 < U < 1.960 , the test is not significant at the 5% level and we conclude that this test provides no evidence of non-stationarity for these data. Consequently, the delayed renewal process might provide an adequate fit to these data, without the need for a more complicated model. However, we might consider using the DARP if downtime is important or one of the later models if concomitant information is also available. Example 8.8 Here the data comprise 65 event observations collected over seven years. In the first half of this period, there were 15 CM and 11 PM actions. In the second half of this period, there were 29 CM actions and 10 PM actions. Hence, this is a sad system, which might benefit from preventive maintenance. We fit the generalized proportional intensities model to these data with explanatory variables representing quality of last maintenance and time since last maintenance. A loglinear baseline with constant reduction factors generates the results in Table 8.6. Table 8.6. Log-likelihoods and parameter estimates for GPIM analyses of oil pump data Predictor variables

Loglikelihood



Parameter estimates

αˆ

βˆ

ρˆ

σˆ

γˆ

−211.7

5 × 10 −4

1.01

0.719

0.740



Quality of last action

−210.2

6 × 10−4

1.01

0.699

0.745

6 × 10−3

Ttime since last action

−210.8

7 × 10−4

1.01

0.666

0.728

− 8 ×10−3

Quality of last action

−209.5

8 ×10−4

1.01

0.653

0.734

6 × 10−3

Time since last action

− 7 ×10 −3

The best model includes both “quality of last maintenance action” and “time since last maintenance action” as predictor variables. This is not surprising, as it contains six parameters whereas the model with no predictor variables has only four. As the associated PM reduction factor ρˆ is about two-thirds, preventive

Preventive Maintenance Models for Complex Systems

205

maintenance reduces the intensity of critical failures for this system and so improves its reliability. Although slightly less impressive, corrective maintenance reduces the intensity function too. Hence, the maintenance workforce appears to be very effective for this application! A graph of the intensity function for the GPIM with both covariates follows in Figure 8.5, based on the corresponding parameter estimates in the last row of Table 8.6. Intensity Function

0.1

λ ( t , a , b , r , s , c1 , c2)

0 0

t

2487

Fig. 8.5. Intensity function for GPIM analysis of oil pump data with two covariates

We now perform a simulation analysis for this last model based on the methods described in Section 8.8, in order to determine an optimal strategy for scheduling preventive maintenance. Several convenient PM intervals are considered for our calculations, including weekly, monthly, two-monthly, quarterly, biannually, annually and biennially. The minimum cost per unit time over a ten-year fixed horizon is achieved with monthly PM and generates a projected 80% saving over annual PM, though this estimated reduction in costs is sensitive to the choice of model. The previous policy implemented averages about three PM actions per year, which our simulation estimates would cost about four times as much in preventive maintenance when compared with the optimal policy of monthly PM.

8.10 Conclusions This chapter discussed the ideas of modelling complex repairable systems, with the intention of scheduling preventive maintenance to improve operational efficiency and reduce running costs. It started by emphasising the importance of improved, accurate and complete data collection in practice. It then presented the renewal process, delayed renewal process and delayed alternating renewal process as reasonable models for systems that exhibit stationary failure patterns.

206

D. Percy

The virtual age model and proportional hazards model were described as suitable for systems that do not exhibit stationarity and for systems where predictor variables such as condition monitoring observations are also measured. The nonhomogeneous Poisson process, intensity reduction model and proportional intensities model, with a promising generalization, were described next. We claim that these models offer natural interpretations of the physical underlying reliability and maintenance processes. Finally, this chapter demonstrated some applications of these ideas using reliability and maintenance data taken from the oil industry and reviewed several methods for model selection and goodness-of-fit testing, including graphs, Laplace trend test, likelihood ratios and the Akaike and Bayes information criteria. The use of mathematical modelling and statistical analysis in this fashion can improve, and has improved, the quality of PM scheduling. This can then result in considerable cost savings and help to improve system availability.

8.11 References Ascher HE, Feingold H, (1984) Repairable Systems Reliability: Modeling, Inference, Misconceptions and their Causes. New York: Marcel Dekker Baik J, Murthy DNP, Jack N, (2004) Two-dimensional failure modeling with minimal repair. Naval Research Logistics 51:345–362 Cox DR, (1972a) Regression models and life tables (with discussion). Journal of the Royal Statistical Society Series B 34:187–220 Cox DR, (1972b) The statistical analysis of dependencies in point processes. In Stochastic Point Processes (Lewis PAW). New York: Wiley Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability Data. London: Chapman and Hall Dagpunar JS, Jack N, (1993) Optimizing system availability under minimal repair with nonnegligible repair and replacement times. Journal of the Operational Research Society 44:1097–1103 Dekker R, Frenk H, Wildeman RE, (1996) How to determine maintenance frequencies for multi-component systems? A general approach. In Reliability and Maintenance of Complex Systems (Ozekici S). Berlin: Springer Doyen L, Gaudoin O, (2004) Classes of imperfect repair models based on reduction of failure intensity or virtual age. Reliability Engineering and System Safety 84:45–56 Handlarski J, (1980) Mathematical analysis of preventive maintenance schemes. Journal of the Operational Research Society 31:227–237 Jack N, (1998) Age-reduction model for imperfect maintenance. IMA Journal of Mathematics Applied in Business and Industry 9:347–354 Jardine AKS, Anderson PM, Mann DS, (1987) Application of the Weibull proportional hazards model to aircraft and marine engine failure data. Quality and Reliability Engineering International 3:77–82 Jiang R, Jardine AKS, (2006) Composite scale modeling in the presence of censored data. Reliability Engineering and System Safety 91:756–764 Kobbacy KAH, Fawzi BB, Percy DF, Ascher HE, (1997) A full history proportional hazards model for preventive maintenance scheduling. Quality and Reliability Engineering International 13:187–198 Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical analysis of repairable systems. Technometrics 45:31–44

Preventive Maintenance Models for Complex Systems

207

Lugtigheid D, Banjevic D, Jardine AKS, (2004) Modelling repairable systems reliability with explanatory variables and repair and maintenance actions. IMA Journal of Management Mathematics 15:89–110 Newby M, (1994) Perspective on Weibull proportional hazards models. IEEE Reliability Transactions 43:217–223 Percy DF, Alkali BM, (2006) Generalized proportional intensities models for repairable systems. IMA Journal of Management Mathematics 17:171–185. Percy DF, Kobbacy KAH, (2000) Determining economical maintenance intervals. International Journal of Production Economics 67:87–94 Percy DF, Bouamra O, Kobbacy KAH, (1998a) Bayesian analysis of fixed-interval preventive-maintenance models. IMA Journal of Mathematics Applied in Business and Industry 9:157–175 Percy DF, Kobbacy KAH, Ascher HE, (1998b) Using proportional-intensities models to schedule preventive-maintenance intervals. IMA Journal of Mathematics Applied in Business and Industry 9:289–302 Taylor HM, Karlin S, (1994) An Introduction to Stochastic Modelling. London: Academic Press Van der Duyn Schouten F, (1996) Maintenance policies for multicomponent systems: an overview. In Reliability and Maintenance of Complex Systems (Ozekici S). Berlin: Springer Watson C, (1970) Is preventive maintenance worthwhile? In Operational Research in Maintenance (Jardine AKS). Manchester: University Press

9 Artificial Intelligence in Maintenance Khairy A. H. Kobbacy

9.1 Introduction Over the past two decades their has been substantial research and development in operations management including maintenance. Kobbacy et al. (2007) argue that the continous research in these areas implies that solutions were not found to many problems. This was attributed to the fact that many of the solutions proposed were for well-defined problems, that the solutions assumed accurate data were available and that the solutions were too computationally expensive to be practical. Artificial intelligence (AI) was recognised by many researchers as a potentially powerful tool especially when combined with OR techniques to tackle such problems. Indeed, there has been vast interest in the applications of AI in the maintenance area as witnessed by the large number of publications in the area. This chapter reviews the application of AI in maintenance management and planning and introduces the concept of developing intelligent maintenance optimisation system. The outline of the chapter is as follows. Section 9.2 deals with various maintenance issues including maintenance management, planning and scheduling. Section 9.3 introduces a brief definition of AI, some of its techniques that have applications in maintenance and Decision Support Systems. A review of the literature is then presented in Section 9.4 covering the applications of AI in maintenance. We have focused on five AI techniques namely knowledge based systems, case based reasoning genetic algorithms, neural networks and fuzzy logic. This review also covers “hybrid” systems where two or more of the above mentioned AI techniques are used in an application. Other AI techniques seem to have very few applications in maintenance to date. A discussion of the development of the prototype hybrid intelligent maintenance optimisation system (HIMOS) which was developed to evaluate and enhance preventive maintenance (PM) routines of complex engineering systems follows in Section 9.5. HIMOS uses knowledge based system to identify suitable models to schedule PM activities and case base reasoning to add capability to utilise past experience in model selection. Future developments and

210

K. Kobaccy

outline design of an Adaptive Maintenance Measurement and Control Model are covered in Section 9.6. Concluding remarks are presented in Section 9.7. The following abbreviations are used throughout this chapter. AHP: Analytic hierarchy process AI: Artificial intelligence CBR: Case based reasoning CO: Corrective action DMG: Decision making grid DSS: Decision support system FL: Fuzzy logic GAs: Genetic algorithms HIMOS: Hybrid intelligent maintenance optimisation system IDSS: Intelligent decision support system IMOS: Intelligent maintenance optimisation system KBS: Knowledge based systems NHPP: Non-homogeneous Poisson process NNs: Neural networks OR: Operational research PHM: Proportional hazards model PIM: Proportional intensities model PM: Preventive maintenance RBR: Rule based reasoning

9.2 Maintenance Management, Planning and Scheduling Most industrial organisations have maintenance departments which deal with many issues regarding operations. For example they can be involved in process design, inventory, schedulling and staffing. However, the ultimate objective of maintenance is to keep equipment at acceptable standard. To achieve this objective a variety of maintenance actions are employed including inspection, repair, planned maintenance and replacement. An adequate planning of type, contents and timing of maintenance actions is essential for the success of the maintenance function (Kobbacy 1992). A survey of some 34 companies was carried out in the UK (Kobbacy et al. 2005). It indicated that around half of the work that was carried out by maintenance departments was on repair; around a quarter was on preventive maintenance and 5% on inspection. The remaining effort was on other types of maintenance actions including opportunistic maintenance, condition monitoring and design-out maintenance. Repairs represent the largest proportion of maintenance actions carried out by maintenance department and indeed all departments surveyed carried out repairs. Repair is the maintenance action that restores the equipment to operating condition. Some repair actions restore equipment to as new condition while others are classed as minimal repair, i.e. restore equipment to the condition prior to failure. In reality, equipment is likely to be restored to a condition between these two states. Occasionally, repairs may introduce faults to the equipment.

Artificial Intelligence in Maintenance

211

Preventive maintenance is the maintenance action that is undertaken in the belief that it reduces the occurrence of failures as compared with the alternative of repairing components only upon failures (Kobbacy et al. 1995a). PM is perhaps the most intractable of maintenance actions in terms of mathematical modelling. The main reason is that only one point is usually known on the curve representing cost/ availability against PM interval, and the analyst attempts to predict failure rate at a range of PM intervals in order to select the optimal interval. Inspection is the action taken to establish the condition of equipment at some point in time. It can be triggered by observing unusual performance of equipment, e.g. noise, or else the inspection can be carried out at regular predetermined intervals. A major difference between PM and inspection is that PM routines usually involve planned maintenance action, e.g. replace component, make adjustment, etc. while inspection involves checking the condition of equipment and carrying out maintenance action based on the outcome of inspection. In other words inspection routines, unlike PM, do not contain predetermined restoration of equipment condition. Fault diagnosis is an integral part of maintenance actions and it follows from realizing that a fault has occurred. This is essentially required before repairs are carried out following failure, preventive maintenance, inspection or condition monitoring. There are two approaches for maintenance planning/management – the engineering approach and the mathematical approach (Gits 1984). The engineering approach has a broad view of the maintenance problem as the maintenance concept is determined through consideration of the operations plan, maintenance constraints and item behaviour. Thus it emphasises the development of rules or guidelines for planning maintenance action. The mathematical approach has more emphases on developing optimal maintenance policies, e.g. optimal PM interval. A major challenge in this field is how to integrate these approaches. Many software packages have been developed over the years to help in the analysis and modelling of maintenance situations, though they have their limitations including the interference of an analyst, which can slow the process or make the analysis almost intractable for large systems. Scheduling of maintenance actions is a part of maintenance planning. Not all maintenance actions require scheduling, e.g. repair upon failure and design-out maintenance. Opportunistic maintenance, by definition, is carried out taking advantage of the time when equipment is not in use, but planning for spare parts can be required. Condition monitoring can be a continuous process but often requires planning of monitoring interval and the subsequent replacement. The other two major maintenance actions that require scheduling are preventive maintenance and “planned” inspections. Typically, one needs first to establish a model for failure pattern, i.e. times between failures. A non-homogeneous Poisson process is usually the model of first choice for deteriorating repairable systems (Ascher and Kobbacy 1995). There have been many attempts to schedule PM routines, i.e., to decide on the frequency of PM actions per year. Ascher and Kobbacy (1995) present models for scheduling PM by minimizing cost/ maximizing availability and using NHPP. Other attempts use Cox’s proportional hazards model (Kobbacy et al. 1997) and proportional intensities model (Percy et al. 1998). The latter has proved to be of

212

K. Kobaccy

great promise and indeed being investigated for application in more complex PM situations, e.g., multiple PM routines.

9.3 AI Techniques AI is a branch of computer science that develops programmes to allow machines to perform functions normally requiring human intelligence (Microsoft ENCARTA College Dictionary 2001). The goal of AI is to teach machines to “think” to a certain extent under special conditions (Firebaugh 1988). There are many AI techniques, the most used in maintenance decision support are as follows. Knowledge based systems (KBS): use of domain specific rules of thumb or heuristics (production rules) to identify a potential outcome or suitable course of action. Case based reasoning (CBR): utilises past experiences to solve new problems. It uses case index schemes, similarity functions and adaptation. It provides machine learning through updating of the case base. Genetic algorithms (GAs): these are based on the principle that solutions can evolve. Potential promising solutions evolve through mutation and weaker solutions become extinct. Neural networks (NNs): use back propagation algorithm to emulate behaviour of human brain. Both of NNs and GAs are capable of learning how to classify, cluster and optimise. Fuzzy logic (FL): allows the representation of information of uncertain nature. It provides a framework in which membership of a category is graded and hence quantifies such information for mathematical modelling, etc. There are several other AI techniques and these include Data Mining, Robotics and Intelligent Agents. However, to date very few publications are available about their applications in maintenance. 9.3.1 Intelligent Decision Support Systems A useful definition of DSS is as follows. It is a computer based system that helps decision makers confront ill-structured problems through direct interaction through data and analysis and models (Sprague and Watson 1986). The result of integrating an AI technique within a DSS is referred to in this chapter as an Intelligent DSS. This is essentially a DSS as defined above, but has the additional capabilities to “understand”, “suggest” and “learn” in dealing with managerial tasks and problems. The method of integration and the features of the end product depend very much on the area of application.

9.4 AI in Maintenance AI techniques have been used successfully in the past two decades to model and optimise maintenance problems. Since the resurgence of AI in the mid-1980s researchers have consider the applications of AI in this field. The article by

Artificial Intelligence in Maintenance

213

Dhaliwal (1986) is one of the early ones that argued for the appropriateness of using AI techniques for addressing the issues of operating and maintaining large and complex engineering systems. Kobbacy (1992) discusses the useful role of knowledge based systems in the enhacement of maintenance routines. Over the years the applications of AI in maintenance grew to cover very wide area of applications using a variety of AI techniques. This can be explained by the individual nature of each technique. For example GAs and NNs have the advantage of being useful in optimising complex and nonlinear problems and overcome the limitations of the classic “black box” approaches, where attempt is made to identify the system by relating system outputs to inputs without understanding and modelling the underlying process. Hence the widespread applications in the scheduling area and also in fault diagnosis. In this section, an up to date survey is presented covering the area of application of AI techniques in maintenance including fault diagnosis. This chapter will only refer to some of the references in the vast applications of AI in fault diagnosis. Interested readers can refer to the recent comprehensive review by Kobbacy et al. (2007) on applications of AI in Operations. 9.4.1 Case Based Reasoning (CBR) CBR is an interesting AI technique which adds learning capabilities to DSS systems. This may explain the lack of publications on using CBR on its own in maintenance. Instead there are few hybrid applications which utilises CBR together with other AI techniques. Details about CBR technique are discussed while presenting the case study in Section 9.5.3. Yu et al. (2003) present a problem-oriented multi-agent-based E-service system (POMAESS). The system uses a CBR-based decision support function. The case study, which is discussed later in this chapter deals with a hybrid KBS/CBR maintenance optimisation system (HIMOS). More publications are found in fault diagnosis including papers on its application in locomotive diagnostics, e.g. Varma and Roddy (1999). Xia and Rao (1999) argue the need to develop dynamic CBR which introduces new mechanisms such as time-tagged indexes and dynamic and multiple indexing to help accurate solving of problems taking into account system dynamics and fault propagation phenomena. Cunningham et al. (1998) describe an incremental CBR mechanism that can initiate the fault diagnosis process with only a few features. There are also papers on hybrid CBR systems in fault diagnosis including the use of CBR with Petri nets for induction motor fault diagnosis (Tang et al. 2004), CBR with FL in fault diagnosis of modern commercial aircraft ( Wu et al. 2004), CBR with NN in web-based intelligent fault diagnosis system (Hui et al. 2001), CBR with heuristic reasoning and hypermedia for incident monitoring (Rao et al. 1998) and CBR with KBS in pattern search problem in fault diagnosis (Kohno et al. 1997).

214

K. Kobaccy

9.4.2 Genetic Algorithms (GAs) GAs are popular in maintenance applications because of their robust search capabilities that help reduce the computational complexity of large optimisation problems (Morcous and Lounis 2005), such as large scale maintenance scheduling models. GAs have applications in infrastructure networks including programming the maintenance of concrete bridge decks (Morcous and Lounis 2005; Lee and Kim 2007), pavement maintenance programme (Chootinan et al. 2006), and optimising highway life-cycle by considering maintenance of roadside appurtenances (Jha and Abdullah 2006). GAs also have applications in maintenance activities in nuclear power plants including optimising the technical specification of a nuclear safety system by coupling GAs and Monte Carlo simulation in attempt to minimise the expected value of system unavailability and its associated variance (Marseguerra et al. 2004). Another important area of application is in manufacturing. Ruiz et al. (2006) present an approach for scheduling of PM in a flowshop problem with the aim of maximising availability. Sortrakul et al. (2005) present a heuristic based on genetic algorithms to solve an integrated optimisation model for production scheduling and preventive maintenance planning. Chan et al. (2006) propose a GA approach to deal with distributed flexible manufacturing system scheduling problem subject to machine maintenance constraint. Other popular application areas for GAs include preventive maintenance scheduling optimisation. Application areas in PM include chemical process operations (Tan and Kramer 1997), power systems (Huang 1998), single product manufacturing production line (Cavory et al. 2001) and mechanical components (Tsai et al. 2001). GAs are also used in deciding on opportunistic maintenance policies (Saranga 2004; Dragan et al. 1995). GAs have had some moderate but constant interest over the past decade in the area of fault diagnosis. Applications range from manufacturing systems (Khoo et al. 2000), nuclear power plants (Yangping et al. 2000), electrical distribution networks (Wen and Chang 1998) to a new area of application in automotive fuel cell power generators (Hissel et al. 2004). 9.4.3 Neural Networks (NNs) NNs are popular AI technique applied in the areas of maintenance and in particular in fault diagnosis. NNs are the primary information processing structure used in neurocomputing i.e. systems that learn the relationship between data through a process of training (Dendronic Decisions Ltd 2003). NNs have many applications in the areas of predictive maintenance and condition monitoring. Gilabert and Arnaiz (2006) present a case study for noncritical machinery, where NN is used for elevator monitoring and diagnosis as no previous experience existed. Al-Garni et al. (2006) also use NN for predicting the failure rate of an airplane tyres. Gromann de Araujo Goes et al. (2005) have developed a computerised online reliability monitoring system for nuclear power plant applications. An interesting application, developed by Garcia et al. (2004), uses NNs to aid tele-maintenance, where staff can carry out the work remotely and in collaboration with other experts. Other applications of NNs in condition monitoring include the work of Bansal et al. (2004) on machine systems, Booth

Artificial Intelligence in Maintenance

215

and McDonald (1998) on electrical power transformers and Spoerre (1997) on bearings. Shyur et al. (1996) use NNs to predict component inspection requirements for ageing aircraft and Eldin and Senouci (1995) use NNs for the condition rating of joint concrete pavements. Lin and Wang (1996) developed an approach combining NNs and advanced vibration monitoring methods for online predictive maintenance of rotating machinery. Luxhoj and Williams (1996) present a hybrid NN/KBS DSS for aircraft safety inspection. NNs suit model based fault detection and isolation when analytical models are not available. Frank and Koppen-Seliger (1997) define three steps for fault detection: residual generation, i.e. generation of a signal that reflects the fault, residual evaluation, i.e. the logical decision making on the time of occurrence and location of the fault and fault analysis, i.e., determination of the type of fault, its size and cause. NNs have to be trained for both residual generation and evaluation using collected or simulated data for the former and residuals in the latter. There is large number of papers published on the use of NNs in fault diagnosis covering a wide range of applications. These include diagnosis in induction motors (Yang and Kim 2006), marine propulsion systems (Kuo and Chang 2004), supervision of desalination plant during dynamic states, e.g. start up (Tarifa et al. 2003), engineering structures (Chen et al. 2003), navigation systems (Zhang et al. 2001), power plants (Simani and Fantuzzi 2000) and automotive engine management (Shayler et al. 2000). There are various applications in the chemical process industry for using NNs in fault diagnosis, e.g., packed towers (Sharma et al. 2004) and batch processes (Scenna 2000). There are studies that make use of hybrid NNs systems for fault diagnosis. Yang et al. (2004) integrate CBR with an ART-KNN to enhance fault diagnosis when solving a new problem with NN used to make hypotheses and to guide CBR to search for similar previous cases. Jota et al. (1998) use neuro-fuzzy, neuroexpert and fuzzy expert algorithms for fault detection in a range of electrical power system equipment. 9.4.4 Knowledge Based Systems (KBSs) The use of KBS in maintenance management represents one of the early applications of AI in maintenance. Martland et al. (1990) developed a knowledgebased expert system to guide the rail scheduling process, i.e. in developing a plan for rail relay or replacement. Ahmed et al. (1991) developed an expert system for offshore structure inspection and maintenance. Kobbacy (1992) argued the use of KBS in evaluation and enhancement of maintenance routines. Batanov et al. (1993) developed EXPERT-MM, an expert system that supports maintenance policy suggestions, machine diagnosis and maintenance scheduling. Feldman et al. (1992) designed a rule-based expert system to investigate maintenance policies with regards to replacement, minimal repair or no actions in continuous manufacturing environments. Srinivasan et al. (1993) present an intelligent scheduling system using KBS for application on a power-distributed system. Drury and Prabhu (1996) provide a framework for information design that captures the interaction between the inspection task and its information requirements in the

216

K. Kobaccy

operation of commercial aircraft. The framework is used together with the cognitive control categories of skill-rule-knowledge-based behaviour to analyse information needs of aircraft inspectors. de Brito et al. (1997) developed a prototype system for optimising the inspection and maintenance and repair strategies for bridges. A fuzzy knowledge based method for maintenance planning in power system is demonstrated by Sergaki and Kalaitzakis (2002). In addition the work of Kobbacy and Jeon (2001) is discussed later (Section 9.5.3). KBSs also have a wide range of applications in fault diagnosis that are showing an increasing trend unlike applications in maintenance planning. KBSs can be used in all three phases of fault diagnosis (see Section 9.4.3). In the case of complex systems where there is insufficient information to formulate a mathematical model, KBSs have been particularly successful. Examples of applications of KBS in fault diagnosis include diagnosing electrical failures in induction motors (Acosta et al. 2006), fault diagnosis of rotating machinery (Yang et al. 2005), CNC machine-tools (Leung and Romagnoli 2002), industrial gas turbines (Milne et al. 2001), research reactors (Varde et al. 1998), power transmission networks (Baroni et al. 1997), real-time fault detection of green house sensors (Beaulah and Chalabi 1997), monitoring, diagnosis and optimisation of a coal washing plant (Villanueva and Lamba 1997), continuous and semi-continuous chemical processes (Nam et al. 1996) and in diagnosis and maintenance of robotic systems (Patel et al. 1995). Miller et al. (1990) developed a vehicle trouble-shooting expert system which has integrated imaging capability. The system is used to diagnose maintenance problems in the electrical/ hydraulic subsystems. Hybrid KBS systems applications in fault diagnosis include the KBS/NNs application in batch chemical plants (Ruiz et al. 2001). Frank and Ding (1997) outline advances of the theory of observed-based fault diagnosis in dynamic systems covering the use of AI including KBSs and NNs. 9.4.5 Fuzzy Logic (FL) FL has been used in various applications in the maintenance area to deal with uncertainity. Oke and Charles-Owaba (2006) apply an FL control model to Gant charting preventive maintenance scheduling. Al-Najjar and Alsyouf (2003) use a fuzzy multiple criteria decision making to select in advance the most informative (efficient) maintenance approach, i.e. strategies, policies or philosophies. Braglia et al. (2003) adopt FL to help an approach to allow analysts formulating efficiently assessment of possible causes of failure in mode, effects and criticality analysis. Sudiarso and Labib (2002) investigated FL approach to an integrated maintenance/ production scheduling algorithm. Jeffries et al. (2001) develop an efficient hybrid method for capturing machine information in a packaging plant using FL, fuzzy condition monitoring, in order to reduce wastage and maintenance overheads. Examples of FL hybrid applications include the use of a KBS for bridge damage diagnosis which aims at providing information about the impact of design factors on bridge deterioration with FL used to handle uncertainties (Zhao and Chen 2001). Sinha and Fieguth (2006) propose a neuro-fuzzy classifier that com-

Artificial Intelligence in Maintenance

217

bines FL and NNs for the classification of defects by extracting features in segmented buried pipe images. Applications for FL in fault diagnosis include fault diagnosis of railway wheels (Skarlatos et al. 2004), thrusters for an open- frame underwater vehicle (Omerdic and Roberts 2004), chemical processes (Dash et al. 2003) and rolling element bearings in machinery (Mechefske 1998).

9.5 The Hybrid Intelligent Maintenance Optimisation System In this section we discuss the Hybrid Intelligent Maintenance Optimisation System (HIMOS). 9.5.1 Why Intelligent Maintenance DSSs are Needed Optimisation of the maintenance policies of complex technical systems, such as telecommunication systems and complex manufacturing plants, can prove to be difficult. With the developments in information technology over the past two decades, many organisations with complex technical systems have developed maintenance databases. Though the stored history data is potentially very useful to the maintenance engineer aiming to improve maintenance policies, in many cases the data are mainly used to produce simple statistics for management reporting. This is not due to the lack of interest on the part of maintenance practitioners, but to the challenging nature of these systems. The following difficulties are likely to be encountered while attempting to optimise the maintenance routines of complex systems: 1. The system contains a large number of sub-systems and components. This gives rise to a wide variety of maintenance situations that can be handled using different models and methods. 2. For a maintenance engineer, optimising the maintenance routines using available software packages, a familiarity with maintenance modelling in addition to engineering expertise is required. 3. Even if engineers with such experience were available, the time required to examine a large number of components using this type of software can be prohibitive. 4. The changeable nature of large technical systems, e.g. replacement of components with different types or modification of design, will present constant challenges. All these difficulties accentuate the need to develop special computerised systems that can cope with the management of complex engineering systems. Intelligent DSSs are a candidate.

218

K. Kobaccy

9.5.2 The Required Functional Features of An Intelligent Maintenance DSS The main functional features which would be expected of such a system, to cope with the above situation are (Kobbacy 2004): 1. 2. 3. 4. 5. 6.

To access the history data from a maintenance data base. To check the quality of data. To recognise characteristic data patterns. To query the user for additional information, judgement, and criterion. To select the most appropriate PM scheduling model for the decision analysis. To optimise the selected model, evaluate the current policy and propose optimal maintenance policy. 7. To present the results of the analysis in a flexible format. 8. To respond to user enquiries, perform ‘What if?’ decision modelling and provide explanations of the recommended decisions. 9. To have learning capabilities. 10. To have a user friendly Windows interface.

In the following section we will present one specific application of intelligent systems in maintenance, namely the development of an intelligent system to schedule PM for complex technical systems. This section is based on the work of the author with others (see references below). 9.5.3 The Hybrid Intelligent Maintenance Optimisation System (HIMOS) HIMOS aims at deciding the optimal PM cycle interval for a repairable system by selecting and applying the most appropriate optimisation model automatically and without the need for expert interference (Kobbacy and Jeon 2001). HIMOS is the result of developing its predecessor IMOS (Kobbacy et al. 1995b), the intelligent maintenance optimisation system, which used rule based reasoning to select an appropriate model for analysis. HIMOS employs hybrid reasoning by combining rule-based reasoning (RBR) and case-base reasoning (CBR) to choose a model from a model base for a given data set. Analysis of a typical large data file by IMOS showed that about two thirds of components cannot be modelled, mostly because of insufficient history data needed for model selection (Kobbacy 2004). However, some of the cases which could not be modelled may have parameters with values close to those of a model’s acceptance level as stated in the rulebase. By introducing case based reasoning, the system can model cases which are not identified by the rule base, although it has analysed similar cases in the past. Thus, such a hybrid (KBS and CBR) system is expected to increase the previous low percentage of model cases where the system is able to identify a suitable model.

Artificial Intelligence in Maintenance

219

Figure 9.1. Outline Design of HIMOS (Kobbacy and Jeon 2001)

Figure 9.1 illustrates the conceptual structure of HIMOS which is divided into two areas. The DSS contribution area contains a database to store maintenance historical data, a model base for data analysis models and optimisation models, and a user interface to communicate with the user. In the AI contribution area, there are two bases which contain experts’ knowledge: knowledge base and case base. 9.5.3.1 HIMIS Procedure Figure 9.2 illustrates the model selection for a data set consisting of a sequence of preventive maintenance (PM) and corrective action (CO) events to enable calculating the optimal PM interval. HIMOS has the ability to use a set of production rules to select and then optimise a suitable model in order to provide an evaluation of the current maintenance routine and to propose an optimal policy. These rules are acquired from experts’ knowledge and may require subjective judgements to be made. The processor of HIMOS identifies data patterns through data analysis procedure and then selects the most appropriate model for a given data set by consulting the rule base. If a data set cannot be matched by any of the KBS rules, then the system attempts to use CBR to identify a suitable model.

220

K. Kobaccy

Figure 9.2. Model Selection In HIMOS (Kobbacy and Jeon 2001)

Data Formatting and Analysis After reading data from the input data file, the system formats and checks the data to create a suitable data set for the next step of analysis. Suspect or missing items of data are flagged in order to be sorted out by the system or investigated by the user. The analysis consists of five steps: recognition of PM and CO patterns, calculation of current availability, Weibull distribution fitting to failure times, trend test of frequency and severity to establish data stationarity with respect to frequency and severity or otherwise, and if applicable analysis of Multi-PM cases. In the first step a basic analysis is carried out to identify the features of the data set such as the numbers of PM and CO events and the mean lives to failure, so that the data set can be compared with characteristic data patterns in the model selection process. The data produced in this process are referred to as ‘metadata’. Model Base The model base contains two sets of models: the data analysis models and the PM scheduling optimisation models. The data analysis models identify a data pattern which together with the RBR/CBR help to select an optimisation

Artificial Intelligence in Maintenance

221

model. The optimisation models are a set of mathematical models of maintenance policies which evaluate current policies and, in certain circumstances, recommend optimal policy. These models deal with components rather than systems and they assume independence of components, i.e. the failure of one component does not affect the performance of another. The models in the models base are classified into single-PM and multi-PM models and for the former case into stationary and nonstationary models. Stationary models deal with the data sets in which no trend is found. If there is frequency or severity trends, then a nonstationary model can be used. Multi-PM models are structured to deal with components subject to more than one PM routine. IMOS model base includes 21 different models. The description of the models used in HIMOS can be found elsewhere ( Kobbacy and Jeon 2001). Model Selection Using the Rule-Base In HIMOS the rule base (or knowledge base) consists of a list of rules capturing some of the knowledge of experts in maintenance modelling concerning mathematical modelling techniques and their applicability to various situations. The rules match data sets to the models by searching for patterns in the data set for each component such as relative numbers of CO and PM events, component life distribution, range of PM intervals, etc. The approach used to develop the rule base is described in Kobbacy (2004). The knowledge base implemented in HIMOS consists of the set of 15 rules, an example of which is shown below. If the rule base failed to identify a suitable model the CBR is invoked. RULE 1:

If Not matched and There are multi-PMs Then Apply Multi-PM Model Matched

RULE 2:

If Not matched and Trend test statistics of frequency is significantly large and Trend test statistics of severity of CO is significantly large and Trend test statistics of severity of PM is significantly large Then Apply NHPPScoSpm Model Matched

RULE 3:

If Not matched and Trend test statistics of frequency is significantly large and Trend test statistics of severity of CO is significantly large Then Apply NHPPSco Model Matched

Model Selection Using CBR CBR is an approach to problem solving that utilises past experiences to solve new problems. The first step in the operation of a CBR system is the retrieval in which the inputs are analysed to determine the critical

222

K. Kobaccy

features to use in retrieving past cases from the case database. Among the well known methods for case retrieval is the nearest neighbour which is used in HIMOS. To find the nearest neighbour matching the case being considered, the case with the largest weighted average of similarity functions for selected features is selected. In HIMOS four features were selected and all given equal weights. These features are: number of PM, number of CO, trend value and variability of PM cycle length. The reason for selecting these features is that they were found to be the main causes for failure to select a suitable model using the rule based system. The similarity function was selected as the difference between the values of feature in the current and retrieved cases divided by the standard deviation of the feature. Once the best matching case has been retrieved, adaptation is carried out to reduce any prominent difference between the retrieved case and the current case through the derivational replay method. Thus in the CBR phase, the system uses rules similar to those used in the KBS phase to find a solution. However some critical values in the adaptation rules are more relaxed compared with the original rules. In the evaluation step the system displays multiple candidate models (possible solutions) with their critical features for the current case (adaptation results). The user can then evaluate these alternatives and selects one using their expertise. For the non-expert user, the system itself provides the ‘Recommended Model’ as a result of evaluation. Here the system compares the results of adaptation with the results of retrieval. If there is no matching model then no recommendation is made, otherwise the system recommends the matching model. If there is more than one matching model, the system merely recommends the first ranked (nearest neighbour) model. 9.5.3.2 Results and Validation of HIMOS HIMOS results for a component include some basic statistics for, e.g. number of PM, CO, current availability, etc. The most important result from the decisionmaker’s point of view is the recommended PM interval. The optimal availability gives an estimate of availability which might be achieved if the recommended PM policy is implemented. Table 9.1 shows the percentage success rate of HIMOS in modelling a large number of components. As can be seen, around two thirds of the components could not be modelled because no rule matched the data to a specific model. The introduction of case base reasoning can add to the success rate of modelling components. The table also shows that the introduction of CBR reduces the percentage of cases where no suitable model was identified from 68.6 % to 52.7 %. Given the self-learning nature of CBR where the case base expands with use, it is possible to improve the success rate with the extended use of the system in certain environments.

Artificial Intelligence in Maintenance

223

Table 9.1. Percentage use of maintenance models for HIMOS when applied to large systems, 1633 components in three data files (Kobbacy 2004) Model

HIMOS* RBR

Stochastic

RBR+CBR

RP

6.6

12.8

NHPP

1.6

1.6

NRP

2.3

2.3

Total stochastic

10.5

16.4

Geometric I

15.7

23.5

Geometric II

1.7

1.8

Weibull

1.7

3.7

Deterministic

1.8

1.9

68.6

52.7

No model suitable

HIMOS was validated using test cases by comparing the results of analysis of selected cases by HIMOS with the recommendations of an expert panel. For the validation HIMOS, eight data sets were used and a panel of five experts were involved. In general there was agreement between HIMOS and the experts. The experts had a measure of disagreement in their advices as a result of making different assumptions in their analysis. Experts also made useful suggestion for the operation of the system. Table 9.2 is a typical example of HIMOS and the experts’ recommendations. Table 9.2. Example of validation of IMOS Data Set 3 HIMOS

Increase PM interval from 177 to 403 days (CBR-RPOW model)

Expert A

There is no evidence of trend. Increase PM interval but should not be allowed to approach 600 days

Expert B

Unless failure has substantive safety or risk association, PM could be extended.

Expert C

Optimal PM interval is found to be 404 days

Expert D

Increase PM interval

Expert E

Increase PM interval to 250 days

224

K. Kobaccy

9.6 Future Developments In approaching the problem of maintenance management of complex engineering systems one can identify two broad levels for tackling the maintenance issues (Kobbacy and Labib 2003). At a higher decision-making level, one is usually concerned with effectiveness issues such as prioritising machines, modes of failure and types of maintenance actions that will lead to improving systems operations. At the lower decision level, one is concerned with maintenance efficiency issues, e.g. PM interval. Researchers tend to address either the higher-level issues of effectiveness or the lower decisions level issues of efficiency. Labib (1998) proposed two techniques to identify effective maintenance policies at higher levels; namely the rule based decision making grid (DMG) and the analytic hierarchy process (AHP). The AHP is a technique for prioritisation that relies on modelling a problem into a hierarchical structure of a goal, at the apex, and levels of criteria and alternatives at the bottom. The DMG acts as a map where the machines with the worst performances are placed, based on selected multiple criteria. These criteria, such as, downtime and frequency of breakdowns, are determined through prioritisation based on the AHP approach. The objective is to take maintenance actions to improve the machines’ performance as measured by the selected multiple criteria. This approach is discussed in Chapter 17. In order to tackle the issue of efficient maintenance management for complex engineering systems, Kobbacy (1992) proposed integrating Artificial Intelligent techniques such as rule based reasoning with mathematical modelling. Such approach allows automated modelling of large amounts of maintenance data to carry out analysis and propose optimal maintenance schedule, i.e. frequency of PM. This approach has been explained in Section 9.5. Kobbacy and Labib (see Section 9.8) propose merging their approaches of DMG and HIMOS in order to develop an integrated approach towards developing ‘effective’ and ‘efficient’ maintenance management approach. Figure 9.3 outlines the design of such a futuristic system. This proposed concept emphasis the sharing of data and tools between the two models while maintaining their distinct features and allowing flow of information between them.

Artificial Intelligence in Maintenance

225

Figure 9.3. Outline design of AMMCM (Adaptive Maintenance Measurement and Control Model

9.7 Concluding Remarks There has been many developments in the use of AI in the maintenance area. Hundreds of papers have been published in this area. Kobbacy et al. (2007) have shown that the number of publications using NNs and GAs in maintenance have had increasing trends in the past few years which can be explained by their use in optimising complex and nonlinear problems (see Section 9.4). There is an apparent increase in using hybrid approaches and utilising their combined strengths. There is enormous potential for developments in many applications of AI in maintenance by combining two or more AI techniques. Multiple hybrid intelligent management systems (MHIMS) are potentially powerful tools that can help making the right decisions right, i.e. making effective and efficient decisions. The author has a vision that such MHIMS may be assembled in the future from off the shelf modules, resulting in reduction in time and cost of development.

226

K. Kobaccy

9.8 Acknowledgments The author wishes to acknowledge the contributions of those who collaborated at the various stages of the development of IMOS and HIMOS. In particular I wish to acknowledge the significant contribution of A.L. Labib in developing the proposal for the AMMCM presented in Section 9.6.

9.9 References Acosta, G.G., Verucchi, C.J. and Gelso, E.R. (2006) A current monitoring system for diagnosing electrical failures in induction motors, Mechanical Systems and Signal Processing, 20, 953–965. Ahmed, K., Langdon, A. and Frieze, P.A., (1991), An expert system for offshore structure inspection and maintenance, Computers and Structures, 40, 143–159. Al-Garni, A.Z., Jamal, A., Ahmad, A.M. Al-Garni, A.M. and Tozan, M. (2006), Neural network-based failure rate prediction for De Havilland Dash-8 tires, Engineering Applications of Artificial Intelligence, 19, 681–691. Al-Najjar, B. and Alsyouf, I. (2003), Selecting the most efficient maintenance approach using fuzzy multiple criteria decision making, International Journal of Production Economics, 84, 85–100. Ascher, H.E. and Kobbacy, K.A.H. (1995), Modelling preventive maintenance for deteriorating repairable systems, IMA Journal of Mathematics Applied in Business & Indistry, 6, 85–99. Bansal, D., Evans, D.J. and Jones, B. (2004), A real-time predictive maintenance system for machine systems, International Journal of Machine Tools and Manufacture, 44, 759–766. Baroni, P., Canzi, U. and Guida, G. (1997), Fault diagnosis through history reconstruction: an application to power transmission networks, Expert Systems with Applications, 12, 37–52. Batanov, D., Nagarue, N. and Nitikhunkasem, P. (1993) EXPERT-MM: A knowledge-based system for maintenance management, Artificial Intelligence in Engineering, 8, 283–291. Beaulah, S.A. and Chalabi, Z.C. (1997), Intelligent real-time fault diagnosis of greenhouse sensors, Control Engineering Practice, 5, 1573–1580. Booth, C. and McDonald, J.R. (1998), The use of artificial neural networks for condition monitoring of electrical power transformers, Neurocomputing, 23, 97–109. Braglia, M., Frosolini, M. and Montanari, R. (2003), Fuzzy criticality assessment model for failure modes and effects analysis, International Journal of quality & Reliability Management, 20, 503–524. Cavory, G., Dupas R. and Goncalves, G. (2001), A genetic approach to the scheduling of preventive maintenance tasks on a single product manufacturing production line. International Journal of Production Economics 74, 135–146. Chan, F.T.S., Chung, S.H., Chan, L.Y., Finke, G. and Tiwari, M.K. (2006), Solving distributed FMS scheduling problems subject to maintenance: Genetic algorithms approaches, Robotics and Computer-Integrated Manufacturing, 22, 493–504. Chen, Q., Chan, Y.W. and Worden, K. (2003), Strucural fault diagnosis and isolation using neural networks based on response-only data, Computers & Structures, 81, 2165–2172.

Artificial Intelligence in Maintenance

227

Chootinan, P., Chen, A., Horrocks, M.R. and Bolling, D. (2006), A multi-year pavement maintenance program using a stochastic simulation-based genetic algorithm approach, Transportation Research Part A: Policy and Practice, 40, 725–743. Cunningham, P., Smyth, B. and Bonzano, A. (1998), An incremental retrieval mechanism for case-based electronic fault diagnosis. Knowledge-Based Systems 11, 239–248. Dash, S., Rengaswamy, R. and Venkatasubramanian, V. (2003), Fuzzy-logic based trend classification for fault diagnosis of chemical processes, Computers & Chemical Engineering, 27, 347–362. de Brito, J., Branco, F.A., Thoft-Christensen, P. and Sorensen, J.D. (1997), An expert system for concrete bridge management, Engineering Structures, 19, 519–526. Dendronic Decisions Ltd (2003), www.dendronic.com/articles.htm. Dhaliwal, D.S. (1986), The use of AI in maintaining and operating complex engineering systems, in Expert systems and Optimisation in Process Control, A. Mamdani and J E Pstachion, eds, 28–33. Gower Technical Press, Aldershot. Dragan, A.S., Walters, G.A. and Knezevic, J. (1995), Optimal opportunistic maintenance policy using genetic algorithms, 1 formulation, Journal of Quality in Maintenance Engineering, 1, 34–49. Drury, C.G. and Prabhu, P. (1996), Information requirements of aircraft inspection: framework and analysis, International Journal of human-Computer Studies, 45, 679–695. Eldin, N.N. and Senouci, A.B. (1995), Use of neural networks for condition rating of joint concrete pavements, Advances in Enginering software, 23, 133–141. Feldman, R.M., William, M.L., Slade, T., McKee, L.G. and Talbert, A. (1992), The development of an integrated mathematical and knowledge-based maintenance delivery system, Computers & Operations Research, 19, 425–434. Firebaugh, M.W. (1988), Artificial Intelligence: A Knowledge-based Approach, Boyd & Fraser Publishing Co. Danvers, MA, USA. Frank, P.M. and Ding, X. (1997), Survey of robust residual gereration and evaluation methods in observed-based fault detection systems, Journal of Process Control, 7, 403–427. Frank, P.M. and Koppen-Seliger, B. (1997), New developments using AI in fault diagnosis, Engineering applications in Artificial Intelligence, 10, 3–14. Garcia, E., Guyennet, H., Lapayre, J.C. and Zerhouni, N. (2004), A new industrial cooperative tele-maintenance platform. Computers & Industrial Engineering 46, 851–864. Gilabert, E. and Arnaiz, A. (2006), Intyelligent automation systems for predictive maintenance: A case study, Robotics and Computer Integrated Manufacturing, 22, 543–549. Gits, C.W. (1984), On the maintenance concept for a technical system, PhD Thesis, Eindhoven Technische Hogeschool, Eindhoven. Gromann de Araujo Goes, A., Alvarenga, M.A.B. and Frutuoso e Melo, P.F. (2005), NAROAS: a neural network-based advanced operator support system for the assessment of systems reliability, Reliability Engineering & System Safety, 87, 149–161. Hissel, D., Pera, M.C. and Kauffmann, J.M. (2004) Diagnosis of automotive fuel cell power generators, Journal of power Sources, 128, 239–246. Huang, S.J. (1998), Hydroelectric generation scheduling – an application of geneticembedded fuzzy system approach. Electric Power Systems Research 48, 65–72. Hui, S.C., Fong, A.C.M. and Jha, G. (2001) A web-based intelligent fault diagnosis system for customer service support, Engineering Applications of Artificial Intelligence, 14, 537–548. Jeffries, M., Lai, E.. Plantenberg, D.H. and Hull, J.B. (2001), A fuzzy approach to the condition monitoring of a packaging plant, Journal of Materials Processing technology, 109, 83–89.

228

K. Kobaccy

Jha, M.K. and Abdullah, J. (2006) A Markovian approach for optimising highway life-cycle with genetic algorithms by considering maintenance of roadside appurtenances, Journal of the Franklin Institute, 343, 404–419. Jota, P.R.S., Islam, S.M.,Wu, T. and Ledwich, G. (1998), A class of hybrid intelligent system for fault diagnosis in electric power systems. Neurocomputing 23, 207–224. Khoo, L.P., Ang, C.L. and Zhang, J. (2000), A Fuzzy-based genetic approsach to the diagnosis of manufacturing systems, Engineering Applications of artificial Intelligence, 13, 303–310. Kobbacy, K.A.H. (1992), The use of knowledge-based systems in evaluation and enhancement of maintenance riutines, International Journal of Production Economics, 24, 243– 248. Kobbacy, K.A.H. (2004), On the evolution of an intelligent maintenance optimisation system, journal of the Operational Research Society, 55, 139–146 Kobbacy, K.A.H. and Jeon, J. (2001), The development of a hybrid intelligent maintenance optimisation system (HIMOS), Journal of the Operational Research society, 52, 762–778. Kobbacy, K.A.H., Percy, D.F. and Fawzi, B.B. (1995a), Sensitivity analysis for preventive maintenance modeld, IMA Journal of Mathematics Applied in Business& industry, 6,53– 66. Kobbacy, K.A.H., Proudlove, N.L. and Harper, M.A. (1995b), Towards an intelligent maintenance optimisation system, Journal of the Operatonal Research society, 46, 229–240. Kobbacy, K.A.H., Fawzi, B.B., Percy, D.F. and Ascher, H.E. (1997), A Full history proportional hazards model for preventive maintenance modelling, Journal of Quality and Reliability Engineering Internationa, 13, 187–198. Kobbacy, K.A.H., Percy, D. F. and Sharp, J.M. (2005), Results of preventive maintenance survey, unpublished report,University od Salford.. Kobbacy, K.A.H., Vadera, S. and Rasmy, M.H.(2007), AI and OR in management of operations:history and trends, Journal of the Operational Research Society, 58, 10–28. Kohno, T., Hamada, S., Araki, D., Kojima, S. and Tanaka, T. (1997) Error repair and knowlledge acquisition via case-based reasoning, Artificial Intelligence, 91, 85–101. Kuo, H-C. and Chang, H-K. (2004) A new symbiotic evolution-based fuzzy-neural approach to fault diagnosis of maine propulsion systems, Engineering Applications of Artificial Intelligence, 17, 919–930. Labib, A.W. (1998) World class maintenance using a computerised maintenance management system, Journal of Quality in Maintenance Engineering,4, 66–75. Lee, C-K. and Kim, S-K. (2007) GA-based algorithm for selecting optimal repair and rehabilitation methods for reinforced concrete (RC) bridge decks, Automation in Construction, 16, 153–164. Leung, D. and Romagnoli, J. (2002) An integration mechanism for multivariate knowledgebased fault diagnosis, Journal of Process Control, 12, 15–26. Lin, C-C. and Wang, H-P. (1996), Performance analysisof routating machinary using venhanced cerebellar model articulation controller (E-CMAC) neural netyworks, Computers and industrial Engineering, 30, 227–242. Luxhoj, J.T. and Williams, T.P. (1996), Integrated decision support for aviation safety inspectors. Finite Elements in Analysis and Design 23, 381–403. Marseguerra, M., Zio, E. and Podofillini, L. (2004), A multiobjective genetic algorithm approach to optimisation of the technical specifications of a nuclear safety system, Reliability Engineering & System Safety, 84, 87–99. Martland, C.D., McNeil, S., Axharya, D., Mishalani, R. and Eshelby, J. (1990), Applications of expert systems in railroad maintenance:Scheduling rail relays, Transportation Research Part A: General, 24, 39–52.

Artificial Intelligence in Maintenance

229

Mechefske, C.K. (1998), Objective machinery fault diagnosis using fuzzy logic, Mechanical Systems and signal Processing, 12, 855–862. Microsoft ENCARTA College Dictionary (2001), StMartin’s Press, N.Y. Miller, D., Mellichamp, J.M. and Wang, J. (1990), An image enhanced knowledge based expert system for maintenance trouble shooting, Computers in Industry, 15, 187–202. Milne, R., Nicole, C. and Trave-Massuyes, L. (2001) TIGER with model based diagnosis: initial deployment, Knowledge-based Systems, 14, 213–222. Morcous, G. and Lounis, Z. (2005), Maintenance optimisation of infrastructure networks using genetic algorithms, Automation in Construction, 14, 129–142. Nam, D.S., Jeong, C.W., Choe, Y.J. and Yoon, E.S. (1996), Operation-aided system for fault diagnosis of continuous and semi-continuous processes, Computers& Chemical Engineering, 20, 793–803. Oke, S.A. and Charles-Owaba, O.E. (2006), Application of fuzzy logic control model to Gantt charting preventive maintenance scheduling, International Journal of Quality & Reliability Management, 23, 441–459. Omerdic, E. and Roberts, G. (2004), thruster fault diagnosis and accommodation for openframe underwater vehicles, Control Engineering Practice, 12, 1575–1598. Patel, S.A., Kamrani, A.K. and Orady, E. (1995), A knowledge-based system for fault diagnosis and maintenance of advanced automated systems, Computers & Industrial Engineering, 29, 147–151. Percy, D.F., Kobbacy, K.A.H. and Ascher, H.E. (1998), Using proportional intensities models to schedule preventive maintenance intervals, IMA Journal of Mathematics Applied in Business& industry, 9, 289–302. Rao, M., Yang, H. and Yang, H. (1998), Integrated distributed intelligent system architechture for incidents monitoring and diagnosis, Computers in Industry, 37, 143–151. Ruiz, D., Canton, J., Nougues, J.M., Espuna, A. and Puigjaner, L. (2001), On-line fault diagnosis system support for reactive scheduling in multipurpose batch chemical plants, Computers & Chemical Engineering, 25, 829–837. Ruiz, R., Garcia-Diaz, C. and Maroto, C. (2006), Considering scheduling and preventive maintenance in the flowshop sequencing problem, Computers & Operations Rresearch,34, 3314–3330. Saranga, H. (2004) Opportunistic maintenance using genetic algorithms, Journal of Quality in Maintenance Engineering, 10, 66–74. Scenna, N.J. (2000) Some aspects of fault diagnosis in batch processes, Reliability Engineering & System Safety, 70, 95–110. Sergaki, A. and Kalaitzakis, K. (2002), Reliability Engineering& System Safety, 77, 19–30. Sharma, R., Singh, K., Singhal, D. and Ghosh, R. (2004), Neural network applications for detecting process faults in packed towers. Chemical Engineering and Processing 43, 841–847. Shayler, P.J., Goodman, M. and Ma, T. (2000), The exploitation of neural networks in automative engine management systems, Engineering Applications of Artificial Intelligence, 13, 147–157. Shyur, H.J., Luxhoj, J.T. and Williams, T.P. (1996), Using neural networks to predict component inspection requirements for aging aircraft. Computers & Industrial Engineering 30, 257–267. Simani, S. and Fantuzzi, C. (2000), Fault diagnosis in power plant using neural networks, Information Sciences, 127, 125–136. Sinha, S.K. and Fieguth, P.W. (2006) Neuro-fuzzy network for the classification of buried pipe defects, Automation in Construction, 15, 73–83. Skarlatos, D., Karakasis, K. and Trochidis, A. (2004), Railway wheel fault diagnosis using a fuzzy-logic method, Applied Acoustics, 65, 951–966.

230

K. Kobaccy

Sortrakul, N., Nachtmann, H.L. and Cassady, C.R. (2005), Genetic algorithms for integrated preventive maintenance planning and production schedulling for a single machine, Computers in Industry,56, 161–168. Spoerre, J.K. (1997), Application of the cascade correlation algorithm (CCA) to bearing fault classification problems. Computers in Industry 32, 295–304. Sprague, R.H. and Watson, H.J. (1986) Decision support systems – putting theory into practice, Prentice Hall, Englewood Cliffs, New Jersey. Srinivasan, D., Liew, A.C., Chen, J.S.P. and Chang, C.S. (1993) Intelligent maintenance scheduling of distributed system components with operating constraints, Electric Power Systems Research, 26, 203–209. Sudiaros, A. and Labib, A.W. (2002) A fuzzy logic approach to an integrated maintenance/ production scheduling algorithm, International Journal of Production Research, 40, 3121–3138. Tan, J.S. and Kramer, M.A. (1997), A general framework for preventive maintenance optimization in chemical process operations. Computers & Chemical Engineering 21, 1451–1469. Tang, B-S., Jeong, S.K., Oh, Y-M. and Tan, A.C.C. (2004), Case-based reasoning system with Petri nets for induction motor fault diagnosis, Expert Systems with Applications, 27, 301–311. Tarifa, E.E., Humana, D., Franco, S., Martinez, S.l. Nunez, A.F. and Scenna, N.J. (2003) Fault diagnosis for MSF using neural networks, Desalination, 152, 215–222. Tsai, Y-T., Wang, K-S. and Teng, H-Y. (2001), Optimizing preventive maintenance for mechanical components using genetic algorithms. Reliability Engineering & System Safety 74, 89–97. Varde, P.V., Sankar, S. and Verma, A.K. (1998), An operator support system for research reactor operations and fault diagnosis through a connectionist framework and PSA based knowledge based system, Reliability Engineering and System safety, 60, 53–69. Varma, A. and Roddy, N. (1999), ICARUS: design and deployment of a case-based reasoning system for locomotive diagnostics, Engineering Applications of Artificial Intelligence 12, 681–690. Villanueva, H. and Lamba, H. (1997). Operator guidance system for industrial plant supervision, Expert systems withy Applications, 12, 441–454. Wen, F. and Chang, C.S. (1998), A new approach to fault diagnosis in electrical distribution networks using a genetic algorithm. Artificial Intelligence in Engineering 12, 69–80. Wu, H., Liu, Y., Ding, Y. and Qiu, Y. (2004), Fault diagnosis expert system for modern commercial aircraft, Aircraft Engineering and Aerospace Technology, 76, 398–403 Xia, Q. and Rao, M. (1999), Dynamic case-based reasoning for process operation support systems. Engineering Applications of Artificial Intelligence 12, 343–361. Yang, B-S. and Kim, K.J. (2006) Applications of Dempster-Shafer theory in fault diagnisis of induction motors, Mechanical systems and Signal Processing, 20, 403–420. Yang, B-S., Han, T. and Kim, Y-S (2004), Integration of ART-Kohonen neural network and case-based reasoning for intelligent fault diagnosis, Expert Systems with Applications, 26, 387–395. Yang, B-S., Lim, D-S. and Tan, A.C.C. (2005), VIBEX : an expert system for vibtation fault diagnosis of rotating machinery using decision tree and decision table, Expert Systems with Applications, 28, 735–742. Yangping, Z., Bingquan, Z. and DongXin, W. (2000), Application of genetic algorithms to fault diagnosis in nuclear power plants. Reliability Engineering & System Safety, 67, 153–160. Yu, R., Iung, B. and Panetto, H. (2003), A Multi-Agents based E-maintenance system with case-based reasoning decision support, Engineering Applications of Artificial Intelligence, 16, 321–333.

Artificial Intelligence in Maintenance

231

Zhang, H.Y., Chan, C.W., Cheung, K.C. and Ye, Y.J. (2001) Fuzzy artmap neural network and its application to fault diagnosis of navigation systems, Automatica, 37, 1065–1070. Zhao, Z. and Chen, C. (2001), concrete bridge deterioration diagnosis using fuzzy inference system, Advances in Engineering Software, 32, 317–325.

Part D

Problem Specific Models

10 Maintenance of Repairable Systems Bo Henry Lindqvist

10.1 Introduction A commonly used definition of a repairable system (Ascher and Feingold 1984) states that this is a system which, after failing to perform one or more of its functions satisfactorily, can be restored to fully satisfactory performance by any method other than replacement of the entire system. In order to cover more realistic applications, and to cover much recent literature on the subject, we need to extend this definition to include the possibility of additional maintenance actions which aim at servicing the system for better performance. This is referred to as preventive maintenance (PM), where one may further distinguish between condition based PM and planned PM. The former type of maintenance is due when the system exhibits inferior performance while the latter is performed at predetermined points in time. Traditionally, the literature on repairable systems is concerned with modeling of the failure times only, using point process theory. A classical reference here is Ascher and Feingold (1984). The most commonly used models for the failure process of a repairable system are renewal processes (RP), including the homogeneous Poisson processes (HPP), and nonhomogeneous Poisson processes (NHPP). While such models are often sufficient for simple reliability studies, the need for more complex models is clear. In this chapter we consider some generalizations and extensions of the basic models, with the aim to arrive at more realistic models which give better fit to data. First we consider the trend renewal process (TRP) introduced and studied in Lindqvist et al. (2003). The TRP includes NHPP and RP as special cases, and the main new feature is to allow a trend in processes of non-Poisson (renewal) type. As exemplified by some real data, in the case where several systems of the same kind are considered, there may be unobserved heterogeneity between the systems which, if overlooked, may lead to non-optimal or possibly completely wrong decisions. We will consider this in the framework of the TRP process, which in Lindqvist et al. (2003) is extended to the so-called HTRP model which

236

B. Lindqvist

includes the possibility of heterogeneity. Heterogeneity can be thought of as an effect of an unobserved covariate. Another extension of the basic models is to allow the systems to be preventively maintained. We review some recent research in this direction, where this situation is modeled as a competing risks problem between failure and PM. This leads to a need for combining the theory of competing risks with repair models and point process theory. Relevant statistical data for such analyses are found in most modern reliability databases. The book by Bedford and Cooke (2001) contains a chapter related to this. A general reference to competing risks is the book by Crowder (2001).

Figure 10.1. Event times ( Ti ) and sojourn times ( X i ) of a repairable system

The last extension of the basic models to be considered in the present chapter consists of using Markov models to model the behavior of periodically inspected systems in between inspections, with the use of separate Markov models for the maintenance tasks at inspections. Recent review articles concerning repairable systems and maintenance include Peña (2006) and Lindqvist (2006). A review of methods for analysis of recurrent events with a medical bias is given by Cook and Lawless (2002). General books on statistical models and methods in reliability, covering much of the topics considered here, are Meeker and Escobar (1998) and Rausand and Høyland (2004).

10.2 Point Process Approach 10.2.1 Notation and Basic Definitions Consider a repairable system where time usually runs from t = 0 and events occur at ordered times T1 , T2 ,…. Here time is not necessarily calendar time, but can be for example operation time, number of cycles, number of kilometers run, length of a crack, etc. In the present treatment we shall disregard time durations of repair and maintenance, and assume that the system is always restarted immediately after failure or maintenance action. The inter-event, or inter-failure, times will be denoted X 1 , X 2 ,…. Here X i = Ti − Ti −1 , i = 1, 2,… , where for convenience we define T0 ≡ 0 . Figure 10.1 illustrates the notation. We also make use of the counting process representation N (t ) = number of events in (0, t ] . In order to describe probability models for repairable systems we use some notation from the theory of point processes. A key reference is Andersen et al. (1993). H t denotes the history of the failure process up to, but not including, time t .

Maintenance of Repairable Systems

237

The conditional intensity of the process at time t is defined as γ (t ) = lim ∆t ↓ 0

Pr (event of type j in [t , t + ∆t ) | H t ) . ∆t

(10.1)

From this we obtain an expression for the likelihood function, which is needed for statistical inference. Suppose that a single system as described above is observed from time 0 to time τ , resulting in observations T1 , T2 ,…, TN (τ ) . The likelihood function is then given by (Andersen et al. 1993, Section II.7)

{

}

τ ⎧⎪ N (τ ) ⎫⎪ L = ⎨∏ γ (Ti ) ⎬ exp − ∫ γ (u ) du . 0 ⎩⎪ i =1 ⎭⎪

(10.2)

10.2.2 Perfect and Minimal Repair Models Consider a system with failure rate z (t ) . Suppose first that after each failure, the system is repaired to a condition as good as new, called a perfect repair. In this case the failure process can be modeled by a renewal process (RP) with inter-event time distribution F , denoted RP( F ) . Clearly, the conditional intensity defined in Equation 10.1 is given by γ (t ) = z (t − TN (t − ) ),

where t − TN (t − ) is the time since the last failure strictly before time t . Suppose instead that after a failure, the system is repaired only to the state it had immediately before the failure, called a minimal repair. This means that the conditional intensity of the failure process immediately after the failure is the same as it was immediately before the failure, and hence is exactly as it would be if no failure had ever occurred. Thus we must have γ (t ) = z (t ),

and the process is a nonhomogeneous Poisson process (NHPP) with intensity z (t ) , denoted NHPP( z (⋅) ). In practice a minimal repair usually corresponds to repairing or replacing only a minor part of the system. If z (t ) = λ does not depend on t , then NHPP ( z (⋅)) is a homogeneous Poisson process which we denote by HPP (λ ) . Note that an HPP is at the same time an RP with exponential inter-failure times. 10.2.3 The Trend-Renewal Process The idea behind the trend-renewal process is to generalize the following well known property of the NHPP. First let the cumulative intensity function corresponding to

238

B. Lindqvist t

an intensity function λ (⋅) be defined by Λ (t ) = ∫ λ (u ) du . Then if T1 , T2 ,… is an 0 NHPP(λ (⋅)) , the time-transformed stochastic process Λ (T1 ), Λ(T2 ),… is HPP(1). The trend-renewal process (TRP) is defined simply by allowing the above HPP(1) to be any renewal process RP ( F ) . Thus, in addition to the intensity function λ (t ) , for a TRP we need to specify a distribution function F of the inter-arrival times of this renewal process. Formally we can define the process TRP( F , λ (⋅) ) as follows. t Let λ (t ) be a nonnegative function defined for t ≥ 0 , and let Λ (t ) = ∫ λ (u ) du . 0 The process T1 , T2 ,… is called TRP( F , λ (⋅) ) if the process Λ (T1 ), Λ(T2 ),… is RP( F ∵ ), that is if the Λ (Ti ) − Λ (Ti −1 ); i = 1, 2,… are i.i.d. with distribution function F . The function λ (⋅) is called the trend function, while F is called the renewal distribution. In order to have uniqueness of the model it is usually assumed that F has expected value 1. Figure 10.2 illustrates the definition. For the cited property of the NHPP, the lower axis would be an HPP with unit intensity, HPP(1). For the TRP, this process is instead taken to be any renewal process, RP(F), where F has expectation 1. This shows that the TRP includes the NHPP as a special case. Further, if λ (t ) ≡ 1 , then Λ (Ti ) = Ti , and so T1 , T2 ,… is RP(F). For an NHPP(λ (⋅)) , the RP( F ) would be HPP(1) . Thus TRP (1 − e − x , λ (⋅)) = NHPP(λ (⋅)). Also, TRP ( F ,1) = RP( F ) , which shows that the TRP class includes both the RP and NHPP classes.

Figure 10.2. The defining property of the trend-renewal process

It can be shown (Lindqvist et al. 2003) that the conditional intensity function, given the history H t , for the TRP( F , λ (⋅)) is γ (t ) = z (Λ (t ) − Λ (TN (t − ) ))λ (t )

(10.3)

where z (⋅) is the hazard rate corresponding to F . This is a product of one factor, λ (t ) , which depends on the age t of the system and one factor which depends on a transformed time from the last previous failure. Suppose now that a single system has been observed in [0,τ ] , with failures at T1 , T2 ,…, TN (τ ) . If a TRP( F , λ (⋅) ) is used as a model, then substitution of Equation 10.3 into Equation 10.2 gives the likelihood

Maintenance of Repairable Systems N (τ )

239

τ

L = {∏ z[Λ(Ti ) − Λ(Ti −1 )]λ (Ti )}exp{−∫ z[Λ(u ) − Λ(TN (u − ) )]λ (u)du}. (10.4) 0

i =1

For the NHPP (λ (⋅)) we have z (t ) ≡ 1 , so the likelihood simplifies to the well known expression (Crowder et al. 1991, p 166) N (τ )

∏ λ (T )}exp{−∫

L ={

i =1

i

τ 0

λ (u ) du}.

Returning to the general case, if f is the density function corresponding to F , the we can write the likelihood at Equation 10.4 as N (τ )

∏ f [Λ(T ) − Λ(T

L ={

i =1

i

i −1

)]λ (Ti )}{1 − F [ Λ(τ ) − Λ(TN (τ ) )]}.

(10.5)

This latter form of the likelihood of the TRP follows directly from the definition, since the conditional density of Ti given T1 = t1 ,…, Ti −1 = ti −1 is f [Λ (ti ) − Λ (ti −1 )]λ (ti ) , and the probability of no failures in the time interval (TN (τ ) ,τ ] , given T1 ,…, TN (τ ) , is 1 − F [Λ (τ ) − Λ (TN (τ ) )] . This again simplifies if λ (t ) ≡ 1 in which case it gives the likelihood of an RP(F) observed on [0,τ ] . 10.2.4 Observations from Several Similar Systems Suppose that m systems of the same kind are observed, where the j-th system ( j = 1, 2,…, m ) is observed in the time interval [0,τ j ] . For the j-th system, let N j denote the number of failures that occur during the observation period, and let the specific failure times be denoted T1 j < T2 j <  < TN j . Figure 10.3 illustrates the notation and explains the information given in a so-called event plot which is provided by computer packages for analysis of this kind of data (see examples below). j

Example 1 Nelson (1995) presented data for times of valve-seat replacements in a fleet of m = 41 diesel engines. Figure 10.4 shows an event plot for the complete dataset. Example 2 Bhattacharjee et al. (2003) presented failure data for motor operated closing valves in safety systems at two boiling water reactor plants in Finland. Failures of the type “external leakage” were considered for 104 valves with a follow-up time of nine years. An event plot for the 16 valves which experienced at least on failure, is given in Figure 10.5. The remaining 88 valves had no failures.

240

B. Lindqvist

Figure 10.3. Observation of failure times of m systems. The j-th system is observed over the time interval [0,τ j ] , with N j ≥ 0 observed failures

Figure 10.4. Event plot for times of valve seat replacements for 41 diesel engines, taken from Nelson (1995)

When data are available for m systems as described above, one will typically assume that the systems behave independently but with the same probability laws (“i.i.d. rules”). The total likelihood for the data will then be the product of the likelihoods at Equations 10.4 or 10.5, one factor for each of the m systems.

Maintenance of Repairable Systems

241

Figure 10.5. Event plot for times of external leakage from nuclear plant valves, taken from Bhattacharjee et al. (2003). In addition, 88 valves had no failures in 3286 days (9 years)

However, even if the m systems are considered to be of the same type, they may well exhibit different probability failure mechanisms. For example, systems may be used under varying environmental or operational conditions. To cover such cases we shall assume that failures of the j-th system follow the process TRP ( F , λ j (⋅)) , j = 1,…, m , where the renewal function F is fixed and differences between systems are modeled by letting the trend functions λ j (t ) vary from system to system. The assumption of a fixed F parallels the NHPP case, where F is the unit exponential distribution. Assuming that systems work independently of each other, we obtain from m

Equation 10.5 the full likelihood L ≡ ∏ j =1 L j where Nj

L j = {∏ f [Λ j (Tij ) − Λ j (Ti −1, j )]λ j (Tij )}{1 − F [ Λ j (τ j ) − Λ j (TN j )]}. (10.6) i =1

j

As an example of the use of Equation 10.6, assume that differences between system performances can be attributed to an observable covariate vector x , and that the trend λ j (t ) for system j is represented by a proportional trend model with λ j (t ) = g (x j )λ (t ), j = 1,…, m

(10.7)

Here λ (⋅) is a basic trend function common to all systems, while g is a function of the covariate vector x j of system j . The special cases of this model

242

B. Lindqvist

corresponding to NHPP and RP are studied, respectively, by Lawless (1987) and Follmann and Goldberg (1988). 10.2.5 The Heterogeneous Trend Renewal Process As noted in the introduction, in addition to observable differences there may be an unobserved heterogeneity between systems. A common way of incorporating such heterogeneity is to modify Equation 10.7 to λ j (t ) = a j g (x j )λ (t ) where the a j are unobservable (positive) random variables taking values independently across systems (Andersen et al. 1993, Chapter IX). For simplicity we shall in this chapter restrict attention to the case with no observed covariates, and instead concentrate on unobserved heterogeneity. In the following we thus assume the model λ j (t ) = a j λ (t )

(10.8)

where the a j are independently distributed according to a common probability distribution H , say, and where for convenience we assume that the expected value of a j equals 1 . Thus in Equation 10.8, λ (⋅) is regarded as a basic trend function, while the a j represent a possibly different failure intensity “level” for each system, averaging to 1 . The special case when a j = 1 with probability 1 will be referred to as the “no heterogeneity” case. For given values of the a j the likelihood for the j-th system is, by Equation 10.6, Nj

L j (a j ) = {∏ f [a j (Λ(Tij ) − Λ(Ti −1, j ))]a j λ (Tij )}{1 − F[ a j ( Λ(τ j ) − Λ(TN j ))]}. j

i =1

However, since the a j are unobservable, we need to take the expectation with respect to the a j , giving



L j = E[ L j (a j )] = L j ( a j ) dH ( a j )

as the contribution to the likelihood from the j-th system. The total likelihood is then the product m

L = ∏ Lj .

(10.9)

j =1

We shall use the notation HTRP ( F , λ (⋅), H ) for the model with the likelihood at Equation 10.9,). Here the renewal distribution F and the heterogeneity distribution H are distributions corresponding to positive random variables with expected value 1, while the basic trend function λ (t ) is a positive function defined for t ≥ 0 .

Maintenance of Repairable Systems

243

A useful feature of the HTRP model is that several important models for repairable systems are easily represented as submodels. With the notation HPP, NHPP, RP and TRP used as before, we define corresponding models with heterogeneity as at Equation 10.8 by putting an H in front of the abbreviations. Specifically, from a full model, HTRP ( F , λ (⋅), H ) , we can identify the seven submodels described in Table 10.1. Table 10.1. The seven submodels of HTRP ( F , λ (⋅), H ) . ’exp’ means the unit exponential distribution, ’1’ means the distribution degenerate at 1 . The third column contains references to work on the corresponding models or special cases of them. Submodel

HTRP-formulation

HPP (ν )

HTRP (exp,ν ,1)

RP ( F ,ν )

HTRP ( F ,ν ,1)

NHPP (λ (⋅))

HTRP (exp, λ (⋅),1)

TRP ( F , λ (⋅))

HTRP ( F , λ (⋅),1)

HHPP (ν , H )

HTRP (exp,ν , H )

HRP ( F ,ν , H )

HTRP ( F ,ν , H )

HNHPP (λ (⋅), H )

HTRP (exp, λ (⋅), H )

The HTRP and the seven submodels may also be represented in a cube, as illustrated in Figures 10.6 and 10.7. Each vertex of the cube represents a model, and the lines connecting them correspond to changing one of the three “coordinates” in the HTRP notation. Going to the right corresponds to introducing a time trend, going upwards corresponds to entering a non-Poisson case, and going backwards (inwards) corresponds to introducing heterogeneity. In analyzing data by parametric HTRP models we shall see below how we use the cube to facilitate the presentation of maximum log-likelihood values for the different models in a convenient, visual manner. The log-likelihood cube was introduced in Lindqvist et al. (2003). Example 1 (continued) Figure 10.6 shows the log-likelihood cube of the valve-seat data. It should be noted that each arrow points in a direction where exactly one parameter is added (see text of Figure 10.6 for definitions of parameters). Using standard asymptotic likelihood theory we know that if this parameter has no influence in the model, then twice the difference in log likelihood is approximately chi-square distributed with one degree of freedom. For example, if twice the difference is larger than 3.84, then the p-value of no significant difference is less than 5% and we have an indication that the extra parameter in fact has some relevance. Note that adding an extra parameter will always lead to a larger value of the maximum log likelihood, but from what we just argued, the difference needs to be more than, say, 3.84 / 2 = 1.92 to be of real interest.

244

B. Lindqvist

Figure 10.6. The log-likelihood cube for the Nelson valve seat data of Nelson (1995), fitted with a parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibulldistribution with expected value 1 and shape parameter s , λ (t ) = cbt b −1 is a power function of t , and H is a gamma-distribution with expected value 1 and variance v . The maximum value of the log likelihood is denoted l

Looking at the valve-seat data cube in Figure 10.6 we note first that going from a vertex of the front face to the corresponding vertex of the back face (adding “H” in front of the model acronym) there is never much to gain (1.17 at most from HPP to HHPP). This indicates no apparent heterogeneity between the various engines. By comparing the left and right faces we conclude, however, that there seems to be a gain in including a time trend. Having already excluded heterogeneity we are thus faced with the possibilities of either NHPP or TRP. Here the latter model “wins”, since the difference in log-likelihood is as large as (−343.66) − (−346.49) = 2.83 and twice the difference equal to 5.66 corresponding to an approximate p-value of 0.017. The resulting estimated TRP is seen to have a renewal distribution which is Weibull with shape parameter 0.6806 which implies a decreasing failure rate. This means that the conditional intensity function will jump upward at each failure, which may be explained by burn-in problems at each valve-seat replacement. Further, there will be an estimated time trend of the form λˆ (t ) = 3.26 × 10−6 × 1.929 × t 0.929 = 6.29 × 10 −6 × t 0.929 which increases with t so that replacements are becoming more and more frequent. Example 2 (continued) For the closing valve failures considered by Bhattacharjee et al. (2003), previous studies had shown significant variations in the number of

Maintenance of Repairable Systems

245

failures of each valve, suggesting a heterogeneity between valves. Bhattacharjee et al. (2003) thus stressed the importance of taking heterogeneity into consideration and concluded that even very simple models may describe the heterogeneous behavior successfully. In particular they considered a model where heterogeneity is represented by assuming that each valve is either “good” or “bad”. While Bhattacharjee et al. (2003) used hierarchical Bayes-models, we fitted an HTRP model and its sub-models, with a trend function of power law type as for the valve-seat data, λ (t ) = cbt b −1 , but now with a heterogeneity distribution H being a two-point distribution with values a1 = “good”, a2 = “bad” (so a1 ≤ a2 by assumption) and P("good") = p . In order to have uniqueness of parameters we imposed the restriction of expected value 1 for the distribution H , leading to pa1 + (1 − p)a2 = 1 . The results are given in the log-likelihood cube of Figure 10.7. By comparing the front and back faces of Figure 10.7 it is clear that there is a considerable heterogeneity present, leaving us with the back face. Thus we continue by investigating whether we have Poisson-behavior or renewal-behavior at failures. This is done by comparing the bottom and top faces, in other words (HHPP, HNHPP) vs. (HRP, HTRP). The difference from HHPP to HRP happens to be 1.92 so the p-value is 5%. Thus we might prefer the HRP model. However, in order to obtain a simple model with a simple interpretation we might go for the HHPP which gives that the closing valve is a “good” one with probability 0.9524, with failures following an HPP with rate 1.083 ×10−4 × 0.35 = 3.79 ×10 −5 (per day),

or a “bad” one with probability 0.0476 and rate 1.083 ×10−4 ×14.0 = 1.52 ×10 −3 (per day).

The expected number of failures in 3286 days are hence 0.125 and 4.99 , respectively, for the “good” and “bad” valves.

10.3 A Competing Risks Model for Failure vs. Preventive Maintenance 10.3.1 A General Setup Consider again the situation illustrated in Figure 10.1, where the sojourns X 1 , X 2 ,… are times to failure of a system which is repaired immediately before the start of the sojourn. In the present section we consider the case when the failure which we expect at the end of the sojourn X i , may be avoided by a preventive maintenance (PM) after a time Z i in the sojourn. The experienced sojourn time will in this case be Yi = min( X i , Z i ) , and it will result in either a failure or a PM according to whether Yi = X i or Yi = Z i . We thus have a competing risks situation with two risks, corresponding to failure and PM.

246

B. Lindqvist

Figure 10.7. The log-likelihood cube for the data of Bhattacharjee et al. (2003) concerning failures of motor operated closing valves in nuclear reactor plants in Finland, fitted with a parametric HTRP( F , λ (⋅), H ) model and its sub-models. Here F is a Weibull-distribution b −1 is a power function of t , and with expected value 1 and shape parameter s , λ (t ) = cbt H is a two-point distribution with unit expectation, giving probability p for the value “low” and 1 − p for the value “high”. The maximum value of the log likelihood is denoted by l

Doyen and Gaudoin (2006) recently presented a point process approach for modeling of such competing risks situations between failure and PM. A general setup for this kind of processes is furthermore suggested in the review paper Lindqvist (2006). For simplicity we shall in this chapter consider only the case where the component or system is perfectly repaired or maintained at the end of each sojourn. This will lead to the observation of independent copies of the competing risks situation in the same way as for a renewal process. We will therefore in the following consider only a single sojourn and hence suppress the subscripts of the observed times. Thus we let X and Z be, respectively, the potential times to failure and time to PM of a single sojourn. Then Y = min( X , Z ) is the observed sojourn, and in addition we observe the indicator variable δ which we define to be 1 if there is a PM ( Y = Z ) and 0 if there is a failure ( Y = X ). This situation has been extensively studied by Cooke (1993, 1996), Bedford and Cooke (2001), Langseth and Lindqvist (2003, 2006), Lindqvist et al. (2006) and Lindqvist and Langseth (2005). Thus note that the observable result is the pair (Y , δ ) , rather than the underlying times X and Z , which may often be the times of interest. For example, knowing

Maintenance of Repairable Systems

247

the distribution of X would be important as a basis for maintenance optimization. It is well known (see Crowder 2001, Chapter 7), however, that in a competing risks case as described here, the marginal distributions of X and Z are not identifiable from observation of (Y , δ ) alone unless specific assumptions are made on the dependence between X and Z . The most frequently used assumption of this kind is to let X and Z be independent, in which case identifiability follows. This assumption is not reasonable in our application, however, since the maintenance crew is likely to have some information regarding the system’s state during operation. This insight is used to perform maintenance in order to avoid failures. We are thus in practice usually faced with a situation of dependent competing risks between X and Z . 10.3.2 Random Signs Censoring Cooke (1993, 1996) suggested that the competing risks situation between failure and PM will often satisfy what he called the random signs censoring property. The important features of random signs censoring are that the marginal distribution of X is always identifiable, and that an indication of the validity of this type of censoring could be found from data plotting. A lifetime Z is said to be a random signs censoring of X if the event {Z < X } is stochastically independent of X , i.e. if the event of having a PM before failure is not influenced by the time X at which the system fails or would have failed without PM. The idea is that the system emits some kind of signal before failure, and that this signal is discovered with a probability which does not depend on the age of the system. We now introduce some notation. Below we assume without further mention that X , Z are positive and continuous random variables, with P( X = Z ) = 0 . We let FX (t ) = P( X ≤ t ) and FZ (t ) = P( Z ≤ t ) be the cumulative distribution functions of X and Z , respectively. The subdistribution functions of X and Z are defined as, respectively, FX∗ (t ) = P( X ≤ t , X < Z ) and FZ∗ (t ) = P( Z ≤ t , Z < X ) . Note that the functions FX∗ and FZ∗ are nondecreasing with FX∗ (0) = 0 and ∗ FZ (0) = 0 . Moreover, we have FX∗ (∞) + FZ∗ (∞) = 1 . We will also use the notion of conditional distribution functions, defined by  F X (t ) = P( X ≤ t | X < Z ) and F Z (t ) = P( Z ≤ t | Z < X ) . Note then that ∗ ∗ ∗ ∗ F X (t ) = FX (t ) /FX (∞) , F Z (t ) = FZ (t ) /FZ (∞) . It is important to note that the functions FX∗ , FZ∗ , F X , F Z are identifiable from data of the form (Y , δ ) , since they are given in terms of probabilities of events that can be expressed by (Y , δ ) . For example, FX∗ (t ) = P(Y ≤ t , δ = 0) and can hence be estimated consistently from a sample of values of (Y , δ ) . On the other hand, as already mentioned, the marginal distribution functions FX , FZ are not identifiable in general since they are not probabilities of events that can be expressed by (Y , δ ) . We now show that the marginal distribution of X is identifiable under random signs censoring. In fact this follows directly from the definition, since we must have

248

B. Lindqvist

F X (t ) = P( X ≤ t | X < Z ) = P( X ≤ t ) = FX (t )

(10.10)

by independence of X and the event X < Z . As verified above, F X (t ) can always be estimated consistently from data, and thus this holds for FX (t ) as well by Equation 10.10. Hence we have the somewhat surprising result under random signs censoring that the marginal distribution of X is the same as the distribution of the observed occurrences of X . Cooke (1993) showed that under random signs censoring we have F X (t ) < F Z (t ) for all t > 0.

(10.11)

Moreover, he showed the kind of inverse statement that whenever Equation 10.11 holds, there exists a joint distribution of ( X , Z ) satisfying the requirements of random signs censoring and giving the same sub-distribution functions. On the other hand, if F X (t ) ≥ F Z (t ) for some t , then there is no joint distribution of ( X , Z ) for which the random signs requirement holds. For more discussion on random signs censoring and its applications we refer to Cooke (1993, 1996) and Bedford and Cooke (2001, Chapter 9). One idea is to estimate the functions F X (t ) and F Z (t ) from data to check whether Equation 10.11 may possibly hold and when this is the case to suggest a model that satisfies the random signs property. 10.3.3 The Repair Alert Model Lindqvist et al. (2006) introduced the so-called repair alert model which extends the idea of random signs censoring by defining an additional repair alert function which describes the “alertness” of the maintenance crew as a function of time. The definition can be given as follows: The pair ( X , Z ) of life variables satisfies the requirements of the repair alert model provided the following two conditions both hold: (i) Z is a random signs censoring of X b (ii) There exists an increasing function G defined on [0, ∞) with G (0) = 0 , such that for all x > 0 , P( Z ≤ z | Z < X , X = x) =

G( z) , 0 < z ≤ x. G ( x)

The function G is called the cumulative repair alert function. Its derivative g (when it exists) is called the repair alert function. The repair alert model is hence a specialization of random signs censoring, obtained by introducing the repair alert function G . Part (ii) of the above definition means that, given that there would be a failure at time X = x , and given that the maintenance crew will perform a PM before that

Maintenance of Repairable Systems

249

time (i.e. given that Z < X ), the conditional density of the time Z of this PM is proportional to the repair alert function g . Lindqvist et al. (2006) showed that whenever Equation 10.11 holds there is a unique repair alert model giving the same sub-distribution functions. Thus, restricting to repair alert models we are able to strengthen the corresponding result for random signs censoring which does not guarantee uniqueness. The repair alert function is meant to reflect the reaction of the maintenance crew. More precisely, g (t ) ought to be high at times t for which failures are expected and the alert therefore should be high. Langseth and Lindqvist (2003) simply put g (t ) = λ (t ) where λ (t ) is the failure rate of the marginal distribution of X . This property of g (t ) of course simplifies analyses since it reduces the number of parameters, but at the same time it seems fairly reasonable given a competent maintenance crew. In a subsequent paper, Langseth and Lindqvist (2006) present ways to test whether g (t ) can be assumed equal to the hazard function λ (t ) . It follows from the construction in Lindqvist et al. (2006) that the repair alert model is completely determined by the marginal distribution function FX of X , the cumulative repair alert function G , the probability q ≡ P( Z < X ) , and the assumption that X is independent of the event {Z < X } (i.e. random signs censoring). Thus, given statistical data, the inference problem consists of estimating FX (t ) (possibly on parametric form), the repair alert function g (or G ), and the probability q of PM. We refer to Lindqvist et al. (2006) and Lindqvist and Langseth (2005) for details on such statistical inferences. The following is a simple example of a repair alert model. Example 3 Let ( X , Z ) be a pair of life variables with joint density parameterized by λ > 0 and 0 < q < 1 , f XZ ( x, z; λ , q ) = (q /x)λ e− λ x for x > 0, 0 < z < x /q.

The marginal distribution of X is the exponential distribution with density f X ( x) = λ e − λ x , while the conditional distribution of Z given X = x is the uniform distribution on (0, x /q) . From this we obtain P( Z < X | X = x) = q for all x > 0 . Thus the event Z < X is independent of X and condition (i) of the definition is satisfied. The following computation shows that condition (ii) holds as well. Let 0 < z < x . Then P ( Z ≤ z, Z < X | X = x) P( Z < X | X = x) P( Z ≤ z | X = x) = q z ( q /x ) z = = , q x

P( Z ≤ z | Z < X , X = x) =

which implies condition (ii) of Definition 2 with G (t ) = t .

250

B. Lindqvist

The practical interpretation of this example is as follows. We consider a component or system with lifetime X which is exponentially distributed with failure rate λ . With probability q a PM is performed before X , at a time which for given X = x is uniformly distributed on the interval from 0 to x . 10.3.4 Further Properties of The Repair Alert Model The following formula (taken from Lindqvist et al. 2006) shows in particular why Equation 10.11 holds under the repair alert model: F Z (t ) = FX (t ) + G (t )



∞ t

f X ( y) dy. G( y)

(10.12)

Note that for random signs and hence for the repair alert model we have F X (t ) = FX (t ) . We next discuss some implications of the repair alert model, in particular how the parameters q and G influence the observed performance of PM and failures. In order to help intuition, we sometimes consider the power version G (t ) = t β where β > 0 is a parameter. Then g (t ) = β t β −1 so β = 1 means a constant repair alert function, while β < 1 and β > 1 correspond to, respectively, a decreasing and increasing repair alert function. Under the random signs assumption, the parameter q = P( Z < X ) is connected to the ability to discover “signals” regarding a possibly approaching failure. More precisely, q is understood as the probability that a failure is avoided by a preceding PM. Given that there will be a PM, one should ideally have the time of PM immediately before the failure. It is seen that this issue is connected to the function G . For example, large values of β will correspond to distributions with most of its mass near x . Moreover, it follows from Equation 10.12 that E (Z | Z < X ) =



∞ 0

⎡ M (X )⎤ (1 − F Z ( z ))dz = E ( X ) − E ⎢ ⎥ ⎣ G( X ) ⎦

x

where M ( x) = ∫ G (t )dt . For the special case when G (t ) = t β , we obtain the 0 simple result E (Z | Z < X ) =

β E( X ) β +1

(10.13)

which clearly indicates that good PM performance corresponds to large values of β . An interesting observation is, furthermore, that Equation 10.13 can be used to estimate β from a sample of (Y , δ ) . In fact, E ( Z | Z < X ) can be estimated simply by the average of the observed Z , and since E ( X ) = E ( X | X < Z ) for random

Maintenance of Repairable Systems

251

signs censoring, we can estimate E ( X ) similarly by the average of the observed X . An estimate of the quotient β/ ( β + 1) and hence of β follow. Instead of merely considering the conditional expectation E ( Z | Z < X ) one may more generally study the conditional distribution of Z given Z < X , or the conditional distribution of X − Z given Z = z, Z < X . A good PM performance would then mean that the former distribution is stochastically as large as possible, while the latter distribution should be small (stochastically). For precise results in this direction we refer to Lindqvist et al. (2006). Consider next Y = min( X , Z ) , which is the actual sojourn time. The following results are hence of practical interest, and may in addition shed light on the influence of the parameters of the repair alert model: P (Y ≤ t ) = FX (t ) + qG (t )



∞ t

f X ( y) dy G( y)

⎡ M (X )⎤ E (Y ) = E ( X ) − qE ⎢ ⎥ , where M ( x) = ⎣ G( X ) ⎦



x 0

G (t )dt.

Furthermore, if G (t ) = t β , then ⎛ q ⎞ E (Y ) = E ( X ) ⎜1 − ⎟. β +1⎠ ⎝

(10.14)

We finally give a simple illustration of how the parameters q and β (assuming G (t ) = t β for simplicity) influence the long run cost per time unit under the repair alert model. Let CPM , CF be costs of PM and failure, respectively, for a single sojourn. Assume now that following an event (PM or failure), the operation is restarted with a system assumed to be as good as new, and that this process continues. This leads to a sequence of observations of (Y , δ ) , which we shall assume are independent and identically distributed. The theory of renewal reward processes (e.g. Ross 1983, p 78) implies that the expected cost per unit time in the long run equals the expected cost per sojourn divided by the expected length of a sojourn, i.e. qCPM + (1 − q)CF

(

E ( X ) 1 − βq+1

)

,

where we used Equation 10.14. This is a decreasing function of β , which seems reasonable. On the other hand, it is a decreasing function of q provided β > CPM / (CF − CPM ) . This last inequality is likely to hold in many practical cases since the right hand side will usually be much less than 1, while β should for a competent maintenance crew be larger than 1. Thus a high value of q is usually preferable.

252

B. Lindqvist

10.4 Periodically Tested Systems Certain systems, for example alarm systems, are tested only at fixed times which are usually periodic. If the system is found in a failed state, then it is repaired or replaced. Thus repair is usually not done at the same time as the failure, and the situation is hence not covered by the methods considered earlier in this chapter. A simple model of this situation was suggested by Hokstad and Frøvig (1996) and further studied and extended by Lindqvist and Amundrustad (1998) which is the main source for the present section. The approach of Lindqvist and Amundrustad (1998) involves a continuous time Markov model for the system state when time runs between testing epochs, and in addition two discrete time Markov chains for the states of the system reported immediately before and after each test, respectively. As will be seen, the given framework also allows in an easy manner the potentially useful extension to modeling of incomplete repairs or maintenance actions. We consider a standby system observed from time 0 , with testing, repair and PM performed periodically at times

τ , 2τ , 3τ ,…, called PM epochs. Here τ > 0 is the length of what we shall call the PM interval. 10.4.1 The Markov Model Let X (t ) ∈ S denote the state of the system at time t , where the set S of possible states is finite. It is assumed that X (t ) behaves like a time homogeneous Markov chain as long as time runs inside PM intervals, i.e. inside time intervals nτ ≤ t < (n + 1)τ for n = 0,1,…. This Markov chain is governed by an infinitesimal intensity matrix A , where the entry a jk of A for j ≠ k is the transition intensitiy from state j to state k ; see for example Taylor and Karlin (1984, p 254). An example of an intensity matrix A is given by Equation 10.15, an illustration of which is provided by the state diagram in Figure 10.9. Let Pjk (t ) = P( X (t ) = k | X (0) = j ); j, k ∈ S , t > 0

denote transition probabilities for the Markov chain governed by A and let P(t ) = ( Pjk (t ); j, k ∈ S )

be the corresponding transition matrix. In order to specify the effect of maintenance and repair at PM epochs, we next introduce for n = 1, 2,…, Yn = X (nτ −) ≡ lim X (t ), t ↑ nτ

Maintenance of Repairable Systems

253

which is the state of the system immediately before the n -th PM epoch. The effect of PM at time nτ is to change the state of the system from Yn to Z n according to a transition matrix R = ( R jk ) , where P( Z n = k | Yn = j ) = R jk ; j, k ∈ S .

Moreover, given Yn it is assumed that Z n is independent of all transitions of the system state before time nτ . The definitions of the Yn and Z n are illustrated in Figure 10.8.

Figure 10.8. The definition of Yn and Z n

The model description is completed by defining the initial state of the Markov chain X (t ) running inside the PM interval [nτ , (n + 1)τ ) to be X (nτ ) ≡ Z n ( n = 0,1,…), where Z 0 is the initial state of the system, usually the perfect state in S . It is furthermore assumed that the Markov chain X (t ) on [nτ , (n + 1)τ ) , given its initial state Z n , is independent of all transitions occurring before time nτ . Let the distribution of Z 0 ≡ X (0) be denoted ρ = ( ρ j ; j ∈ S ) , where ρ j = P( Z 0 = j ) . Then for any k ∈ S , P(Y1 = k ) = P( X (τ −) = k )

= ∑ P( X (τ −) = k | X (0) = j ) P( X (0) = j ) j∈S

= ∑ ρ j Pjk (τ ) = [ ρ P(τ )]k . j∈S

Thus the distribution of Y1 is given by the vector-matrix product ρ P(τ ) . Further, for n ≥ 1 , P(Yn +1 = k | Yn = j ) = =

∑ P(Y

n +1

∈S

∑P ∈S

k

= k | Z n = , Yn = j ) P( Z n =  | Yn = j )

(τ ) R j = [ RP(τ )] jk .

It follows that Y1 , Y2 ,… is a discrete time Markov chain on S with transition matrix Q = RP(τ ).

254

B. Lindqvist

On the other hand, P( Z n +1 = k | Z n = j ) =

∑ P( Z ∈S

n +1

= k | Yn +1 = , Z n = j )

×P(Yn +1 =  | Z n = j ) =

∑P ∈S

j

(τ ) Rk = [ P(τ ) R] jk .

Thus, our assumptions imply that Z 0 , Z1 ,… is a discrete time Markov chain on S with transition matrix T = P(τ ) R.

10.4.2 Reliability Measures The approach may now be used to compute interesting reliability measures. 10.4.2.1 Average rate of Critical Failures Let π = (π j , j ∈ S ) be the stationary distribution of the Markov chain Y1 , Y2 ,…, i.e. π is the unique probability vector satisfying the equation π Q ≡ π RP(τ ) = π .



π j . This is the expected relative number For any subset G ⊂ S , define π G = j∈G of PM epochs, in the long run, where the system is found to be in G . Moreover, 1/π G is the mean time, in the long run, between visits to G (measured with time unit τ ). These facts are well known from the theory of Markov chains (Taylor and Karlin 1984, Chapter 4). Let in the following G be the subset of S defining the critical failure states of the system. Then as in Hokstad and Frøvig (1996) we define the mean time between critical failures to be MTBFcrit = τ/π G

and the average rate of critical failures to be λcrit = 1/MTBFcrit = π G /τ .

10.4.2.2 Critical Safety Unavailability Consider a PM interval [nτ , (n + 1)τ ) . The expected relative amount of time in this interval that the system is in a critical state, i.e. in G , is

Un =

1 ( n +1)τ P ( X (t ) ∈ G )dt . τ ∫ nτ

Maintenance of Repairable Systems

255

By our assumptions, X (t ) behaves in the interval [nτ , (n + 1)τ ) in the same manner as if it was run in the interval [0,τ ) and started in state Z n . Thus Un =

1 τ

τ

∫ ∑P 0

j∈S

jG

(t ) P( Z n = j )dt

where PjG (t ) = ∑ k∈G Pjk (t ) .

Letting n tend to infinity, and the P( Z n = j ) tend to the limiting values γ j defined from the stationary distribution γ = (γ j ) of the Markov chain Z 0 , Z1 ,… . This distribution is found by solving the equations γ T ≡ γ P(τ ) R = γ .

Following Hokstad and Frøvig (1996) we shall define the critical safety unavailability (CSU) of the system by CSU = lim U n n →∞

1 = τ

τ

∫ ∑P 0

jG

(t )γ j dt =

j∈S

∑γ Q j

j

j∈S

where Qj =

1 τ



τ

0

PjG (t )dt

is the critical safety unavailability given that the system state is j at the beginning of the PM interval. 10.4.3 The Failure Model of Hokstad and Frøvig As an illustration we shall reconsider the most general failure model of Hokstad and Frøvig (1996), namely their Failure Mechanism III. Here the state space is S = {O, D, K I , K II },

where O = the system is as good as new, D = the system has a failure classified as degraded (noncritical), K I = the system has a failure classified as critical, caused by a sudden shock, K II = the system has a failure classified as critical, caused by the degradation process. It is assumed that the Markov chain X (t ) is defined by the state diagram of Figure 10.9, and thus has infinitesimal transition matrix

256

B. Lindqvist

⎡ −λd − λk ⎢ 0 A=⎢ ⎢ 0 ⎢ 0 ⎣

λd −λk − λdk

λk λk

0

0

0

0

0⎤ λdk ⎥⎥ 0⎥ ⎥ 0⎦

(10.15)

Note that both K I and K II are absorbing states.

Figure 10.9. State diagram for the failure mechanism of Hokstad and Frøvig (1996)

The model assumes that no repairs are done in the time intervals between PM epochs. Moreover, since A is upper triangular, we can obtain P(t ) = etA rather easily. It is clear that P(t ) can be written ⎡ POO (t ) POD (t ) POK (t ) POK (t ) ⎤ ⎢ ⎥ PDD (t ) PDK (t ) PDK (t ) ⎥ ⎢ 0 ⎢ 0 0 1 0 ⎥ ⎢ ⎥ 0 0 1 ⎥⎦ ⎢⎣ 0 I

I

II

II

where expressions for the entries are found in Lindqvist and Amunrustad (1998). In practice it is of interest to quantify the effect of various forms of preventive maintenance. This can be done in the presented framework by means of the repair matrix R . Some examples are given below. If all failures are repaired at PM epochs, then the PM always returns the system back to state O , and we have ⎡1 ⎢1 R=⎢ ⎢1 ⎢ ⎣1

0 0 0⎤ 0 0 0 ⎥⎥ 0 0 0⎥ ⎥ 0 0 0⎦

Maintenance of Repairable Systems

257

Next, if only critical failures are repaired at PM epochs, then the appropriate R matrix is ⎡1 ⎢0 R=⎢ ⎢1 ⎢ ⎣1

0 0 0⎤ 1 0 0 ⎥⎥ 0 0 0⎥ ⎥ 0 0 0⎦

More generally one may consider an extension of this by assuming that all critical failures are repaired, while degraded failures are repaired with probability 1 − r and remain unrepaired with probability r , 0 ≤ r ≤ 1 . The repair strategy is thus determined by the parameter r . This clearly leads to the matrix 0 0 0⎤ ⎡ 1 ⎢1 − r r 0 0 ⎥ ⎥ R=⎢ ⎢ 1 0 0 0⎥ ⎢ ⎥ 0 0 0⎦ ⎣ 1

A more general imperfect repair model can be defined by

R=

0

0

1− r 1 − rk1

r 0 0 rk1

0 0

1 − rk 2

0

r

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣

1

0

0

⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ k 2 ⎥⎦

Here r has the same meaning as before, while 1 − rk1 is the probability of successful repair of a K I failure and 1 − rk 2 is the similar for K II .

10.5 Concluding Remarks In the present chapter we have considered some aspects of the modeling and analysis of repaired and maintained systems. Rather than giving a comprehensive review of the field we have concentrated on a few points, partly chosen by the interest of the author. It is believed, however, that the chapter touches some topics that have to a certain degree been overlooked in much of reliability practice. The first point concerns the use of the NHPP as the single model for repairable systems with trend. Although this is appropriate in perhaps most cases, there are cases where renewal effects caused by repair or maintenance destroy the randomness associated with Poisson processes. One way of checking NHPP models is to embed them in larger models, and here the TRP can serve as a means of model

258

B. Lindqvist

checking (see for example the consideration of maximum log likelihoods in the examples of Section 10.2.5). Another way of extending the NHPP processes is via the large class of imperfect repair models. The classical model is here the one suggested by Brown and Proschan (1983) (see the review paper Lindqvist 2006 for an introduction to the subsequent literature). Imperfect repair models combine two basic ingredients, a hazard rate z (t ) of a new system together with a particular repair strategy which governs a so called virtual age process. The idea is that the virtual age of the system is reduced at repairs by a certain amount which depends on the repair strategy. The extreme cases are the perfect repair (renewal) models where the virtual age is set to 0 after each repair, and the minimal repair (NHPP) models where the virtual age is not reduced at repairs and hence always equals the actual age. Second, we have put some emphasis on the consideration of possible heterogeneity between systems of the same kind. Recall our Example 2 based on data from Bhattacharjee et al. (2003). The authors write in their conclusion: “The heterogeneity of failure behaviour of safety related components, such as valves in our case study, may have important implications for reliability analysis of safety systems. If such heterogeneity is not identified and taken into account, the decisions made to maintain or to enhance safety can be non-optimal or even erroneous. This non-optimality is more serious if the safety related decisions are made on the basis of failure histories of the components”. Still it is believed that heterogeneity has been neglected in many reliability applications. In fact, analyses of reliability data will often lead to an apparent decreasing failure rate which is counterintuitive in view of wear and ageing effects. Proschan (1963) pointed out that such observed decreasing rates could be caused by unobserved heterogeneity. Proschan presented failure data from 17 air conditioner systems on Boeing 720 airplanes, concluding that an HPP model was appropriate for each plane, but that the rates differed from plane to plane. This is a classical example of heterogeneity in reliability. If times between failures had been treated as independent and identically distributed across planes, the conclusion would have been that these times between failures had a decreasing failure rate. It has long been known in biostatistics that neglecting individual heterogeneity may lead to severe bias in estimates of lifetime distributions. The idea is that individuals have different “frailties”, and that those who are most “frail” will die or fail earlier than the others. This in turn leads to a decreasing population hazard, which has often been misinterpreted in the same manner as mentioned for the reliability applications. Important references on heterogeneity in the biostatistics literature are Vaupel et al. (1979), Hougaard (1984) and Aalen (1988). It should be noted that heterogeneity is in general unidentifiable if being considered an individual quantity. For identifiability it is necessary that frailty is common to several individuals, for example in family studies in biostatistics, or if several events are observed for each individual, such as for the repairable systems considered in this paper. The presence of heterogeneity is often apparent for data from repairable systems if there is a large variation in the number of events per system. However, it is not really possible to distinguish between heterogeneity and dependence of the intensity on past events for a single process.

Maintenance of Repairable Systems

259

The third point to be mentioned regards the use, or lack of use, of methods for competing risks in reliability applications. The following is a citation from Crowder (2004) appearing in the article on Competing Risks in Encyclopedia of Actuarial Sciences: “If something can fail, it can often fail in one of several ways and sometimes in more than one way at a time. In the real world, the cause, mode, or type of failure is usually just as important as the time to failure. It is therefore remarkable that in most of the published work to date in reliability and survival analysis there is no mention of competing risks. The situation hitherto might be referred to as a lost case”. Fortunately, some work has been done recently in order to include competing risks in the study of repaired and maintained systems. Much of this work, partly reviewed in Section 10.3, has been motivated by the work of Cooke (1996) and his collaborators. His point of departure was formulated in the conclusion of Cooke (1996): “The main themes of Parts I and II of this article are that current RDB (Reliability Data Bank) designs: 1. are not giving RDB users what they need; 2. are not doing a good job of analyzing competing risk data; 3. are not doing a good job in handling uncertainty. Improvements in all these areas are possible. However, it must be acknowledged that the models and methods presented here merely scratch the surface. It is therefore appropriate to conclude with a summary of open issues...” The final section of the present chapter considers an example of an approach which in some sense generalizes the competing risks issue, namely using Markov chains to model failure mechanisms of various equipment. The chapter has mostly considered the modeling of repairable systems, with less mention of statistical methods. It is believed that much of future research on maintenance of repairable systems will still be centered around modeling, possibly with an increased emphasis on point process models including multiple types of events (see for example Doyen and Gaudoin 2006). More detailed models of the underlying failure and maintenance mechanisms may indeed be of great value for planning and optimization of maintenance actions. On the other hand, the new advances in modeling certainly lead to considerable statistical challenges. This point was touched on by Cooke (1996) as cited above, and it is clear that the information in reliability databases could and should be handled by more sophisticated methods than the ones that are traditionally used. Here there is much to learn from the biostatistics literature where there has for a long time been an emphasis on nonparametric methods and on regression methods using covariate information.

10.6 References Aalen OO, (1988) Heterogeneity in survival analysis. Statistics in Medicine 7:1121–1137. Andersen P, Borgan O, Gill R, Keiding, N, (1993) Statistical Models Based on Counting Processes. Springer, New York. Ascher H, Feingold H, (1984) Repairable Systems – Modeling, inference, misconceptions and their causes. Marcel Dekker, New York. Bedford T, Cooke RM, (2001) Probabilistic Risk Analysis: Foundations and Methods; Cambridge University Press: Cambridge.

260

B. Lindqvist

Bhattacharjee M, Arjas E, Pulkkinen, U, (2003) Modeling heterogeneity in nuclear power plant valve failure data. In: Mathematical and Statistical Methods in Reliability (Lindqvist BH, Doksum KA, eds.) World Scientific Publishing, Singapore, pp 341–353. Brown M, Proschan F, (1983) Imperfect repair. Journal of Applied Probability 20:851–859. Cook RJ, Lawless JF, (2002) Analysis of repeated events. Statistical Methods in Medical Research 11:141–166. Cooke RM, (1993) The total time on test statistics and age-dependent censoring. Statistics and Probability Letters 18:307–312. Cooke RM, (1996). The design of reliability databases, Part I and II. Reliability Engineering and System Safety 51:137–146 and 209–223. Crowder MJ, (2001) Classical competing risks. Chapman & Hall/CRC, Boca Raton. Crowder MJ, (2004) Competing risks. In: Encyclopedia of actuarial science (Teugels JL, Sundt B, eds.) Wiley, Chichester, pp. 305–313. Crowder MJ, Kimber AC, Smith RL, Sweeting TJ, (1991) Statistical Analysis of Reliability Data. Chapman & Hall, Great Britain. Doyen L, Gaudoin O, (2006) Imperfect maintenance in a generalized competing risks framework. Journal of Applied Probability 43:825-839. Follmann DA, Goldberg MS, (1988) Distinguishing heterogeneity from decreasing hazard rate. Technometrics 30:389–396. Hokstad P, Frøvig AT, (1996) The modelling of degraded and critical failures for components with dormant failures. Reliability Engineering and System Safety 51:189–199. Hougaard P, (1984) Life table methods for heterogeneous populations: Distributions describing the heterogeneity. Biometrika 71:75–83. Langseth H, Lindqvist BH, (2003) A maintenance model for components exposed to several failure mechanisms and imperfect repair. In: Mathematical and Statistical Methods in Reliability (Lindqvist BH, Doksum KA, eds.). World Scientific Publishing, Singapore, pp 415-430. Langseth H, Lindqvist BH, (2006) Competing risks for repairable systems: A data study. Journal of Statistical Planning and Inference 136:1687–1700. Lawless JF, (1987) Regression methods for Poisson process data. Journal of American Statistical Association 82:808–815. Lindqvist BH, (2006) On the statistical modelling and analysis of repairable systems. Statistical Science 21:532–551. Lindqvist BH, Amundrustad H, (1998) Markov models for periodically tested components. In: Safety and Reliability. Proceedings of the European Conference on Safety and Reliability - ESREL ’98 (Lydersen S, Hansen GK, Sandtorv HA). AA Balkema, Rotterdam, pp 191–197. Lindqvist BH, Langseth H, (2005) Statistical modelling and inference for component failure times under preventive maintenance and independent censoring. In: Modern Statistical and Mathematical Methods in Reliability (Wilson A, Limnios N, Keller-McNulty S, Armijo Y). World Scientific Publishing, Singapore, pp. 323–337. Lindqvist BH, Elvebakk G, Heggland K, (2003) The trend-renewal process for statistical analysis of repairable systems. Technometrics 45:31–44. Lindqvist BH, Støve B, Langseth H, (2006) Modelling of dependence between critical failure and preventive maintenance: The repair alert model. Journal of Statistical Planning and Inference 136:1701–1717. Meeker WQ, Escobar LA, (1998) Statistical methods for reliability data. Wiley, New York. Nelson W, (1995) Confidence limits for recurrence data – applied to cost or number of product reapair. Technometrics 37:147–157. Peña EA, (2006) Dynamic modelling and statistical analysis of event times. Statistical Science 21:487–500.

Maintenance of Repairable Systems

261

Proschan F, (1963) Theoretical explanation of observed decreasing failure rates. Technometrics 5:375–383. Rausand M, Høyland A, (2004) System reliability theory: Models, statistical methods, and applications. 2nd ed. Wiley-Interscience, Hoboken, N.J. Ross SM, (1983) Stochastic Processes. Wiley, New York. Taylor HM, Karlin S, (1984) An introduction to stochastic modeling. Academic Press, Orlando. Vaupel JW, Manton KG, Stallard E, (1979) The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16:439–454.

11 Optimal Maintenance of Multi-component Systems: A Review Robin P. Nicolai and Rommert Dekker

11.1 Introduction Over the last few decades the maintenance of systems has become more and more complex. One reason for this is that systems consist of many components which depend on each other. On the one hand, interactions between components complicate the modelling and optimization of maintenance. On the other hand, interactions also offer the opportunity to group maintenance which may save costs. It follows that planning maintenance actions is a big challenge and it is not surprising that many scholars have studied maintenance optimization problems for multi-component systems. In some articles new solution methods for existing problems are proposed, in other articles new maintenance policies for multi-component systems are studied. Moreover, the number of papers with practical applications of optimal maintenance of multi-component systems is still growing. Cho and Parlar (1991) give the following definition of multi-component maintenance models: “Multi-component maintenance models are concerned with optimal maintenance policies for a system consisting of several units of machines or many pieces of equipment, which may or may not depend on each other (economically/stochastically/structurally).” So, in these models it is all about making an optimal maintenance planning for systems consisting of components that interact with each other. We will come back later to the concepts of optimality and interaction. For now it is important to remember that the condition of the systems depends on (the state of) the components which will only function if adequate maintenance actions are performed. In this chapter we will give an up-to-date review of the literature on multicomponent maintenance optimization. Let us start with a brief summary of the overview articles that have appeared in the past. Cho and Parlar (1991) review articles from 1976 to 1991. The authors divide the literature into five topical categories: machine-interference/repair models, group/block/cannibalization/opportunistic models, inventory/maintenance models, other maintenance/replacement models and inspection/maintenance models. Dekker et al. (1996) deal exclusively

264

R. Nicolai and R. Dekker

with multi-component maintenance models that are based on economic dependence. Emphasis is put on articles that have been published after 1991, but there is an overlap with the review of Cho and Parlar (1991). The classification scheme of Dekker et al. (1996) differs from that of Cho and Parlar (1991). First, models are classified based on the planning aspect of the model: stationary (long-term) and dynamic (short-term). Second, the stationary-grouping models are divided in the categories grouping corrective maintenance, grouping preventive maintenance and opportunistic grouping maintenance. Here, opportunistic grouping is grouping both preventive and corrective maintenance. The dynamic grouping models are divided into two categories: those with a finite horizon and those with a rolling horizon. In a recent article Wang (2002) gives an overview of maintenance policies of deteriorating systems. The emphasis is on policies for single component systems. One section is devoted to opportunistic maintenance policies for multi-component systems. The author primarily considers models with economic dependence. The existing review articles indicate that there are several ways to categorize articles and models. In Section 11.2 of this chapter we structure the field and present our comprehensive classification scheme. It differs from the schemes used in the review articles discussed earlier. First of all, we distinguish between models with economic, structural and stochastic dependence. Economic dependence implies that grouping maintenance actions either save costs (economies of scale) or result in higher costs (because of, e.g. high down-time costs), as compared to individual maintenance. Stochastic dependence occurs if the condition of components influences the lifetime distribution of other components. Structural dependence applies if components structurally form a part, so that maintenance of a failed component implies maintenance of working components. In Sections 11.3–11.5, we discuss papers concerning economic, stochastic and structural dependence between components. In Section 11.6 we classify articles according to the planning aspect of the maintenance model and the method used to optimize the model. Following the review of Dekker et al. (1996) we distinguish between models with finite and infinite planning horizons. Models with an infinite planning horizon are called stationary, since they usually provide static rules for maintenance, which do not change over the planning horizon. Finite horizon models are called dynamic, since these models can generate dynamic decisions that may change over the planning horizon. In these models short-term information can be taken into account. With respect to the optimization methods, we divide the papers into three categories: exact, heuristic and policy optimization. Section 11.7 covers trends and open research areas in multi-component maintenance. Conclusions are drawn in Section 11.8.

11.2 Structuring the Field In Section 11.2.1 we give a short review of the terminology used in multi-component maintenance optimization models and explain how we searched the literature. In Section 11.2.2 we present our comprehensive classification scheme.

Optimal Maintenance of Multi-component Systems: A Review

265

11.2.1 Search Strategy and Terminology Presenting a scientific review on a certain topic implies that one tries to discuss all relevant articles. Finding these articles, however, is very difficult. It depends on the search engines and databases used, electronic availability of articles and the search strategy. We used Google Scholar, Scirus and Scopus as search engines, and used ScienceDirect, JStor and MathSciNet as (online) database. We primarily searched on key words, abstracts and titles, but we also searched within the papers for relevant references. Note that papers published in books or proceedings that are not electronically available, are likely to have not been identified. Terminology is another important issue, as the use of other terms can hide a very interesting paper. The field has been delineated by maintenance, replacement or inspection on one hand and optimization on the other. This combination, however, provides almost 5000 hits in Google Scholar. Next, the term multi-component has been used in junction with related terms as opportunistic maintenance (policies), piggyback(ing), joint replacement, joint overhaul, combining maintenance, grouping maintenance, economies of scale and economic dependence. With respect to the term stochastic dependence, we have also searched for synonyms and related terms such as failure interaction, probabilistic dependence and shock damage interaction. This yields approximately 500 hits. Relevant articles have been selected from this set by scanning the articles. The vast literature on maintenance of multi-component systems has been reviewed earlier by others. Therefore, we have also consulted existing reviews and overview articles in this field. Moreover, we have applied a citation search (looking both backwards and forwards in time for citations) to all articles found. This citation search is an indirect search method, whereas the above methods are direct methods. The advantage of this method is that one can easily distinguish clusters of related articles. 11.2.2 Classification Scheme First of all, we classify the multi-component maintenance models on the basis of the dependence/interaction between components in the system considered. Thomas (1986) defines three different types of interactions: economic, structural and stochastic dependence. Simply said, economic dependence implies that the cost of joint maintenance of a group of components does not equal the total cost of individual maintenance of these components. The effect of this dependence comes to the fore in the execution of maintenance activities. On the one hand, the joint execution of maintenance activities can save costs in some cases (e.g. due to economies of scale). On the other hand, grouping maintenance may also lead to higher costs (e.g. due to manpower restrictions) or may not be allowed. For this reason, we will subdivide the models with economic dependence into two categories: positive and negative economic dependence. That is, we refine the definition of economic dependence as compared to the definition used in the review article of Dekker et al. (1996). Note that in many systems both positive and negative economic dependence between

266

R. Nicolai and R. Dekker

components are present. We give special attention to the modelling of maintenance optimization of these systems, in particular the k-out-of-n system. Stochastic dependence occurs if the condition of components influences the lifetime distribution of other components. Synonyms of stochastic dependence are failure interaction or probabilistic dependence. This kind of dependence defines a relationship between components upon failure of a component. For example, it may be the case that the failure of one component induces the failure of other components or causes a shock to other components. Structural dependence applies if components structurally form a part, so that maintenance of a failed component implies maintenance of working components, or at least dismantling them. So, structural dependence restricts the maintenance manager in his decision on the grouping of maintenance activities. A second classification of the models is based on the planning aspect: stationary or dynamic. That is, do we make a short-term/operational or a long-term/ strategic planning for the maintenance activities? Is the planning horizon finite or infinite? In stationary models, a long-term stable situation is assumed and mostly these models assume an infinite planning horizon. Models of this kind provide static rules for maintenance, which do not change over the planning horizon. They generate for example long-term maintenance frequencies for groups of related activities or control-limits for carrying out maintenance depending on the state of components. In dynamic grouping models, short-term information such as a varying deterioration of components or unexpected opportunities can be taken into account. These models generate dynamic decisions that may change over the planning horizon. The last classification we consider is based on the type of optimization method used. This can be an exact method, a heuristic or a search within classes of policies. Exact optimization methods are designed to find the real optimal solution of a problem. However, if the computing time of the optimization method increases exponentially with the number of components, then exact methods are only desirable to a certain extent. In that case solving problems with many components is impossible and heuristics should be used. Heuristics are local optimization methods that do not pretend to find the global optimum, but can be applied to find a solution to the problem in reasonable time. The quality of such a solution depends on the problem instance. In some cases it is possible to give an upper bound on the gap between the optimal solution and the solution found by the heuristic. In many papers a maintenance planning is made by optimizing a certain type of policy. Well known maintenance policies are the age and block replacement policies and their extensions. The advantage of policy optimization over other optimization methods is that it gives more insight into the solution of the problem. Note that policy optimization will not always result in the global optimal solution, since there may be another policy that results in a better solution. In some cases however, it can be proved that applying a certain maintenance policy results in the exact (global) optimal solution.

Optimal Maintenance of Multi-component Systems: A Review

267

11.3 Economic Dependence In this section we review articles on multi-component systems with economic dependence. We focus on articles appearing since the review of Dekker et al. (1996). In Sections 11.3.1 and 11.3.2 we discuss models with positive and negative dependence, respectively. In Section 11.3.3 we discuss articles on k-out-of-n systems, in which both positive and negative dependence between components are present. 11.3.1 Positive Economic Dependence Positive economic dependence implies that costs can be saved when several components are jointly instead of separately maintained. Compared with the review of Dekker et al. (1996) we refine the concept of (positive) economic dependence and distinguish the following forms: • Economies of scale – General – Single set-up – Multiple set-ups o Hierarchy of set-ups • Downtime opportunity The term economies of scale is often used to indicate that combining maintenance activities is cheaper than performing maintenance on components separately. The term economies of scale is very general and it seems to be similar to positive economic dependence. In this chapter we will speak of economies of scale when the maintenance cost per component decreases with the number of maintained components. Economies of scale can result from preparatory or set-up activities that can be shared when several components are maintained simultaneously. The cost of this set-up work is often called the set-up cost. Set-up costs can be saved when maintenance activities on different components are executed simultaneously, since execution of a group of activities requires only one set-up. In this overview we distinguish between single set-ups and multiple set-ups. In the latter case there usually is a hierarchy of set-ups. For instance, consider a system consisting of two components, which both consist of two subcomponents. Maintenance of the subcomponents of the components may require a set-up at system level and component level. First, this means that the set-up cost at component level is paid only once when the maintenance of two subcomponents of a component is combined. Second, the set-up cost at system level is paid only once when all subcomponents are maintained at the same time. Set-up costs usually come back in the objective function of the maintenance problem. If economies of scale are not explicitly modelled by including set-up costs in the objective function, then we classify the model in the category ‘general’. Another form of positive dependence is the downtime opportunity. Component failures can often be regarded as opportunities for preventive maintenance of nonfailed components. In a series system a component failure results in a non-operating

268

R. Nicolai and R. Dekker

system. In that case it may be worthwhile to replace other components preventively at the same time. This way the system downtime results in cost savings since more components can be replaced at the same time. Moreover, by grouping corrective and preventive maintenance the downtime can be regulated and in some cases it can even be reduced. Note that if the downtime cost is included in the set-up cost in a certain paper, then we will not classify the paper in the category ‘downtime opportunity’, but in the category ‘set-up cost’. In general, however, it is difficult to assess the cost associated with the downtime (see, e.g. Smith and Dekker (1997), who approximate the availability and the cost of downtime for a 1-out-of-n system). Therefore, the downtime cost is usually not included in the set-up cost. In the paragraphs below we discuss articles dealing with positive economic dependence. Our main focus is on the modelling of this dependence. 11.3.1.1 Economies of Scale General In comparison with Dekker et al. (1996) the category ‘general economies of scale’ is new. The papers in this category deal with multi-component systems for which joint maintenance of components is cheaper than individual maintenance of components. This form of economies of scale cannot be modelled by introducing a single set-up cost. The cost associated with the maintenance of components is often concave in the number of components that are maintained simultaneously. Dekker et al. (1998a) evaluate a new maintenance concept for the preservation of highways. In road maintenance cost savings can be realized by maintaining larger sections instead of small patches. The road is divided into sectors of 100-m length. Set-up costs are present in the form of the direct costs associated with the maintenance of different parts of the road. The set-up cost is a function of the number of these parts in a maintenance group. A heuristic search procedure is proposed to find the optimal maintenance planning. Papadakis and Kleindorfer (2005) introduce the concept of network topology dependencies (NTD) for infrastructure networks. In these networks two types of NTD can be distinguished: contiguity and set-up discounts. Both types define positive economic dependence between components. In the former case savings are realized when costs are paid once when contiguous sections are maintained at the same time. In the latter case savings are realized when costs are paid once for a neighbourhood of the infrastructure network, independently of how much work is carried out on it. For both types of dependencies a non-linear discount function is defined. The authors consider the problem of maintaining an infrastructure network. It is modelled as an undirected network. Risk measures or failure probabilities for the segments of this network are assumed to be known. A maximum flow minimum cut formulation of the problem is developed. This formulation makes it easier to solve the problem exactly and efficiently. Single Set-up Nearly all articles reviewed by Dekker et al. (1996) fall into this category. The objective function of the maintenance optimization model usually consists of a

Optimal Maintenance of Multi-component Systems: A Review

269

fixed cost (the set-up cost) and variable costs. In the articles discussed below, this will not be different. Castanier et al. (2005) consider a two-component series system. Economic dependence between the two components is present in the following way. The setup cost for inspecting or replacing a component is charged only once if the actions on both components are combined. That is, joint maintenance of components saves costs. In this article the condition of the components is modelled by a stochastic process and it is monitored by non-periodic inspections. In the opportunistic maintenance policy several thresholds are defined for doing inspections, corrective and preventive replacements, and opportunistic maintenance. These thresholds are decision variables. Many articles on this type of models have appeared, but most of these articles only consider single component models. The articles of Scarf and Deara (1998, 2003) consider both economic and stochastic dependence between components in a series system. This combination is scarce in the literature. Positive economic dependence is modelled on the basis that the cost of replacement of one or more components includes a one-off set-up cost whose magnitude does not depend on the number of components replaced. We will discuss these articles in more detail in Section 11.4. In one of the few case studies found in the literature, Van der Duyn Schouten et al. (1998) investigate the problem of replacing light bulbs in traffic control signals. Each installation consists of three compartments for the green, red, and yellow lights. Maintenance of light bulbs means replacement, either correctively or preventively. First, positive economic dependence is present in the form of set-up cost, because each replacement action requires a fixed cost in the form of transportation of manpower and equipment. Second, the failure of individual bulbs is an opportunity for doing preventive maintenance on other bulbs. The authors propose two types of maintenance policies. In the first policy, also known as the standard indirect-grouping strategy (introduced in maintenance by Goyal and Kusy 1985; for a review of this strategy we refer to Dekker et al. 1996), corrective and preventive replacements are strictly separated. Economies of scale can thus only be achieved by combining preventive replacements of the bulbs. The authors also propose the following opportunistic age-based grouping policy. Upon failure of a light bulb, the failed bulbs and all other bulbs older than a certain age are replaced. Budai et al. (2006) consider a preventive maintenance scheduling problem (PMSP) for a railway system. In this problem (short) routine activities and (long) unique projects for one track have to be scheduled in a certain period. To reduce costs and inconvenience for the travellers and operators, these activities should be scheduled together as much as possible. With respect to the latter, maintenance of different components of one track simultaneously requires only one track possession. Time is discretized and the PMSP is written as a mixed-integer linear programming model. Positive dependence is taken into account by the objective function, which is the sum of the total track possession cost and the maintenance cost over a finite horizon. To reduce possible end-of-horizon effects an end-of-horizon valuation is also incorporated in the objective function. Note that the possession cost can be seen as a downtime cost. The cost is modelled as a fixed/ set-up cost. This is the reason that it is classified in this category. Besides this positive dependence there also exists negative dependence between components, since some activities exclude each other.

270

R. Nicolai and R. Dekker

The advantage of a discrete time model is that negative dependence can be incorporated in the model by adding additional restrictions. It appears that the PMSP is a NP-hard problem. Heuristics are proposed to find near-optimal solutions in reasonable time. Multiple Set-ups This is also a new category. The maintenance of different components may require different set-up activities. These set-up activities may be combined when several components are maintained at the same time. We have found one article in this category; it assumes a complex hierarchical set-up structure. Van Dijkhuizen (2000) studies the problem of clustering preventive maintenance jobs in a multiple set-up multi-component production system. As far as the authors know, this is the first attempt to model a maintenance problem with a hierarchical (tree-like) set-up structure. Different set-up activities have to be done at different levels in the production system before maintenance can be done. Each component is maintained preventively at an integer multiple of a certain basic interval, which is the same for all components, and corrective maintenance is carried out in between whenever necessary. So, every component has its own maintenance frequency — the frequencies are based on the optimal maintenance planning for single components. Obviously, set-up activities may be combined when several components are maintained at the same time. The problem is to find the maintenance frequencies that minimize the average cost per unit of time. This problem is an extension of the standard-indirect grouping problem (for an overview of this problem see Dekker et al. 1996). 11.3.1.2 Downtime Opportunity As we stated earlier, the downtime of a system is often an opportunity to combine preventive and corrective maintenance. This is specially true for series systems, where a single failure results in a system breakdown. Of course, non-failed components should not be replaced when they are in a good condition, because useful lifetime can be wasted. The maintenance policies proposed in the articles discussed below use this idea. Gürler and Kaya (2002) propose a new opportunistic maintenance policy for a series system with identical items. The article is an extension of the work by Van der Duyn Schouten and Vanneste (1993), who also propose an opportunistic policy for such a system. In their model, the lifetime of the components is described by several stages, which are classified as good, doubtful, preventive maintenance due and failed. Gürler and Kaya (2002) classify the stages in the same way, but the stages good and doubtful are subdivided into a number of states. The proposed policy is of the control-limit type. Components which are PM due (failed) are preventively (correctively) replaced immediately. The entire system is replaced when a component is PM due or down and the number of components in doubtful states is at least N. Here, N is a decision variable. It appears that this policy achieves significant savings over a policy where the components are maintained individually without any system replacement. Popova and Wilson (1999) consider m-failure, T-age and (m,T) failure group policies for a system of identical components operating in parallel. According to

Optimal Maintenance of Multi-component Systems: A Review

271

these policies the system is replaced at the time of the m-th failure, every T time units, and at the minimum time of these events, respectively. These policies were first introduced by Assaf and Shanthikumar (1987), Okumoto and Elsayed (1983) and Ritchken and Wilson (1990), respectively. Popova and Wilson (1999) assume that downtime costs are incurred when failed components are not repaired or replaced. So, when the system operates there is also negative dependence between the components. After all, when the components are left in a failed condition, with the intention to group corrective maintenance, then downtime costs are incurred. In the maintenance policies a trade-off between the downtime costs and the advantages of grouping (corrective) maintenance is made. Sheu and Jhang (1996) propose a new two-phase opportunistic maintenance policy for a group of independent identical repairable units. Their model takes into account downtime costs and the maintenance policy includes minimal repair, overhaul, and replacement. In the first phase, (0,T], minor failures are removed by minimal repairs and ‘catastrophic’ failures by replacements. In the second phase, (T,T+W], minor failures are also removed by minimal repairs, but ‘catastrophic’ failures are left idle. Group maintenance is conducted at time T+W or upon the k-th idle, whichever comes first. The generalized group maintenance policy requires inspection at either the fixed time T+W or the time when exactly k units are left idle, whichever comes first. At an inspection, all idle components are replaced with new ones and all operating components are overhauled so that they become as good as new. Higgins (1998) studies the problem of scheduling railway track maintenance activities and crews. In this problem positive economic dependence is present in the following way. The occupancy of track segments due to maintenance prevents all train movements on those segments. The costs associated with this can be regarded as downtime costs. The maintenance scheduling problem is modelled as a large scale 0-1 programming problem with many (non-linear) restrictions. The objective is to minimize expected interference delay with the train schedule and prioritized finishing time. The downtime costs are modelled by including downtime probabilities in the objective function. The author proposes tabu search to solve the problem. The neighbourhood, which plays a prominent role in local search techniques, is easily defined by swapping the order of activities or maintenance crews. The article of Sriskandarajah et al. (1998) discusses the maintenance scheduling of rolling stock. Multiple train units have to be overhauled before a certain due date. The aim is to find a suitable common due date for each train so that the due dates of individual units do not deviate too much from the common due date. Maintenance carried out too early or too late is costly since this may cause loss of use of a train. A genetic algorithm is proposed to solve this scheduling problem. 11.3.2 Negative Economic Dependence Negative economic dependence between components occurs when maintaining components simultaneously is more expensive than maintaining components individually. There can be several reasons for this:

272

R. Nicolai and R. Dekker

• • •

Manpower restrictions Safety requirements Redundancy/production-loss

First grouping maintenance results in a peak in manpower needs. Manpower restrictions may even be violated and additional labour needs to be hired, which is costly. The problem here is to find the balance between workload fluctuation and grouping maintenance. Second, there are often restrictions on the use of equipment, when executing maintenance activities simultaneously. For instance, use of equipment may hamper use of other equipment and cause unsafe operations. Legal and/or safety requirements often prohibit joint operation. Third, joint (corrective) maintenance of components in systems in which some kind of redundancy is available may not be beneficial. Although there may exist economies of scale through simultaneous repair of a number of (identical) components, leaving components in a failed condition for some time increases the risk of costly production losses. We will come back to this in Section 11.3.3. Production loss may increase more than linearly with the number of components out of operation. For an example of this type of economic dependence we refer to Stengos and Thomas (1980). The authors give an example of the maintenance of blast furnaces. The disturbance due to maintenance is substantially more, the more furnaces that are out of operation. That is, the cost of overhauling the furnaces increases more than linearly with the number of furnaces out of action. It appears that maintenance of systems with negative dependence is often modelled in discrete time. The models can be regarded as scheduling problems with many restrictions. These restrictions can easily be incorporated in discrete time models such as (mixed) integer programming models. With respect to these models, there is always the question whether the exact solution can be found efficiently. In other words, the question arises whether the problem is NP-hard. An example of discrete time modelling is given by the article of Grigoriev et al. (2006). In this article the so-called periodic maintenance problem (PMP) is studied. In this problem machines have to be serviced regularly to prevent costly production losses. The failures causing these production losses are not modelled. Time is discretized into unit-length periods. In each period at most one machine can be serviced. Apparently negative economic dependence in the form of manpower restrictions or safety measures play a role in the maintenance of the machines. The problem is to find a cyclic maintenance schedule of a given length T that minimizes total service and operating costs. The operating costs of a machine increase linearly with the number of periods elapsed since last servicing that machine. PMP appears to be an NP-hard problem and the authors propose a number of solution methods. This leads to the first exact solutions for larger sized problems. In Stengos and Thomas (1980) time is also discretized but the maintenance problem, scheduling the overhaul of two pieces of equipment, is set up as a Markov decision process. The pieces can be in different states and the probability of failure increases with the time since the last overhaul. So in comparison with the problem of Grigoriev et al. (2006), pieces can fail during operation. Negative economic dependence is modelled as follows. The cost of overhauling the pieces

Optimal Maintenance of Multi-component Systems: A Review

273

increases more than linearly with the number of pieces out of action. The objective is to minimize the ‘loss of production’ cost, which is incurred when a piece is overhauled. The optimal policy is found by a relative value successive approximation algorithm. In Langdon and Treleaven (1997) the problem of scheduling maintenance for electrical power transmission networks is studied. There is negative economic dependence in the network due to redundancy/production-loss. Grouping certain maintenance activities in the network may prevent a cheap electricity generator from running, so requiring a more expensive generator to be run in its place. That is, some parts of the network should not be maintained simultaneously. These exclusions are modelled by adding restrictions to the MIP formulation of the problem. The authors propose several genetic algorithms and other heuristics to solve the problem. 11.3.3 k-out-of-n Systems In this section we discuss the different dependencies in the k-out-of-n system in more detail. This system is a typical example of a system with both positive and negative economic dependence between components. A k-out-of-n system functions if at least k components function. If k = 1, then it is a parallel system; if k = n, then it is a series system. Let us for the moment distinguish between the cases k = n and k < n. In the series system (k = n), there is positive economic dependence due to downtime opportunities. The failure of one component results in an expensive downtime of the system and this time can be used to group preventive and corrective maintenance. Negative economic dependence is not explicitly present in the series system. If k < n, then there is redundancy in the system and it fails less often than its individual components. This way a specified reliability can be guaranteed. Typically, the components of this system are identical which allows for economies of scale in the execution of maintenance activities. It is not only possible to obtain savings by grouping preventive maintenance, but also by grouping corrective maintenance. Note that the latter form of grouping is not advantageous in series systems. In other words, the redundant components introduce additional positive dependence in the system. Whereas positive economic dependence is present upon failure of a component, negative economic dependence plays a role as long as the system operates. A single failure of a component may not always be an opportunity to combine maintenance activities. First, grouping corrective and preventive maintenance upon the failure of the component increases the probability of system failure and costly production losses. Second, leaving components in a failed condition for some time, with the intention to group corrective maintenance at a later stage, has the same effect. So, there is a trade-off between the potential loss resulting from a system failure and the benefit of joint maintenance. One problem of optimizing (age-based) maintenance in k-out-of-n systems is the determination of downtime costs, as a failure does not directly result in system failure. Smith and Dekker (1997) derive the uptime, downtime and costs of maintenance in a 1-out-of-n system (with cold standby), but in general it is very difficult

274

R. Nicolai and R. Dekker

to assess the availability and the downtime costs of a k-out-of-n system. In their article, Smith and Dekker (1997) optimize the following age-replacement policy. A component is taken out for preventive maintenance and replaced by a stand-by one, if its age has reached a certain value Tpm. Moreover, they determine the number of redundant components needed in the system. In the maintenance policies considered in the articles below, an attempt is made to balance the negative aspects of downtime costs and the positive aspects of grouping (corrective) maintenance. The opportunistic maintenance policies proposed in these articles are age-based and also contain a threshold for the number of failures (except for the policy introduced by Sheu and Kuo 1994). In Dekker et al. (1998b) the maintenance of light-standards is studied. A light standard consists of n independent and identical lamps screwed on a lamp assembly. To guarantee a minimum luminance, the lamps are replaced if the number of failed lamps reaches a pre-specified number m. In order to replace the lamps the assembly has to be lowered. This set-up activity is an opportunity to combine corrective and preventive maintenance. Several opportunistic age-based variants of the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered in this paper. Simulation optimization is used to determine the optimal opportunistic age threshold. Pham and Wang (2000) introduce imperfect PM and partial failure in a k-outof-n system. They propose a two-stage opportunistic maintenance policy for the system. In the first stage failures are removed by minimal repair; in the second stage failed components are jointly replaced with operating components when m components have failed, or the entire system is replaced at time T, whichever occurs first. Positive economic dependence is of an opportunistic nature. Joint maintenance requires less time than individual maintenance. Sheu and Kuo (1994) introduce a general age replacement policy for a k-out-ofn system. Their model includes minimal repair, planned and unplanned replacements, and general random repair costs. The system is replaced when it reaches age T. The long-run expected cost rate is obtained. The aim of the paper is to find the optimal age replacement time T that minimizes the long-run expected cost per unit time of the policy. The article of Sheu and Liou (1992) will be discussed in Section 11.4, because they assume stochastic dependence between the components of a k-out-of-n system.

11.4 Stochastic Dependence In the survey of Thomas (1986) multi-component maintenance models with stochastic dependence are considered as a separate class of models. In the more recent review articles this is not the case. In Cho and Parlar (1991) some articles dealing with failure interaction are discussed, but the modelling of failure interaction between components is not. In Wang (2002) nothing is said about systems with failure interaction; articles on this kind of systems only appear in the references. Actually, this is the first publication, since the survey of Thomas (1986), to give a comprehensive review of multi-component maintenance models with stochastic dependence. We do not aim to give solely a list of papers that have appeared.

Optimal Maintenance of Multi-component Systems: A Review

275

Instead, we want to give insight into the different ways of modelling failure interaction between components and explain the implications of certain approaches and assumptions with respect to practical applicability. Stochastic dependence, also referred to as failure interaction or probabilistic dependence, implies that the state of components can influence the state of the other components. Here, the state can be given by the age, the failure rate, state of failure or any other condition measure. In their seminal work on stochastic dependence, Murthy and Nguyen (1985b) introduce three different types of failure interaction in a two-component system. Type I failure interaction implies that the failure of a component can induce a failure of the other component with probability p (q), and has no effect on the other component with probability 1 – p (1 – q). It follows that there are two types of failures: natural and induced. The natural failures are modelled by random variables and the induced failures are characterized by the probabilities p and q. In Murthy and Nguyen (1985a) the authors extend type I failure interaction to systems with multiple components. It is assumed that whenever a component fails it induces a total failure of the system with probability p and has no effect on the other components with probability (1 – p). In this chapter we will consider this to be the definition of type I failure interaction. Type II failure interaction in a two-component system is defined as follows. The failure of component 2 can induce a failure of component 1 with probability q, whereas every failure of component 1 acts as a shock to component 2, without inducing an instantaneous failure, but affecting its failure rate. Type III failure interaction implies that the failure of each component affects the failure rate of the other component. That is, every failure of one of the components acts as a shock to the other component. A potential problem of the failure rate interaction defined by the last two types, is determining the size of the shock. In practice it is very difficult to assess the effect of a failure of one component on the failure rate of another component. Usually there is not much data on the course of the failure rate of a component after the occurrence of a shock. Shocks can also be modelled by adding a (random) amount of damage to the state of another component. Natural failures then occur if the state of a component (measured by the cumulative damage) exceeds a certain level. In this paper we will bring this modelling of type II and III failure interaction together in one definition. That is, we renew the definition of type II failure interaction for multi-component systems. It reads as follows. The system consists of several components and the failure of a component affects either the failure rate of or causes a (random) amount of damage to the state of one or more of the remaining components. It follows that we regard a mixture of induced failures and shock damage as type II failure interaction. Models with type II failure interaction will also be called shock damage models. In general, the maintenance policies considered in the literature on stochastic dependence, are mainly of an opportunistic nature, since the failure of one component is potential harmful for the other component(s). Modelling failure interaction appears to be quite elaborate. Therefore, most articles only consider two-component systems. Below we review the articles on failure interaction in the following order. First, we will discuss the type I interaction models. For this type of inter-

276

R. Nicolai and R. Dekker

action different opportunistic versions of the well known age and block replacement policies have been proposed. Second, the articles on type II interaction will be reviewed. We will see that in most of these articles the occurrence of shocks is modelled as a non-homogeneous Poisson process (NHPP) or that the failure rate of components is adjusted upon failure of other components. Third, we pay attention to articles that consider both types of failure interaction. Finally, we discuss other forms of modelling failure interaction. 11.4.1 Type I Failure Interaction Murthy and Nguyen (1985a) consider two maintenance policies in a multicomponent system with type I failure interaction. Under the first policy all failed components are replaced by new ones. When there is no total system failure, then only the single failed component is replaced. Under the second policy all components, also the functioning component(s), are replaced. When there is no total system failure, then the single failed component is subjected to minimal repair and made operational. The failure rate of the failed component after repair is the same as that just before failure. The authors deduce both the expected cost of keeping the system operational for a finite time period as well as the expected cost per unit time, of keeping the system operational for an infinite time period. Sheu and Liou (1992) consider an optimal replacement policy for a k-out-of-n system subject to shocks. Shocks arrive according to a NHPP. The system is replaced preventively whenever it reaches age T > 0 at a fixed cost c0. If the m-th shock arrives at age Sm < T, it can cause the simultaneous failure of i components at the same time with probability pi(Sm) for i = 0, 1,..., n, where



n i =0

pi ( S m ) = 1 . If

i ≥ k, then the k-out-of-n system is replaced by a new one at a cost c∞ (unplanned failure replacement). So, the downtime is used to replace all components. If 0 ≤ i < k, then the system is minimally repaired with cost ci(Sm). After a complete replacement (either a planned or a failure replacement), the shock process is set to zero. All failures subject to shocks are assumed to be instantly detected and repaired. The aim of the paper is to find the optimal T that minimizes the long run expected cost per unit time of the maintenance policy. The articles of Scarf and Deara (1998, 2003) consider failure-based, (opportunistic) age and (opportunistic) block replacement policies for a labelled twocomponent series system with type I failure interaction. The articles can be seen as an extension of the article of Murthy and Nguyen (1985b) on failure-based replacement for such systems. Note that since we deal with a series system, the failure of either component causes a system downtime. So, if the system is down, this does not necessarily mean that both components have failed. Economic dependence is modelled on the basis that the cost of replacement of one or more components includes a one-off set-up cost whose magnitude does not depend on the number of components replaced. The maintenance policies considered in Scarf and Deara (1998) are of the agebased replacement type: replace a component on failure or at age T, whichever is sooner. Failure-based maintenance is viewed as the limiting case (T → ∞) of agebased replacement. As there is also economic dependence between components,

Optimal Maintenance of Multi-component Systems: A Review

277

the authors consider opportunistic age-based replacement policies: replace a component on failure or at age T or at age T' < T if an opportunity exists. The policies considered in Scarf and Deara (2003) are of the block replacement type and are extended for two-component systems. The independent block replacement policy is a single component policy and it is of the following form: replace all failed components, replace component 1 at times k∆1, k = 1, 2,... and replace component 2 at times k∆2, k = 1, 2,... . Block replacement can be grouped: replace failed components and replace the system at times k∆, k = 1, 2,... . It can also be combined: replace both components (whether failed or not) on failure of the system and replace the system at times k∆, k = 1, 2,... . In modified block replacement policies for a two-component system, a component is only replaced at the block replacement times if its age is greater than some critical value. The block replacement times may be independent or grouped, or the components may be combined. Opportunistic modified block replacement policies are of the form: on failure of component 1, if the age of component 2, τ2, is greater than b2′ , then replace both components; otherwise just replace component 1. On failure of component 2, if the age of component 1, τ1, is greater than b1′ , then replace both components; otherwise just replace component 2. At block replacement times for component 1, k∆1, k = 1, 2,..., replace component 1 if τ1 > b1 and replace component 2 if τ2 > b2′ ; at block replacement times for component 2, k∆1, k = 1, 2,..., replace component 2 if τ2 > b2 and replace component 1 if τ1 > b1′ (for suitable chosen thresholds, b1, b2, b1′ and b2′ . In both articles the maintenance policies are considered in the context of the clutch system used in a bus fleet. This system consists of the clutch assembly (component 2) and the clutch controller (component 1). Actually, the failure of the controller causes a failure of the assembly with probability 1 and the failure of the assembly causes a failure of the controller with probability 0. It is important to mention that the maintenance policies are not only compared on the basis of cost, but also on ease of implementation and system reliability. It is found that an agebased policy is best, but since this implies that components ages have to be monitored, the authors propose to implement a block or modified block policy. Combined modified block replacement seems to be the best alternative for the clutch system under consideration. Combining maintenance of components has the advantage that the system is in general more reliable, although the long run costs per unit time are higher. The economic gains from using a complex policy have to be weighed up against the addition of investment required to implement such policies. Jhang and Sheu (2000) address the problem of analyzing preventive maintenance policies in a multi-component system with type I failure interaction. The ith component 1 ≤ i ≤ N has two types of failures. Type 1 failures are minor failures and are rectified through minimal repair. Type 2 failures are catastrophic failures and induce a total failure of the system (i.e. failure of all other components in the system). Type 2 failures are removed by an unplanned/unscheduled replacement of the system. The model takes into account costs for minimal repairs, replacements and preventive maintenance. Generalized age and block replacement policies are proposed. The age replacement policy implies preventive replacement of all com-

278

R. Nicolai and R. Dekker

ponents whenever an operating system reaches age T. In the case of a block replacement policy the system is preventively replaced every T years. The expected long-run cost per unit time for each policy is derived and it is discussed how the optimal T can be determined. Various special cases are discussed in detail. Finally, the authors mention the application of their model to the maintenance of mining cables used in hoisting load. 11.4.2 Type II Failure Interaction Satow and Osaki (2003) consider a two-component parallel system. Component 1 is repairable and at failure minimal repair is done. Failures of component 1 occur according to a NHPP. Whenever the component fails it induces a random amount of damage to component 2. The damage is additive and component 2 fails whenever the total damage exceeds a certain failure level. A system failure always occurs whenever component 2 fails, because both components fail simultaneously. By assumption component 2 is not repairable. This means that a failed system needs to be replaced by a new one. Since preventive replacement is cheaper than failure replacement, a two-parameter preventive replacement policy is analyzed. The policy takes into account both system age and the total damage of component 2. The system is replaced preventively whenever the total damage of component 2 exceeds k or at time T and it is replaced correctively at system failures. An expression for the expected cost per unit time for long run operation is derived and the policy is optimized analytically for two special cases (the one-parameter policies). Numerical examples show that the policy imposing a limit on the total damage (k) of component 2 outperforms the age T policy. It appears that the twoparameter preventive maintenance policy does not necessarily lead to lower expected costs. This is because in this model the state of component 2 is best indicated by the total damage and its age does not provide any additional information. Zequeira and Bérenguer (2005) study inspection policies for a two-component parallel standby system. The system operates successfully if at least one component functions. Failures can be detected only by periodic inspections. The failure times are modelled as independent random variables. Type II failure interaction is modelled as follows. The failure of one component modifies the (conditional) failure probability of the other component with probability p and does not influence the failure time with probability 1 – p. Within this respect, the model extends the failure rate interaction models proposed by Murthy and Nguyen (1985b). Inspections are either staggered, i.e. the components are inspected one at a time, or non-staggered, i.e. the components are inspected simultaneously at the same time. It is assumed that there are no economies of scale by doing nonstaggered inspections. Numerical experiments prove that for the case of constant hazard rates, staggered inspections outperform non-staggered inspections on the expected average cost per unit time criterion. The authors explain this counterintuitive result as follows. When inspections are staggered, at least one component is in an operating condition more frequently than when inspections are not staggered.

Optimal Maintenance of Multi-component Systems: A Review

279

Lai and Chen (2006) consider a two-component system with failure rate interaction. The lifetimes of the components are modelled by random variables with increasing failure rates. Component 1 is repairable and it undergoes minimal repair at failures. That is, component 1 failures occur according to a NHPP. Upon failure of component 1 the failure rate of component 2 is modified (increased). Failures of component 2 induce the failure of component 1 and consequently the failure of the system. The authors propose the following maintenance policy. The system is completely replaced upon failure, or preventively replaced at age T, whichever occurs first. The expected average cost per unit time is derived and the policy is optimized with respect to parameter T. The optimum turns out to be unique. Barros et al. (2006) introduce imperfect monitoring in a two-component parallel system. It is assumed that the failure of component i is detected with probability 1 – pi and is not detected with probability pi. The components have exponential lifetimes and when a component fails the extra stress is placed on the surviving one for which the failure rate is increased. Moreover, independent shocks occur according to a Poisson process. These shocks correspond to common cause failures and induce a system failure. The following maintenance policy is proposed. Replace the system upon failure (either due to a shock or failure of the components separately), or preventively at time T, whichever occurs first. Assuming that preventive replacement is cheaper, the total expected discounted cost over an unbounded horizon is minimized. Numerical examples show the relevance of taking into account monitoring problems in the maintenance model. The model is applied to a parallel system of electronic components. When one fails, the surviving one is overworked so as keep the delivery rate not affected. 11.4.3 Types I and II failure interaction Murthy and Nguyen (1985b) derive the expected cost of operating a two-component system with type I or type II failure interaction for both a finite and an infinite time period. They consider a simple, non-opportunistic, maintenance policy. Always replace failed components immediately. This means that the system is only renewed if a natural failure induces a failure of the other component. Nakagawa and Murthy (1993) elaborate on the ideas of Murthy and Nguyen (1985b). They consider two types of failure interaction between two components. In the first case the failure of component 1 induces a failure of component 2 with a certain probability. In the second case the failure of component 1 causes a random amount of damage to the other component. In the latter case the damage accumulates and the system fails when the total damage exceeds a specified level. Failures of component 1 are modelled as an NHPP with increasing intensity function. The following maintenance policy is examined. The system is replaced at failure of component 2 or at the N-th failure of component 1, whichever occurs first. For both models the optimal number of failures before replacing the system as to minimize the expected cost per unit time over an infinite horizon is derived. The maintenance policy for the shock damage model is extended as follows: the system is also replaced at time T. This results in a two-parameter maintenance policy, which is also optimized. The authors give an application of their models to the

280

R. Nicolai and R. Dekker

chemical industry; component 1 is a pneumatic pump and component 2 is a metal container. The failure of the pneumatic pump may either lead to an explosion, causing system failure (model 1), or lead to a reduction in the wall thickness of the container (model 2). The extension of model 2 captures the introduction of preventive maintenance of the container at time T. 11.4.4 Other Types of Failure Interaction Özekici (1988) considers a reliability system of n components. The state of the system is given by the random vector Xt of the ages of the components at time t, that is Xt = ( X1t ,..., Xtn ) . It is assumed that Xit ≥ 0 for all t > 0 and i = 1,...,N, where Xit = ∞ implies that component i is in a failed state at time t. The stochastic structure of the system is that the stochastic process with state-space [0, ∞) is a positive, increasing, right-continuous, and quasi-left continuous, strong Markov process. Stochastic dependence between the components is modelled by making the age (state) of a component at time t dependent on the age of the system up to time t. The failure interaction considered here differs from type I and II failure interaction defined above. It is worth to mention that this paper is written independently of the work of Murthy and Nguyen (1985a,b). Maintenance is modelled as follows. There are periodic overhauls at which the state of the system is inspected and a replacement decision is made on the components based on the observation of the system. Here the cost structure of the maintenance decision is very general and consists of two types: costs which only depend on the number of replaced components and costs which depend on the state of the system at the time of inspection. Economic dependence between components is ‘hidden’ in the former costs. Replacing a group of components together is cheaper than replacing the components separately or in a smaller subgroup. The optimal replacement problem is formulated as a Markov decision process. The author proposes a very general class of replacement policies, for which the decision to replace a component depends on the age of all components. It appears to be possible to characterize the optimal solution to the replacement problem. Unfortunately, it cannot be proved that there exists a single critical age for the system, which describes the optimal replacement problem. The author provides some intuitive results, e.g. it is not always optimal to replace new components and if the age of components that have to be replaced is increased, then the optimal policy does not change. He also gives an important counter-intuitive result: it is not true that more components are replaced as the system gets older.

11.5 Structural Dependence Structural dependence means that some operating components have to be replaced, or at least dismantled, before failed components can be replaced or repaired. In other words, structural dependence between components indicates that they cannot be maintained independently. This is not failure dependence, but maintenance dependence. Since the failure of a component offers an opportunity to replace other

Optimal Maintenance of Multi-component Systems: A Review

281

components, opportunistic policies are expected to perform well on systems with structural dependence between components. Obviously, preventive maintenance may also be advantageous, since maintenance of structural dependent components can be grouped. There may be several reasons for structural dependence. For example, a bicycle chain and a cassette form a union, which should always be replaced together rather than individually. Another example is from Dekker et al. (1998a), which considers road maintenance. Several deterioration mechanisms affect roads, e.g. longitudinal and transversal unevenness, cracking and ravelling. For each mechanism one may define a virtual component, but if one applies a maintenance action to such a component it also affects the state with respect to the other failure mechanisms. The seminal paper in this category is from Sasieni (1956). He considers the production of rubber tyres. The machine that produces the tyres consists of two “bladders”; one tyre is produced on each bladder simultaneously. Upon failure of a bladder, the machine must be stripped down before replacement can be done. This means that the other bladder can be replaced at the same time. Note that immediate replacement is not mandatory, but a failed bladder will produce faulty tyres. Two maintenance policies are analyzed and optimized. The first is a preventive maintenance policy. Bladders which have made a predetermined number of tyres (m) without failure are replaced. The second is an opportunistic version of the first policy. When a machine is stripped to replace one bladder, replace the other bladder if it has produced more than n ≤ m tyres.

11.6 Planning Horizon and Optimization Methods In this section we will classify articles on the basis of the planning horizon of the maintenance model and the optimization methods used to solve this model. Actually, these two concepts are related. The majority of the articles reviewed here assume an infinite horizon. This assumption facilitates the mathematical analysis; it is often possible to derive analytical expressions for optimal control parameters and the corresponding optimal costs. So, in the category infinite horizon (stationary grouping) models policy optimization is the most popular optimization method. For convenience we will not review the articles in this category. Finite-horizon models consider the system in this horizon only, and hence assume implicitly that the system is not used afterwards, unless a so-called residual value is incorporated to estimate the industrial value of the system at the end of the horizon. In the article of Budai et al. (2006) the so-called end-of-horizon effect is eliminated by adding an additional term to the objective function. This term values the last interval.

282

R. Nicolai and R. Dekker

The optimization methods applied to finite horizon models are either exact methods or heuristics1. Exact methods always find the global optimum solution of a problem. If the complexity of an optimization problem is high and the computing time of the exact method increases exponentially with the size of the problem, then heuristics can be used to find a near-optimal solution in reasonable time. The scheduling problem studied by Grigoriev et al. (2006) appears to be NPhard. Instead of defining heuristics, the authors choose to work on a relatively fast exact method. Column-generation and a branch-and-price technique are utilized to find the exact solution of larger-sized problems. The problem considered by Papadakis and Kleindorfer (2005) is first modelled as a mixed integer linear programming problem, but it appears that it can also be formulated as a max-flow min-cut problem in an undirected network. For this problem efficient algorithms exist and thus, an exact method is applicable. Langdon and Treleaven (1997), Sriskandarajah et al. (1998), Higgins (1998) and Budai et al. (2006) propose heuristics to solve complex scheduling problems. The first two articles utilize genetic algorithms. Higgins (1998) applies tabu search and Budai et al. (2006) define different heuristics that are based on intuitive arguments. In all four articles the heuristics perform well; a good solution is found within reasonable time.

11.7 Trends and Open Areas In this section we comment on the future research of optimal maintenance of multicomponent systems. We first analyze the trends in modelling multi-component maintenance and then discuss the future research areas in this field.

11.7.1 Trends In the last few years several articles have appeared on optimal maintenance of systems with stochastic dependence. In particular, the shock-damage models have received much attention. One explanation for this is that type II failure interaction can be modelled in several ways, whereas there is not much room for extensions in the type I failure model. Another reason is that since the field of stochastic dependence is not very broad yet, it is easy to add a new feature such as minimal repair or imperfect monitoring to an existing model. Third, many existing opportunistic maintenance policies for systems with economic dependence have not yet been applied to systems with (type II) failure interaction. Another upcoming field in multi-component maintenance modelling is the class of finite horizon maintenance scheduling problems. Finite horizon models can be

1

Actually, if the maintenance policy is relatively easy, it is sometimes possible to determine the expected maintenance costs over a finite period of time. For instance, Murthy and Nguyen (1985a,b) consider failure-based policies in a system with stochastic dependence and derive an expression for the expected cost of operating the system for a finite time.

Optimal Maintenance of Multi-component Systems: A Review

283

regarded as dynamic models, because short-term information can be taken into account. Maintenance scheduling problems are often modelled in discrete time as mixed integer linear programming problems. These problems can be NP-hard and in that case heuristics or local search methods have to be developed in order to solve the problems to near-optimality efficiently. In the last decade tabu search, genetic algorithms and problem specific heuristics have already been applied to maintenance scheduling problems (see Langdon and Treleaven 1997, Sriskandarajah et al. 1998, Higgins 1998 and Budai et al. 2006). However, there is still need for better local search algorithms.

11.7.2 Open Areas There is scope for more work in the following areas. 11.7.2.1 Finite Horizon Models On one hand, the class of infinite horizon models has been studied extensively in the literature. Based on the renewal-reward theory many maintenance policies for stationary grouping models have been analyzed. On the other hand, the class of finite horizon models, which includes many maintenance scheduling problems, has never had that much attention. However, maintenance of multi-component systems has to be made operational. Therefore, finite horizon and especially rolling horizon models, which also take short-term into account, have to be developed. In order to solve these models heuristics/local search methods should be further developed. Exact algorithms also need more attention. The article of Grigoriev et al. (2006) shows that some scheduling problems of reasonable size can be optimized exactly in a reasonable time. 11.7.2.1 Case-studies This review shows that case-studies are not represented very well in the field. This is surprising, since maintenance is an applied topic. In our opinion many models are just (mathematical) extensions of existing models and most of the times models are not validated empirically. Case-studies can lead to new models, both in the context of cost structures and dependencies between components. 11.7.2.3 Modelling Multiple Set-up Activities In this article we have subdivided the category “economic dependence” into a number of subcategories. It appears that examples of modelling maintenance of systems with multiple set-up activities are scarce. Therefore, this seems to be a promising field for further research. After all, in many production systems complex set-up structures exist. 11.7.2.4 Structural Dependence The field of structural dependence is wide open. In our opinion there have only been a few articles published on this topic.

284

R. Nicolai and R. Dekker

11.7.2.5 Stochastic Dependence Two decades ago Murthy and Nguyen published two articles on the maintenance of systems with stochastic dependence. Although this topic has had much attention since then, most articles still deal with two-component systems. So, there is still a lot of work to do on modelling maintenance of systems with failure interaction consisting of more than two components. 11.7.2.6 Combination of Dependencies In this article we have seen one example of the combination of structural and economic dependence (Scarf and Deara 1998, 2003). We have also reviewed some papers with both positive and negative economic dependence. Obviously, the combination of different types of interaction results in difficult optimization models. So, this is also an opportunity for researchers to come up with some new models. 11.7.2.7 Simulation Optimization We have already said that much work has been done on maintenance policies for the class of infinite horizon models. Many maintenance policies are not analytically tractable and simulation is needed to analyze these policies. We observe that the optimization of policies via simulation is often done by using algorithms for deterministic optimization problems. Methods such as simulated annealing and response surface methodology may be more efficient. This should be investigated further.

11.8 Conclusions In this chapter we have reviewed the literature on optimal maintenance of multicomponent maintenance. We first classified articles on the basis of the type of dependence between components: economic, stochastic and structural dependence. Subsequently, we subdivided these classes into new categories. For example, we have introduced the categories positive and negative economic dependence. We have paid attention to articles with both forms of interaction. Moreover, we have defined several subcategories in the class of models with positive economic dependence. With respect to articles in the class of stochastic dependence, we are the first to review these articles systematically. Another classification has been made on the basis of the planning horizon models and optimization methods. We have focussed our attention on the use of heuristics and exact methods in finite horizon models. We have concluded that this is a promising open research area. We have discussed the trends and the open areas of research reported in the literature on multi-component maintenance. We have observed a shift from infinite horizon models to finite horizon models and from economic to stochastic dependence. This immediately defines the open research areas, which also include topics such as case studies, modelling combinations of dependencies between components and modelling multiple set-up activities.

Optimal Maintenance of Multi-component Systems: A Review

285

11.9 References Assaf D, Shanthikumar J, (1987) Optimal group maintenance policies with continuous and periodic inspections. Management Science 33:1440–1452 Barros A, Bérenguer C, Grall A, (2006) A maintenance policy for two-unit parallel systems based on imperfect monitoring information. Reliability Engineering and System Safety 91:131–136 Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance activities. Journal of the Operational Research Society 57:1035–1044 Castanier B, Grall A, Bérenguer C (2005) A condition-based maintenance policy with nonperiodic inspections for a two-unit series system. Reliability Engineering & System Safety 87:109–120 Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23 Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the preservation of highways. IMA Journal of Mathematics applied in Business and Industry 9:109–156 Dekker R, van der Duyn Schouten F, Wildeman R, (1996) A review of multi-component maintenance models with economic dependence. Mathematical Methods of Operations Research 45:411–435 Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of lightstandards: a case-study. Journal of the Operational Research Society 49:132–143 Goyal, S, Kusy M, (1985) Determining economic maintenance frequency for a family of machines. Journal of the Operational Research Society 36:1125–1128 Grigoriev A, van de Klundert J, Spieksma F, (2006) Modeling and solving the periodic maintenance problem. European Journal of Operational Research 172:783–797 Gürler Ü, Kaya A, (2002) A maintenance policy for a system with multi-state components: an approximate solution. Reliability Engineering & System Safety 76:117–127 Higgins A, (1998) Scheduling of railway track maintenance activities and crews. Journal of the Operational Research Society 49:1026–1033 Jhang J, Sheu S, (2000) Optimal age and block replacement policies for a multi-component system with failure interaction. International Journal of Systems Science 31:593–603 Lai M, Chen Y, (2006) Optimal periodic replacement policy for a two-unit system with failure rate interaction. The International Journal of Advanced Manufacturing and Technology 29:367–371 Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission networks using genetic programming. In Warwick K, Ekwue A, Aggarwal R, (eds.) Artificial intelligence techniques in power systems, Institution of Electrical Engineers, Stevenage, UK, 220–237 Murthy D, Nguyen D, (1985a) Study of a multi-component system with failure interaction. European Journal of Operational Research 21:330–338 Murthy D, Nguyen, D (1985b) Study of two-component system with failure interaction. Naval Research Logistics Quarterly 32:239–247 Nakagawa T, Murthy D, (1993) Optimal replacement policies for a two-unit system with failure interactions. RAIRO Recherche operationelle / Operations Research 27:427–438 Okumoto K, Elsayed E, (1983) An optimum group maintenance policy. Naval Research Logistics Quarterly 30:667–674 Özekici S, (1988) Optimal periodic replacement of multicomponent reliability systems. Operations Research 36:542–552 Papadakis I, Kleindorfer P, (2005) Optimizing infrastructure network maintenance when benefits are interdependent. OR Spectrum 27:63–84

286

R. Nicolai and R. Dekker

Pham H, Wang H, (2000) Optimal (τ, T) opportunistic maintenance of a k-out-of-n:G system with imperfect PM and partial failure. Naval Research Logistics 47:223–239 Popova E, Wilson J, (1999) Group replacement policies for parallel systems whose components have phase distributed failure times. Annals of Operations Research 91: 163–189 Ritchken P, Wilson J, (1990) (m; T) group maintenance policies. Management Science 36:632–639 Sasieni M, (1956) A Markov chain process in industrial replacement. Operational Research Quarterly 7:148–155 Satow T, Osaki S, (2003) Optimal replacement policies for a two-unit system with shock damage interaction. Computers and Mathematics with Applications 46:1129–1138 Scarf P, Deara M, (1998) On the development and application of maintenance policies for a two-component system with failure dependence. IMA Journal of Mathematics Applied in Business & Industry 9:91–107 Scarf P, Deara M, (2003) Block replacement policies for a two-component system with failure dependence. Naval Research Logistics 50:70–87 Sheu S, Jhang J, (1996) A generalized group maintenance policy, European Journal of Operational Research 96:232–247 Sheu S, Kuo C, (1994) Optimal age replacement policy with minimal repair and general random repair costs for a multi-unit system. RAIRO Recherche Operationelle/Operations Research 28:85–95 Sheu S, Liou C, (1992) Optimal replacement of a k-out-of-n system subject to shocks. Microelectronics Reliability 32:649–655 Smith M, Dekker R, (1997) Preventive maintenance in a 1 out of n system: the uptime, downtime and costs. European Journal of Operational Research 99:565–583 Sriskandarajah C, Jardine A, Chan C, (1998) Maintenance scheduling of rolling stock using a genetic algorithm. Journal of the Operational Research Society 49:1130–1145 Stengos D, Thomas L, (1980) The blast furnaces problem. European Journal of Operational Research 4:330–336 Thomas L, (1986) A survey of maintenance and replacement models for maintainability and reliability of multi-item systems. Reliability Engineering 16:297–309 Van der Duyn Schouten F, Vanneste S, (1993) Two simple control policies for a multicomponent maintenance system. Operations Research 41:1125–1136 Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for traffic control signals. IMA Journal of Mathematics Applied in Business & Industry 9:325–346 Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component production systems. In Ben-Daya M, Duffuaa S, Raouf A, (eds.) Maintenance, Modeling and Optimization, Kluwer Academic Publishers, Boston, 283–306 Wang H, (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139:469–489 Zequeira R, Bérenguer C, (2005) On the inspection policy of a two-component parallel system with failure interaction. Reliability Engineering and System Safety 88:99–107

12 Replacement of Capital Equipment P.A. Scarf and J.C. Hartman

12.1 Introduction Businesses require equipment in order to function and deliver their outputs. In the global, competitive environment, this equipment is critical to success. However, equipment generally degrades with age and usage, and investment is required to maintain the functional performance of equipment. For example, in mass urban transportation, annual expenditure on equipment replacement for the Hong Kong underground is of the order of $50 million, and further, the Hong Kong underground network is a fraction of the size of that in London, Paris or New York. Where equipment replacement impacts significantly on the bottom line of a corporation and decision-making about such expenditure is under the control of the company executive, the modelling of such decision making is within the scope of this chapter. Capital equipment investment projects are typically driven by operating cost control, technical obsolescence, requirements for performance and functionality improvements, and safety. That is, rational decision-making about capital equipment replacement will take account of engineering, economic, and safety requirements. In this chapter we will assume that the engineering requirements concerning replacement will define certain choices for equipment replacement. For example, engineers would normally propose a number of options for providing the continuity of equipment function: retain the current equipment as is, refurbish the equipment in order to improve operation and functionality, or replace the equipment with new improved technology. We will further assume that safety requirements are addressed when these options are analysed by engineers. Consequently, we argue that rational choice between the defined replacement options is an economic question. Thus, a logistics corporation may be considering replacement of certain assets in its road transportation fleet. The organisation may have to raise capital to fund such replacement. There is the expectation that engineers for the corporation will offer a number of choices for replacement (e.g. buy tractors from company X or Y, buy tractors now or in N years time, or scrap or retain existing tractors as spares) that meet future functional and safety requirements. In this way, decision making about

288

P. Scarf and J. Hartman

replacement then necessarily considers the costs of the replacement options over some suitable planning horizon. As capital equipment replacement potentially incurs significant costs, the cost of capital is a factor in the decision problem and models to support decision making typically take account of the time value of capital through discounting. Capital equipment is a significant asset of a business. It consists of necessarily complex systems and a business would typically own or operate a fleet of equipment: the Mass Transit Railway Corporation Limited of Hong Kong operates hundreds of escalators; Fed Ex Express, the cargo airline corporation operates more than 600 aircraft; electricity distribution systems comprise thousands of kilometres of cable and hundreds of thousands of items such as transformers and switches; water supply networks are on a similar scale. We can appeal to the law of large numbers and assume with some justification that the economic costs that enter capital equipment replacement decisions are deterministic. Consequently, we consider deterministic models in this chapter and model rational decision making throughout using net present value techniques (e.g. see Arnold 2006; Northcott 1985). When considering optimal equipment replacement in an uncertain environment, authors have argued the case for using real options (Dixit and Pindyck 1994; Bowe and Lee 2004). Whenever replacement decisions may be exercised continuously, it is argued that the choice to replace an existing asset with a new asset at a specified time is characteristic of an American call option—this approach seeks to value the opportunity to replace the asset. Such a modelling approach would be valuable when considering expansion of assets, for example, through the building of a new transportation link for which the likely return on investment would be highly uncertain. However, we do not consider this approach in this chapter. We do not consider problems of component replacement in which the functionality of repairable systems is optimized either on a cost basis or a required reliability basis. Such maintenance does not typically involve capital expenditure, and the models used are often stochastic in nature—times to failure are considered to be random. For a recent review of such models, see Wang 2002. The outline of the chapter is as follows. In Section 12.2 we describe the framework for the classification of models that are discussed in this chapter. This framework considers the nature of capital equipment replacement problems in general and presents further detail regarding the nature of cost factors that contribute to replacement decisions. Section 12.3 looks at economic life models and discusses several models and an application of one of the models. Section 12.4 deals with replacement of a network system. Dynamic programming models are discussed in Section 12.5 and the chapter concludes with a discussion of topics for future research in Section 12.6.

12.2 Framework for Replacement Modelling of Capital Equipment The composition of a fleet may be classified as singular (one operating plant), multiple identical (homogeneous), or multiple non-identical (inhomogeneous). Replacement policies may be classified as single plant replacement, sub-fleet replace-

Replacement of Capital Equipment

289

ment, or entire fleet replacement (Scarf and Christer 1997). The capital replacement models that are considered in this chapter may be classified as economic life models or dynamic programming models. The former are concerned with determining the optimal lifetime of an item of equipment, taking account of costs over some planning horizon. The latter considers replacement decisions dynamically, determining whether plant should be retained or replaced after each period. Economic life models may be further classified according to the length of the planning horizon: infinite, variable finite (with length of the horizon a function of decision variables), or fixed (with a variable number of replacement cycles). Dynamic programming models generally require a finite horizon, but may be used to identify the optimal time zero decision for an infinite horizon. Early models (e.g. Eilon et al. 1966) were formulated in continuous time with optimum policy obtained using calculus. More complex models are simpler to implement under a discrete time formulation. In the case of economic life models, optimization may be performed using a crude search when there exists a small number of decision variables. For fleets with many items, the discrete time formulation naturally gives rise to mathematical programming problems. Dynamic programming models necessarily require a discrete time formulation. Real options models are formulated in continuous time. We begin by looking at simple economic life models. These are applied in a case study on escalator replacement. Economic life models are then extended to consider first an inhomogeneous fleet and second a network system viewed as an inhomogeneous fleet with interacting items. A number of different dynamic programming models are introduced for singular systems and then expanded to homogeneous and inhomogeneous fleets and networks of assets. It is assumed that data relating to maintenance are available and sufficient for modelling purposes. Data on other “age” related operating costs, such as fuel costs and failures (breakdowns), would also ideally be available. Where usage of plant is non-uniform, particularly if decreasing with age, usage data are also required for replacement policy to be meaningful. This is because, for example, maintenance costs for older plant may be artificially low due to under utilization or neglect of good maintenance practice for plant near the end of their useful life. Some plant may even be retired as occasional spares. Under reporting and thus bias of maintenance cost data may also be significant (Scarf 1994). Replacement models have also been considered when cost information is obtained subjectively (Apeland and Scarf 2003). Penalty costs play a role in all replacement decisions (Christer and Scarf 1994). It is only the extent to which penalty cost is quantified in the modelling process that varies. Rather than attempt to estimate the values of “difficult to quantify” parameters such as penalty cost and then determine optimal policy, the influence of these parameters on the decision should be quantified. In this latter approach, threshold values that lead to a step-change in optimum policy can be investigated and presented and the decision makers can then consider whether they believe that such values are realistic within the context of the problem. Thus, the penalty cost can be used to measure in part the subjective component of a replacement choice. All costs considered in the modelling will be discounted to net present value through the use of a constant discount factor. We refer the reader to Kobbacy and

290

P. Scarf and J. Hartman

Nicol (1994) for a detailed discussion of the role of discounting in capital replacement. Appropriate functions describing resale values are assumed to be known, as are purchase costs. Tax considerations in particular contexts should be taken into account and modelled.

12.3 Economic Life Models 12.3.1 A Simple Model for Individual Plant Early economic life models such as Eilon et al. (1966) considered an idealised equipment replaced at age T, that is, replacement every T time units, in perpetuity. In this idealised framework, for T small, frequent replacement leads to high replacement or capital costs. Infrequent replacement (large T), on the other hand, results in high operating or revenue costs (assuming that operating costs increase with the age of equipment). Trading-off capital costs against revenue costs leads to an optimum age at replacement, T*, the so-called economic life. The decision criterion is typically the total cost per unit time or the annuity—this latter term has been called the rent by Christer (1984). In the case without discounting, the total cost per unit time, c(T), and the annuity are equivalent and T

c(T ) = {∫ m0 (t ) dt + R}/ T ,

(12.1)

0

where m0 (t ) is the operating cost rate and R is the replacement cost, and assuming no residual value. From Equation 12.1, it follows that T* is the solution of



T* 0

m0 (t ) dt + R = T * m0 (T *) ,

provided it exists. In its discrete time form the total cost per unit time is T c(T ) = {∑ i =1 m0i + R}/ T , where m0i is the operating cost in time period i. With a discount factor ν , discounting to year end, and a residual value function S(T), the net present value (NPV) of all future costs in perpetuity is



cNPV (T ) = (1 + ν T + ν 2T + ...){

T i =1

m0iν i + ν T [ R − S (T )]}

= (1 −ν T ) −1{∑ i =1 m0iν i + ν T [ R − S (T )]}. T

An objection to this criterion is that as ν → 1, cNPV (T ) → ∞ . Consequently, we recommend the annuity or rent (the amount paid annually and in perpetuity that is necessary to meet the total discounted cost) given by



(1 + ν + ν 2 + ...) crent (T ) = (1 + ν T +ν 2T + ...){

T i =1

m0iν i +ν T [ R − S (T )]} ,

Replacement of Capital Equipment

291

whence crent (T ) =

(1 − ν ) T

(1 − ν )

T

∑i =1 m0iν i + ν T [ R − S (T )]} .

{

Notice that as ν → 1, crent (T ) → c(T ), the total cost per unit time. The economic life can be obtained by minimising crent (T ) , typically using a spreadsheet by considering a range of values of T. 12.3.2 Analysing Technological Change Using a Two-cycle Model The economic life model can be adapted to consider technological change in a number of ways. One can consider economic factors for new models of equipment (future operating costs) in a parametric fashion, specifying a model for technological change which then implies operating cost functions, replacement cost and residual values for each replacement cycle into the future (Elton and Gruber 1976). Alternatively, one can model replacement over a limited time scale, either by fixing the time horizon, or by fixing the number of replacement cycles. Christer (1984) did the latter and described a two-cycle model which models the immediate replacement decision problem by considering existing plant as having age τ and age-related operating cost m0i , and new plant as having operating cost m1i . In its discrete form, the annuity for this model is 2 crent ( K , L) =



K i =1

m0(i +τ )ν i + ν K {R1 − S0 ( K + τ ) + ∑ i =1 m1iν i +ν L [ R1 − S1 ( L)]} L



K +L

i =1

νi

.(12.2)

Here K and L are decision variables, with K modelling the time (from now) to replacement of the existing asset; K+L is the time to second replacement. The advantage of this model is that one only need estimate the operating cost of the existing and new assets (as functions of age), the capital cost for the new asset, R1 , and the age-related resale or residual value of new and existing assets, S0 , S1 . 12.3.3 A Fixed Planning Horizon Model In the financial appraisal of projects, a standard approach fixes the time horizon and determines the NPV of future costs over this horizon (e.g. Northcott 1985). This fixed horizon model has been studied by Scarf and Hashem (2003) and its simplicity lends itself to application in complex contexts (e.g. Scarf and Martin 2001). The annuity for this model can be derived from Equation 12.2 above simply by setting X = K and K + L = h , the length of the planning horizon, and then considering h as fixed. Whence, there is only one decision variable, X, the time to replacement. Given the possibility that X = h , that is, no replacement over the planning horizon whence we retain the current asset, the annuity function has a discontinuity at X = h , and X * = h implies that it is not optimal to undertake the

292

P. Scarf and J. Hartman

(replacement) project. Furthermore, since the replacement at the end of the horizon has a fixed cost (with respect to the decision variable X) its inclusion or exclusion has no effect on the optimal time to replacement. It is natural not to include the replacement cost at the horizon-end since a standard financial appraisal approach would only account for revenue costs up to project execution, capital costs at project execution, subsequent revenue costs up to the horizon-end, and residual values. Including the replacement at h on the other hand allows cost comparisons with the two-cycle model and the associated rent, Equation 12.2. We take the former approach here however and the annuity is



⎧ { ⎪ ⎪ h crent ( X ) = ⎨ ⎪ ⎪{ ⎩



X i =1

m0(i +τ )ν i + ν X [ R1 − S0 ( X + τ )] +

−ν h S1 (h − X )}/ h i =1



h



h i = X +1

νi,

X < h,

i =1

m0(i +τ )ν i −ν h S0 (h + τ )}/



m1(i − X )ν i

h

νi,

i =1

(12.3)

X = h.

12.3.4 A Modified Two-cycle Model It is interesting to consider the behaviour of these models at Equations 12.2 and 12.3 when the operating costs are constant (or increasing only slowly), since it is not unusual for plant to age only slowly. Of course, replacement of an existing asset in these circumstances would only be contemplated if the operating cost (or functionality) of the new asset is significantly lower (or functionality higher), e.g. electricity supply network components; see Brint et al. (1998). The behaviour is simplest to follow for the continuous time formulation when the discount factor is unity (no discounting) and residual values are zero. Under these circumstances, the costs per unit time (annuity) for the two-cycle model and the fixed horizon model become 2 crent ( K , L) = ( Km0 + R + Lm1 + R ) /( K + L) ,

(12.4)

and

⎧[ Xm0 + R + (h − X )m1 ] / h, h crent (X ) = ⎨ ⎩ m0

X < h, X = h.

respectively. From Equation 12.4 we get 2 dcrent ( K , L) / dK = [ L(m0 − m1 ) − 2 R] /( K + L) 2 . Thus there is no K such that 2 2 dcrent ( K , L) / dK = 0, but that dcrent ( K , L) / dK > 0 ⇒ K * = 0 if L(m0 − m1 ) > 2 R 2 ( K , L) / dL = [ K (m1 − m0 ) − 2 R] /( K + L) 2 , and for any fixed L. Furthermore, dcrent 2 so if m0 > m1 (which is a necessary condition for dcrent ( K , L) / dK > 0 ) 2 dcrent ( K , L) / dL < 0 and so L* does not exist. However, in any practical implementation of the two-cycle model, one would bound L with some upper value lmax , say. Then the optimal policy would be K*=0 (L*= lmax ) if

Replacement of Capital Equipment

lmax (m0 − m1 ) > 2 R .

293

(12.5)

We can consider a similar argument for the fixed horizon model. Thus, h h dcrent ( X ) / dX = ( m0 − m1 ) / h, X < h and so dcrent ( X ) / dX > 0 if m0 > m1 . Howh ever, since crent ( X ) has a discontinuity at X=h, X*=0 is optimal only if m0 > m1 h h and crent (0) < crent (h) . That is, if ( R + hm1 ) / h < m0 , that is, if h(m0 − m1 ) > R .

(12.6)

Thus, comparison of inequalities at Equations 12.5 and 12.6 shows that the two models have different properties in terms of the behaviour of optimal policy as a function of cost parameters. Thus the two-cycle model is inconsistent with standard financial models. However, a simple modification to the model will correct this inconsistency. Scarf et al. (2006) suggest simply to omit the replacement at the end of the second cycle. For the constant revenue case above, the rent becomes 2 c rent ( K , L) = ( Km 0 + R + Lm1 ) /( K + L) and optimal policy would be K*=0 ( L* = lmax ) if lmax (m0 − m1 ) > R , which is consistent with the fixed horizon model and hence with standard financial appraisal models. However, it would appear that the two-cycle model with its two replacements (at t=K and at t=K+L) is applicable for the case of increasing operating costs and that a modified two-cycle model with one replacement (at t=K only) for operating costs that are constant or increasing only slowly. However, this issue can be resolved. When operating costs are increasing only slowly, typically L* does not exist, and, in practice L must be constrained such that L ≤ lmax (as pointed out above) since numerically we can only search for L* over a finite space. In constraining L ≤ lmax under the two replacements formulation, we impose a replacement at lmax when in fact there should not be a second replacement since L* does not exist. This then suggests that the two-cycle replacement model should be modified in the following subtle way: if there does not exist an L such that 2 crent ( K , L) has a minimum strictly within the search space, that is, within {( K , L) : 0 < K < K max ,0 < L < lmax } then, when determining that K which 2 minimises crent ( K , lmax ) , no replacement cost should be incurred at t = K + lmax . Thus the model should be modified so that there is only one replacement. Otherwise the “cost hurdle” for replacement of the current asset will be set artificially high (inequality at Equation 12.5). Thus, in all practical situations for which operating costs are increasing only slowly, one should use this modified two-cycle model or the fixed horizon model as a special case. 12.3.5 Discussion of Finite Horizon Replacement Models Using the fixed horizon model or equivalently using the modified two-cycle model with a finite search space may lead to significant end-of-horizon effects (since costs beyond the horizon-end are ignored). Thus time to first replacement will depend on h (or equivalently lmax ). Choice of h (or lmax ) will need to be considered carefully; in practice the horizon may be specified by company policy on accounting methods and discounting may reduce those costs incurred in the distant

294

P. Scarf and J. Hartman

future to a small or insignificant level. Furthermore, specification of the residual value may be problematic, particularly for non-movable assets with either constant or slowly changing operating revenue. This is because the market resale value of the asset is arguably zero. However, the residual value, as measured by the benefit of the function the asset performs rather than its value if sold, may be non-zero. In this case company policy may prescribe a “straight line” depreciation so that the residual value is proportional to the estimated asset life fraction remaining at replacement or horizon end. However, such an approach may be difficult to justify since the asset life is unknown and linearity is a strong assumption. One possible approach here would be to look at sensitivity to the parameters in a residual value model such as this but there would be a number of parameters and this may become over-complex. An alternative would be to equate the residual value at the horizon-end to the cost-benefit of the replacement (whenever it took place) over the next m years. But this then amounts to extending the planning horizon from K+L to K+L+m or from h to h+m. This of course will lead to models the same as those considered at present but with longer horizons (or to a three cycle model if a subsequent refurbishment is also considered). Thus if one accepts that a two-cycle model is sufficient for modelling purposes, then, logically, consideration of residual values for a non-movable asset amounts to considering sensitivity to horizon length (either h or K+L whichever model is used). Restricting replacement models to (at most) two cycles may lead to sub-optimal replacement policies. However, for typical discount rates and planning horizons the modelling of operation and replacement beyond the end of the second cycle may have only a small effect on the time to first replacement, the issue of principal interest in practice. Replacement models that are not restricted to (at most) two replacement cycles are considered in Section 12.3. 12.3.6 An Application of Finite Horizon Repacement Models Example 12.1 Decision making regarding the replacement of escalators on a mass transit rail system in a particular city has been considered over a number of years by the corporation that owns and operates the system (Scarf et al. 2006). Maintenance of escalators is generally outsourced to equipment suppliers due to the difficulty that alternative contractors have in obtaining proprietary spares. The original manufacturers can keep costs down as a result of the economy of scale that is achievable through maintaining equipment over a large number of client organisations. Currently, the corporation operates of the order of 600 escalators and the annual maintenance contract price is over $10 million. Escalator replacement is therefore a significant issue within the organisation. Studies by the corporation suggest that the economic life of escalators is of the order of 25 years but that, based on overseas experience, escalator life can be extended to up to 40 years. However, given the size of the fleet, a strategy has to be set to manage escalator maintenance and to deal with the replacement or refurbishment of older escalator assets. A key factor in this strategy is the approach of the organisation to the re-negotiation of maintenance contracts and in particular to

Replacement of Capital Equipment

295

determine the scale of refurbishment of older assets and the level of major parts replacement and supply within the negotiated contract. For the presentation of the modelling work in this example, it is necessary to consider the asset management options open to the corporation in a simple manner, and a homogeneous sub-fleet of the escalators is considered, with modeling carried out for a typical escalator—this is a reasonable simplification since all escalators in the sub-fleet were installed at approximately the same time. For this group, replacement, although crudely costed, was not really a viable option—economic costs were too high and disruption unacceptable given the duration of replacement work. Refurbishment by the original manufacturer, replacing worn parts, upgrading the control system and maintenance access was being carefully considered by the corporation as a viable strategy for managing the asset life. Cost savings could be achieved through a reduction in the annual maintenance contract price subsequent to refurbishment. Thus, put simply, for the escalator group, the corporation was faced with the decision: continue with the current relatively higher-price maintenance contract or refurbish and benefit from a new relatively lower-priced maintenance contract. Other benefits would also accrue from refurbishment for both contractor and the corporation. For the contractor, improved access and safety for maintenance was part of the refurbishment package. For the corporation, upgrade of the control system would result in fewer unplanned escalator stoppages. We consider some four asset management options: “do nothing”—continue with high-price maintenance contract; “refurb”—renew worn parts, retro-fit new control system and proceed with lower-price maintenance contract; “delay refurb”—delay refurbishment for up to n years; “replace”—a full replacement option with nominal costs included for comparison purposes. The costs of refurbishment (per escalator) in the present study were obtained from initial quotations from the respective manufacturers: these are $63K for refurbishment. On-going annual maintenance contract costs (per escalator) are: $9K pre-refurbishment; $7K post-refurbishment. Prior to refurbishment the cost of replacement of major parts is in addition to the annual maintenance contract and major parts are replaced on the basis of condition. Post-refurbishment, the annual maintenance contract includes replacement of major parts at no extra cost. Given that we might expect major parts to be replaced somewhat less frequently than dictated by their recommended lives, we introduce a cost parameter to model such life-extension—this is called the effective life factor, ρ . ρ = 1 implies that major parts are replaced at a frequency corresponding to their recommended life (for example, once every 25 years for the steps at a cost of $48K), and the replacement frequency ∝ 1 / ρ ( ρ = 2 implies replacement of steps once every 50 years). The cost of a replacement ($170K) is a nominal figure and used mainly for crude comparison with refurbishment. In practice, replacement may cost significantly more than this. The corporation recommend a discount rate of r= 0.11 and a projected inflation rate of i= 0.05. This corresponds to an effective discount factor, ν , of 0.057 ( 1 /(1 +ν ) = (1 + i ) /(1 + r ) ). Integral to the refurbishment option is the up-grading of the escalator control system to allow “power-dip ride-through”—this facility prevents unnecessary emergency stops caused by momentary power loss that can cause injuries to passengers. However, the effectiveness of the “ride-through” facility is uncertain; hence we introduce another cost parameter, control system

296

P. Scarf and J. Hartman

retro-fit effectiveness, which represents the percentage of passenger injuries due to power dips that would be prevented by up-grading of the control system at refurbishment. Also, for the purposes of sensitivity analysis it is necessary to place a cost on an emergency stop due to a power dip. We call this the penalty cost of failure. Historic records of the number of such stops (approximately 0.5 stops per escalator per year) and the retro-fit effectiveness are used to calculate a total penalty cost (saving) per escalator per year post-refurbishment. Another parameter that is difficult to quantify was also considered, the passenger delay cost due to refurbishment, but the results are omitted here. In Table 12.1, we look at annuities for the modified two-cycle model and the fixed horizon model for a range values of the respective decision variables. Note that the annuities for the fixed horizon model lie on the diagonal indicated. This is because the fixed horizon model is equivalent to the modified two-cycle model with the additional constraint that K+L= h. These annuities are also presented in Figure 12.1a. Figure 12.1b shows the annuities for the two models in the case of no discounting—discounting has the effect of slightly extending the economic life (since the NPV of future costs are reduced) and this accounts for the small difference in optimum policy between the fixed horizon model and the modified two-cycle model in Figure 12.1(a), X* = 1, K* = 4 (years). Table 12.1. Annuities ($000s per escalator per year) escalator for modified two-cycle model with refurbishment at K years from now and again after a further L years. Annuities for fixed horizon model with h = 22 years highlighted, except for X* = 22 (no replacement) for which annuity = $139.4K. Cost parameters as follows: refurbishment cost, $62.9K; effective discount factor, 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of 0.11); penalty cost of failure, $5K; effective life parameter, 1.5; control system retro-fit effectiveness, 75%; cost of refurbishment delay, $10K; annual maintenance contract prerefurbishment, $8.8K (per escalator); annual maintenance contract post-refurbishment, $6.9K (per escalator).

K, length of the first cycle, years

1 3 5 7 9 11 13 15 17 19 21

1 491.8 302.4 241.1 211.6 193.9 182.2 173.9 167.8 163.1 159.4 156.4

3 298.0 236.1 206.8 190.2 179.3 171.6 165.9 161.5 158.0 155.3 153.0

5 233.0 202.7 186.2 176.1 169.0 163.7 159.7 156.5 154.0 151.9 150.2

L, length of the second cycle, years 7 9 11 13 15 200.3 180.7 167.8 158.6 151.8 182.8 169.6 160.2 153.3 148.0 172.5 162.9 155.8 150.3 146.0 166.1 158.7 153.1 148.6 145.0 161.4 155.5 150.9 147.2 144.2 157.7 153.0 149.2 146.1 143.6 154.9 151.0 147.8 145.2 143.0 152.6 149.3 146.7 144.4 142.5 150.7 148.0 145.7 143.8 142.1 149.1 146.8 144.9 143.2 141.8 147.8 145.9 144.2 142.8 141.5

17 146.6 143.8 142.5 142.1 141.7 141.4 141.2 140.9 140.8 140.6 140.4

19 142.5 140.5 139.7 139.7 139.6 139.6 139.6 139.6 139.6 139.5 139.5

21 139.2 137.8 137.4 137.7 137.9 138.1 138.3 138.4 138.5 138.6 138.7

The cost parameters in Table 12.1 and Figure 12.1 are held at intermediate values. In Figure 12.2, we present annuities for a number of “replacement” options as a function of each of the cost parameters. These replacement options correspond to those considered by the corporation, with “refurb” referring to immediate refurbishment (in year 1), and “delay refurb” referring to refurbishment in year 10 (from

Replacement of Capital Equipment

297

time of study). Given the size of the fleet, a constraint on the number of escalators that can be refurbished at any one time and the duration of refurbishment, we would expect the refurbishment programme to last some 15 years and therefore a significant proportion of the fleet would experience this kind of delay prior to refurbishment. Therefore we include it as a particular policy for indicative purposes. We use the fixed horizon model here in order to make comparisons between annuities—this is because one would wish to compare the cost of different options over the same horizon. Equivalently, we could use the modified two-cycle model with the additional constraint K+L= h= 22 (years), say. 150.0 annuity per escalator, HK$K

annuity per escalator, HK$K

170.0

160.0

150.0

140.0

140.0

130.0

120.0

110.0

130.0 1

5

9

13

17

1

21

K, years

a L=10

L=12

5

1

7

0 0 .

1

6

0 0 .

1

5

0 0 .

1

4

0 0 .

1

3

0 0 .

1

L=16 5

9

13

17

21

K, years

b L=14

9

L=18 1

3

1

7

L=21 2

fixed

1

Figure 12.1a,b. Annuities ($000s per escalator per year) for modified two-cycle model with refurbishment at K years from now and operation for a further L years. Annuities for fixed horizon model with h = 22 years also shown (X < 22: bold, solid curve; X = 22: ■). Cost parameters as Table 12.1 except: a effective discount factor equals 0.06 (equivalent to inflation rate of approximately 0.05 and discount rate of 0.11); b no discounting.

From Figure 12.2 we can see that optimum policy is certainly sensitive to these cost factors with the influence of cost parameters as expected. Threshold values that lead to a step-change in the optimum policy (option) can be observed from these figures. Thus while estimation of the penalty cost of failure, for example, may be difficult and contentious, the importance of its effect can be observed. This may then provide an incentive for further investigation of this parameter or discussion about whether its true value is above or below the threshold of policy change. As a final note for the escalator replacement problem in particular, one could argue that the cost of differing options or policies will reflect the maintenance contractor’s profit requirement, whatever the details of the arrangement, and therefore the total costs of options would expect to vary very little. What can differ, however, is that some options may lead to lower risk (for example where the contractor bears the cost of major parts’ wear-out which may be subject to significant uncertainty) and lower risk is certainly desirable from the point of view of the operator.

298

P. Scarf and J. Hartman

12.3.7 A Model for an Inhomogeneous Fleet

200

200

190

190

annuity per escalator / HK$000s

annuity per escalator / HK$000s

Consider now a fleet consisting of sub-fleets classified on the basis of class (e.g. vehicle-type) and age (or condition) so that the operator of the fleet is concerned with the replacement of sub-fleets, and not with replacement of individual equipment or with replacement of the entire fleet. For this fleet, it is natural to focus on the replacement of particular sub-fleet(s). The economic life models of the previous section must be extended given that the replacement of particular sub-fleets has cost implications for the rest of the fleet.

180 170 160 150 140 130 1.00

a

1.50

1.75

150 140

0

20

40

60

80

penalty cost of failure (HK$K)

b 200

annuity per escalator / HK$000s

annuity per escalator / HK$000s

160

2.00

200

c

170

130 1.25

effective life factor

190 180 170 160 150 140 130 0.05

180

190 180 170 160 150 140 130

0.07

0.09

0.11

discount rate

00 129 1 8 0 0 1 1 7 6 0 0 1 5 0 1 1 4 3 0 0 2 0

refurb

2 2

20

0.13

22

d delay refurb 2 4

24

26

28

horizon / years

2 6

do nothing 2 8

replace

Figure 12.2a–d. Annuities (per escalator) as a function of cost parameters for fixed horizon model, with h = 22 years for various refurbishment/replacement options: a annuity vs. effective life parameter; b annuity vs. penalty cost of failure; c annuity vs. nominal discount rate; d annuity vs. horizon length h. Cost parameter values when not varying set at: effective life, 1.5; penalty cost of failure, $5K; nominal discount rate, 0.11; control system retro-fit effectiveness, 75%; refurbishment delay cost, $10K.

Replacement of Capital Equipment

299

Considering a “rolling schedule” of replacements, the questions of interest are then: which sub-fleet should be replaced first (second, third)?; when should they be replaced?; and what model (equipment specification) should be purchased at replacements? The order in which the sub-fleets are replaced we call the replacement schedule. We call a particular replacement schedule along with the time scale for replacements and the choice of model for purchase a replacement policy. It is expected that the operator would have significant input into the choice of sub-fleets to be replaced, based on experience with the fleet. Also the choice of model for purchase at replacements is also likely to be decided in advance by the operator. This would give the operator a level of control over the modelling process which is highly desirable in practice (Russell 1982). The purpose of the modelling is therefore to provide decision support for the operator on: (i) the cost implications of alternative replacement schedules; (ii) the time scale for replacements and budget requirements; (iii) the cost implications of particular sub-optimal policies necessitated by technological obsolescence or changing economic considerations (e.g. Suzukia and Pautschb 2005). The most important considerations are the choice of sub-fleet for first replacement and the time to first replacement. It is expected that the optimal policy will be updated periodically, as the fleet evolves, and when new information about maintenance costs and new (equipment) models becomes available. To model the replacement scenario described, we consider an extension of the simple fixed horizon model (Equation 12.3) in which we have a variable number, N, of replacement cycles. Let the inhomogeneous fleet comprise of r sub-fleets, with the current sub-fleets indexed by k = 1,..,r. New replacement sub-fleets are indexed by k = r+1,..,r+N. For a fixed planning horizon of length h, and given replacement schedule and choice of class for the replacement sub-fleets, the decision variables are then: number of cycles, N(>1); and time from beginning of i-th cycle to the replacement of sub-fleet i, Li (i=1,...,N). The whole fleet is operated over cycle i, which ends with the resale of sub-fleet i of size ni and purchase of sub-fleet r+i of size n r +i . Sub-fleets need not be homogeneous and the current ages of plant are denoted by τ ij (i = 1,..., r + N ) . The fleet size may be constant ( ni = nr +i all i) or variable. For a given replacement schedule, the total discounted cost over the horizon can be expressed as ctdc ( N , L; h) =



N



ν ti {

i =1

ti s = ti −1 +1

mi ( s )ν s −ti + nr + i Rr + i − Si (ti )}

(12.7)

where t i = ∑ij−=10 Li with L0 = 0 . Here mi (.) is the age related operating cost of the whole fleet in cycle i; S i (.) is the age related resale value of plant in sub-fleet i; Rr +i is the cost each of replacement plant in sub-fleet r+i; and v is the discount rate. The costs mi (.) and S i (.) may be expressed as mi ( s ) = ∑ k =i

r + i −1



nk j =1

M k (τ kj + s ), (i = 1,..., N ),

Si ( Li ) = ∑ j =1 Si1 (τ ij + Li ), (i = 1,..., N ) , ni

300

P. Scarf and J. Hartman

where M k (.) is the age related operating cost per unit time for an individual plant in sub-fleet k (k = 1,..., r + N ) , and S i1 (.) is the age related resale value for individual plant in sub-fleet i. (Also, τ kj = 0 for k>r). Appropriate penalty costs, associated with failures, may be incorporated into the operating costs. The annuity, c tdc ( N , L; h) / ∑ih=1ν i , or other suitable objective function may then be minimized subject to the constraint ∑iN=1 Li = h . Technological change is allowed for in that costs relating to proposed plant for cycles 2,..,N may be assigned as appropriate. The optimum replacement schedule may be obtained by minimizing the objective function over all possible schedules. In practice the range of possible schedules would be narrowed greatly by the experience of the operator. However, as the decision-maker will not have a firm value for the horizon length, the optimum policy must be robust to variation in h. Furthermore, because the fleet is mixed, both different replacement schedules and different planning horizon lengths will give rise to different age compositions of the fleet at the end of the horizon. Thus replacement policies may need to be compared not just on the basis of cost but also on the basis of the age composition of the fleet at the end of the planning horizon. This final age composition can be considered as quantifying the end-of-horizon effect. Non-uniform usage, particularly between sub-fleets, may be allowed for by varying the fleet size at replacements. For example, if older plant are underutilized, a smaller number of new plant would be required to meet the demand currently placed on an older sub-fleet. This effectively reduces the replacement cost for that sub-fleet by factor which is the ratio of the utilization of the old to the new sub-fleet. Of course, other more complex methods of accounting for differing usage may be considered. Given sufficient data, operating costs could be quantified in terms of usage and optimum policy may be obtained given forecasts for usage of sub-fleets over the planning horizon. The models may be extended to the case in which sub-fleets are retired as spares. The number of sub-fleets would simply increase by one at each replacement, with the costs associated with retired sub-fleet added. Predicting operating costs for a retired sub-fleet would be difficult however, as it is likely that no data would be available for this. Also it is assumed that equipment is bought new: in principle it is a simple matter to extend Equation 12.7 to the case in which used equipment may be purchased. Note that the formulation as presented allows for the possibility for a sub-fleet to be composed of a single unit of equipment. This may be appropriate if the fleet is small. The complexity of the computational problem increases rapidly as the number of sub-fleets increases. However we do not consider efficient algorithms for determining optimum policy here. 12.3.8 Application of a Replacement Model for an Inhomogeneous Fleet Example 12.2 Scarf and Hashem (1997) consider the inter-city coach fleet operated by Express National Berhad in Malaysia. The fleet comprised of 160 vehicles of 5 vehicletypes of varying ages, with maintenance cost modelled as M (τ ) = aτ b and resale

Replacement of Capital Equipment

301

values S (τ ) = 0.6 R(0.81)τ , for replacement cost R (Table 12.2). The data available were not sufficient for obtaining the maintenance cost model for all vehicletypes—for example, for the MAN, only data relating to their first year of operation were available. Furthermore, for older vehicles the costs appeared to be decreasing. This could perhaps be put down to under-utilization (partial retirement) and also neglect of vehicles reaching the end of their useful life. It was therefore necessary to pool the data to obtain reasonable cost models. The fitted maintenance cost models for the Cummins, Isuzu CJR and MAN were obtained by first fitting an overall cost model to data on vehicles up to eight years old, and then scaling this model to the costs of the individual vehicletypes in the manner described in Christer (1988). The costs for the older sub-fleets, the Mitsubishi and Isuzu CSA, were taken as constant. Penalty costs for breakdowns on the road were also modelled—see Scarf and Hashem (1997) for a full discussion of this. It was known that the Mitsubishi and Isuzu sub-fleets were in partial retirement and candidates for immediate replacement, capital expenditure permitting. The usage of sub-fleets was unknown, although with a daily requirement for 125 vehicles, it was reasonable to suppose that the usage level for the Mitsubishi and Isuzu sub-fleets was about half that of the other newer sub-fleets. This assumption led to the null “optimal policy”—replace the Mitsubishi and Isuzu CSA sub-fleets as soon as possible—which is uninteresting from a model validation point of view. Therefore in order to illustrate the replacement model, we consider the following subproblem in detail: investigate replacement policy for the “fleet” comprising of the Cummins, Isuzu CJR and MAN, assuming a fixed fleet size (93 vehicles) and uniform usage. Table 12.2. Fleet composition by vehicle type showing purchase cost, R, maintenance cost parameters and age distribution at time of replacement study

>12

8–9

6–7

0 0 0.72 0.72 0.72

4–5

55.6 57.8 24.7 11.1 18.4

10–11

750 800 500 300 450

b

2–3

R, M$000s IsuzuCSA Mitsubishi Cummins IsuzuCJR MAN

a

Age distribution: number of vehicles in each age group (2 year intervals) 0) be the capital cost of project P. Assume income cashflows are negative and expenditure cashflows are positive, and that all cashflows are incurred at the year end and discounted at rate v. If project P is released in year x from now then the total cashflow over h years from now will be



x −1 t =1

ft 0ν t + ν x

(∑

h− x t =0

)

ft1ν t + C .

(12.8)

If project P is not released then the cashflow over the horizon will be ∑ t =1 ft 0ν t . Define the “gain” from releasing project P in year x to be the difference between these cashflows: h

g P ( x; h) = ∑ t =1 ft 0ν t −1 − [∑ t =1 ft 0ν t + ν x {∑ t = 0 ft1ν t + C}] h

x −1

h− x

= ∑ t = x ( ft 0 − ft1− x )ν t −ν x C. h

Release of project P in year x (x 0. We can optimize project release by choosing that x for which the gain is a maximum and positive. If g P , x (h) ≤ 0 for all x = 1,..., h then release of project P will not be recommended (over the horizon). The consequences of project release may also be measured in terms of performance (Scarf and Martin, 2001). For a large system comprising many potential projects, the outcome of this modelling approach will typically be a list of projects that should be released immediately, and a list of those that should not. Of course, the release of projects will, in both cases, be limited by the budget for capital expenditure. For prioritizing project release, we can consider all projects over the horizon (0,h). Let g ij (h) be the gain when project i is released in year j ( 1 ≤ j ≤ h ), and suppose that, in year j, n projects have positive gains. These projects might be listed in order of magnitude of their gains. They might also be listed in order of magnitude of the “profitability” index, gij (h) / Ci , where Ci is the capital cost of project i. If rational decision criteria are to be used to determine policy then projects at the top of the list should be given priority, since they would be associated with the largest expected gain over the planning horizon. Under capital rationing with a fixed budget, this project priority list would indicate which projects can be released in the current year. For consequences considered in cashflow terms, appropriate discount rates may be chosen to reflect the investment risks of projects. A higher required rate of return (smaller discount rate) might be imposed on network expansion projects than on replacement of existing assets. Also, factors other than cost may impact on decisions: safety related or projects with high customer benefit may take priority. A capital rationing model to prioritize project release over k years (k ≤ h) may be formulated as a linear program (LP), similarly to that proposed in vehicle fleet management by Karabakal et al. (1994). Suppose the capital investment budget for year j is Bj (j = 1,...,h). Introduce the indicator variable xij which takes the value 1 if project i is released in year j and 0 otherwise. Then seek those values of xij (i = 1,..,n; j =1,...,k) which maximize the total gain over (0,h) of all projects re-

304

P. Scarf and J. Hartman

leased subject to the constraints that the capital investment budget is not exceeded in each year. That is maximize

∑ ∑ n

k

i =1

j =1 ij

x gij (h)

subject to

∑ ∑

n

x Ci ≤B j for all j = 1,...,k;

i =1 ij k

x ≤1

j =1 ij

for all i = 1,...,n;

(12.9) (12.10)

xij = 0,1.

Constraint set Equation 12.9 ensures that the budget for year j is not exceeded. Constraint set at Equation 12.10 ensures that project i is released at most once over the planning horizon. Note that if an individual project has negative gain whatever its execution time, then the contribution to the objective function from this project will be greatest when this project is not released over (0, k). Typically such planning may be informative over the planning horizon, but only decisions relating to the immediate future (one to two years) would be acted on. Therefore policy would be continually updated, implying a “rolling horizon” approach. Where a network consists of many identical components, the modelling of project planning may be extended to the case in which a proportion of “similar” projects are released in a given year. This could be done by formulating the capital rationing model (CRM) as a mixed programming problem. Consider now dependence between projects. For example, a major expansion project, while not replacing existing assets, may have significant operating cost or performance implications for particular assets: the building of a large ring-main in a water supply network is one such example. Essentially, if two projects P1 and P2 interact in this way, then new projects P1' = ( P1, not P2 ) , P2' = (not P1, P2 ) , and ' P12 = ( P1, P2 ) would have to be introduced, along with the constraint to ensure that ' at most one of P1' , P2' , and P12 is released over the planning horizon. While this approach may lead to a significant increase in the number of “projects” in the model, in principle the solution procedure would remain unchanged. The existence of future-cost dependencies between projects would have to be identified by the network owner. This may be extremely difficult in practice. However such dependency would very much characterize the network replacement problem, and therefore the approach described is an advance over current methods. A similar approach has been taken by Santhanam and Kyparisis (1996) in modelling dependency in the project release of information systems. Capital costs may be considered simply using the concept of shared set-up. It is possible that it may be optimal to release both P1 and P2 during the planning horizon, but not simultaneously. This presents a more difficult modelling task, without introducing many pseudo-projects, that is. For example, we could consider: release P1 at time s and P2 at time t; however for k=10, say, this would mean the introduction of 25 variables, x( P1P2 )( s ,t ) , for the P1 , P2 decision alone!

Replacement of Capital Equipment

305

For reasons of budgeting constraints or technical delays, the release of some recommended project at some optimal execution moment x* may not be possible. In such cases, it would be informative to have an indication of the extra cost to be incurred in revenue expenditure because of lack of capital expenditure; this is the marginal increased revenue expenditure due to delayed release. Given the capital rationing model, and focusing on cashflows, the operating cost consequences of capital rationing can be determined by calculating the delay associated with each project as a result of capital rationing. The revenue cost implications due to this delay would from expression Equation 12.8 be x ′ −1

h − x′

h − x∗

δ P ( x*, x′; h) = ∑ t = x ft 0ν t + ∑ t = 0 ft1ν t + x′ − ∑ t = 0 ft1ν t + x





where x ′ is the execution time for the project under capital rationing. The marginal increase in revenue expenditure would be found by summing over all projects. In a similar manner, the marginal increase in revenue expenditure due to projects delayed in year j could be found by summing over all projects with x∗ = j , and this measure indicates how much more capital investment would be required to reduce revenue expenditure to the optimum level. Uncertainty in the cashflow/performance model parameter estimates, reflecting the extent of currently available information about particular components and potential projects, and the extent of technological developments (new materials and techniques), may be propagated through into uncertainty in the gain function, g (.) . This would be most easily done using the delta method; see Baker and Scarf (1995) for an example of this in maintenance. The variance of the gain, as well as the expected gain, may then be used to produce the project priority list and those projects for which the expected gain is high and the uncertainty in the gain (variance of the gain) is low are candidates for release; these projects would be viewed as sound investments. Markovitz (1952) is the classic reference here; for a more recent discussion see Booth and King (1998). Also, a real options approach might be taken (e.g. Bowe and Lee 2004). Where there are no data regarding a potential project, there will be no objective basis for determining if and where the project lies on the project priority list. One possible approach to this problem would be to use data relating to other projects that are similar in design. Also subjective data may be collected, and used to update component data for the whole network in the manner described in O’Hagan (1994) and Goldstein and O’Hagan (1996) in the context of sewer networks. These methods are particularly useful for multi-component systems in which there are only limited data for a limited number of individual components. On the other hand, it may be that the income cashflow may be deterministic in some situations. For example, expansion of the network may be initiated by legislation, and the compensation for the investment costs are fixed and predetermined per customer connection.

306

P. Scarf and J. Hartman

12.5 Dynamic Programming Models Dynamic programming (DP) is a versatile technique for modelling and solving optimization problems which are sequential in nature. Thus, it is ideally suited to solve capital equipment replacement problems that consider whether plant should be kept or replaced after each period. The use of dynamic programming in equipment replacement analysis is significant as the methodology allows for keep/replace decisions to be evaluated after every period. This relaxes the assumption of economic life models that assume an asset and its replacements are retained for the same length of time over the horizon. Furthermore, dynamic programming allows for general modelling of costs and technological change. A dynamic program evaluates the transition from an initial state to possible eventual states, determining the optimal path of decisions over time. In our application, this entails evaluating an asset and the periodic decisions to keep or replace that asset over some horizon. This requires the definition of a state for the asset. As there are many possibilities for the definition of a state space, a number of methods have been developed. We highlight these different models in the following sections. 12.5.1 Age Based Model Bellman (1955) introduced the first dynamic programming model to analyze the equipment replacement problem. In this model, the state of the system is defined as the age of the asset and the decision to be evaluated at each stage is whether to keep or replace the asset. Thus, a solution consists of keep and replace decisions in each period of the horizon. The dynamic program can be described by the network in Figure 12.3. Each node in the network represents the age of the asset, which is the state of the system, at the end of the period. The states are labelled according to the age of the asset along the y-axis, increasing from 1 to N, the maximum allowable age of the asset (N = 5 in the figure), at the end of the time period which is labelled on the x-axis from 0 to T, the horizon time (T = 4 in the figure). The arcs connecting the nodes represent keep and replace decisions. An arc representing a “keep” decision (K) connects a state (age) of n to n+1 in consecutive stages (periods), as the asset ages one period. A “replace” decision (R) connects a state of n to a state of 1, as the nperiod old asset is salvaged and a new asset is purchased and used for one period. The initial decision is made at time zero, with n = 4 in the figure, and the asset is salvaged at the end of the horizon. Define ft(n) as the minimum net present value cost of making optimal keep and replace decisions for an asset of age n at time period t through time period T. Mathematically, we evaluate ft(n) with the following recursion: ⎧⎪K : α ( Ct +1 (n + 1) + ft +1 (n + 1) ) ft (n) = min ⎨ , n ≤ N , t ≤ T −1 ⎪⎩R : Pt − St (n) + α ( Ct +1 (1) + f t +1 (1) )

(12.11)

Replacement of Capital Equipment

N=5

307

K

4 R

3 2 1 0

1

2

3

T=4

Figure 12.3. Dynamic programming network for an age-based model

If the n-period old asset is kept (K), the operating and maitenance (O&M) cost Ct+1(n+1) is incurred for the asset in the following period. As the asset is age n+1 at the end of the period, ft+1(n+1) defines the costs going forward. (This is why ft is often referred to as the “cost to go function” in dynamic programming.) If the asset is replaced (R), then a salvage value St(n) is received and a purchase price Pt is paid for a new asset. The new asset is utilized for the period as the state transitions to an age of 1, defined by costs ft+1(1) going forward. If the asset reaches the maximum age of N, then only the replace decision is feasible. When the horizon time T is reached, the asset is salvaged such that fT (n) = − ST (n)

(12.12)

Traditionally, the recursion is solved backwards, such that Equation 12.12 is evaluated for each feasible age n. These values are substituted into Equation 12.11 when determining fT–1(n) for each feasible n. This process continues until the value and decision at stage zero (t = 0) are computed, signaling the initial decision in the optimal sequence of decisions over time. Note that O&M costs are paid at the end of the period and thus are discounted by the periodic factor α, along with the ensuing state cost ft+1(n). Salvage values and purchase costs are assumed to occur at the beginning of the period. As the recursion works from the horizon time T to time zero, the net present value cost is computed. Example 12.3 Assume a four-period old asset is owned at time zero, its maximum age is 5, and an asset is required to be in service in each of the next four periods (such that the decisions are represented in Figure 12.3). The purchase price is $50,000 with first year O&M costs $10,000, increasing 20% per period of use. The salvage value is expected to decline 30% (from the purchase price) after the first year of use and an additional 10% each year thereafter. For simplicity, we assume no technological change and the interest rate is 12% per period.

308

P. Scarf and J. Hartman Table 12.4. Dynamic programming state values ft(n) for the example problem

t\n 0 1 2 3 4

1 2 3 4 ---$47,996 $16,332 ---–$407 $6,292 --–$17,411 –$12,455 –$7,353 -–$35,000 –$31,500 –$28,350 –$25,515

5 -$35,602 ----

Table 12.4 shows the results of solving the dynamic programming algorithm. The values in the final row (f4 (n)) are the negative salvage values received for a given asset of age n at that time. To illustrate a calculation at t = 3, consider n = 1. Substituting into Equation 12.11: ⎧⎪K : 0.893 ( $12, 000 − $31,500 ) f3 (1) = min ⎨ = −$17, 411 ⎪⎩R : $50, 000 − $35, 000 + 0.893 ( $10, 000 − $35, 000)

The recursion continues in this fashion until f0 (4) is evaluated, with the decision to replace the asset immediately with a new asset. This new asset is retained through the horizon. The net present value cost of this sequence of decisions is $47,996. The benefit of using this model, in addition to allowing for replacements after each period, is that periodic costs are explicitly modeled on each arc in the network. This allows for detailed cost modelling of technological change, as in Regnier et al. (2004) or those costs associated with after-tax analysis, as in Hartman and Hartman (2001). A similar line of models have also been developed such that the condition of the asset, not its age, is tracked (i.e. Derman 1963). As opposed to moving from state to state by increasing the age of the asset, there is some probability that the asset will degrade to a lower condition during a period. The work assuming stochastic deterioration has been extended to include technological change (Hopp and Nair 1994) or consider probabilistic utilization (Hartman 2001). 12.5.2 Period Based Model Wagner (1975) offered an alternative dynamic programming formulation for the equipment replacement problem in which the state of the system is the time period and the decision at each period is the length of time to retain an asset. This model is described in the network in Figure 12.4. The nodes represent the state of the system (time period) and the arcs connecting two nodes represent the decision to keep an asset in service between those time periods.

Replacement of Capital Equipment

0

1

2

3

309

4

Figure 12.4. Dynamic programming network for a period-based model

The objective is to find the sequence of service lives that minimizes costs from time 0 through time T. (As previously, T = 4 in the figure.) Assuming costs along an arc connecting node t to node t+n are defined as net present value costs at time t, the optimal sequence of decisions can be determined by solving the following recursion: f (t ) = min n ≤ N ,t + n ≤T {ctn + α n f (t + n)}, t = 0,1,..., T − 1

(12.13)

where ctn represents the cost of retaining the asset for n periods from period t. Using our previous notation, ctn is defined as n

ctn = Pt + ∑ α j Ct + j ( j ) − α n St + n (n)

(12.14)

j =1

This model can be solved similarly to the age-based model, assuming that f(T) = 0 is substituted into Equation 12.13. Note that the network in Figure 12.4 assumes that a new asset is purchased at time 0. To include the option to keep or replace an asset owned at time zero, another set of arcs must be drawn, emanating from node 0, representing the length of time to retain the owned asset with its associated costs. As these arcs parallel those illustrated in Figure 12.4, the higher cost parallel arcs can be deleted, as they will not reside on the optimal path. This can be completed in a pre-processing step, with the recursion ensuing as defined. Example 12.4 Utilizing the same data from Example 12.3, the network in Figure 12.4 represents the options associated with purchasing a new asset in each period. We would add an arc from node 0 to node 1 to represent the decision to retain the four-period old asset for one additional period (to its maximum feasible age of 5). Table 12.5 provides the net present value costs (at time t) on the arcs from node t to node t+n. The arc from node 0 to 1 represents the cost of retaining the fourperiod old asset for one period, as this is cheaper than salvaging the used asset and purchasing a new asset for one period of use. The values of c02, c03, and c04 include the revenue received for salvaging the four-period old asset at time zero. With the values in Table 9.5, the dynamic programming recursion in Equation 12.13 can be solved.

310

P. Scarf and J. Hartman Table 12.5. Arc costs for Figure 12.4 using the example data

t \ t+n 0 1 2 3

1 -$1,989

2 $17,868 $27,679

3 $33,051 $43,383 $27,679

4 $47,996 $58,566 $43,383 $27,679

To illustrate a calculation, note that f(4) = 0, f(3) = $27,619, and consider t = 2. Substituting into Equation 12.13: ⎧$43,383 + $0 ⎫ f (2) = min ⎨ ⎬ = $43,383, $27, 679 + 0.893($27, 679) ⎩ ⎭

defining that it is cheaper to keep the asset for two periods (from the end of period 2 to the end of the horizon) rather than replacing it after one period of use. Continuing in this manner, it is found that f(0) = $47,996, signaling that the four-period old asset should be sold and the new asset should be retained through the horizon. This is the same solution found with Bellman’s model. While this model can be shown to be more computationally efficient than the age-based model, it is the ease with which multiple challengers (as parallel arcs) or technological change is modelled that has led to numerous extensions in the literature. See Oakford et al. (1984), Bean et al. (1985, 1994), and Hartman and Rogers (2006). 12.5.3 Cumulative-usage Based Model Recently, Hartman and Murphy (2006) offered a third dynamic programming formulation for the equipment replacement problem following the form of the classical knapsack model. The model determines the number of times an asset is used for a given length of time over some horizon. The dynamic program is described by the network in Figure 12.5. The y-axis defines the periods, 1 through T, while the x-axis identifies the stage in which an asset is to be retained for a given length of time, 1 through N, is evaluated. In the figure, the order is ages 4, 3, 2, and then 1. Thus, in the first stage of the dynamic program, the number of times to retain an asset for four consecutive periods is analyzed. (For this small example with T = 4, the asset can only be retained for four periods once.) In the second stage, it is evaluated whether an asset should be retained for three periods. In the third stage, it is evaluated whether an asset should be retained for two periods either once (for two periods of total service) or twice (for four periods of total service).

Replacement of Capital Equipment

311

T=4 3 2 1 0 0

4

3

2

1

Figure 12.5. Dynamic programming network for a cumulative usage-based model

A node in the network represents the cumulative service that has been accrued through a given stage. For example, after the first stage in Figure 12.5, either 0 or 4 periods of service have been reached. As the horizon is 4, a solution must ultimately result in 4 periods of service. As with the other dynamic programming, models, the goal is to find the minimum cost path from the initial node, representing no service at time zero, to the final node, representing an entire horizon’s worth of service after the final stage. To determine an optimal solution it is assumed that the costs are stationary and the stages (lengths of service) are ordered according to increasing annualized costs. Thus, before the recursion can be solved, the annualized costs of keeping an asset for each possible service life must be computed such that the stages can be ordered accordingly. Example 12.5 We revisit the previous examples again. From the given costs, the annual equivalent costs are computed as given in Table 12.6. For example, to retain the asset for two years costs $25,670 per year, equivalently, assuming a 12 percent interest rate. The net present value (NPV) costs are also given. We restrict the set of decisions to those of a new asset – namely how many to purchase and how long to retain them over the finite horizon. Table 12.6. Annual equivalent costs of keeping the asset for up to five years Age 0 1 2 3 4 5

O&M $10,000 $12,000 $14,400 $17,280 $20,736

SV $50,000 $35,000 $31,500 $28,350 $25,515 $22,964

AEC

NPV

$31,000 $25,670 $24,384 $24,202 $24,540

$27,679 $43,383 $58,566 $73,511 $88,462

Given the information in Table 12.6, the stages are ordered according to ages 4, 3, 5, 2, and 1, as the annual equivalent costs increase accordingly. As an asset is only required for four periods, the age 5 cost can be ignored.

312

P. Scarf and J. Hartman

According to Figure 12.5, an asset can be retained a maximum of one time for four years, at a cost of $73,511. Thus, the states in the first stage and their values are

f1 (0) = 0, f1 (4) = $73,511. Similar reasoning defines f2(0)=0, f2(3)=$58,566, and f2(4)=$73,511. For the third stage, the decisions are more interesting because an asset can be retained for two years twice in the sequence. Thus f3 (0) = 0, f3 (2) = $43,383, f3 (3) = $58,566,

{

}

f3 (4) = min $73,511,$43,383 + (0.893) 2 $43,383 = $73,511.

The final stage evaluates using assets for a single period with previous combinations (three-period and two-period aged assets). It can be shown that the optimal decision is to retain the asset for all four periods at a net present value cost of $73,511. Note that this is the same decision found with the two previous formulations, as $73,511 less the salvage value of the four-period old asset ($25,515) is $47,996. This recursion was not developed in order to provide another computational approach to the equipment replacement problem. Rather, it was developed to illustrate the relationship between the infinite and finite horizon solutions under stationary costs. Specifically, as the optimal solution to the infinite horizon problem is to repeatedly replace an asset at its economic life (age which minimizes equivalent annualized costs), the question being investigated was whether the solution (replacing at the economic life) translates to the finite horizon case. It was shown that using the infinite horizon solution provides a good answer when O&M costs increase over the life of an asset more drastically than salvage values decline. In the case when the salvage value declines are more drastic than the O&M cost increases, it is generally better to retain the final asset in the sequence for a period longer than the economic life of the asset. For the cases when O&M cost increases and salvage value declines are similar, then it is beneficial to solve a dynamic programming recursion to find the optimal policy. 12.5.4 Infinite Horizon Considerations The solution of a dynamic programming algorithm assumes that the horizon is finite. In the case of an infinite horizon in which an asset is expected to remain in service indefinitely, it may be possible to identify an optimal time zero decision. Bean et al. (1985) show that if the time zero decision for an equipment replacement problem does not change for N consecutive horizons, where N is the maximum age of an asset, then the decision is optimal for any length of horizon, includ-

Replacement of Capital Equipment

313

ing an infinite horizon. Unfortunately, this does not guarantee the existence of an optimal time zero decision. For the age or period based dynamic programming recursions, the models must be solved over T, T+1, T+2, …, T+N horizons. If the time zero decision does not change for these problems, then the optimal time-zero decision is found. If this is not the case, the progression must continue until N consecutive time zero decisions are identified. This may be more easily facilitated using a forward recursion. In the period-based model, this requires defining f(t) as a function of f(t–1), f(t–2), etc., with f(0) = 0. We illustrate by revisiting Example 12.4. Example 12.6 We illustrate the first few stages of the forward recursion, as its implementation is better suited for infinite horizon analysis. As noted earlier, the recursion is initialized with f(0) = 0. Stepping forward in time, it is assumed that T = 1. Using the values from Table 12.2, it is clear that the only feasible decision is to retain the four-period old asset for one period such that f(1) = -$1,989. For the second stage, there are two feasible decisions to evaluate, such that ⎧0.893($27, 679) − $1,989 ⎫ f (2) = min ⎨ ⎬ = $17,868. ⎩$17,868 + $0 ⎭

The first decision evaluates using the new asset for one period, assuming (from stage 1) that the four-period old asset is retained for one period. The second decision assumes the four-period old asset is retired immediately and a new asset is used for two periods. This process moves forward in time, increasing the value of T in each step. The process stops when, in this case, five consecutive solutions (with increasing T) result in the same time zero decision. 12.5.5 Modeling Complex Systems The presented dynamic programming algorithms are designed for single asset systems. More complex systems are obviously defined by multiple assets which are not independent, otherwise the presented models would be sufficient. The most straightforward case in where all assets operate in parallel, such as a fleet. Jones et al. (1991) offered the first dynamic programming recursion for the parallel machine replacement model, which can be used to analyze fleet replacement decisions. Machines are assumed to operate in parallel and thus the capacity of the system is equal to the sum of the individual asset capacities. In addition to defining the capacity of the system, the assets are often linked economically. Jones et al. focused on the assumption that a fixed cost would be charged in any period in which a replacement occurs (in addition to the typical per unit charges for each asset replaced). This provides an incentive to replace multiple assets together over time so as to reduce the number of times the fixed charge in incurred over some horizon. To model replacement decisions for this system, the state of the system is defined as the number of assets aged 1 through N, represented as a vector, [m1, m2,

314

P. Scarf and J. Hartman

…, mN]. In general, this would seem to be an intractable model, as the number of feasible combinations of replacements to evaluate in each period is exponential – as one could replace any combination of the m1+m2+…+mN assets. However, Jones et al. illustrated two key theorems that drastically reduce the computational complexity. First, it was shown that clusters, or assets of the same age in the same time period, do not split. That is, a group of same aged assets are kept or replaced in their entirety at the end of each period. Second, under some mild cost assumptions, they showed that older clusters of assets are replaced before younger clusters. With these two theorems, the number of possible replacements in a given period is drastically reduced. In each period, either the oldest cluster is replaced, or the two oldest clusters are replaced, etc., for a given state of the system. Consider the network in Figure 12.6. At time zero, a system is defined by six assets; three of age one, two of age two, and one of age three, and the maximum feasible age of an asset is three. Replacement decisions, assuming the no-splitting rule and oldest cluster replacement rule, are illustrated for three periods. Note that the maximum number of decisions for a given state is N = 3. Determining the optimal sequence of replacement decisions for the system is similar to our previous dynamic programming recursions. A value is assigned to f3(S), where S refers to the state vector. This would merely be the sum of the salvage values for each asset owned at time T = 3. Then, the value of each state S at time 2 would be determined by summing the costs of the decision and discounting the value of the resulting state. For example, moving from state [3,0,3] to state [3,3,0] would entail selling the three three-period old assets and purchasing three new assets. The new assets would be utilized for one period (incurring O&M costs) while the three one-period old assets (at the end of time period two) would be utilized for a second year, also incurring O&M costs. These costs (discounted accordingly) would be added to the discounted value of f3(3,3,0). This value would be compared to the decision of replacing all six assets (leading to state [6,0,0]) to determine the value of state f2(3,0,3).

[3,2,1]

0

[1,3,2] [3,3,0] [6,0,0]

[2,1,3] [5,1,0] [0,3,3] [3,0,3] [0,6,0] [6,0,0]

1

2

[3,2,1] [4,2,0] [0,5,1] [1,5,0] [3,0,3] [3,3,0] [0,0,6] [0,6,0] [6,0,0] T=3

Figure 12.6. Dynamic programming network for parallel-machine replacement problem

Replacement of Capital Equipment

315

Define n as the decision of what minimum aged assets are to be replaced for a given state at time t. That is, all assets of age n and older are replaced while the remaining assets are retained. We can model the recursion in general as follows: ⎧⎪ N ⎛ N ⎞ f t (m1, m2 ,..., mN ) = min n ⎨ K t ⋅ 1n >1 + ⎜ ∑ m j ⎟ Pt − ∑ m j St ( j ) + ⎜ ⎟ ⎪⎩ j =n ⎝ j =n ⎠

(12.15)

⎛⎛ N ⎞⎫⎪ n −1 ⎞ α ⎜ ⎜ ∑ m j ⎟Ct +1 (1) + ∑ m j Ct +1 ( j + 1) + f t +1 (mm + mm +1 + ... + mN , m1, m2 ,..., mn −1,0,0,...0) ⎟⎬ ⎜ ⎜ j =n ⎟ ⎟⎪ j =1 ⎠ ⎝⎝ ⎠⎭

Examining the recursion, a purchase price is paid and a salvage value is received for all assets that are replaced. All of the newly purchased assets (the total number of assets is the sum of mn+mn+1+…+ mN) incur the O&M cost of a new asset while the O&M costs of the retained assets are incurred according to their age. A fixed charge Kt is paid if at least one group of assets is replaced (n>1), captured by the indicator function. The resulting state is a group of new assets (age 1) with all other assets incrementing one period in age. A number of extensions to this model have been published in the literature, although many utilize integer programming modeling techniques to deal with the large-state space. Chand et al. (2000) focus on the use of dynamic programming and include capacity expansion decisions with the replacement decisions. Unfortunately, capital budgeting constraints greatly complicate the problem as it cannot be assumed that groups of assets must be kept or replaced together. While the theorems presented in Jones et al. (1991) greatly reduce the computational difficulties of solving the dynamic program for the parallel replacement problem, it should be clear that using dynamic programming to address replacement decisions for more complex systems may be difficult due to computational complexities that arise due to the number of combinations of replacement alternatives. (See Hartman and Ban (2002) and the references therein for a discussion of these issues.) Consider a more complex system in which a number of machines are used in series (such as a production line) and there are a number of lines in parallel, such as the one given in Figure 12.7. The lines are labeled 1, 2 and 3 while the machines are labeled a, b, c, and d.

1 2 3 a

b

c

d

Figure 12.7. System with assets in series and lines in parallel

316

P. Scarf and J. Hartman

The capacity of a line is now defined by the machine in the line with the minimum capacity. However, the capacity of the system is raised due to the parallel design. The capacity of the system is defined by the sum of the capacity of each line. Therefore, it is defined by the sum of the minimum capacity asset in each line. Reliability is measured similarly to a capacity, in that it is reduced by the series structure but increased with parallel (redundant) structure. For a given series, the reliability of the line is equal to the product of the reliability of each individual asset. That is because if one asset is down, the line is down. The reliability of the system, assuming only one line must be up and running, is increased as the system is operating even if three lines are down. If one defines minimum system capacity or reliability constraints, these can be incorporated into a dynamic programming recursion that evaluates the possibility of replacing any combination of assets in each period over some horizon. Presumably, newer assets would have higher capacity or reliability, either due to technological change or due to the fact that they are new (and have not deteriorated), and thus would increase the respective capacity or reliability of the system (in order to meet the defined constraints). The difficulty with using a dynamic programming recursion to evaluate these decisions is not in capturing the capacity or reliability constraints. Rather, the difficulty is in the exponential growth in the number of possible combinations of replacements in each period. Consider the 12 assets shown in Figure 12.7. In the most general problem, each asset and each combination of assets can be replaced in each period, totaling 212 combinations each period for each state of the system. This system could easily become more complicated, merely by defining a, b, c, and d as processes, each of which may have a number of assets in parallel (or in series). In the parallel machine replacement problem, a similar problem was encountered, but the number of possible decisions was reduced to N (the maximum allowable age for an asset) for each possible state in each period with the two theorems introduced by Jones et al., without sacrificing optimality. Unfortunately, the interaction of the assets may prohibit the application of these theorems to other systems. In fact, defining the state of the system is not entirely clear. For the system described in Figure 12.7, we could define the system as a matrix of asset ages. Each row would be defined by the age of each machine in a given line, with a row defined for each line. If an asset is replaced, then the age would translate to 1 in the next stage while it would merely increment 1 period if the machine is retained. This modeling approach could be expanded to the case of multiple machines in a given process – by expanding the size of the matrix. Again, the difficulty would be in restricting the number of decisions to evaluate for each state in a given period. Following the approach of Jones et al. (1991), older assets would be replaced first (and even further restricted to have to be above a certain age for consideration) and similarly aged assets of the same type would be replaced in the same time period. Another approach would be to only consider replacing assets that increase the system capacity or reliability. Thus, replacements could be examined in the order of either increasing capacity or increasing reliability. Whether these heuristic approaches provide a good solution for a given problem instance would require extensive numerical testing.

Replacement of Capital Equipment

317

12.6 Discussion and Further Topics for Research In this chapter we have reviewed both economic life and dynamic programming models to address capital replacement problems which can arise in various settings, including manufacturing, transportation, and utility industries. It should be clear that the trend in developing solutions to these problems has migrated from single asset to those of complex sytems. While good solutions exist for systems with homogeneous assets in parallel, computational difficulties exist for those with inhomogeneous assets both in series and parallel and opportunities exist to develop optimal solution methods with advanced computational techniques or good solution rules developed from simpler, tractable models. In addition to the investigation of systems with multiple assets, further savings can be achieved by considering operational and replacement decisions simultaneously (Hartman 1999, 2004). It should be clear that the usage of an asset over time impacts its replacement schedule. In the context of multiple assets, it may be possible to allocate usage to assets over time, thereby influencing replacement schedules. Thus, in order to minimize total system costs over time, both replacement and operating decisions should be considered simultaneously. There are numerous application areas where this analysis is warranted, including transportation networks, water distribution networks, and production systems. A final area of future research must center on technological change. While assets are often replaced due to deterioration, newer assets are often purchased because they are technologically advanced—providing similar capabilities at lower cost or additional capabilities for additional revenue. Numerous studies focus on the continuous evolution of technological change, however, more detailed research must focus on appropriate models for various applications, as it is clear that technological advances in different ways for different industrial sectors.

12.7 References Apeland, S. and Scarf, P.A. (2003) A fully subjective approach to capital equipment replacement. Journal of the Operational Research Society 54, 371–378. Arnold, G. (2006) Essentials of Corporate Financial Management. Pearson, London. Baker, R.D. and Scarf, P.A. (1995) Can models to small data samples lead to maintenance policies with near-optimal cost? IMA Journal of Mathematics Applied in Business and Industry 6, 3–12. Bean, J.C., Lohmann, J.R. and Smith, R.L. (1985) A dynamic infinite horizon replacement economy decision model. The Engineering Economist 30, 99–120. Bean, J.C., Lohmann, J.R. and Smith, R.L. (1994) Equipment replacement under technological change, Naval Research Logistics, 41, 117–128. Bellman, R.E. (1955) Equipment replacement policy. Journal of the Society for the Industrial Applications of Mathematics 3, 133–136. Booth, P. and King, P. (1998) The relationship between finance and actuarial science. In Hand, D.J., Jacka, S.D. (Eds), Statistics in Finance, Arnold, London, pp.7–40. Bowe, M. and Lee, D.L. (2004), Project evaluation in the presence of multiple embedded real options: evidence from the Taiwan High-Speed Rail Project, Journal of Asian Economics 15, 71–98.

318

P. Scarf and J. Hartman

Brint, A.T., Hodgkins, W.R., Rigler, D.M and Smith, S.A. (1998) Evaluating strategies for reliable distribution. IEEE Comput.Applns. in Power 11, 43–47. Chand, S., McClurg, T. and J. Ward (2000) A model for parallel machine replacement with capacity expansion. European Journal of Operational Research, 121. 519–531. Christer, A.H. (1984) Operational research applied to industrial maintenance and replacement. In Eglese, R.W. and Rand, G.K. (Eds) Developments in Operational Research (pp.31–58). Pergamon Press, Oxford. Christer, A.H. (1988) Determining economic replacement ages of equipment incorporating technological developments. In Rand, G.K. (Eds) Operational Research ’87 (pp.343– 354). Elsevier, Amsterdam. Christer, A.H. and Scarf, P.A. (1994) A robust replacement model with applications to medical equipment. J.Opl.Res.Soc. 45:261–275. Derman, C. (1963) Inspection-maintenance-replacement schedules under markovian deterioration. In Mathematical Optimization Techniques, University of California Press, Berkely, CA, pp. 201–210. Dixit, A.K. and Pindyck R.S. (1994) Investment Under Uncertainty Princeton University Press, New Jersey. Eilon, S., King, J.R. and Hutchinson, D.E. (1966). A study in equipment replacement. Opl.Res.Quart. 17:59–71. Elton, D.J. and Gruber, M.J. (1976) On the optimality of an equal life policy for equipment subject to technological change. Opl.Res.Quart. 22:93–99. Goldstein, M. and O’Hagan, A. (1996) Bayes linear sufficiency and systems of expert posterior assessments. Journal of the Royal Statistical Society Series B 58, 301–316. Hartman, J.C. (1999) A General Procedure for Incorporating Asset Utilization Decisions into Replacement Analysis. Eng. Econ., 44(3):217–238. Hartman, J.C. (2001) An Economic Replacement Model with Probabilistic Asset Utilization. IIE Transactions, 33, 717–729. Hartman, J.C. (2004) Multiple asset replacement analysis under variable utilization and stochastic demand. European Journal of Operational Research 59, 145–165. Hartman, J.C. and J. Ban (2002) The series-parallel replacement problem. Robotics and Computer Integrated Manufacturing, 18, 215–221. Hartman, J.C. and R.V. Hartman (2001) After-Tax Replacement Analysis. The Engineering Economist, 46, 181–204. Hartman, J.C. and Murphy, A. (2006) Finite Horizon Equipment Replacement Analysis. IIE Transactions 38, 409–419. Hartman, J.C. and Rogers, J.L. (2006) Dynamic Programming Approaches for Equipment Replacement Problems with Continuous and Discontinuous Technological Change. IMA Journal of Management Mathematics, 17, 143–158. Hopp, W.J. and Nair, S.K. (1991) Timing replacement decisions under discontinuous technological change. Naval Research Logistics 38, 203–220. Hopp, W.J. and Nair, S.K. (1994) Markovian deterioration and technological change. IIE Transactions, 26, 74–82. Jones, P.C., Zydiak, J.L. and Hopp, W.J. (1991) Parallel machine replacement. Naval Research Logistics, 38, 351–365. Karabakal, N., Lohmann, J.R. and Bean, J.C. (1994) Parallel replacement under capital rationing constraints. Management Science 40, 305–319. Kobbacy, K. and Nicol, D. (1994) Sensitivity of rent replacement models. Int.J.Prod.Econ. 36, 267–279. Markovitz, H.M. (1952) Portfolio selection. Journal of Finance 7, 77–91. Northcott, D. (1985) Capital Investment Decision Making. Dryden Press, London. Oakford, R.V., Lohmann, J.R. and Salazar, A. (1984) A dynamic replacement economy decision model. IIE Transactions, 16, 65–72.

Replacement of Capital Equipment

319

O’Hagan, A. (1994) Robust modelling for asset management. Journal of Statistical Planning and Inference 40, 245–259. Regnier, E., Sharp, G., and Tovey, C. (2004) Replacement under ongoing technological progress. IIE Transactions, 36, 497–508. Russell, J.C. (1982) Vehicle replacemeny: a case study in adapting a standard approach for a large organisation. Journal of the Operational Research Society 33, 899–911. Santhanam, R. and Kyparisis, G.J. (1996) A decision model for interdependent information system project selection. European Journal of Operational Research 89, 380–399. Scarf, P.A. (1994) Optimal buying, running and selling policy for the private motorist: an application of capital replacement modelling, IMA Bulletin 30, 181–186. Scarf, P.A. and Christer, A.H. (1997) Applications of capital replacement models with finite planning horizions, International Journal of Technology Management 13, 25–36. Scarf, P.A. and Hashem, M. (1997) On the application of an economic life model with a fixed planning horizon, International Transactions in Operations Research 4, 139–150. Scarf, P.A. and Hashem, H. (2003) Characterization of optimal policies for capital replacement models. IMA Journal of Management Mathematics 13, 261–271. Scarf, P.A. and Martin, H. (2001) A framework for maintenance and replacement of a network structured system. Int. J. Prod.Econ. 69, 287–296. Scarf, P.A., Dwight, R., McCusker, A. and Chan, A. (2006) Asset replacement for an urban railway using a modified two-cycle replacement model. Journal of the Operational Research Society (doi: 10.1057/palgrave.jors.2602288). Suzukia, Y. and Pautschb, G.R. (2005) A vehicle replacement policy for motor carriers in an unsteady economy. Transportation Research Part A: Policy and Practice 39, 463–480. Wagner, H.M. (1975) Principles of Operations Research. Prentice-Hall Inc., Englewood Cliffs, NJ. Wang, H. (2002) A survey of maintenance policies of deteriorating systems. European Journal of Operational Research 139, 469–489.

13 Maintenance and Production: A Review of Planning Models Gabriella Budai, Rommert Dekker and Robin P. Nicolai

13.1 Introduction Maintenance is the set of activities carried out to keep a system into a condition where it can perform its function. Quite often these systems are production systems where the outputs are products and/or services. Some maintenance can be done during production and some can be done during regular production stops in evenings, weekends and on holidays. However, in many cases production units need to be shut down for maintenance. This may lead to tension between the production and maintenance department of a company. On one hand the production department needs maintenance for the long-term well-being of its equipment, on the other hand it leads to shutting down the operations and loss of production. It will be clear that both can benefit from decision support based on mathematical models. In this chapter we give an overview of mathematical models that consider the relation between maintenance and production. The relation exists in several ways. First of all, when planning maintenance one needs to take production into account. Second, maintenance can also be seen as a production process which needs to be planned and finally one can develop integrated models for maintenance and production. Apart from giving a general overview of models we will also discuss some sectors in which the interactions between maintenance and production have been studied. Many review articles have been written on maintenance, e.g. Cho and Parlar (1991), but to our knowledge only one on the combination between maintenance and production, Ben-Daya and Rahim (2001). This review differs from that in several aspects. First of all, we also consider models which take production restrictions into account, rather than integrated models. Second we discuss some specific sectors. Finally, we discuss the more recent articles since that review. Maintenance is related to production in several ways. First of all, maintenance is intended to allow production, yet to execute maintenance production often has to be stopped. This negative effect has therefore to be considered in maintenance plan-

322

G. Budai, R. Dekker and R. Nicolai

ning and optimization. It comes specifically forward in the costing of downtime and in opportunity maintenance. All articles taking the effect of production on maintenance explicitly into account fall into this category. Second, maintenance can also be seen as a production process which needs to be planned. Planning in this respect implies determining appropriate levels of capacity (e.g. manpower) concerning the demand. Third, we are concerned with production planning in which one needs to take maintenance jobs into account. The point is that the maintenance jobs take production capacity away and hence they need to be planned together with production. Maintenance has to be done either because of a failure or because the quality of the produced items is not high enough. In this third category we also consider the integrated planning of production and maintenance. The relation between maintenance and production is also determined by the business sector. We consider the following sectors: railways, road, airlines and electrical power system maintenance. The outline of the rest of this chapter is now as follows. In Section 13.2 we present an overview of the main elements of maintenance planning as these are essential to understand the rest of this chapter. Following our classification scheme, in Section 13.3 we review articles in which maintenance is modelled explicitly and where the needs of production are taken into account. Since these needs differ between business sectors, we discuss in Section 13.4 the relation between production and maintenance for some specific business sectors. In Section 13.5 we consider the second category in our classification scheme: maintenance as a production process which needs to be planned. In Section 13.6 we are concerned with production planning in which one needs to take maintenance jobs into account (integrated production and maintenance planning). Trends and open research areas are discussed in Section 13.7 and, finally, conclusions are drawn in Section 13.8.

13.2 Maintenance Planning and Optimization: An Overview In maintenance several important decisions have to be made. We distinguish between (i) the long term strategic and maintenance concept, (ii) medium term planning, (iii) short term scheduling and finally (iv) control and performance indicators. Major strategic decisions concerning maintenance are made in the design process of systems. What type of maintenance is appropriate and when should it be done? This is laid down in the so-called maintenance concept. Many optimization models address this problem and the relation with production is implicit in some of them. Another important strategic problem is the organization of the maintenance department. Is maintenance done by production personnel (in the way total productive maintenance prescribes) or is there specific maintenance personnel? Second, questions such as “Where is it located?”, “Are specific types of work outsourced?”, etc. should be answered. Although they are important topics, they are more the concern of industrial organization than the topic of mathematical models. Further important strategic issues concern how a system can be maintained, whether specific expertise or equipment are needed, whether one can easily reach

Maintenance and Production: A Review

323

the subsystems, what information is available and what elements can be easily replaced. These are typical maintainability aspects, but they have little to do with production. In the tactical phase, usually between a month and year, one plans for the major maintenance/upgrade of major units and this has to be done in cooperation with the production department. Accordingly, specific decision support is needed in this respect. Another tactical problem concerns the capacity of the maintenance crew. Is there enough manpower to carry out the preventive maintenance program? These questions can be addressed by use of models as will be indicated later on. In the short term scheduling phase one determines the moment and order of execution, given an amount of outstanding corrective or preventive work. This is typically the domain of work scheduling where extensive model-based support can be given. We will next consider another important aspect in maintenance, which is the type of maintenance. A typical distinction is made between corrective and preventive maintenance work. The first is carried out after a failure, which is defined as the event by which a system stops functioning in a prescribed way. Preventive work however, is carried out to prevent failures. Although this distinction is often made, we like to remark that the difference is not that clear as it may seem. This is due to the definition of failure. An item may be in a bad state, while still functioning and one may or may not consider this as a failure. Anyhow, an important distinction between the two is that corrective maintenance is usually not plannable, but preventive maintenance typically is. The execution of maintenance can also be triggered by condition measurements and then we speak of condition-based maintenance. This has often been advocated as more effective and efficient than time-based preventive maintenance. Yet it is very hard to predict failures well in advance, and hence condition-based maintenance is often unplannable. Instead of time based maintenance one can also base the preventive maintenance on utilisation (run hours, mileage) as being more appropriate indicators of wear out. Finally one may also have inspections which can be done by sight or instruments and often do not affect operation. They do not improve the state of a system however, but only the information about it. This can be important in case machines start producing items of a bad quality. There are inspection-quality problems where inspection optimization is connected to quality control. Another distinction is about the amount of work. Often there are small works, grouped into maintenance packages. They may start with inspection, cleaning and next some improvement actions like lubricating and or replacing some parts. These are typically part of the preventive maintenance program attached to a system and have to be done on a repetitive basis (monthly, quarterly, yearly or two-yearly). Next, one has replacements of parts or subsystems and overhauls or refurbishments where a substantial system is improved. The latter are planned well in advance and carried out as projects with individual (or separate) budgets. A traditional optimization problem has been the choice and trade-off between preventive and corrective maintenance. The typical motivation is that preventive maintenance is cheaper than corrective. Maintenance costs are usually due to manhours, materials and indirect costs. The difference between corrective and preven-

324

G. Budai, R. Dekker and R. Nicolai

tive maintenance costs is especially in the latter category. They represent loss of production and environmental damage or safety consequences. Costing these consequences can be a difficult problem and is tackled in Section 13.3.1. It will also be clear that preventive maintenance should be done when production is least effected. This can be done using opportunities, which has given rise to a specific class of models dealt in a separate section (Section 13.3.2).

13.3 When to do Maintenance in Relation to Production In this section we discuss articles in which maintenance (planning or scheduling) is modelled explicitly and the needs of production are taken into account. The latter, however, is not usually modelled as such, but it is taken into account in the form of constraints or requirements. Alternatively the effect of maintenance on varying production scenarios may be considered. Following this reasoning we arrive at three streams of research. A first stream assesses the costs of downtime, which are important in the planning of maintenance. The second stream deals with studies where one tries to schedule maintenance work at those moments that units are not needed for production (opportunities) and in the last stream articles are considered which schedule maintenance in line with production. Each stream is dealt in a separate section. 13.3.1 Costing of Downtime Assessing the costs of downtime is an important step in the determination of costs of preventive and corrective maintenance. Although exact values are not necessary as most optimization results show, it is important to assess these values with a reasonable accuracy. It is easier to determine downtime costs in case of preventive maintenance than in case of corrective maintenance as failures may have many unforeseen consequences. Yet even in case of preventive maintenance the assessment can be difficult, e.g. in case of highway shutdowns or railway stoppage. Another problem to be tackled is the system-unit relation. A system can be a complex configuration of different units, which may imply that downtime of one unit does not necessarily halt the full system. Accordingly, an assessment of the consequences of unit downtime on system performance has to be made. This is especially a problem in case of k-out-of-n systems or even in more general configurations. Several articles deal with this issue. Some give an overall model, others describe a detailed case. Geraerds (1985) gives an outline of a general structuring to determine downtime costs. In Dekker and Van Rijn (1996) a downtime model is described for k-out-of-n systems used on the oil production platforms. Edwards et al. (2002) give a detailed model for the costs of equipment downtime in open-pit mining. They use regression models based on historical data. Knights et al. (2005) present a model to assist maintenance managers in evaluating the economic benefits of maintenance improvement projects.

Maintenance and Production: A Review

325

13.3.2 Opportunity Maintenance Opportunity maintenance is the maintenance that is carried out at an opportune moment, i.e. moments at which the units to be maintained are less needed for their function than normally. We speak of opportunities if these events occur occasionally and if they are difficult to predict in advance. There can be several reasons for a maintenance opportunity: •



Failure and hence repairs of other units/components. The failure of one component is often an opportunity to preventively maintain other components. Especially if the failure causes the breakdown of the production system it is favourable to perform preventive maintenance on other components. After all, little or no production is lost above that resulting from the original failure. An example is given in Van der Duyn Schouten et al. (1998) who consider the replacement of traffic lights at an intersection. Other interruptions of production. Production processes are not only interrupted by failures or repairs. Several outside events may create an opportunity as well. This can be market interruptions, or other work for which production needs to be stopped (e.g. replacing catalysts etc.) and this is an opportunity to combine preventive maintenance.

According to the foregoing discussion there are two approaches to opportunities. The first models a whole multi-component system in which upon a failure preventive maintenance can be carried out on other components as well. In the latter stream the opportunities are modelled as an outside event at which one may do maintenance. In the simplest form one considers one component, with maintenance which may be done at opportunities, or also with a forced shutdown. Bäckert and Rippin (1985) consider the first type of opportunistic maintenance for plants subject to breakdowns. In this article three methods are proposed to solve the problem. In the first two cases the problem is formulated as a stochastic decision tree and solved using a modified branch and bound procedure. In the third case the problem is formulated as a Markov decision process. The planning period is discretised, resulting in a finite state space to which a dynamic programming procedure can be applied. In Wijnmalen and Hontelez (1997) a multi-component system is considered where failures of one component may create an opportunity, but the opportunity process is approximated by an independent process with the same mean rate. In this way they circumvent the problem of dimensionality which appears in the study of Bäckert and Rippin (1985). There are several articles considering the other stream. Tan and Kramer (1997) propose a general framework for preventive maintenance optimization in chemical process operations. The authors combine Monte Carlo simulation with a genetic algorithm. Opportunities are the failure of other components. In Dekker and Dijkstra (1992) and Dekker and Smeitink (1991) it is assumed that the opportunity-generating process is completely independent of the failure process and is modelled as a renewal process. Dekker and Smeitink (1994) consider multi-component maintenance at opportunities of restricted duration and determine priorities of what preventive maintenance to do at an opportunity.

326

G. Budai, R. Dekker and R. Nicolai

In Dekker and Van Rijn (1996) a decision-support system (PROMPT) for opportunity-based preventive maintenance is discussed. PROMPT was developed to take care of the random occurrence of opportunities of restricted duration. Here, opportunities are not only failures of other components, but also preventive maintenance on (essential) components. Many of the techniques developed in the articles of Dekker and Smeitink (1991), Dekker and Dijkstra (1992) and Dekker and Smeitink (1994) are implemented in the decision-support system. In PROMPT preventive maintenance is split up into packages. For each package an optimum policy is determined, which indicates when it should be carried out at an opportunity. From the separate policies a priority measure is determined with which maintenance package should be executed at a given opportunity. In Dekker et al. (1998b) the maintenance of light-standards is studied. A lightstandard consists of n independent and identical lamps screwed on a lamp assembly. To guarantee a minimum luminance, the lamps are replaced if the number of failed lamps reaches a prespecified number m. In order to replace the lamps the assembly has to be lowered. As a consequence, each failure is an opportunity to combine corrective and preventive maintenance. Several opportunistic age-based variants of the m-failure group replacement policy (in its original form only corrective maintenance is grouped) are considered. Simulation optimization is used to determine the optimal opportunistic age threshold. Dagpunar (1996) introduces a maintenance model where replacement of a component within a system is possible when some other part of the system fails, at a cost of c2. The opportunity process is Poisson. A component is replaced at an opportunity if its age exceeds a specified control limit t. Upon failure a component is replaced at cost c4 if its age exceeds a specified control limit x, otherwise it is minimally repaired at cost c1. In case of a minimal repair the age and failure rate of the component after the repair is as it was immediately before failure. There is also a possibility of a preventive or “interrupt” replacement at cost c3 if the component is still functioning at a specified age T. A procedure to optimise the control limits t and T is given in Dekker and Plasmeijer (2001). 13.3.3 Maintenance Scheduling in Line with Production Here we consider models where the effect of production on maintenance is explicitly taken into account. These models only address maintenance decisions, but they do not give advice on how to plan production. The models developed in the articles in this category show that a good maintenance plan, one that is integrated with the production plan, can result in considerable cost savings. This integration with production is crucial because production and maintenance have a direct relation. Any breakdown in machine operation results in disruption of production and leads to additional costs due to downtime, loss of production, decrease in productivity and quality, and inefficient use of personnel, equipment and facilities. Below we review articles following this stream of research in chronological order. Dedopoulos and Shah (1995) consider the problem of determining the optimal preventive maintenance policy parameters for individual items of equipment in multipurpose plants. In order to formulate maintenance policies, the benefits of

Maintenance and Production: A Review

327

maintenance, in the form of reduced failure rates, must be weighed against the costs. The approach in this study first attempts to estimate the effect of the failure rate of a piece of equipment on the overall performance/profitability of the plant. An integrated production and maintenance planning problem is also solved to determine the effects of PM on production. Finally, the results of these two procedures are then utilized in a final optimization problem that uses the relationship between profitability and failure rate as well as the costs of different maintenance policies to select the appropriate maintenance policy. Vatn et al. (1996) present an approach for identifying the optimal maintenance schedule for the components of a production system. Safety, health and environment objectives, maintenance costs and costs of lost production are all taken into consideration, and maintenance is thus optimized with respect to multiple objectives. The approach is flexible as it can be carried out at various levels of detail, e.g. adapted to available resources and to the management’s willingness to give detailed priorities with respect to objectives on safety vs. production loss. Frost and Dechter (1998) define the scheduling of preventive maintenance of power generating units within a power plant as constraint satisfaction problems. The general purpose of determining a maintenance schedule is to determine the duration and sequence of outages of power generating units over a given time period, while minimizing operating and maintenance costs over the planning period. Vaurio (1999) develops unavailability and cost rate functions for components whose failures can occur randomly. Failures can only be detected through periodic testing or inspections. If a failure occurs between consecutive inspections, the unit remains failed until the next inspection. Components are renewed by preventive maintenance periodically, or by repair or replacement after a failure, whichever occurs first (age-replacement). The model takes into account finite repair and maintenance durations as well as costs due to testing, repair, maintenance and lost production or accidents. For normally operating units the time-related penalty is loss of production. For standby safety equipment it is the expected cost of an accident that can happen when the component is down due to a dormant failure, repair or maintenance. The objective is to minimize the total cost rate with respect to the inspection and the replacement interval. General conditions and techniques are developed for solving optimal test and maintenance intervals, with and without constraints on the production loss or accident rate. Insights are gained into how the optimal intervals depend on various cost parameters and reliability characteristics. Van Dijkhuizen (2000) studies the problem of clustering preventive maintenance jobs in a multiple set-up multi-component production system. This article has been reviewed in Chapter 11, which gives an overview of multi-component maintenance models. Cassady et al. (2001) introduce the concept of selective maintenance. Often production systems are required to perform a sequence of operations with finite breaks between each operation. The authors establish a mathematical programming framework for assisting decision-makers in determining the optimal subset of maintenance activities to perform prior to beginning the next operation. This decision making process is referred to as selective maintenance. The article of Haghani and Shafahi (2002) deals with the problem of scheduling bus maintenance activities. A mathematical programming approach to the problem

328

G. Budai, R. Dekker and R. Nicolai

is proposed. This approach takes as input a given daily operating schedule for all buses assigned to a depot along with available maintenance resources. Then a daily inspection and maintenance schedule is designed for the buses that require inspection so as to minimize the interruptions in the daily bus-operating schedule, and maximize the reliability of the system and efficiently utilize the maintenance facilities. Charles et al. (2003) examine the interaction effects of maintenance policies on batch plant scheduling in a semiconductor wafer fabrication facility. The purpose of the work is the improvement of the quality of maintenance department activities by the implementation of optimized preventive maintenance (PM) strategies and comes within the scope of total productivity maintenance (TPM) strategy. The production of semiconductor devices is carried out in a wafer lab. In this production environment equipment breakdown or procedure drifting usually induces unscheduled production interruptions. Cheung et al. (2004) consider a plant with several units of different types. There are several shutdown periods for maintenance. The problem is to allocate units to these periods in such a way that production is least effected. Maintenance is not modelled in detail, but incorporated through frequency or period restrictions.

13.4 Specific Business Sectors The purpose here is to illustrate the interdependence between maintenance and production for some specific sectors in more detail. Moreover, it shows what ideas were employed in which sector and the difference between them. Although many sectors could be distinguished we take those where maintenance plays an important role. Not surprisingly, these are all capital intensive sectors with high maintenance expenditure and we discuss railway, road, airline and electric power system maintenance. 13.4.1 Railway Maintenance Since rail is an important transportation mode, proper maintenance of the existing lines, repairs and replacements carried out in time are all important to ensure efficient operation. Moreover, since some failures might have a strong impact on the safety of the passengers, it is important to prevent these failures by carrying out in time, and according to some predefined schedules, preventive maintenance works. The preventive maintenance works are the small routine works and/or projects. The routine (spot) maintenance activities, that consist of inspections and small repairs (see Esveld 2001), do not take much time to be performed and are done regularly, with frequencies varying between monthly and once a year. The projects include renewal works and they are carried out once or twice every few years. In the literature there are a couple of articles that provide useful methods for finding optimal track possession intervals for carrying out preventive maintenance works, i.e. time periods when a track is required for maintenance, therefore it will be blocked for the operation. In production planning terms track possession means

Maintenance and Production: A Review

329

downtime required for maintenance. The main question is when to carry out maintenance such that the inconvenience for the train operators, the disruption to and from the scheduled trains, the infrastructure possession time for maintenance are minimized and the maintenance cost is the lowest possible. For a more detailed overview of techniques used in planning railway infrastructure maintenance we refer to Dekker and Budai (2002) and Improverail (2002). In some articles (see, e.g. Higgins 1998, Cheung et al. 1999 and Budai et al. 2006) the track possession is modelled in between operations. This can be done for occasionally used tracks, which is the case in Australia and some European countries. If tracks are used frequently, one has to perform maintenance during nights, when the train traffic is almost absent or during weekends (with possible interruption of the train services), when there are less disturbances for the passengers. In the first case one can either make a cyclic static schedule, which is done by Den Hertog et al. (2005) and Van Zante-de Fokkert (2001) for the Dutch situation, or a dynamic schedule with a rolling horizon, which is done in Cheung et al. (1999). The latter schedule has to be made regularly. Some other articles deal with grouping railway maintenance activities to reduce costs, downtime and inconvenience for the travellers and operators. Here we mention the study of Budai et al. (2006) in which the preventive maintenance scheduling problem is introduced. This problem arises in other public/private sectors as well, since preventive maintenance of other technical systems (machine, road, airplanes, etc.) also contains small routine works and large projects. 13.4.2 Road Maintenance Road maintenance has many common characteristics with railway maintenance. Failures are often indirect, in the sense that norms are surpassed, but there may not be any consequences. The production function is indirect, but that does not mean that it is not felt by many. Governments may define a cost penalty due to one hour waiting per vehicle because of congestion caused by road maintenance. Similar to railway maintenance one sees that work is shifted to nights or a lot of work is combined into a large project on which the public is informed long before it is started. The night work causes high logistics costs for maintenance, yet it is useful for small repairs or patches. Other similarities with railroads are the large number of identical parts (a road is typically split up in lanes of 100 meters about which information is stored). Vans with complex road analysing equipment are used to assess the road quality. For railways special trains with complex measuring equipment are used. Videos are used in both cases. Next, both roads and rails have multiple failure modes. Furthermore, the assets to be maintained are spread out geographically, which result in high logistics costs for maintenance. This is also true for airline and truck maintenance. Both road and rail need much maintenance and as a result large budgets need to be allocated for both. Although several articles have been written on road maintenance, few take the production or user consequences into account. We would like to mention Dekker et al. (1998a) who compare two concepts to do road maintenance – one with small projects carried out during nights and the other where large road segments (some

330

G. Budai, R. Dekker and R. Nicolai

4 km) are overhauled in one stretch. In the latter case the traffic is diverted to other lanes or the side of the road. It is shown that the latter is both advantageous for the traffic as well as cheaper, provided the volume of traffic on the road is not too high. Another interesting contribution is from Rose and Bennett (1992) who provide a model to locate and decide on the size (or capacity) of road maintenance depots, for corrective maintenance. 13.4.3 Airline Maintenance Maintenance costs are a substantial factor of an airline’s costs. Estimates are that 20% of the cost is due to maintenance. Maintenance is crucial because of safety reasons and because of high downtime costs. Apart from a crash, the worst event for an airline is an aircraft on ground (AOG) because of failures. Accordingly a lot of technology has been developed to facilitate maintenance. We like to mention inflight diagnosis, such that quick actions can be taken on ground and a very high level of modularity, such that failed components can easily be replaced. Yet in an aircraft there is still a high level of time-based preventive maintenance rather than condition-based maintenance. A plane has to undergo several checks, ranging from an A check taking about an hour after each flight, to a monthly B check, a yearly C check and a five-yearly D check, where it is completely overhauled and which can take a month. The presence of the monthly check implies that planes cannot always fly the same route, but need to be rotated on a regular basis. It also implies that airlines need multiple units of a type in order to provide a consistent service. Several studies have addressed the issue of fleet allocation and maintenance scheduling. In the fleet allocation one decides which planes fly which route and at which time. One would preferably make an allocation which remains fixed for a whole year, but due to the regular maintenance checks this is not possible. Gopalan and Talluri (1998) give an overview of mathematical models on this problem. Moudani and Mora-Camino (2000) present a method to do both flight assignment and maintenance scheduling of planes. It uses dynamic programming and heuristics. A case of a charter airline is considered. Sriram and Haghani (2003) also consider the same problem. They solve it in two phases. Finally, Feo and Bard (1989) consider the problem of maintenance base planning in relation to an airlines fleet rotation, while Cohn and Barnhart (2003) consider the relation between crew scheduling and key maintenance routing decisions. In another line of research, Dijkstra et al. (1994) develop a model to assess maintenance manpower scheduling and requirements in order to perform inspection checks (A type) between flight turnarounds. It appears that their workload is quite peaked because of many flights arriving more or less at the same time (socalled banks) in order to allow fast passenger transfers. The same problem is also tackled by Yan et al. (2004). The articles in this line of research consider in effect the production planning of maintenance, a topic also addressed in Section 13.5. As the last article in this category we would like to mention Cobb (1995) who presents a simulation model to evaluate current maintenance system performance or the positive effect of ad hoc operating decisions on maintenance turn times (i.e. the time maintenance takes to carry out a check or to do a repair).

Maintenance and Production: A Review

331

13.4.4 Electric Power System Maintenance Kralj and Petrovic (1988) have presented an overview article on optimal maintenance of thermal generating units in power systems. They primarily focused on articles published in IEEE Transactions on Power Apparatus and Systems. Here we will briefly discuss the typical problems of the maintenance of power systems and review two articles dealing with these problems. First of all, note that maintenance of power systems is costly, because it is impossible to store generated electrical energy. Moreover, the continuity of supply is very important for its customers. A second problem of scheduling the maintenance of power systems is that joint maintenance of units is often impossible or very expensive, since that would too much effect production. Frost and Dechter (1998) consider the problem of scheduling preventive maintenance of power generating units within a power plant. The purpose of the maintenance scheduling is to determine the duration and sequence of outages of power generating units over a given time period, while minimizing operating and maintenance costs over the planning period, subject to various constraints. A subset of the constraints contains the pairs of components that cannot be maintained simultaneously. In this article the maintenance problem are cast as constraint satisfaction problems (CSP). The optimal solution is found by solving a series of CSPs with successively tighter cost-bound constraints. Langdon and Treleaven (1997) study the problem of scheduling maintenance for electrical power transmission networks. Grouping maintenance in the network may prevent the use of a cheap electricity generator, so requiring a more expensive generator to be run in its place. That is, some parts of the network should not be maintained simultaneously. These exclusions are modelled by adding restrictions to the MIP formulation of the problem.

13.5 Production Planning of Maintenance Maintenance can also be regarded as a production process which needs to be planned. Planning in this respect implies determining appropriate levels of capacity concerning the demand. It will be clear that this activity can only be carried out for plannable maintenance, e.g. overhauls or refurbishment and that it is only needed when there are capacity restrictions, e.g. in a shipyard. The specific aspect of maintenance production planning with standard production planning is that there tend to be more unforeseen events and intervening corrective maintenance work than in regular production planning. Articles in this category are Dijkstra et al. (1994) and Yan et al. (2004), who both consider manpower determination and allocation problems in case of a fluctuating workload for aircraft maintenance. Shenoy and Bhadury (1993) use the MRP approach to develop a maintenance-manpower plan. Bengü (1994) discusses the organization of maintenance centres that are specialized to carry out particular types of maintenance jobs in the telecommunication sector. Al-Zubaidi and

332

G. Budai, R. Dekker and R. Nicolai

Christer (1997) consider the problem of manpower planning for hospital building maintenance. Another typical production planning problem is with respect to layout planning. A case study for a maintenance tool room is described in Rosa and Feiring (1995). The study by Rose and Bennett (1992), which was discussed in Section 13.4, also falls into this category.

13.6 Integrated Production and Maintenance Planning In recent years there has been considerable interest in models attempting to integrate production, quality and maintenance (Ben-Daya 2001). Whereas in the past these aspects have been treated as separate problems, nowadays models take into account the mutual interdependencies. Production planning typically concerns determining lot sizes and evaluating capacity needs, in case of fluctuating demand. Both the optimal lot size and the capacity needs are influenced by failures. On the other hand, maintenance prevents breakdowns and improves quality. Accordingly, they should be planned in an integrated way (see, e.g. Nahmias 2005). We subdivide the class of integrated production and maintenance planning models into four categories: high-level models considering conceptual and process design problems (Section 13.6.1); the economic manufacturing quantity model, which was originally posed as a simple inventory problem, but has been (successfully) extended to deal with quality and failure aspects (Section 13.6.2); models of production systems with buffer capacities, which by definition are suitable to deal with breakdowns (Section 13.6.3); finally, production and maintenance rate optimization models, which aim to find the production and preventive/corrective maintenance rates of machines so as to minimize the total cost of inventory, production and maintenance (Section 13.6.4). In Section 13.6.5 we discuss articles which do not fit in any of the above categories. 13.6.1 Conceptual and Design Models In a number of articles conceptual models are developed that integrate the preventive and corrective aspects of the maintenance planning, with aspects of the production system such as quality, service level and priority and capacity activities. For instance, Finch and Gilbert (1986) present an integrated conceptual framework for maintenance and production in which they focus especially on manpower issues in corrective and preventive work. Weinstein and Chung (1999) test the hypothesis that integrating the maintenance policy with the aggregate production planning will significantly influence total cost reduction. It appears that this is the case in the experimental setting investigated in this study. Lee (2005) considers production inventory planning, where high level decisions on maintenance (viz. their effects) are made. Another group of articles deal with integrating process design, production and maintenance planning. Already at the design stage decisions on the process system and initial reliabilities of the equipments are made. Pistikopoulos et al. (2000) describe an optimization framework for general multipurpose process models,

Maintenance and Production: A Review

333

which determine both the optimal design as well as the production and maintenance plans simultaneously. In this framework, the basic process and system reliability-maintainability characteristics are determined in the design phase with the selection of system structure, components, etc. The remaining characteristics are determined in the operation phase with the selection of appropriate operating and maintenance policies. Therefore, the optimization of process system effectiveness depends on the simultaneous identification of optimal design, operation and maintenance policies having properly accounted for their interactions. In Goel et al. (2003) a reliability allocation model is coupled with the existing design, production, and maintenance optimization framework. The aim is to identify the optimal size and initial reliability for each unit of equipment at the design stage. They balance the additional design and maintenance costs with the benefits obtained due to increased process availability. 13.6.2 EMQ Problems In the classical economic manufacturing quantity (EMQ) model items are produced at a constant rate p and the demand rate for the items is equal to d < p. The aim of the model is to find the production uptime that minimizes the sum of the inventory holding cost and the average, fixed, ordering cost. This model is an extension of the well known economic order quantity (EOQ) model, the difference being that in the EOQ model orders are placed when there is no inventory. Note that the EMQ model is also referred to as economic production quantity (EPQ) model. In the extensive literature on production and inventory problems, it is often assumed that the production process does not fail, that it is not interrupted and that it only produces items of acceptable quality. Unfortunately, in practice this is not always the case. A production process can be interrupted due to a machine breakdown or because the quality of the produced items is not acceptable anymore. The EMQ model has been extended to deal with these aspects and we thus divide the literature on EMQ models into two categories. First, we consider EMQ problems that take into account the quality aspects of the items produced. The second category of EMQ models analyzes the effects of (stochastic machine) breakdowns on the lot sizing decision. 13.6.2.1 EMQ Problems with Quality Aspects One of the reasons why a production process is interrupted is the (lack of) quality of the items produced. Obviously, items of inferior quality can only be sold at a lower revenue or cannot be sold at all. Thus, the production of these items results in a loss (or a lower profit) for the firm. This type of interruption is usually modelled as follows. It is assumed that at the start of the production cycle the production is in an “in-control” state, producing items of acceptable quality. After some time the production process may then shift to an “out-of-control” state. In this state a certain percentage of the items produced are defective or of substandard quality. The elapsed time for the process to be in the in-control state, before the shift occurs, is a random variable. Once a shift to the out-of-control state has occurred, it is assumed that the production process stays in that state unless it is

334

G. Budai, R. Dekker and R. Nicolai

discovered by (a periodic) inspection of the process, followed by corrective maintenance. One of the earliest works that consider the problem of finding the optimal lot size and optimal inspection schedule is the article of Lee and Rosenblatt (1987). They show that the derived optimal lot size is smaller than the classical EMQ if the time for the process to be in the in-control state follows an exponential distribution. Lee and Rosenblatt (1989) have extended this work by assuming that the cost of restoration is a function of the elapsed time since a shift from an in-control to an out-of-control state of the production process has occurred. In addition, the possibility of incurring shortages in the model is allowed. Many attempts have been made to extend these two models. For instance, Tseng (1996) assumes that the process lifetime is arbitrarily distributed with an increasing failure rate. Furthermore, two maintenance actions are considered. The first is a perfect maintenance action, which restores the system to an as-good-as new condition if the process is in the in-control state. If however, the production process is in out-of-control state, it is restored to the in-control state at a given restoration cost. Secondly, maintenance is always done at the end of a production cycle to ensure that the process is perfect at the beginning of each production cycle. Wang and Sheu (2003) assume that the periodic inspections are imperfect. Two types of inspection errors are considered, namely (I) the process is declared out-ofcontrol when it is in-control and (II) the process is declared in-control when it is out-of-control. They use a Markov chain to jointly determine the production cycle, process inspection intervals, and maintenance level. Wang (2006) derives some structural properties for the optimal production/preventive maintenance policy, under the assumption that the (sufficient) conditions for the optimality of the equalinterval PM schedule hold. This increases the efficiency of the solution procedure. The quality characteristics of the product in a production process can be monitored by x -control chart. The economic design of the x -control chart determines the sample size n, sampling interval h, and the control limit coefficient k such that the total cost is minimized. Rahim (1994) develops an economic model for joint determination of production quantity, inspection schedule and control chart design for a production process which is subject to a non-Markovian random shock. In their model it is assumed that the in-control period follows a general probability distribution with an increasing failure rate and that production ceases only if the process is found to be out of control during inspection. However, if the alarm turns out to be false the time for searching an assignable cause is assumed to be zero. Rahim and Ben-Daya (1998) generalize the model of Rahim (1994) by assuming that the production stops for a fixed amount of time not only for a true alarm, but also whenever there is a false alarm during the in-control state. Rahim and Ben-Daya (2001) further extend the model of Rahim (1994) by looking at the effect of deteriorating products and a deteriorating production process on the optimal production quantity, inspection schedule and control chart design parameters. The deterioration times for both product and process are assumed to follow Weibull distributions. It is assumed that the process is stopped either at failure or at the m-th inspection

Maintenance and Production: A Review

335

interval, whichever occurs first. Furthermore, the inventory is depleted to zero before a new cycle starts. Tagaras (1988) develops an economic model that incorporates both process control and maintenance policies, and simultaneously optimizes their design parameters. Lam and Rahim (2002) present an integrated model for joint determination of economic design of x -control charts, economic production quantity, production run length and maintenance schedules for a deteriorating production system. In the model of Ben-Daya and Makhdoum (1998) PM activities are also coordinated with quality control inspections, but they are carried out only when a preset threshold of the shift rate of the production process is reached. 13.6.2.2 EMQ Problems with Failure Aspects A couple of articles study the EMQ model in the presence of random machine breakdowns or random failures of a bottleneck component For instance, Groenevelt et al. (1992a) consider the effects of stochastic machine breakdowns and corrective maintenance on economic lot sizing decisions. Maintenance of the machine is carried out after a failure or after a predetermined time interval, whichever occurs first. They consider two production control policies. Under the first policy when the machine breaks down the interrupted lot is not resumed and a new lot starts only when all available inventory is depleted. In the second policy, production is immediately resumed after a breakdown if the current on hand inventory is below a certain threshold level. They showed that under these policies the optimal lot size increases with the failure rate and assuming a constant failure rate and instantaneous repair times the optimal lot sizes are always larger than the EMQ. Nevertheless, Groenevelt et al. (1992a) propose to use the EMQ as an approximation to the optimal production lot size. Chung (2003) provides a better approximation to the optimal production lot size. Groenevelt et al. (1992b) study the problem of selecting the economic lot size for an unreliable manufacturing facility with a constant failure rate and general distributed repair times. The quantity of the safety stock that is used when the machine is being repaired is was derived based on the managerially prescribed service level. Makis and Fung (1995) present a model for joint determination of the lot size, inspection interval and preventive replacement time for a production facility that is subject to random failure. The time that the process stays in the in-control state is exponentially distributed and once the process is in out-of-control state, a certain percentage of the items produced is defective or qualitatively not acceptable. Periodic inspections are done to review the production process and the time to machine failure is generally distributed random variable. Preventive replacement of the production facility is based on operation time, i.e. after a certain number of production runs the production facility is replaced. Some other articles are concerned with PM policies for EMQ models. For instance, in Srinivasan and Lee (1996) an (S, s) policy is considered, i.e. as soon as the inventory level reaches S, a preventive maintenance operation is initiated and the machine becomes as good as new. After the preventive maintenance operation, production resumes as soon as the inventory level drops down to or below a prespecified value, s, and the facility continues to produce items until the inventory level is raised back to S. If the facility breaks down during operation, it is mini-

336

G. Budai, R. Dekker and R. Nicolai

mally repaired and put back into commission. Okamura et al. (2001) generalize the model of Srinivasan and Lee (1996) by assuming that both the demand as well as the production process is a continuous-time renewal counting process. Furthermore, they suppose that machine breakdown occurs according to a non-homogeneous Poisson process. In Lee and Srinivasan (2001) the demand and production rates are considered constant and a production run begins as soon as the inventory drops to zero. If the facility fails during operation, it is assumed to be repaired, but restoring the facility only to the condition it was in before the failure. Lee and Srinivasan (2001) consider an (S, N) policy, where the control variable N specifies the number of production cycles the machine should go through before it is set aside for preventive maintenance overhaul, which restores the facility to its original condition. Recently, Lin and Gong (2006) determined the effect of breakdowns on the decision of optimal production uptime for items subject to exponential deterioration under a no-resumption policy. Under this policy, a production run is executed for a predetermined period of time provided that no machine breakdown has occurred in this period. Otherwise, the production run is immediately aborted. The inventories are built up gradually during the production uptime and a new production run starts only when all on-hand inventories are depleted. If a breakdown occurs then corrective maintenance is carried out and this takes a fixed amount of time. If the inventory build-up during the production uptime is not enough to meet the demand during the entire period of the corrective maintenance, shortages (lost sales) will occur. Maintenance restores the production system to the same initial working conditions. 13.6.3 Deteriorating Production System with Buffer Capacity In order to reduce the negative effect of a machine breakdown on the production process, a buffer inventory may be built up during the production uptime (as it is done in the EMQ model). The role of this buffer inventory is that if an unexpected failure of the installation occurs then this inventory is used to satisfy the demand during the period that corrective maintenance is carried out. One of the earliest works on this subject is Van der Duyn Schouten and Vanneste (1995). In their model the demand rate is constant and equal to d (units/time) and as long as the fixed buffer capacity (K) is not reached the installation operates at a constant rate of p units/time (p>d) and the excess output is stored in the buffer. When the buffer is full, the installation reduces its speed from p to d. Upon failure corrective maintenance starts and the installation becomes as good as new. It is possible to perform preventive maintenance, which takes less time than repair and it also brings the installation back into the as-good-as-new condition. The decision to start a preventive maintenance action is not only based on the condition of the installation, but also on the level of the buffer. The criterion is to minimize the average inventory level and the average number of backorders. Since the optimal policy is difficult to implement, the authors develop suboptimal (n, N, k) control-limit policies. Under this policy if the buffer is full, preventive maintenance is undertaken at age n. If the buffer is not full, but it has at least k items, preventive main-

Maintenance and Production: A Review

337

tenance is undertaken at age N. Maintenance is never performed unless the system has at least k items. The objective is to obtain the best values for n, N and k. Iravani and Duenyas (2002) extend the above model by assuming a stochastic demand and production process. Demand that cannot be met from the inventory is lost and a penalty is incurred. Moreover, it is assumed that the production characteristics of the system change with usage and the more the system deteriorates the more its production rate decreases and the more its maintenance operation becomes time-consuming and costly. In a recent article, Yao et al. (2005) assume that the production system can produce at any rate from 0 (idle) to its maximum rate if it is in working state. Upon failure corrective maintenance is performed immediately to restore the system to the working state. Preventive maintenance actions can be performed as well. Both the failure process and the times to complete corrective/ preventive maintenance are assumed to be stochastic. Thus, in addition to the direct cost of performing corrective/preventive maintenance the non-negligible maintenance completion time leads to an indirect cost of lost production capacity due to system unavailability. Kyriakidis and Dimitrakos (2006) study an infinite-state generalisation of Van der Duyn Schouten and Vanneste (1995). The deterioration process of the installation is considered nonstationary, i.e. the transition probabilities depend not only on the working conditions of the installation but on its age and buffer level as well. Furthermore, the cost structure is more general than in Van der Duyn Schouten and Vanneste (1995) since it includes operating and maintenance costs of the installation as well as storage and shortage costs. It is assumed that the operating costs of the installation depend on both the working condition and the age of the installation. Another way of maintaining the buffer inventory is according to an (S, s) policy, i.e. the system stops production when the buffer inventory reaches S and the production restarts when the inventory drops to s. This idea is used by Das and Sarkar (1999). They assume that exogenous demand for the product arrives according to a Poisson process. Back-orders are not allowed. The unit production time, the time between failures, and the repair and maintenance times are assumed to have general probability distributions. Preventive maintenance decisions are made only at the time that the buffer inventory reaches S, and they depend on both the current inventory level and the number of items produced since the last repair/maintenance operation. The objective is to determine when to perform preventive maintenance on the system in order to improve the system performance. A different approach of dealing with integrated maintenance/production scheduling with buffer capacity is presented in Chelbi and Ait-Kadi (2004). They assume the preventive maintenance actions are regularly (after each T time periods) performed and the duration of corrective and preventive maintenance actions is random. The proposed strategy consists of building up a buffer stock whose size S covers at least the average consumption during the repair periods following breakdowns within the period of length T. When the production unit has to be stopped to undertake the planned preventive maintenance actions, a certain level of buffer stock must still be available in order to avoid stoppage of the subsequent assembly line. The two decision variables are: the period T at which preventive maintenance must be performed, and the level S of the buffer stock.

338

G. Budai, R. Dekker and R. Nicolai

A recent article of Kenne et al. (2006) considers the effects of both preventive maintenance policies and machine age on optimal safety stock levels. Significant stock levels, as the machine age increases, hedge against more frequent random failures. The objective of the study is to determine when to perform preventive maintenance on the machine and to find the level of the safety stock to be maintained. 13.6.4 Production and Maintenance Rate Optimization An integrated production and maintenance planning can also be made by optimizing the production and maintenance rates of the machines under consideration. In this line of research we mention the work of Gharbi and Kenne (2000, 2005), Kenne and Boukas (2003) and Kenne et al. (2003). In these articles a multiple-identical-machine manufacturing system with random breakdowns, repairs and preventive maintenance activities is studied. The objective of the control problem is to find the production and the preventive maintenance rates of the machines so as to minimize the total cost of inventory/backlog, repair and preventive maintenance. 13.6.5 Miscellaneous Finally, we list some articles that deal with integrated maintenance and production planning, but their approaches for modelling or the problem settings are different from the articles in the previous categories discussed earlier. For instance, the model presented in Ashayeri et al. (1996) deals with the scheduling of production and preventive maintenance jobs on multiple production lines, where each line has one bottleneck machine. The model indicates whether or not to produce a certain item in a certain period on a certain production line. In Kianfar (2005) the manufacturing system is composed of one machine that produces a single product. The failure rate of the machine is a function of its age and the demand of the manufacturing product is time-dependent. Its rate depends on the level of advertisement of the product. The objective is to maximize the expected discounted total profit of the firm over an infinite time horizon. Sarper (1993) considers the following problem. Given a fixed repair/maintenance capacity, how many of each of the low demand large items (LDLIs) should be started so that there are no incomplete jobs at the end of the production period? The goal is to ensure that the portion of the total demand started will be completed regardless of the amount by which some machines may stay idle due to insufficient work. A mixed-integer model is presented to determine what portion of the demand for each LDLI type should be rejected as lost sales so that the remaining portion can be finished completely.

13.7 Trends and Open Areas Initial publications on models in the production and maintenance area date from the end of the 1980s (Lee and Rosenblatt 1987). Since that time many papers have

Maintenance and Production: A Review

339

been published with the majority dating from the 1990s and the new millennium. The most popular area in this review is also the oldest one, i.e. on integrated models for maintenance and production. However, still many papers appear in that area and the models become more and more complex, with more decision parameters and more aspects. The topics on opportunity maintenance and scheduling maintenance in line with production have also been popular, but maybe more in the past than today. We did expect to find more studies on specific business sectors, but could only find many for the airline sector. That sector seems to be the most popular as it has both a lot of interaction between maintenance and production as well as high costs involved. In the other sectors, we do see the interaction, but perhaps more papers will be published in the future. The other sections are interesting but small in terms of papers published. In general, the demands on maintenance become higher as public and companies are less likely to accept failures, bad quality products or non-performance. Yet at the same time society’s inventory of capital goods is increasing as well as ageing in the western societies. This is very much the case for roads, railways, electric power generation, transport, and aircrafts. As there are continuous pressures on maintenance budgets we do foresee the need for research supporting maintenance and production decisions, also because decision support software is gaining in popularity and more data becomes electronically available. A theory is therefore needed for such decision support systems. As several case studies have taught us that practical problems have many complex aspects, there is a high need for more theory that can help us to understand and improve complex maintenance decision-making.

13.8 Conclusions In this chapter we have given an overview of planning models for production and maintenance. These models are classified on the basis of the interactions between maintenance and production. First, although maintenance is intended to allow production, production is often stopped during maintenance. The question arises when to do maintenance such that production is least effected. In order to answer this question planning models should take into account the needs of production. These needs are business sector specific and thus applications of planning models in different areas have been considered. In comparison with other specific sectors, much work has been done on modelling maintenance for the airline sector. Second, maintenance itself can also be seen as a production process which needs to be planned. Models for maintenance production planning mainly address allocation and manpower determination problems. Finally, maintenance also affects the production process since it takes capacity away. In production processes maintenance is mostly initiated by machine failures or low quality items. Maintenance and production should therefore be planned in an integrated way to deal with these aspects. Indeed, integrated maintenance and production planning models determine optimal lot sizes while taking into account failure and quality aspects. We observe

340

G. Budai, R. Dekker and R. Nicolai

a non-stop attention for such models, which take more and more “real world” aspects into account. Although many articles have been written on the interaction between production and maintenance, a careful reader will detect several open issues in this review. The theory developed thus far, is far from complete and any real application, is likely to reveal many more open issues.

13.9 Acknowledgements The authors would like to thank Georgios Nenes, Sophia Panagiotidou, and the editors for their helpful suggestions and comments.

13.10 References Al-Zubaidi H, Christer A, (1997) Maintenance manpower modelling for a hospital building complex. European Journal of Operational Research 99:603–618 Ashayeri J, Teelen A, Selen W, (1996) A production and maintenance planning model for the process industry. International Journal of Production Research 34: 3311–3326 Bäckert W, Rippin D, (1985) The determination of maintenance strategies for plants subject to breakdown. Computers and Chemical Engineering 9(2):113–126 Ben-Daya M, Makhdoum M, (1998) Integrated production and quality model under various preventive maintenance policies. Journal of the Operational Research Society 49(8): 840–853 Ben-Daya M, Rahim M, (2001) Integrated production, quality & maintenance models: an overview. in M. Rahim and M. Ben-Daya (eds), Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 3–28 Bengü G, (1994) Telecommunications systems maintenance. Computers and Operations Research 21:337–351 Budai G, Huisman D, Dekker R, (2006) Scheduling preventive railway maintenance activities. Journal of the Operational Research Society 57:1035–1044 Cassady C, Pohl E, Murdock W, (2001) Selective maintenance modeling for industrial systems. Journal of Quality in Maintenance Engineering 7(2):104–117 Charles A, Floru I, Azzaro-Pantel C, Pibouleau L, Domenech S, (2003) Optimization of preventive maintenance strategies in a multipurpose batch plant: application to semiconductor manufacturing. Computers and Chemical Engineering 27:449–467 Chelbi A, Ait-Kadi D, (2004) Analysis of a production/inventory system with randomly failing production unit submitted to regular preventive maintenance. European Journal of Operational Research 156:712–718 Cheung B, Chow K, Hui L, Yong A, (1999) Railway track possession assignment using constraint satisfaction. Engineering Applications of AI 12(5):599–611 Cheung K, Hui C, Sakamoto H, Hirata K, O'Young L, (2004) Short-term site-wide maintenance scheduling. Computers and Chemical Engineering 28:91–102 Cho D, Parlar M, (1991) A survey of maintenance models for multi-unit systems. European Journal of Operational Research 51:1–23 Chung K, (2003) Approximations to production lot sizing with machine breakdowns. Computers & Operations Research 30:1499–1507 Cobb R, (1995) Modeling aircraft repair turntime: simulation supports maintenance marketing. Journal of Air Transport Management 2:25–32

Maintenance and Production: A Review

341

Cohn A, Barnhart C, (2003) Improving crew scheduling by incorporating key maintenance routing decisions. Operations Research 51(3):387–396 Dagpunar J, (1996) A maintenance model with opportunities and interrupt replacement options. Journal of the Operational Research Society 47:1406–1409 Das T, Sarkar S, (1999) Optimal preventive maintenance in a production inventory. IIE Transactions 31:537–551 Dedopoulos L, Shah N, (1995) Preventive maintenance policy optimisation for multipurpose plant equipment. Computers and Chemical Engineering 19:693–698 Dekker R, Budai G, (2002) An overview of techniques used in planning railway infrastructure maintenance. In Geraerds W, Sherwin D, (eds), Proceedings of IFRIMmmm (maintenance management and modelling) conference, Vaxjo University, Sweden, 1–8 Dekker R, Dijkstra M, (1992) Opportunity-based age replacement: exponentially distributed times between opportunities. Naval Research Logistics 39:175–190 Dekker R, Plasmeijer R, (2001) Multi-parameter maintenance optimisation via the marginal cost approach. Journal of the Operational Research Society 52:188–197 Dekker R, Smeitink E, (1991) Opportunity-based block replacement, European Journal of Operational Research 53:46–63 Dekker R, Smeitink E, (1994) Preventive maintenance at opportunities of restricted duration. Naval Research Logistics 41:335–353 Dekker R, van Rijn C, (1996) Prompt - a decision support system for opportunity based preventive maintenance. In Özekici S, (ed) Reliability and Maintenance of Complex Systems, NATO ASI series 154:530–549 Dekker R, Plasmeijer R, Swart J, (1998a) Evaluation of a new maintenance concept for the preservation of highways. IMA Journal of Mathematics applied in Business and Industry 9:109–156 Dekker R, van der Meer J, Plasmeijer R, Wildeman R, (1998b) Maintenance of lightstandards - a case-study. Journal of the Operational Research Society 49:132–143 Den Hertog D, van Zante-de Fokkert J, Sjamaar S, Beusmans R, (2005) Optimal working zone division for safe track maintenance in the Netherlands. Accident Analysis and Prevention 37:890–893 Dijkstra M, Kroon L, Salomon M, van Nunen J, van Wassenhoven L, (1994) Planning the size and organization of KLM's aircraft maintenance personnel. Interfaces 24:47–58 Edwards D, Holt G, Harris F, (2002) Predicting downtime costs of tracked hydraulic excavators operating in the UK opencast mining industry. Construction Management & Economics 20:581–591 Esveld C, (2001) Modern Railway Track. MRT-Productions, Zaltbommel, The Netherlands Feo T, Bard J, (1989) Flight scheduling and maintenance base planning. Management Science 35(12):1415–1432 Finch B, Gilbert J, (1986) Developing maintenance craft labor efficiency through an integrated planning and control system: a prescriptive model. Journal of Operations Management 6(4):449–459 Frost D, Dechter R, (1998) Optimizing with constraints: a case study in scheduling maintenance of electric power units. Lecture Notes in Computer Science 1520:469–488 Geraerds W, (1985) The cost of downtime for maintenance: preliminary considerations. Maintenance Management International 5:13–21 Gharbi A, Kenne J, (2000) Production and preventive maintenance rates control for a manufacturing system: an experimental design approach. International Journal of Production Economics 65:275–287 Gharbi A, Kenne J, (2005) Maintenance scheduling and production control of multiplemachine manufacturing systems. Computers and Industrial Engineering 48:693–707

342

G. Budai, R. Dekker and R. Nicolai

Goel H, Grievink J, Weijnen M, (2003) Integrated optimal reliable design, production, and maintenance planning for multipurpose process plant. Computers and Chemical Engineering 27:1543–1555 Gopalan R, Talluri K, (1998) Mathematical models in airline schedule planning: a survey. Annals of Operations Research 76(1): 155–185 Groenevelt H, Pintelon L, Seidmann A, (1992a) Production batching with machine breakdowns and safety stocks. Operations Research 40(5):959–971 Groenevelt H, Pintelon L, Seidmann A, (1992b) Production lot sizing with machine breakdowns. Management Science 48(1):104–123 Haghani A, Shafahi Y, (2002) Bus maintenance systems and maintenance scheduling: model formulations and solutions. Transportation Research Part A 36:453–482 Higgins A, (1998) Scheduling of railway maintenance activities and crews. Journal of the Operational Research Society 49:1026–1033 Improverail (2002) http://www.tis.pt/proj/improverail/downloads/d6final.pdf (accessed September 26, 2006) Iravani S, Duenyas I, (2002) Integrated maintenance and production control of a deteriorating production system. IIE Transactions 34:423–435 Kenne J, Boukas E, (2003) Hierarchical control of production and maintenance rates in manufacturing systems. Journal of Quality in Maintenance Engineering 9:66–82 Kenne J, Boukas E, Gharbi A, (2003) Control of production and corrective maintenance rates in a multiple-machine, multiple-product manufacturing system. Mathematical and Computer Modelling 38:351–365 Kenne J, Gharbi A, Beit M, (2006) Age-dependent production planning and maintenance strategies in unreliable manufacturing systems with lost sale. Accepted for publication in European Journal of Operational Research 178(2):408–420 Kianfar F, (2005) A numerical method to approximate optimal production and maintenance plan in a flexible manufacturing system. Applied Mathematics and Computation 170:924–940 Knight P, Jullian F, Jofre L, (2005) Assessing the “size” of the prize: developing business cases for maintenance improvement projects. Proceedings of the International Physical Asset Management Conference, 284–302 Kralj B, Petrovic R, (1988) Optimal preventive maintenance scheduling of thermal generating units in power systems – a survey of problem formulations and solution methods. European Journal of Operational Research 35:1–15 Kyriakidis E, Dimitrakos T, (2006) Optimal preventive maintenance of a production system with an intermediate buffer. European Journal of Operational Research 168:86–99 Lam K, Rahim M, (2002) A sensitivity analysis of an integrated model for joint determination of economic design of x -control charts, economic production quantity and production run length for a deteriorating production system. Quality and Reliability Engineering International 18:305–320 Langdon W, Treleaven P, (1997) Scheduling maintenance of electrical power transmission networks using genetic programming. In Warwick K, Ekwue A, Aggarwal A, (eds), Artificial intelligence techniques in power systems, Institution of Electrical Engineers, Stevenage, UK, 220–237 Lee H, (2005) A cost/benefit model for investments in inventory and preventive maintenance in an imperfect production system. Computers and Industrial Engineering 48:55–68 Lee H, Rosenblatt M, (1987) Simultaneous determination of production cycle and inspection schedules in a production system. Management Science 33:1125–1137 Lee H, Rosenblatt M, (1989) A production and maintenance planning model with restoration cost dependent on detection delay. IIE Transactions 21(4):368–375

Maintenance and Production: A Review

343

Lee H, Srinivasan M, (2001) A production/inventory policy for an unreliable machine. In Rahim M, Ben-Daya M, (eds) Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 79–94 Lin G, Gong D, (2006) On a production-inventory system of deteriorating items subject to random machine breakdowns with a fixed repair time. Mathematics and Computer Modelling 43:920–932 Makis V, Fung J, (1995) Optimal preventive replacement, lot sizing and inspection policy for a deteriorating production system. Journal of Quality in Maintenance Engineering, 1(4): 41–55 Moudani WE, Mora-Camino F, (2000) A dynamic approach for aircraft assignment and maintenance scheduling by airlines. Journal of Air Transport Management 6:233–237 Nahmias S, (2005) Production and operations analysis (5th ed). McGraw-Hill, Boston Okamura H, Dohi T, Osaki S, (2001) Computation algorithms of cost-effective EMQ policies with PM. In Rahim M, Ben-Daya M, (eds) Integrated models in production planning, inventory, quality, and maintenance, Kluwer Academic Publishers, 31–65 Pistikopoulos E, Vassiliadis C, Papageorgiou L, (2000) Process design for maintainability: an optimization approach. Computers and Chemical Engineering 24:203–208 Rahim M, (1994) Joint determination of production quantity, inspection schedule, and control chart design. IIE Transactions, 26(6), 2–11 Rahim M, Ben-Daya M, (1998) A generalized economic model for joint determination of production run, inspection schedule and control chart design. International Journal of Production Research 36:277–289 Rahim M, Ben-Daya M, (2001) Joint determination of production quantity, inspection schedule, and quality control for an imperfect process with deteriorating products. Journal of the Operational Research Society 52(12):1370–1378 Rosa L, Feiring B, (1995) Layout problem for an aircraft maintenance company tool room. International Journal of Production Economics 40:219–230 Rose G, Bennett D, (1992) Locating and sizing road maintenance depots. European Journal of Operations Research 63:151–163 Sarper H, (1993) Scheduling for the maintenance of completely processed low-demand large items. Applied Mathematical Modelling 17:321–328 Shenoy D, Bhadury B, (1993) MRSRP – a tool for manpower resources and spares requirements planning. Computers and Industrial Engineering 24:421–439 Srinivasan M, Lee H, (1996) Production-inventory systems with preventive maintenance. IIE Transactions 28:879–890 Sriram C, Haghani A, (2003) An optimization model for aircraft maintenance scheduling and re-assignment. Transportation Research Part A 37:29–48 Tagaras G, (1988) An integrated cost model for the joint optimization of process control and maintenance. Journal of the Operational Research Society 39(8):757–766 Tan J, Kramer M, (1997) A general framework for preventive maintenance optimization in chemical process operations. Computers and Chemical Engineering 21(12):1451–1469 Tseng S, (1996) Optimal preventive maintenance policy for deteriorating production systems. IIE Transactions 28:687–694 Van der Duyn Schouten F, Vanneste S, (1995) Maintenance optimization of a production system with buffer capacity. European Journal of Operational Research 82:323–338 Van der Duyn Schouten F, van Vlijmen B, Vos de Wael S, (1998) Replacement policies for traffic control signals. IMA Journal of Mathematics Applied in Business and Industry 9:325–346 Van Dijkhuizen G, (2000) Maintenance grouping in multi-setup multi-component production systems. In Ben-Daya M, Duffuaa M, Raouf A, (eds) Maintenance, Modeling and Optimization, Kluwer Academic Publishers, 283–306

344

G. Budai, R. Dekker and R. Nicolai

Van Zante-de Fokkert J, den Hertog D, van den Berg F, Verhoeven J, (2001) Safe track maintenance for the Dutch Railways, Part II: Maintenance schedule. Technical report, Tilburg University, the Netherlands Vatn J, Hokstad P, Bodsberg L, (1996) An overall model for maintenance optimization. Reliability Engineering and System Safety 51:241–257 Vaurio J, (1999) Availability and cost functions for periodically inspected preventively maintained units. Reliability Engineering and System Safety 63:133–140 Wang C, (2006) Optimal production and maintenance policy for imperfect production systems. Naval Research Logistics 53:151–156 Wang C, Sheu S, (2003) Determining the optimal production-maintenance policy with inspection errors: using a Markov chain. Computers & Operations Research 30:1–17 Weinstein L, Chung C, (1999) Integrating maintenance and production decisions in a hierarchical production planning environment. Computers & Operations Research 26:1059–1074 Wijnmalen D, Hontelez A, (1997) Coordinated condition-based repair strategies for components of a multi-component maintenance system with discounts. European Journal of Operational Research 98:52–63 Yan S, Yang T, Chen H, (2004) Airline short-term maintenance manpower supply planning. Transportation Research Part A 38:615–642 Yao X, Xie X, Fu M, Marcus S, (2005) Optimal joint preventive maintenance and production policies. Naval Research Logistics 52:668–681

14 Delay Time Modelling Wenbin Wang

14.1 Introduction In this chapter we present a modelling tool that was created to model the problems of inspection maintenance and planned maintenance interventions, namely delay time modelling (DTM). This concept provides a modelling framework readily applicable to a wide class of actual industrial maintenance problems of assets in general, and inspection problems in particular. The concept of the delay time was first mentioned by Christer (1976) in a context of building maintenance. It was not until 1984, the concept was first applied to an industrial maintenance problem (Christer and Waller 1984). Since then, a series of research papers appeared with regard to the theory and applications of delay time modelling of industrial asset inspection problems; see Christer (1999) for a detailed review. The delay time concept itself is simple which defines the failure process of an asset as a two-stage process. The first stage is the normal operating stage from new to the point that a hidden defect has been identified. The second stage is defined as the failure delay time from the point of defect identification to failure. It is the existence of such a failure delay time which provides the opportunity for preventive maintenance to be carried out to remove or rectify the identified defects before failures. With appropriate modelling of the durations of these two stages, optimal inspection intervals can be identified to optimise a criterion function of interest. The delay time concept is similar in definition to the well known potential failure (PF) interval in reliability centred maintenance (Moubray 1997). It is noted, however, that two differences between these two definitions mark a fundamental difference in modelling maintenance inspection of assets. First, the delay time is random in Christer’s definition while the PF interval is assumed to be constant. Second, the initial point of a defect identification is very important to the set up of an appropriate inspection interval, but ignored by Moubray. Nevertheless, Moubray did not provide any means of modelling the inspection practice, while DTM

346

W. Wang

provides a rich source of modelling methodologies ranged from the concept to practical solutions. Asset inspection modelling has long been researched by many others, Among them, the model proposed by Barlow and Proschan (1965) is perhaps the most famous one. They consider a unit subject to inspections as follows. The unit is inspected at prespecified times, where each inspection is executed perfectly and instantaneously. The policy terminates with an inspection which detects the unit failure. This implies that the unit may have already failed during an operation interval between inspections, but can only be identified at the forthcoming inspection. Various modifications and extensions to the Barlow and Proschan’s model have been proposed; see for example, Thomas et al. (1991), Luss (1983), AbdelHameed (1995), Kaio and Osaki (1989) and McCall (1965). The delay time inspection model is different from the classical Barlow and Proschan’s model on two accounts. First, a failure is identified immediately when it occurs. This is perhaps more rationale than the Barlow and Proschan’s model since if the system fails, it may have stopped operating and should be observed immediately by the operators. Second, there is a failure delay time in DTM which characterises the abnormal deterioration before failure, which is not defined in Barlow and Proschan’s model. It is noted however, that for a certain class of equipment such as fire distinguishers, Barlow and Proschan ’s model is appropriate. To clarify the objective of the type of inspection modelling we are concerned with here, consider a plant item with an inspection practice every period T, says, weeks, months, … , with repair of failures undertaken as they arise. The inspection consists of a check list of activities to be undertaken, and a general inspection of the operational state of the plant. Any defect identified leads to immediate repair, and the objective of the inspection is to minimise operational downtime. Other objectives could be considered, for example cost, availability or output. There are other types of inspection activities such as condition monitoring and preventive maintenance which will be introduced and discussed elsewhere in this book; for now we focus on the inspection practice outlined above using the delay time inspection modelling technique. This chapter is organised as follows. Section 14.2 gives an outline of the delay time concept. Sections 14.3 and 14.4 introduce two delay time inspection models of a single component and a complex system respectively. Section 14.5 discusses the parameters estimation techniques used in DTM. Section 14.6 highlights extensions to the basic delay time model and future research in DTM and Section 14.7 concludes the chapter.

14.2 The Delay Time Concept We are interested in the relationship between the performance of assets and inspection intervention, and to capture this, the conventional reliability analysis of time to first failure, or time between failures, requires enrichment. Consider a repairable item of an asset. It could be, say, a component, a machine, a building, or an integrated set of machines forming a production line, but viewed by management as a unit. For now we take a complex system of multiple components as an

Delay Time Modelling

347

example, the case for a single component will be considered in Section 14.3. The interaction between inspection and equipment performance may be captured using the delay time concept presented below. Let the item of an asset be maintained on a breakdown basis. The time history of breakdown or failure events is a random series of points; see Figure 14.1. For any one of these failures, the likelihood is that, had the item been inspected at some point just prior to failure, it could have revealed a defect which, though the item was still working, would ultimately lead to a failure. Such signals include excessive vibration, unusual noise, excessive heat, surface staining, smell, reduced output, increased quality variability, etc. The first instance where the presence of a defect might reasonably be expected to be recognised by an inspection, had it taken place, is called the initial point u of the defect, and the time h to failure from u is called the delay time of the defect; see Figure 14.2. Had an inspection taken place in (h, u + h) , the presence of a defect could have been noted and corrective actions taken prior to failure. Given that a defect arises, its delay time represents a window of opportunity for preventing a failure. Clearly, the delay time h is a characteristic of the item concerned, the type of defect, the nature of any inspection, and perhaps the person inspecting. For example, if the item was a vehicle, and the maintenance practice was to respond when the driver reported a problem, then there is in effect a form of continuous monitoring inspection of cab related aspects of the vehicle, with a reasonably long delay time consistent with the rate of deterioration of the defect. However, should the exhaust collapse because a support bracket was corroded through, the likely warning period for the driver, the delay time, would be virtually zero, since he would not normally be expected to look under the vehicle. At the same time, had an inspection been undertaken by a service mechanic, the delay time may have been measured in weeks or months. Had the exhaust collapsed because securing bolts became loose before falling out, then the driver could have had a warning period of excessive vibration, and perhaps noise, and the defects would have had a drive related delay time measured in days or weeks.













Figure 14.1. Failure points ‘●’

h

○ u

● failure

Figure 14.2. The delay time for a defect



Time

348

W. Wang

To see why the delay time concept is of use, consider Figure 14.3 incorporating the same failure point pattern as Figure 14.1 along with the initial points associated with each failure arising under a breakdown system. Had an inspection taken place at point (A), one defect could have been identified and the seven failures could have been reduced to six. Likewise, had inspection taken place at points (B) and point (A), four defects could have been identified and the seven failures could have been reduced to three. Figure 14.3 demonstrates that provided it is possible to model the way defects arise, that is the rate of arrival of defects λ (u ) , and their associated delay time h , then the delay time concept can capture the relationship between the inspection frequency and the number of plant failures. We are assuming for now that inspections are perfect, that is, a defect is recognised if, and only if, it is there and is removed by corrective action. Delay time modelling is still possible if these assumptions are not valid, but this more complex case is discussed in Section 14.3.1.

○ ○





● ●



○●

● ○○



● Time

B

A

B

Figure 14.3. ‘○’ initial points; ‘●’ failure points

14.3 Delay Time Models for Complex Plant 14.3.1 Perfect Inspections A complex plant, or multi-component plant, is one where a large number of failure modes arise, and the correction of one defect or failure has nominal impact in the steady state upon the overall plant failure characteristics. Consider the following basic complex plant maintenance modelling scenario where: 1. An inspection takes place every T time units, costs cs units and requires d s time units, where d s (i − 1)T when i is large. Similarly the expected downtime due to an inspection renewal with a defect identified is



∞ i =1

((i − 1)d s + d r ) ∫

iT ( i −1)T

g (u )[1 − F (iT − u )]du

(14.17)

Summing Equations 14.16 and 14.17 gives the complete expected downtime per renewal cycle: E(CD) =



∞ i =1

{[(i −1)d + d ]∫ s

r

iT ( i −1)T

g (u ) du +( d f − d r )



iT ( i −1)T

g (u ) F (iT − u ) du

}

(14.18)

The expected cycle length is obtained in a similar manner and is given by E (CL) =

∑ {∫ ∞

iT

i =1

( i −1)T

t



t ( i −1)T

g (u ) f (t − u ) dudt + iT



iT ( i −1)T

g (u ){1 − F (iT − u )}du

}

(14.19)

Finally the expected downtime per unit time is given by C(T) =

∑ ∑



{[(i −1)d + d ]∫ g (u)du + (d − d )∫ g(u)F (iT − u)du} {∫ t ∫ g (u) f (t − u)dudt + iT ∫ g(u)[1 − F (iT − u)]du}} iT

s

i =1

∞ i =1

r

( i −1)T

iT

f

r

iT

t

iT

( i −1)T

( i −1)T

( i −1)T

( i −1)T

(14.20)

14.4.3 A Case Example The medical physics department of a teaching hospital in England, which maintains a large number of medical equipment, records the history of breakdowns and repairs carried out using history cards for each individual item of departmental equipment. Information available included purchase date, date of preventive maintenance, failures and some description of the work carried out. There were no costs recorded, but some estimated cost values were provided by the hospital staff.

Delay Time Modelling

357

Following a discussion with the chief technician, it seemed best to focus on the following items, to ensure a sample of similar machine types, under heavy and constant use, with a usefully long history of failures, and with reasonably welldefined modes of failures. Two pumps were chosen, namely volumetric infusion pumps and peristaltic pumps all from the intensive-care, neurosurgery and heartcare units. There were 105 volumetric pumps and the most frequent failure mode was the failure of the pressure transducer. There were 35 peristaltic pumps and the most frequent failure mode was battery failure. For a detailed description of the case, data and model fitting see Baker and Wang (1991). Several distributions were chosen for the initial and delay time distributions for both pumps, and it turned out that in both cases a Weibull distribution was the best for the initial time distribution and an exponential distribution for the delay time distribution. The estimated parameter values based on history data using the maximum likelihood method for both pumps are shown in Table 14.1. Table 14.1. Estimated parameter values for the pumps Pump

Delay time pdf.

Initial time pdf.

g (u ) = αη (α u ) β −1 e− (α u )

f ( h) = β e − β h

Volumetric infusion

αˆ =0.0017, ηˆ =1.42

βˆ =0.0174

Peristaltic

αˆ =0.0007, ηˆ =2.41

βˆ =0.0093

η

Although the cost data were not recorded, it was relatively easy to estimate the cost of an inspection (called preventive maintenance in the hospital) and the cost of an inspection repair if a defect was identified. However, it was extremely difficult to have an estimate for the failure cost since if the pump failed to work while needed the penalty cost could be very high compared with the cost of the pump itself. Nevertheless, some estimates were provided, which are shown in Table 14.2 Table 14.2. Cost estimates Pump

Inspection cost

Inspection repair cost

Failure cost

Volumetric infusion

£15

£50

£2000

Peristaltic

£15

£70

£1000

This time we cannot derive an analytical formula for the expected cost because of the use of the Weibull distribution. Numerical integrations have to be used to calculate Equation 14.20. We did this using the maths software package MathCad and the results are shown in Figures 14.9 and 14.10.

358

W. Wang

2.4 2.2 2 Expected_Cost( T )

C(T)

1.8 1.6 1.4 1.2

0

20

40

60

80

100

120

T

Figure 14.9. Expected cost per unit time vs. inspection interval for the volumetric infusion pump 2.5

2

Expected_Cost( T ) 1.5

C(T)

1

0.5

0

20

40

60

80

100

120

T

Figure 14.10. Expected cost per unit time vs. inspection interval for the peristaltic pump

Time is given in days in Figures 14.9 and 14.10, so the optimal inspection interval for the volumetric infusion pump is about 30 days and for the peristaltic pump is around 70 days. The hospital at the time checked the pumps at an interval of six months, so clearly for both pumps the inspection intervals should be shortened. However, it has to be pointed out that the model is sensitive to the failure cost, and had a different estimate been provided, the recommendation would have been different.

Delay Time Modelling

359

14.5 Delay Time Model Parameter Estimation 14.5.1 Introduction In previous sections, delay time models for both a complex system and a single compnent have been introduced. However in a practical situation, before the construction of expected cost or downtime models, it is necessary to estimate the values of the parameters that characterise the defect arrival and failure processes. In this section we discuss various methods developed to estimate the parameters from either ‘subjective’ data of experts opinions or ‘objective’ data collected at failures and inspections. Naturally, the parameter estimation process is not the same for the different types of delay-time model, i.e. single component models where a single potential failure mode is modelled and only one defect may (or may not) be present at any one time, compared with complex system models where many defects can exist simultaneously and many failures can occur in the interval between inspections. This is particularly important for the method using objective data. In this section, we mainly focus on the estimation methods for complex systems since these systems are the most applicable asset items for DTM. The details of the approaches developed for parameters estimation for a single component DTM can be found in Baker and Wang (1991, 1993). 14.5.2 Subjective Data Method If the maintenance records of failures and recorded findings at maintenance interventions such as inspections (collectively called objective data in this chapter) are available and sufficient in quantity and quality, the delay time distribution and parameters can be estimated by the classical statistical method of maximum likelihood; see Section 14.5.3 and the paper by Christer et al. (1995). If, however, such a data set does not exist, or is insufficient in quality and quantity for the purpose of estimation, the alternative is to use the subjective judgement of experienced maintenance engineers or technicians to obtain the delay time distribution and parameters. This section introduces three methods developed by Christer and Waller (1984), Wang (1997) and Wang and Jia (2007) in estimating the delay time distribution and the associated parameters using subjective data. 14.5.2.1 Subjective Estimation of the Delay Times Through an On-site and On-spot Survey This method needs to be done over a time period to collect detailed information and assessment at every maintenance intervention or failure; Christer and Waller (1984). At every failure repair, the maintenance technician repairing the plant would be asked to estimate: HLA: how long ago the defect causing the failure may first have been expected to have been recognised at an inspection. If a defect was identified at an inspection, then in addition to HLA, the technician would be asked to estimate:

360

W. Wang

HML: how much longer could the defect be left unattended before a repair was essential. The estimates are given by hˆ = HLA for a failure, and hˆ = HLA + HML for an inspection repair; see Figure 14.11a,b. f (h) is then estimated from the data of { hˆ }. HLA

HLA

HML

● (a) Failure

(b) Inspection

Figure 14.11. HLA and HML estimates at failure and inspection

At the time of repair, the maintenance technician has information available to produce his estimate. In addition to his experience, the defect is present, the plant may be examined, and operatives questioned. The rate of defect arrivals can be estimated directly from the number of observed failures and defects identified over the survey period. For a case study using this approach for estimating delay time model parameters; see Christer and Waller (1984). 14.5.2.2 Subjective Estimation of the Delay Times Based Identified Failure Modes The method introduced earlier is a questionnaire survey based approach where the subjective opinions of maintenance engineers were asked. It has the advantage of directly facing the defect or failure when the information regarding the delay time was requested. However, it has also the following problems: (a) it is a time consuming process in conducting such a survey, particularly in the case that the frequency of failures or defects is not high, which implies a longer time to get sufficient data; (b) the estimation process is not easy to control since all the forms are left at the hands of the maintenance engineers involved without an analyst present, which may result in confusion and mistakes as experienced in the studies of Christer and Waller (1984) and Christer et al. (1998b). Wang (1997) recommended a new approach to estimate directly the delay time distribution based on pre-defined major failure modes or types. The idea is as follows: 1. If the estimates can be made based on pre-selected major failure types instead of the individual failure or defect when it occurs, the time spent for the questionnaire survey will be greatly reduced since the estimates for all major failure types can be carried out at the same time, which may only take a few hours. This also creates the opportunity for an analyst to be present to reduce possible confusion and mistakes. 2. A group of experts should be questioned on the same failure type and opinions can be properly combined to reduce sampling errors. 3. The question asked should be a probabilistic measure of the delay time over all possible ranges.

Delay Time Modelling

361

The following phases for the estimating of the delay time were suggested; Wang (1997). The problem identification phase This is for the identification of all major failure types and possible causes of the failures. This was normally done via a failure mode and criticality analysis so that a list of dominant failures can be obtained. This process will entail a series of discussions with the maintenance engineers to clarify any hidden issues. If some failure data exists it should be used to validate the list, or otherwise a questionnaire should be designed and forwarded to the person concerned for a list of dominant failure types. Expert identification and choice phase The term ‘expert’ is not defined by any quantitative measure of resident knowledge. However, it is clear in the case here that a person who is regarded by others as being one of the most knowledgeable about the machine should be chosen as the expert. The shop floor fitters or any maintenance technicians or engineers who maintain the machine would be the desired experts; Christer and Waller (1984). After the set of experts is identified, a choice is made of which experts to use in the study. Full discussion with management is necessary in order to select the persons who know the machine ‘best’. Psychologically, five or fewer experts are expected to take part of the exercise, but not less than three. The question formulation phase The questions we want to ask in this case are the rate of occurrence of defects, (assuming we are modelling a complex plant) and the delay time distribution. In the case addressing the rate of arrival of a defect type, we can simply ask for a point estimate since it is not random variable. Without maintenance interventions, this would, in the long term, be equal to the average number of the same failure type per unit time. For example we may ask ‘how many failures of this type will occur per year, month, week or day?’. It is noted that this quantity is usually observable. In fact, our focus is mainly on the delay time estimates. Given the amount of uncertainty inherent in making a prediction of the delay time, the experts may feel uncomfortable about giving a point estimate, and may prefer to communicate something about the range of their uncertainty. Accepting these points, perhaps the best that experts could do in this case would be to give their subjective probability mass function for the quantity in question. In other words, they could provide an estimate over the interval such that the mass above the interval is proportional to their subjective probability measures. Alternatively, three point estimates can be asked, such as the most likely, the minimum and the maximum durations of the delay times for a particular type of failure. The word ‘delay time’ was not entered in the question since it will take some effort to explain what is the delay time. Instead, we just asked a similar question like HLA. But this question was still difficult for the experts to understand based upon our case experience. The lesson learned is to demonstrate one example for them before starting the session.

362

W. Wang

The elicitation phase Elicitation should be performed with each expert individually. If possible, the analyst should be present, which proved to be vital in our case studies. The above-mentioned histogram was used to draw the answer from the experts so that the experts can have a visual overview of their estimates and a smooth histogram could be achieved if the experts are advised to do so. The maximum number of the histogram intervals is set to be five, which is advised by psychological experiments. The calibration phase Roughly speaking, calibration is intended to measure the extent to which a set of probability mass functions ‘correspond to reality’. Reviewing the problem we have concluded that subjective calibration is not recommended due to its time consuming nature. If any objective data is available, we may calibrate the experts’ opinion by a Bayesian approach as discussed by many others. Another approach is to calibrate the estimate by matching a statistics observed. If significant difference is found, the estimates must be revised. The combination phase Experts resolution, or combining probabilities from experts, has received some attention. Here we use one of the simplest approaches, namely the weighting method. It is simply a weighted average of the estimates of all experts. The weights need to be selected carefully according to each expert’s level of expertise, and their sum should be equal to one. Other more complicated methods are available; see Wang (1997) It is noted that the combined delay time distribution obtained from this phase is in a form of discrete probability distribution. In fact a continuous delay time distribution is needed in delay time inspection modelling. To achieve this, based upon the number of delay times in each interval, an estimated continuous delay time distribution Fˆ (h) of F (h) can be obtained by fitting a distribution from a known family failure distributions, such as exponential or Weibull using the least square method or maximum likelihood method. The updating phase This phase is mainly for after some failure and recorded findings become available. In a sense it is a way of calibrating. A case study using the above method is detailed in Akbarov et al. (2006). 14.5.2.3 An empirical Bayesian Approach for Estimating the DTM Parameters Based on Subjective Data In previous subjective data based delay time estimating approaches (Christer and Waller 1984; Wang 1997; Akbarov et al. 2006), some direct subjective estimates of the delay time is required, which has been found to be extremely difficult for the experts to estimate since the delay time is not usually observable and difficult to explain Akbarov et al. (2006). We now introduce a recently developed new approach which starts with subjective data first and then updates the estimates when objective data becomes available. The initial estimates are made using the empirical Bayesian method matching with a few subjective summary statistics provided by the experts. These statistics should be designed easy to get based on the experience of the experts and on observed practice rather than unobservable delay times. Then the updating

Delay Time Modelling

363

mechanism enters the process when objective data become available, which requires a repeated evaluation of the likelihood function which will be introduced later. In the framework of Bayesian statistics and assuming no objective data is available at the beginning, we basically first assume a prior on the parameters which characterize the underlying defect and failure arrival processes. When objective data becomes available, we calculate the joint posterior distribution of the parameters, and then we may use this posterior distribution to evaluate the expected cost or downtime per unit time conditional on observed data. Assuming for now that we are interested in the rate of arrival of defects, λ , and the delay time pdf., f (h) , which is characterised by a two parameter distribution f (h | α , β ) . Unlike the methods proposed in Christer and Waller (1984) and Wang (1997), here we treat parameters λ and the α and β in f (h | α , β ) as random variables. The classical Bayesian approach is used here to define the prior distributions for model parameters λ , α and β as f (λ | Φ λ ) , f (α | Φα ) and f ( β | Φ β ) , where Φ • is the set of hyper-parameters within f (• | Φ • ) . Once those Φ • are available, the point estimates of λ , α and β are the expected values of them and are given by



λˆ =

∞ 0

λ f (λ | Φ λ ) d λ ,

αˆ =



∞ 0

α f (α | Φα )dα and βˆ =



∞ 0

β f (β | Φ β )d β

Let g (λ , α , β ) denote a statistics of interest, which may be a function of λ , α and β , say the mean number of failures within an inspection interval, and E[ g (Φ λ , Φα , Φ β )] denote its expected value in terms of Φ λ , Φα and Φ β then we have E[ g (Φ λ , Φα , Φ β )] =







0

0

0

∫ ∫ ∫

g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β .

(14.21)

If we can obtain a subjective estimate of E[ g (Φ λ , Φα , Φ β )] provided by the experts, denoted by g s , then letting E[ g (Φ λ , Φα , Φ β )] = g s , we have gs =







0

0

0

∫ ∫ ∫

g (λ , α , β ) f (λ | Φ λ ) f (α | Φα ) f ( β | Φ β )d λ dα d β .

(14.22)

Equation 14.22 is only one of such equations and if several such subjective estimates (different) were provided, we could have a set of equations like Equation 14.22. The hyper-parameters Φ • may be estimated by solving the equations like Equation 14.22 in the case that the number of equations like Equation 14.22 is at least the same as the number of hyper-parameters in Φ • . We now demonstrate this in our case. Suppose that the experts can provide us the following subjective statistics in estimating Φ λ :

364

W. Wang

• The average number of failures within [0, T ) , denoted by , n f • The average number of defects identified at inspection time T , denoted by nd • The average probability of no defect at all in [0, T ) , denoted by pnd . In this case if the statistics of interest is the average number of the defects within [0, T ) , we have from the property of the HPP that g (λ , α , β ) = λT , and then E[ g (Φ λ , Φα , Φ β )] =







0

0

0

∫ ∫ ∫

λTf (λ | Φ λ ) f (α | Φα ) f ( β | Φ β ) d λ dα d β =



∞ 0

λTf (λ | Φ λ ) d λ

Since if inspection is perfect we have g s = n f + nd , it follows from Equation 14.22 that ∞

n f + nd = ∫ λTf (λ | Φ λ ) d λ .

(14.23)

0

-λT n Similarly, from the property of the HPP, that is, P( N d (0,T) = n|λ ) = e (λT ) , we

n!

have pnd =



∞ 0

Pr ( N d (0,T) = 0|λ )f (λ|Φλ )d λ =



∞ 0

e − λT f ( λ | Φ λ ) d λ .

(14.24)

where N d (0, T ) is the number of defects in [0, T ) . If we have only two hyper-parameters in Φ λ , then solving Equations 14.23 and 14.24 simultaneously in terms of Φ λ will give the estimated values of the hyper-parameters in Φ λ . Note that λ is independent with α and β so that the integrals of f (α | Φα ) and f ( β | Φ β ) are dropped from Equation 14.21. Similarly if more subjective estimates were provided, the hyper-parameters in Φα and Φ β can be obtained. For a detailed description of such an approach to estimate delay time model parameters see Wang and Jia (2007). Obviously this approach is better than the previously developed subjective methods in terms of the way to get the data and the accuracy of the estimated parameters. It is also naturally linked to the objective method in estimation DTM parameters to be presented in the next section via Bayesian theorem if such objective data becomes available, Wang and Jia (2007). 14.5.3 Objective Data Method Objective data for complex systems under regular inspections should consist of the failures (and associated times) in each interval of operation between inspections and the number of defects found in the system at each inspection. From this data information, we estimate the parameters for the chosen form of the delay time model.

Delay Time Modelling

365

Initially, we consider a simple case of the estimation problem for the basic delay time model where only the number of failures, mi , occurring in each cycle [(i − 1), iT ) and the number of defects found and repaired, ji , at each inspection (at time iT ) are required. We do not know the actual failure times within the cycles The probability of observing mi failures in [(i − 1), iT ) is P ( N f ((i − 1)T , iT ) = mi ) =

e

− E [ N f (( i −1)T ,iT )]

E[ N f ((i − 1)T , iT )]mi

(14.25)

mi !

Similarly the probability of removing ji defects at inspection i (at time iT ) is e− E[ N

P ( N s (iT ) = ji ) =

s

( iT )]

E[ N s (iT )] j ji !

i

(14.26)

As the observations are independent, the likelihood of observing the given data set is just the product of the Poisson probabilities of observing each cycle of data, mi and ji . As such, the likelihood function for K intervals of data is L (Θ) =

K

∏ i =1

⎧⎪⎛ e − E[ N ⎨⎜⎜ ⎩⎪⎝

f

(( i −1)T , iT )]

E[ N f ((i − 1)T , iT )]m ⎞ ⎛ e − E[ N ⎟⎜ ⎟⎜ mi ! ⎠⎝ i

s

( iT )]

E[ N s (iT )] j ji !

i

⎞ ⎫⎪ ⎟⎬ , ⎟ ⎠ ⎭⎪

(14.27)

where Θ is the set of parameters within the delay time model. The likelihood function is optimised with respect to the parameters to obtain the estimated values. This process can be simplified by taking natural logarithms. The log-likelihood function is  ( Θ)

∑ ( m log{E[ N ((i − 1)T , iT )]} + j log{E[ N (iT )]} − E[ N −∑ ( log(m !) + log( j !) ) =

K

i =1

i

f

i

s

f

((i − 1)T , iT )] − E[ N s (iT )])

(14.28)

K

i

i =1

i

where the final summation term is irrelevant when maximising the log-likelihood as it is a constant term and therefore not a function of any of the parameters under investigation. When the times of failures are available, it is often necessary to refine the likelihood function at Equation 14.27 by considering the detailed pattern of behaviour within each interval in terms of the number of failures and their associated times. Define t ij the time of the j-th failure in the i-th inspection interval; the likelihood is given by (Christer et al. 1998a) L (Θ) =

K

⎧⎪ ⎨ ⎩⎪

∏∏ i =1

mi

v (t )e j =1 i ij

− E [ N f (( i −1)T , iT )]

⎛ e − E[ N s (iT )] E[ N s (iT )] ji ⎜ ⎜ ji ! ⎝

where vi (tij ) is given by Equation 14.4.

⎞ ⎫⎪ ⎟⎬ ⎟ ⎠ ⎭⎪

(14.29)

366

W. Wang

In the case study of Christer et al. (1995), only the daily numbers of failures are available. They formulated a different likelihood taking account of this pattern of data. It was done essentially by formulating the probability of a particular number of failures for each day over each inspection interval, and then the likelihood for a particular inspection interval is just the product of these probabilities and the probabilty of observing some number of defects at the inspection; see Christer et al. (1995) for details. 14.5.4 A Case Example A copper works in the north-west of England has used the same extrusion press for over 30 years, and the plant is a key item in the works since 70% of its products go through this press at some stage of their production. The machine comprises a 1700-ton oil-hydraulic extrusion press with one 1700 kW induction heater and completely mechanized gear for the supply of billets to the press and for the removal of the extruded products. The machine was operated 15–18 h a day (two shifts), five days a week, excluding holidays and maintenance down-time. Preventive maintenance (PM) has been carried out on this machine since 1993, which consisted of a thorough inspection of the machinery, along with any subsequent adjustments or repairs if the defects found can be rectified within the PM period. Any major defects which cannot be rectified during the PM time were supposed to be dealt with during non-production hours. PM lasted about 2 h and is performed once a week at the beginning of each week. Questions of concern are (i) whether PM is or could be effective for this machine; (ii) whether the current PM period is the right choice, particularly the one week PM interval which was based upon maintenance engineers’ subjective judgement; (iii) whether PM is efficient, i.e. whether it can identify most defects present and reduce the number of failures caused by those defects. In this case study, the delay time model introduced earlier was used to address the above questions. The first question can also be answered in part by comparing the total downtime per week under PM with the total downtime per week per week of the previous years without PM. A parallel study carried out by the company revealed that PM has lowered the total downtime. The proportion of downtime was reduced from 7.8% to 5.8%. To establish the relationship between the downtime measure and the PM activities using the delay time concept, the first task is to estimate the parameters of the underlying delay time distribution from available data, and hence build a model to describe the failure and PM processes. The type of delay time model used in the study is the non-perfect inspection model. In the original study, Christer et al. (1995), a number of different candidate delay time distributions were considered including exponential and Weibull distributions. The chosen form for the delay time distribution is a mixed distribution consisting of an exponential distribution (scale parameter α) with a proportion P of defects having a delay time of 0. The cdf. is given by F(h) = 1 − ( 1 − P)e −α h

Delay Time Modelling

367

An optimisation algorithm is required for maximisation of the likelihood with respect to the parameters. The estimated values are given in Table 14.3 with their associated coefficients of variation (CV). Table 14.3. Estimated model parameters Rate of occurrence of defect

Probability of perfect Proportional of zero inspection delay time of defects

Scale parameter

λˆ = 1.3561

rˆ = 0.902

Pˆ = 0.5546

αˆ = 0.0178

CV = 0.0832

CV = 3.4956

CV = 0.4266

CV = 1.1572

Inserting the optimal parameter estimates into the log-likelihood function gives an ML value of 101.86. See Christer et al. (1995) on the analysis and the fit of the model to the data.

14.6 Other Developments in DTM and Future Research Several useful extensions have been made over the last decade to make the delay time model more realistic, but that increases the mathematical complexity as well. Christer and Wang (1995) addressed an NHPP non-perfect inspection delay time model of multiple component systems. In this case the constant inspection interval assumption cannot be held, and a recursive algorithm was developed in Wang and Christer (2003) to find the optimal non-constant intervals until final replacement. Christer and Redmond (1990) reported a problem of sampling bias, and proposed ways of estimating the delay time distribution from subjective data. Wang and Christer (1997) modelled a single component system subject to inspections over a finite time horizon. Christer et al. (1997) used an NHPP in modelling the rate of arrival of defects within a case study. Wang (2000) developed a model of nested inspections using the delay time concept. Wang and Jia (2007) reported the use of empirical Bayesian statistics in the estimation of delay time model parameters using subjective data, which overcame a number of problems in previous subjective delay time parameter estimation. If the downtime due to failures cannot be ignored in the calculation of the expected number of failures during an inspection interval, Christer et al. (2000) addressed this problem and a refined method was proposed. Christer et al. (2001) compared the delay time model with an equivalent semi-Markov setting to explore the robustness of both modelling techniques to the Markov assumption. Carr and Christer (2003) in a recent paper studied the problems of non-perfect repairs at failures, which allows failures to reoccur if the repair is not perfect. The future research on the DTM relies on the application areas, the data involved, and the objective function chosen. We consider that the following areas or problems are worthy of research using the delay time concept:

368

W. Wang

1. PM type of inspections. Inspections may consist of many activities and some of them are purely preventive types such as greasing, topping up oil, and cleaning, which may have no connection with defect identification. It is noted, however, that this type of PM may change the RATE of defect arrivals and therefore change the expected number of failures within an inspection interval. This problem has not been modelled in previous DTM research, but it is a reality we have to face. An initial idea is to introduce another parameter in the RATE OF DEFECT ARRIVALS to model the effectiveness of such PM activities. 2. Multiple inspections scheme. This is again common in practice in that more than one inspection intervals of different scales or types are in place. Wang (2000) developed a DTM for nest inspections, but the model is not generic, and can only be used for a specific type of problems. 3. Condition monitoring (CM). This is becoming more popular in industry and offers abundent modelling opportunities with a large amount of data. CM may identify the initial point of a random defect at an earlier stage than manual inspections, and it is possible that u, the initial point of a random defect, becomes observable by CM. A pilot research has been carried out to investigate the use of DTM in condition based maintenance modelling (Wang 2006). 4. Parameters estimation. This is still an on-going research item since for each specific problem we may have to develop a tailor made approach. The empirical Bayesian approach outlined earlier is promising since it combines both subjective and objective data. It is noted, however, that the computation involved is intensive, and therefore, algorithms developments are required to speed up the process.

14.7 Conclusion There is considerable scope for advances in maintenance modelling that impact productivity upon current maintenance practice. This chapter reports upon one methodology for modelling inspection practice. The power of mathematics and statistics is used to exploit an elementary mathematical construct of failure process to build operational models of maintenance interactions. The delay time concept is a natural one within the maintenance engineering context. More importantly, it can be used to build quantitative models of the inspection practice of asset items, which have proved to be valid in practice. The theory is still developing, but so far there has been no technical barrier to developing DTM for any plant items studied. This chapter has introduced the delay time concept and has shown how it can be applied to various production equipment to optimise inspection intervals. To provide substance to this statement, the processes of model parameter estimation and case examples outlining the use of delay time modelling in practice are introduced. We only presented some fundamental DTMs and associated parameters estimation procedures, but interested readers can refer to the references listed at the end of the chapter for further consultation.

Delay Time Modelling

369

14.8 Dedications This chapter is dedicated to Professor Tony Christer who recently passed away. Tony was a “world class” researcher with an international reputation. He was the originator of the delay time concept and had produced in conjunction with others a considerable number of papers in delay time modelling theory and applications. He was a great man who enthused, mentored and guided many of us to strive for higher quality research. He will be sadly missed by all who knew him.

14.9 References Abdel-Hameed, M., (1995), Inspection, maintenance and replacement models, Computers and Operations Research, V22, 4, 435–441 Akbarov, A., Wang W. and Christer A.H., (2006), Problem identification in the frame of maintenance modelling: a case study, to appear in I. J. Prod. Res. Baker, R.D. and Wang, W., (1991), Estimating the delay time distribution of faults in repairable machinery from failure data, IMA J. Maths. Applied in Business and Industry, 4, 259–282. Baker, R. and Wang, W., (1993), Developing and testing the delay time model, Journal of Operational Research Society, Vol. 44, No. 4, 361–374. Barlow, R.E and Proschan, F., (1965), Mathematical theory of reliability, Wiley, New York. Carr, M.J., and Christer, A.H, (2003) Incorporating the potential for human error in maintenance models, J. Opl. Res. Soc., 54 (12), 1249–1253 Christer, A.H., (1976), Innovative decision making, proceedings of NATO conference on the role of effectiveness of theory of decision in practice, eds. Bowen K.C and White D.J., Hodder and Stoughton, 368–377. Christer, A.H., (1999), Developments in delay time analysis for modeling plant maintenance, J. Opl. Res. Soc., 50, 1120–1137. Christer, A.H. and Redmond, D.F., (1990), A recent mathematical development in maintenance theory, Int. J. Prod. Econ, 24, 227–234. Christer, A.H. and Waller, W.M., (1984), Delay time Models of Industrial Inspection Maintenance Problems, J. Opl. Res. Soc., 35, 401–406. Christer, A.H and Wang, W., (1995), A delay time based maintenance model of a multicomponent system, IMA Journal of Maths. Applied in Business and Industry, Vol. 6, 205–222. Christer, A.H and Whitelaw, J. (1983), An Operational Research approach to breakdown maintenance: problem recognition, J Opl Res Soc, 34, 1041–1052. Christer, A.H., Wang, W., Baker, R.D. and Sharp, J.M., (1995), Modelling maintenance practice of production plant using the delay time concept, IMA J. Maths. Applied in Business and Industry, Vol. 6, 67–83. Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1997), A stochastic modelling problem of high-tech steel production plant, in Stochastic Modelling in Innovative Manufacturing, Lecture Notes in Economics and mathematical Systems, (Eds. by A.H Christer, Shunji Osaki and L. C. Thomas), Springer, Berlin, 196–214. Christer, A.H., Wang, W., Choi, K. and Sharp, J.M., (1998a), The delay-time modelling of preventive maintenance of plant given limited PM data and selective repair at PM, IMA J. Maths. Applied in Business and Industry, Vol. 9, 355–379.

370

W. Wang

Christer, A.H., Wang, W., Sharp, J.M. and Baker, R.D., (1998b), A case study of modelling preventive maintenance of production plant using subjective data, J. Opl. Res. Soc., 49, 210–219. Christer, A.H., Wang, W. and Lee, C., (2000), A data deficiency based parameter estimating problem and case study in delay time PM modelling, Int. J. Prod. Eco. Vol. 67, No. 1, 63–76 Christer, A.H. Wang, W., Choi, K. and Schouten, F.A., (2001), The robustness of the semiMarkov and delay time maintenance models to the Markov assumption, IMA. J. Management Mathematics, 12, 75–88. Kaio, N. and Osaki, S., (1989), Comparison of Inspection Policies Journal of the Operational Research Society, Vol. 40, No. 5, 499–503 Luss, H., (1983), An Inspection Policy Model for Production Facilities, Management Science, Vol. 29, No. 9, 1102–1109 McCall, J., (1965), Maintenance Policies for Stochastically Failing Equipment: A Survey, Management Science, Vol. 11, No. 5, 493–524 Moubray, J., (1997), Reliability Centred Maintenance, Butterworth-Heineman, Oxford. Ross, (1983), Stochastic processes, Wiley, New York Taylor, H.M., and Karlin, S., (1998), An introduction to stochastic modeling, 3rd Ed., Academic press, San Diego. Thomas, L.C., Gaver, D.P. and Jacobs, P.A. (1991), Inspection Models and their application, IMA Journal of Management Mathematics, 3(4):283–303 Wang, W., (1997), Subjective estimation of the delay time distribution in maintenance modelling, European Journal of Operational Research, 99, 516–529. Wang W., (2000), A model of multiple nested inspections at different intervals, Computers and Operations Research, 27, 539–558. Wang W., (2006), Modelling the probability assessment of the system state using available condition information, to appear in IMA. J. Management Mathematics Wang W. and Christer A.H., (1997), A modelling procedure to optimise component safety inspection over a finite time horizon, Quality and Reliability Engineering International, 13, No. 4, 217–224. Wang W. and Christer A.H., (2003), Solution algorithms for a multi-component system inspection model, Computers and OR, 30, 190–134. Wang W. and Jia, X., (2007), A Bayesian approach in delay time maintenance model parameters estimation using both subjective and objective data, Quality Maintenance and reliability Int. , 23, 95–105

Part E

Management

15 Maintenance Outsourcing D.N.P. Murthy and N. Jack

15.1 Introduction Every business (mining, processing, manufacturing and service-oriented businesses such as transport, health, utilities, communication) needs a variety of equipment to deliver its outputs. Equipment is an asset that is critical for business success in the fiercely competitive global economy. However, equipment degrades with age and usage and ultimately become non-operational and businesses incur heavy losses when their equipment is not in full operational mode. For example, in open cut mining, the loss in revenue resulting from a typical dragline being out of action is around one million dollars per day and the loss in revenue from a 747 plane being out of action is roughly half a million dollars per day. Non-operational equipment leads to delays in delivery of goods and services and this in turn causes customer dissatisfaction and loss of goodwill. Rapid changes in technology have resulted in equipment becoming more complex and expensive. Maintenance action can reduce the likelihood of such equipment becoming non-operational (referred to as preventive maintenance) and also restore a non-operational unit to an operational state (referred to as corrective maintenance). For most businesses it is no longer economical to carry out maintenance in house. There are a variety of reasons for this including the need for a specialist work force and diagnostic tools that often require constant upgrading. In these situations it is more economical to outsource the maintenance (in part or total) to an external agent through a service contract. Campbell (1995) gives details of a survey where it was reported that 35% of North American companies had considered outsourcing some of their maintenance. Consumer durables (products such as kitchen appliances, televisions, automobiles, computers, etc.) that are bought by individuals are certainly getting more complex. A 1990 automobile is immensely more complex than its 1950 counterpart. Customers need assurance that a new product will perform satisfactorily over its lifetime. In the case of consumer durables, manufacturers have used warranties to provide this assurance during the early part of a product’s useful life. Under

374

D. Murthy and N. Jack

warranty the manufacturer repairs all failures that occur within the warranty period and this is often done at no cost to the customer. The warranty period for most consumer durables has been increasing and the warranty terms have been becoming more favourable to the customer. For example, the typical warranty period for an automobile in 1930 was 90 days, in 1970 it was 1 year, and in 1990 it was 3 years. A warranty is tied to the sale of a product and the cost of servicing the warranty is factored into the sale price. For customers who need assurance beyond the warranty period, manufacturers and/or third parties (such as financial institutions, insurance companies and independent operators) offer extended warranties (or service contracts) at an additional cost to the customer. Extended warranties for automobiles of 5–7 years are now fairly common. Governments (local, state or national) own infrastructure (roads, rail and communication networks, public buildings, dams, etc.) that were traditionally maintained by in-house maintenance departments. Here there is a growing trend towards outsourcing these maintenance activities to external agents so that the governments can focus on their core activities. In all the above cases, we have an asset (complex equipment, consumer durable or an element of public infrastructure) that is owned by the first party (the owner) and the asset maintenance is outsourced to the second party (the service agent who is also referred to as the “contractor” in many technical papers) under a service contract. This chapter deals with maintenance outsourcing from the perspectives of both the owner (the customer for the maintenance service) and the service agent (the service provider). We focus on the first case (where the customer is a business) and we develop a framework to indicate the different issues involved, carry out a review of the literature, and indicate topics that need further investigation and research. The outline of the chapter is as follows. Section 15.2 deals with the customer and the agent perspectives. In Section 15.3, we propose a framework to study maintenance outsourcing. Section 15.4 reviews the relevant literature on maintenance outsourcing and on extended warranties. Section 15.5 deals with a game theoretic approach to maintenance outsourcing and extended warranties. In Section 15.6 we briefly discuss agency theory and its relevance to maintenance outsourcing and, in Section 15.7 we conclude with a brief discussion of future research in maintenance outsourcing.

15.2 Customer and Service Agent Perspectives 15.2.1 Customer Outsourcing of maintenance involves some or all of the maintenance actions (preventive and/or corrective) being carried out by an external service agent under a service contract. The contract specifies the terms of maintenance and the cost issues. It can be simple or complex and can involve penalty and incentive terms.

Maintenance Outsourcing

375

15.2.1.1 Businesses Businesses (producing products and/or services) need to come up with new solutions and strategies to develop and increase their competitive advantage. Outsourcing is one of these strategies that can lead to greater competitiveness (Embleton and Wright 1998). It can be defined as a managed process of transferring activities performed in-house to some external agent. The conceptual basis for outsourcing (see Campbell 1995) is as follows: 1. Domestic (in-house) resources should be used mainly for the core competencies of the company. 2. All other (support) activities that are not considered strategic necessities and/or whenever the company does not possesses the adequate competences and skills should be outsourced (provided there is an external agent who can carry out these activities in a more efficient manner). Most businesses tend not to view maintenance as a core activity and have moved towards outsourcing it. The advantages of outsourcing maintenance are as follows: 1. 3. 4. 5. 6. 7. 8.

Better maintenance due to the expertise of the service agent. Access to high-level specialists on an “as and when needed” basis. Fixed cost service contract removes the risk of high costs. Service providers respond to changing customer needs. Access to latest maintenance technology. Less capital investment for the customer. Managers can devote more resources to other facets of the business by reducing the time and effort involved in maintenance management.

However, there are some disadvantages of outsourcing the maintenance and these are indicated below 1. 2. 3. 4.

Dependency on the service provider. Cost of outsourcing. Loss of maintenance knowledge (and personnel). Becoming locked in to a single service provider.

For very specialised (and custom built) products, the knowledge to carry out the maintenance and the spares needed for replacement need to be obtained from the original equipment manufacturer (OEM). In this case, the customer is forced into having a maintenance service contract with the OEM and this can result in a noncompetitive market. In the USA, Section II of the Sherman Act (Khosrowpour 1995) deals with this problem by making it illegal for OEMs to act in this manner. When the maintenance service is provided by an agent other than the original equipment manufacturer (OEM) often the cost of switching prevents customers from changing their service agent. In other words, customers get “locked in” and are unable to do anything about it without a major financial consequence.

376

D. Murthy and N. Jack

As a result, it is very important for businesses to carry out a proper evaluation of the implications of outsourcing their maintenance. If done properly, outsourcing can be cheaper than in-house maintenance and can lead to greater business profitability. 15.2.1.2 Owners of Infrastructure Traditionally, governments owned and operated infrastructures (such as road, rail, water and electricity networks). There has been a growing trend towards selling these assets to private businesses who either lease them back to the government or operate of the asset. The maintenance of the asset is often outsourced as it is again viewed as not being the core activity of the business owning the asset. A complicating factor is the additional parties involved and these are shown in Figure 15.1. For example, in the case of a rail network, the operators are the different rail companies that use the track and the maintenance is outsourced to specialist contractors. The government plays a critical role in terms of providing loans to and/or acting as a guarantor for the owner and the regulators are independent authorities responsible for ensuring public safety. The role of maintenance now becomes important in the context of safety and risk. For further discussion see Vickerman (2004).

REGULATOR

OWNER

SERVICE AGENT [MAINTENANCE]

ASSET [INFRASTRUCTURE]

GOVERNMENT

OPERATOR

PUBLIC

Figure 15.1. Different parties that need to be considered in the maintenance of infrastructures

15.2.1.3 Individual Consumers In the case of consumer durables, the cost of rectifying failures in the postwarranty period is a concern to buyers. The uncertainty in the cost of repair and attitude to risk determines the amount a customer is willing to pay for an extended warranty or service contract. In one sense, opting for an extended warranty can be viewed as taking out an insurance to cover future potential costs resulting from the product failures in the post-warranty period.

Maintenance Outsourcing

377

15.2.1.4 Decision Problems In the case of businesses (producing goods and services) and infrastructure operators the decision problems are (i) whether to outsource or not, (ii) what maintenance activities to outsource and, (iii) how to implement and manage the process. We will discuss these issues in a later section. In the case of an extended warranty, the customer has to decide (i) whether or not to buy an extended warranty and (ii) the best one to buy when there are several different options. 15.2.2 Service Agent – Issues and Decisions The service agent providing the maintenance needs to operate as a service business. This implies that issues such as return on investment (ROI), number of customers to service (market share), location of operations, range of service contracts to offer are some of the variables that are important in the context of strategic management of the business. The type of contract depends on the needs of customers and they can be either standard contracts or customized. At the operational level, the service agent needs to deal with issues such as scheduling of maintenance tasks, spare part inventory control, etc. The pricing of the different service contracts offered is critical for business profitability. If the price is low, the service agent might end up making a loss instead of profit. On the other hand, if it is too high then there might be no customer for the service. The price must cover the costs and estimating the cost is a challenge due to information uncertainties. 15.2.2.1 Extended Warranty Providers – Issues and Decisions For most products, the product market has become global and highly competitive, resulting in many similar brands. Survival and growth in such an environment requires the manufacturers to differentiate their products from those of competitors. Product support provides the mechanism for this differentiation. Product support deals with issues such as providing better information about the product before sale and post-sale support in the form of warranty, extended warranty, training, upgrades, spares, etc. The bundling of products with product-support is a mechanism that manufactures have used very effectively to market their products (see Eppen et al. 1991). In many industries (for example, consumer electronics) extended warranties have been highly profitable to manufacturers (see Padmanabhan 1996 and the UK Competition Commission Report 2003). The popularity of extended warranties has resulted in third parties (financial institutions, insurance companies and independent operators) providing these to customers. The decision problem here is the pricing of extended warranties. The price must exceed the cost of servicing claims over the warranty period. In the case where the extended warranty is offered by the manufacturer, the manufacturer has some information about product reliability. However, third parties offering extended warranties lack this information and as such the decision on pricing must take into account this uncertainty.

378

D. Murthy and N. Jack

15.3 Framework to Study Maintenance Outsourcing A proper framework to study maintenance outsourcing from both customer and service agent points of view involves several interlinked elements as indicated in Figure 15.2. In Section 2 we discussed the customer and the service agent elements and in this section we discuss the remaining elements.

PAST USAGE

ASSET STATE AT THE START OF CONTRACT

PAST MAINTENANCE

OWNER (CUSTOMER)

CONTRACT

SERVICE AGENT

NOMINATED USAGE RATE

ACTUAL USAGE RATE

PENALTIES / INCENTIVES

ASSET DEGRADATION RATE

NOMINATED MAINTENANCE

ACTUAL MAINTENANCE

ASSET STATE AT THE END OF CONTRACT

Figure 15.2. Framework for study of maintenance outsourcing

15.3.1 Asset and State of Asset In general, an asset is a complex system comprising several components. The state of the system degrades with age and/or usage and this leads to a failure. An asset is said to be in failed state when it is no longer functioning properly. In the case of equipment, or a consumer durable, the failure is due to the failure of one or more components. In the case of infrastructure, for example a road, a failure occurs when a pothole reaches some size or the number of potholes per kilometre exceeds some specified amount. In the case of a new asset, the initial state is determined by the decisions made during its design and construction (or manufacture). The asset reliability characterises the probability of no failure and this decreases with age. The field reliability also depends on the operating stress (load) on the asset and the operating environ-

Maintenance Outsourcing

379

ment. The stress can be thermal, mechanical, electrical, etc., and the reliability decreases as the stress increases and/or the environment gets harsher. When a failure occurs, the asset can be restored to an operational state through corrective maintenance (CM). In the case of equipment, this involves repairing or replacing the failed components. In the case of the road example, the CM involves filling the potholes and resealing a section of the road. The degradation in the asset state can be controlled through use of preventive maintenance (PM) and, in the case of equipment, this involves regular monitoring and replacing of components before failure. The asset state at any given time (subsequent to it being put into operation) is a function of its inherent reliability and past history of usage and maintenance. This information is important in the context of maintenance service contracts for used assets. The information that the service agent (and the customer) has can vary from very little to lot (if detailed records of past usage and maintenance have been kept). Finally, for some assets, the delivery of maintenance requires the service agent to visit the site where the asset is located (for example, lifts in buildings and roads) and for others (most consumer durables and some industrial equipment) the failed asset can be brought to a service centre to carry out the maintenance actions. 15.3.2 Maintenance 15.3.2.1 Corrective Maintenance (CM) These are corrective actions performed when the asset has a failure. The most common form of CM is “minimal repair” where the state of the asset after repair is nearly the same as that just before failure. The other extreme is “as good as new” repair and this is seldom possible unless one replaces the failed asset by a new one. Any repair action that restores the asset state to better than that before failure and not as good as that of new asset is referred to as “imperfect repair”. 15.3.2.2 Preventive Maintenance (PM) In the case of equipment or consumer durables, PM actions are carried out at component level where components are replaced based on age, usage and/or condition. As a result, there are several different kinds of PM policies (Blischke and Murthy 2000). Some of the more commonly used ones are the following: • • • •

Age based maintenance. Replace a component (under PM) when it reaches age T (after being put into use) or on failure under CM, if the item fails earlier. Clock based maintenance. Replace a component (under PM) at set times t = kT , k = 1, 2, , or on failure under CM. Opportunistic maintenance. This is based on exploiting opportunities that become available. An example is PM actions for some components being carried out at the same time as the CM action for a failed component. Condition-based maintenance. Here, the maintenance action is based on an assessment of the state of a component from a set of measurement data obtained. For example, the state of a turbine bearing is assessed on data relating to noise, vibration, wear debris in oil, etc.

380

D. Murthy and N. Jack

15.3.2.3 Modeling Failures and Maintenance Actions To evaluate different maintenance actions, mathematical models are needed for the failure of assets and the effect of maintenance on these failures. Themodeling can be done at two levels – system or component.

INTENSITY FUNCTION

System level modeling If only CM and no PM is used and the time to repair is very much smaller than time between failures, then one can model failures over time as a stochastic point process with an intensity function λ (t ) that is increasing with t (time or age) to capture the degradation with time (see Rigdon and Basu 2000). The effect of operating stress and operating environment can be modeled through a Cox-regression model where the intensity function is modified to g ( z )λ (t ) where  variables z is the vector of covariates representing the stress and environmental  (see Cox and Oakes 1984). The effect of PM actions can be modeled through a reduction in the intensity function as shown in Figure 15.3. The level of PM (indicated by δ in the figure) determines the reduction in the intensity function and the cost of a PM action increases with the level of PM.

PM ACTIONS

δ2

δ1 T1

TIME

T2

Figure 15.3. Effect of PM actions on the intensity function

Component level modeling If a component of the asset fails and is non-repairable and/or too costly to repair, then it is replaced by a new one. If the replacement time is small relative to the mean time to failure, then it can be ignored and component failures (over time) can be modeled by a renewal process (see Ross 1980). If the component is repairable and costly and a failed component is subjected to minimal repair, then failures (over time) can be modeled by a stochastic point process with intensity function having the same form as the hazard function of the component.

Maintenance Outsourcing

381

15.3.3 Contract The contract is a legal document that is binding on both parties (customer and service agent) and it needs to deal with technical, management and economic issues. 15.3.3.1 Technical and Management Issues Maintenance of an asset involves carrying out several activities as indicated in Figure 15.4 (adapted from Dunn 1999). There are many different contract scenarios depending on how these activities are outsourced. Table 15.1 indicates three different scenarios (S-1 to S-3) where: • • •

(D-1). What (components) need to be maintained? (D-2). When should the maintenance be carried out? (D-3). How should the maintenance be carried out? WORK IDENTIFICATION

WORK PLANNING

WORK SCHEDULING

DATA ANALYSIS

DATA RECORDING

WORK EXECUTION

Figure 15.4. Activities in asset maintenance Table 15.1. Different contract scenarios SCENARIOS

DECISIONS CUSTOMER

SERVICE AGENT

S-1 S-2

D-1, D-2 D-1

D-3 D-2, D-3

S-3

-

D-1, D-2, D-3

In scenario S-1, the service agent is only providing the resources (workforce and material) to execute the work. This corresponds to the minimalist approach to outsourcing. In scenario S-2, the service agent decides on how and when and what is to be done is decided by the customer. Finally, in scenario S-3 the service agent makes all three decisions. There is growing trend towards functional guarantee contracts. Here the contract specifies a level for the output generated from equipment, for example, the amount of electricity produced by a power plant, or the total length of flights and number of landings and takeoffs per year. The service agent has the freedom to decide on the maintenance needed (subject to operational constraints) with incentives and/or

382

D. Murthy and N. Jack

penalties if the target levels are exceeded or not. For more on this, see Kumar and Kumar (2004). In the context of infrastructures, there is a trend towards giving the service agent the responsibility for ongoing upgrades or the responsibility for the initial design resulting in a BOOM (build, own, operate and maintain) contract. The levels of risk to both parties vary with the contract scenario. 15.3.3.2 Economic Issues There are a number of alternative contract payment structures. The following list is from Dunn (1999): • • • • • • •

Fixed or firm price Variable price Price ceiling incentive Cost plus incentive fee Cost plus award fee Cost plus fixed fee Cost plus margin

Each of these price structures represents a different level of risk sharing between the customer and the service agent. According to Vickerman (2004), an increasing issue in privatized infrastructure is the appropriate incentives needed to ensure adequate maintenance of the infrastructure as a public resource. 15.3.3.3 Other Issues Some other issues are as follows: Requirements. Both parties might need to meet some stated requirement. For example, the customer needs to ensure that the stresses on the asset do not exceed the levels specified in the contract as this can lead to greater degradation and higher servicing costs to the service agent. Similarly, the service agent needs to ensure proper data recording. Contract duration. This is usually fixed with options for renewal at the end of the contract. Dispute resolution. This specifies the avenues to follow when there is a dispute. The dispute can involve going to a third party (legal courts). Unless the contract is written properly and relevant data (relating to equipment and collected by the service agent) are analysed properly by the customer, the longterm costs and risks will escalate. 15.3.4 Maintenance Outsourcing Market Whether the maintenance outsourcing market is competitive or not depends on the number of customers and service agents. Table 15.2 indicates the different market scenarios. These have an impact on issues such as the types of service contracts available to customers and the pricing of the contracts.

Maintenance Outsourcing

383

Table 15.2. Maintenance outsourcing market scenarios NUMBER OF SERVICE AGENTS

NUMBER OF CUSTOMERS

ONE

FEW

ONE

A-1

B-1

FEW

A-2

B-2

MANY

A-3

B-3

15.4 Review of Literature There is a vast literature on maintenance and it covers a range of topics (approaches to maintenance, mathematical models for deciding optimal maintenance, maintenance management, etc.). There are several review papers that have appeared over the last 40 years and these include McCall (1965), Pierskalla and Voelker (1976), Jardine and Buzzacot (1985), Sherif and Smith (1986), Thomas (1986), ValdezFlores and Feldman (1989), Pintelton and Gelders (1992) and Scarf (1997). Cho and Parlar (1991) and Dekker et al. (1997) deal with the maintenance of mutli-component systems. There are also several maintenance books. In contrast, the literature on maintenance outsourcing is very limited and in this section we briefly review this literature. 15.4.1 Maintenance Outsourcing The literature deals with maintenance outsourcing mainly from the customer perspective and is focussed on management issues. More specifically, attempts are made to address one or more of the following questions in a qualitative manner: 1. 2. 3. 4. 5. 6. 7.

Does outsourcing make sense? Are the objectives achievable? Is the organisation ready? What are the outsourcing alternatives? What maintenance activities should be outsourced? How should the best service agent be selected? What are the negotiating tactics for contract formation?

Some of the relevant papers are Campbell (1995), Judenberg (1994), Martin (1997), Levery (1998) and Sunny (1995). Unfortunately, cost has been the sole basis used by businesses for making maintenance out-sourcing decisions. Sunny (1995) looks at what activities are to be outsourced by looking at the long strategic dimension (core competencies) as well as the short-term cost issues. Bertolini et al. (2004) take a quantitative approach and use the analytic hierarchy process (AHP) to make decisions regarding the outsourcing of maintenance. Ashgarizadeh and Murthy (2000) and Murthy and Ashgarizadeh (1998, 1999) look at maintenance outsourcing from both customer and service agent perspec-

384

D. Murthy and N. Jack

tives and propose game-theoretic models to determine the optimal strategies for both parties. This approach is discussed further in Section 15.5. On the application side, Armstrong and Cook (1981) look at clustering of highway sections for awarding maintenance contracts to minimise the cost and use a fixed-charge goal programming model to determine the optimal strategy. Bevilacqua and Braglia (2000) illustrate their AHP model in the context of an Italian brick manufacturing business having to make decisions regarding maintenance outsourcing. Stremersch et al. (2001) look at the industrial maintenance market. 15.4.2 Extended Warranties The literature can be broadly divided into three groups. 15.4.2.1 Group 1: Warranty cost analysis The cost analysis of many different types of basic warranties can be found in Blischke and Murthy (1994, 1996). For a review of more recent literature, see Murthy and Djamaludin (2002). These techniques can be easily extended to obtain the costs for extended warranties and this has been done by Sahin and Polatoglu (1998). 15.4.2.2 Group 2: Warranty Servicing Strategy When a repairable asset fails under warranty, the manufacturer has the choice of either repairing or replacing it with a new one. The first option costs less then the second but a repaired asset has a greater probability of failing during the remainder of the warranty period. It is therefore important for the manufacturer to choose an appropriate servicing strategy in order to minimise the expected cost of servicing the warranty per asset sold. Servicing strategies for products sold with one-dimensional warranties have received considerable attention. Biedenweg (1981) and Nguyen and Murthy (1986, 1989) assume that repaired items have independent and identically distributed lifetimes different from that of a new item and considered strategies where the warranty period is divided into distinct intervals for repair and replacement. Nguyen (1984) introduces the first servicing model with minimal repair (see Barlow and Hunter 1960), with the warranty period split into a replacement interval followed by a repair interval. The length of the first interval is selected optimally to minimize the expected warranty cost. Jack and Van der Duyn Schouten (2000) show that this strategy is sub-optimal and that the optimal servicing strategy is in fact characterized by three distinct intervals – [0, x), [x, y] and (y, W] where W is the warranty period. The optimal strategy is to carry out minimal repairs in the first and last intervals and to use either minimal repair or replacement by new in the middle interval depending on the age of the item at failure. This strategy is difficult to implement, so Jack and Murthy (2001) propose a near optimal strategy involving the same three intervals but with only the first failure in the middle interval resulting in a replacement and all other failures being minimally repaired.

Maintenance Outsourcing

385

Servicing strategies for products sold with two-dimensional warranties have been studied by Iskandar and Murthy (2003) who propose two strategies similar to those from Nguyen and Murthy (1986, 1989) but with minimal repair. Iskandar et al. (2005) deal with a servicing strategy similar to that given in Jack and Murthy (2001). When the cost of replacement is high compared to the cost of a minimal repair then strategies involving replacement are not appropriate. In this case, strategies involving imperfect repair (where the failure characteristics of the repaired asset are better than those after minimal repair but are not the same as a new item) are more appropriate. The advantage of using imperfect repair is that the degree of improvement in the reliability after repair is a decision variable under the control of the manufacturer. Yun et al. (2006) discuss this topic. Every EW provider also needs to choose appropriate servicing strategies to minimise the costs of servicing the EWs that they have sold. The techniques that have been developed for basic warranties can easily be adapted to the EW case. 15.4.2.3 Group 3: Market for EWs There are a number of studies that have been carried out to show how EWs can be used as a tool for market segmentation. Unfortunately, most of the failure modeling used in these studies is static in nature. The asset either functions or doesn’t function properly during the EW period. Padmanabhan and Rao (1993) consider strategies that manufacturers should adopt for warranty provision when consumers vary in risk attitude and consumer moral hazard is also present. Moral hazard problems occur when consumers who have purchased EWs reduce their level of maintenance effort and this causes increased servicing costs to EW providers. Lutz and Padmanabhan (1994) look at the effect of income variation on EW purchasing and Padmanabhan (1995) and Hollis (1999) consider heterogeneity in consumer usage. Lutz and Padmanabhan (1998) investigate differences in consumers’ valuations of a working asset and the effect of independent EW providers in the market. Desai and Padmanabhan (2004) consider the impact of different distributional arrangements for the sale of assets and their EWs.

15.5 Game Theoretic Approach In the game theoretic approach, the outsourcing problem is viewed as a game with two players – customer and service agent. Each player has his/her own goal or objective and a set of decisions that need to be selected optimally. There are several different scenarios depending on whether there is a dominant player (a leader-follower situation where the actions of the follower depend on the actions of the leader – referred to as a “Stackelberg game formulation”) or there isn’t (both players decide on their actions either in a cooperative or non-cooperative mode – referred to as a “Nash game formulation”), and also on the kinds of information available to each player and their attitudes to uncertainty and risk. This approach allows maintenance outsourcing to be studied from both customer and service agent perspectives.

386

D. Murthy and N. Jack

15.5.1 Maintenance Outsourcing Consider the case where the service agent is the leader and offers n options ( Ai (θi ),1 ≤ i ≤ n, ) to the customer where θi ,1 ≤ i ≤ n, are the decision variables corresponding to the different options that the agent needs to select optimally. As an illustrative case, let n = 2 and the two options that the service agent offers for CM actions are as follows: Option 1 [Fixed Price Service Contract – A1 (θ1 ) ]: For a fixed price P , the service agent agrees to rectify all failures occurring over a period L at no additional cost to the customer. If a failure is not rectified within a period τ , the service agent incurs a penalty. If Y denotes the time for which the equipment is in the non-operational state before it becomes operational, then the penalty incurred is given by max{0, α (Y − τ )} , where α is the penalty cost per unit time. This ensures that the service agent does not deprive the customer of the use of the equipment for too long. Here, θ1 = {P,τ , α }. Option 2 [Pay for each repair contract – A2 (θ 2 ) ]: In this case, whenever a failure occurs, the service agent charges an amount Cs for each repair and does not incur any penalty if the equipment is in the non-operational state for greater than τ units of time. Here, θ 2 = {Cs }. In the Stackelberg game formulation, given the set of options (along with the values for the decision variables of the service agent), the customer chooses the best option to optimize his/her goal. This generates the optimal response function A *(θ1 , θ 2 , ,θ n ) as shown in Figure 15.5. Using this, the service agent then optimally selects the decision variables to optimize his/her objective.

Ai (θi ), 1 ≤ i ≤ n SERVICE AGENT

CUSTOMER

A* (θ1 , θ 2 , , θ n ) Figure 15.5. Stackelberg game formulation

Murthy and Asgharizadeh (1998, 1999) and Asgharizadeh and Murthy (2000) use a Stackelberg game formulation for a special case where the time between equipment failures is given by an exponential distribution so that the failures over time occur according to a Poisson process. They consider the two options discussed earlier and consider the following three cases: 1. 2. 3.

Single service agent and single customer (Case A-1) Single service agent, multiple customers (Case A-2)and one repair facility so that only one failed equipment can be repaired at any given time Single service agent, multiple customers (Case A-3) and more than one repair facility

Maintenance Outsourcing

387

In case 1 the service agent has to decide the optimal number of customers to service and in case 3 he has to decide the optimal number of repair facilities. 15.5.2 Extended Warranties Jack and Murthy (2006) consider the case where the product is complex and so the specialist knowledge of the manufacturer is required to carry out any repairs after the base warranty expires. The consumer must decide how long to keep the item and how to maintain it until replacement. Two maintenance options are available: the consumer can (i) pay the manufacturer to repair the item each time it fails, or (ii) purchase an extended warranty (EW) from the manufacturer. These are similar to Options 2 and 1 respectively, discussed earlier. The EW contract specifies that the manufacturer will again rectify all failures free of charge to the consumer. The consumer has flexibility in choosing when the EW will begin and the length of cover. The price of the EW depends on these two variables and is set by the manufacturer. The manufacturer also has to decide the price of each repair if the item fails and the consumer does not have an EW. A Stackelberg game formulation is used to determine the optimal strategies for both the consumer and the manufacturer.

15.6 Agency Theory (The Principal – Agent Problem) Agency theory deals with the relationship that exists between two parties (a principal and an agent) where the principal delegates work to the agent who performs that work and a contract defines the relationship. Agency theory is concerned with resolving two problems that can occur in agency relationships. The first problem arises when the two parties have conflicting goals and it is difficult or expensive for the principal to verify the actual actions of the agent and whether the agent has behaved properly or not. The second problem involves the risk sharing that takes place when the principal and agent have different attitudes to risk (due to various uncertainties). According to Eisenhardt (1989), the focus of the theory is on determining the optimal contract, behaviour vs. outcome, between the principal and the agent. Many different cases have been studied in depth in the principal-agent literature and these deal with the range of issues indicated in Figure 15.6. Agency theory has also been applied in many different disciplines. For an overview see Van Ackere (1993).

388

D. Murthy and N. Jack

COSTS MONITORING

INCENTIVES

PRINCIPAL

CONTRACT

INFORMATIONAL ASYMMETRY

RISK PREFERENCES AGENT MORAL HAZARD

ADVERSE SELECTION

Figure 15.6. Issues in agency theory

15.6.1 Issues in Agency Theory Moral hazard. Moral hazard refers to lack of effort (or shirking) on the part of the agent. The agent does not put in the agreed-upon effort because the objectives of the two parties are different and the principal cannot assess the level of effort that the agent has actually used. Adverse selection. Adverse selection refers to any misrepresentation of ability by the agent and the principal is unable to completely verify this before deciding to hire the agent. Information. To counteract adverse selection, the principal can invest in getting information about the agent’s ability. One way of getting the desired information is by contacting people for whom the agent has provided service in the past. Monitoring. The principal can counteract the moral hazard problem by monitoring the actions of the agent. Monitoring provides information about the agent’s actual actions. Information asymmetry. There are several uncertainties that affect the overall outcome of the relationship. The two parties, in general, will have different information to make an assessment of these uncertainties and will also differ in terms of other information. Risk. This results from the different uncertainties that affect the outcome of the relationship. The risk attitude of the two parties, in general, will differ for a variety of reasons. A problem arises when this disagreement is over the allocation of risk between the two parties. Costs. There are various kinds of costs for both parties. Some of these depend on the outcome (which is influenced by uncertainties) but also in acquiring information, monitoring and the administration of the contract. The heart of the principalagent theory is the trade-off between (i) the cost of monitoring the actions of the

Maintenance Outsourcing

389

agent and (ii) the cost of measuring the outcomes of the relationship and the transferring of risk to the agent. Contract. The design of the contract that takes into account the issues discussed above is the challenge that lies at the heart of the principal-agent relationship. 15.6.2 Relevance to Maintenance Outsourcing and Extended Warranties 15.6.2.1 Maintenance Outsourcing Outsourcing of maintenance involves all the Agency Theory issues discussed in Section 15.6.1 with the customer as the principal and the maintenance service provider as the agent. The key factor is the contract that specifies what, when, and how maintenance is to be carried out. This needs to be designed taking into account all the various issues. Kraus (1996) reviews the literature on incentive contracting. The customer and service agent both potentially face moral hazard. This can occur for the customer when the service agent shirks to reduce costs and doesn’t do proper maintenance and it can occur for the agent when the customer uses the asset in a manner different to that stated in the contract. Adverse selection can also take place when the customer chooses from a pool of potential maintenance service providers (the B scenarios in Table 15.2). The two parties have different information about asset state, usage level, care and attention of the asset, and quality of maintenance used and this asymmetry will affect the outcome of their relationship. The different market scenarios for maintenance outsourcing are as indicated in Table 15.2. In scenario A-1, the classical principal-agent model discussed in Section 15.6.1 is appropriate with a single principal (customer) and a single agent (maintenance provider). This could be a large business unit, for example. In the remaining five scenarios, there are multiple principals and/or multiple agents. In scenarios A-2 and A-3, the equipment under consideration could be a particular brand of lift installed in different buildings within a city. In this case, all the equipment is maintained either by the OEM or an agent of the OEM. There is an extensive literature dealing with the design of contracts for multiple principal/ multiple agent problems (Macho-Stadler and Perez-Castrillo 1997 and Laffont and Martimort 2002 are a couple of samples of the papers from this literature) and all the issues from Section 15.6.1 are still relevant. The principal-agent models that have been studied in the literature are static in nature and new, dynamic models need to be formulated so that they can be applied meaningfully in the context of maintenance outsourcing. 15.6.2.2 Extended Warranties This case is similar to A-3. In the case of standard commercial and industrial products and consumer durables, the EW policy is decided by the EW provider and the customer does not have any direct input. The issues (such as moral hazard, adverse selection, risk, monitoring, etc) from agency theory are all relevant for EW policies. Current EW offerings lack flexibility from the customer point of view and there is a perception (amongst customers and EW regulators) that the pricing of EWs is not fair. This provides an opportunity for EW providers to offer flexible

390

D. Murthy and N. Jack

warranties to meet the different needs across the customer population. Agency theory offers a framework to evaluate the costs of different policies taking into account all the relevant issues.

15.7 Conclusion and Topics for Future Research In this chapter we have proposed a framework to look at maintenance outsourcing from both the equipment owner (customer for maintenance service) and the service agent (maintenance service provider) perspectives. A review of the literature indicates that the bulk of it is qualitative with only very few papers dealing with the topic in a more quantitative manner. Also, not all the relevant issues have been addressed effectively. Agency theory provides an approach to address all these issues in a unified manner. This will require building new models and offers scope for lot of new research in the future. The provision of extended warranties is very similar to maintenance outsourcing. We have highlighted this link and have also discussed the concept of flexible EWs. The framework proposed in this chapter combined with Agency theory can be used by EW providers to obtain better estimates of the cost of offering different EW options in a more objective and scientific manner where all the various issues such as moral hazard, adverse selection, risk, etc., are taken into account. Again, there is considerable scope for more future research in EWs.

15.8 References Armstrong, R.D. and Cook, W.D. (1981), The contract formation problem in preventive pavement maintenance: A fixed-charge goal-programming model, Comp. Environ. Urban Systems, 6, 147–155 Ashgarizadeh, E. and Murthy, D.N.P. (2000), Service contracts – a stochastic model, Mathematical and Computer Modelling, 31, 11–20 Barlow, R.E. and Hunter, L.C. (1960), Optimum preventive maintenance policies, Operations Research, 8, 90–100 Bertolini, M., Bevilacqua, M. Braglia, M. and Frosolini, M. (2004), An analytical method for maintenance outsourcing service selection, International Journal on Quality & Reliability Management, 21, 772–788 Bevilacqua, M. and Braglia, M. (2000), The analytic hierarchy process applied to maintenance strategy selection, Reliability Engineering & System Safety, 70, 71–83. Biedenweg, F. M. (1981), Warranty Analysis: Consumer Value vs. Manufacturers Cost, Unpublished Ph.D. Thesis, Stanford University, U.S.A. Blischke, W.R. and Murthy, D.N.P. (1994), Warranty Cost Analysis. Marcel Dekker, New York Blischke, W.R. and Murthy, D.N.P. (1996), Product Warranty Handbook, Marcel Dekker, New York Blischke, W.R. and Murthy D.N.P. (2000), Reliability, Wiley, New York Campbell, J.D. (1995), Outsourcing in maintenance management: a valid alternative to selfprovision, Journal of Quality in Maintenance Engineering, 1, 18–24.

Maintenance Outsourcing

391

Cho, D. and Parlar, M. (1991), A survey of maintenance models for multi-unit systems, European Journal of Operational Research, 51, 1–23. Cox, D.R. and Oakes, D. (1984), Analysis of Survival Data, Chapman and Hall, New York Day, E. and Fox, R.J. (1985), Extended warranties, service contracts and maintenance agreements – A marketing opportunity? Journal of Consumer Marketing, 2, 77–86 Dekker, R., Wildeman, R.E. and van der Duyn Schouten, F.A. (1997), Review of multicomponent models with economic dependence, Zor/Mathematical Methods of Operations Research, 45, 411–435. Desai, P.S. and Padmanabhan, V. (2004), Durable good, extended warranty and channel coordination. Review of Marketing Science, 2, Article 2, available at www.bepress.com/romsjournal/vol2/iss1/art2 Dunn, S. (1999), Maintenance outsourcing – Critical issues, available at: www.plantmaintenance.com/maintenance_articles_outsources.html Eisenhardt, K.M. (1989), Agency theory: An assessment and review, The Academy of Management Review, 14, 57–74 Embleton, P.R. and Wright, P.C. (1998), “A practical guide to successful outsourcing”, Empowerment in Organizations, Vol. 6 No. 3, pp. 94–106 Eppen, G.D., Hanson, W.A. and Martin, R.K. (1991), Bundling – new products, new markets, low risks, Sloan Management Review, Summer, 7–14 Hollis, A. (1999), Extended warranties, adverse selection and aftermarkets. The Journal of Risk and Insurance, 66, 321–343 Iskandar, B.P., and Murthy, D.N.P. (2003), Repair-replace strategies for two-dimensional warranty policies, Mathematical and Computer Modelling, 38, 1233–1241 Iskandar, B.P., Murthy, D.N.P. and Jack, N. (2005), A new repair-replace strategy for items sold with a two-dimensional warranty, Computers and Operations Research, 32, 669–682 Jack, N. and Murthy, D.N.P. (2001), A servicing strategy for items sold under warranty, Jr. Oper. Res. Soc., 52, 1284–1288 Jack, N. and Murthy, D.N.P. (2006), A Flexible Extended Warranty and Related Optimal Strategies, Jr. Oper. Res. Soc. (accepted for publication) Jack, N. and Van der Duyn Schouten, F. (2000), Optimal repair-replace strategies for a warranted product, Int. J. Production Economics, 67, 95–100 Jardine, A.K.S. and Buzacott, J.A. (1985), Equipment reliability and maintenance, European Journal of Operational Research, 19, 285–296. Judenberg, J. (1994), Applications maintenance outsourcing, Information Systems Management, 11, 34–38 Khosrowpour, M. (ed) (1995), Managing Information Technology Investments with Outsourcing, Idea Group Publishing, Harrisburg Kraus, S. (1996), An overview of incentive contracting, Artificial Intelligence, 83, 297–346 Kumar, R. and Kumar, U. (2004), Service delivery strategy: Trends in mining industries, Int. J. Surface Mining, Reclamation and Environment, 18, 299–307 Laffont, J. and Martimort, D, (2002) The Theory of Incentives: the Principal-Agent Model, Princeton University Press Levery, M. (1998), Outsourcing maintenance: a question of strategy, Engineering Management Journal, February, 34–40. Lutz, N.A. and Padmanabhan, V. (1994), Income variation and warranty policy. Working Paper, Graduate School of Business, Stanford University. Lutz, N.A. and Padmanabhan, V. (1998), Warranties, extended warranties and product quality. International Journal of Industrial Organization, 16, 463–493. Macho-Stadler, I. and Perez-Castrillo, D. (1997), An Introduction to the Economics of Information, Oxford University Press

392

D. Murthy and N. Jack

Martin, H.H. (1997), Contracting out maintenance and a plan for future research, Journal of Quality in Maintenance Engineering, 3, 81–90 McCall, J.J. (1965), Maintenance policies for stochastically failing equipment: A survey, Management Science, 11, 493–524. Murthy D.N.P. and Ashgarizadeh, E. (1998), A stochastic model for service contract; Int. Jr. of Reliability Quality and Safety Engineering; 5, 29–45 Murthy D.N.P. and Ashgarizadeh, E. (1999), Optimal decision making in a maintenance service operation, European Journal of Operational Research, 116, 259–273 Murthy, D.N.P. and Djamaludin, I. (2002), Product warranty – A review, International Journal of Production Economics, 79, 231–260 Nguyen, D.G. (1984), Studies in Warranty Policies and Product Reliability. Unpublished Ph.D. Thesis, The University of Queensland, Australia. Nguyen, D.G. and Murthy, D.N.P. (1986), An optimal policy for servicing warranty, Jr. Oper. Res. Soc., 37, 1081–1088 Nguyen, D.G. and Murthy, D.N.P. (1989), Optimal replace-repair strategy for servicing items sold with warranty, Euro. Jr. of Oper. Res., 39, 206–212 Padmanabhan, V. (1995), Usage heterogeneity and extended warranties. Journal of Economics and Management Strategy, 4, 33–53 Padmanabhan, V. (1996), Extended warranties, in Product Warranty Handbook, W.R. Blischke and D.N.P. Murthy (eds), Marcel Dekker, New York Padmanabhan, V. and Rao, R.C. (1993), Warranty policy and extended warranties: theory and an application to automobiles. Marketing Science, 12, 230–247 Pierskalla, W.P. and Voelker, J.A. (1976), A survey of maintenance models: The control and surveillance of deteriorating systems, Naval Research Logistics Quarterly, 23, 353–388. Pintelton, L.M. and Gelders, L. (1992), Maintenance management decision making, European Journal of Operational Research, 58, 301–317. Rigdon, S.E. and Basu, A.P. (2000), Statistical Methods for the Reliability of Repairable Systems, Wiley, New York Ross, S.M. (1980), Stochastic Processes, Wiley, New York Sahin, I. and Polatoglu, H. (1998), Quality, warranty and preventive maintenance. Kluwer: Amsterdam Scarf, P.S. (1997), On the application of mathematical models to maintenance, European Journal of Operational Research, 63, 493–506. Sherif, Y.S. and Smith, M.L. (1986), Optimal maintenance models for systems subject to failure - A review, Naval Logistics Research Quarterly, 23, 47–74. Stremersch, S., Wuyts, S. and Frambach, R.T. (2001), The purchasing of full-service contracts: An exploratory study within the industrial maintenance market, Industrial Marketing Management, 30, 1–12 Sunny, I. (1995), Outsourcing maintenance: making the right decisions for the right reasons, Plant Engineering, 49, 156–157. Thomas, L.C. (1986), A survey of maintenance and replacement models for maintainability and reliability of multi-item systems, Reliability Engineering, 16, 297–309 UK Competition Commission (2003): A report into the supply of extended warranties on domestic electrical goods within the UK, available at: www.competition-commission.org.uk/inquiries/completed/2003/warranty/index.htm Valdez-Flores, C. and Feldman, R.M. (1989), A survey of preventive maintenance models for stochastically deteriorating single-unit systems, Naval Research Logistics Quarterly, 36, 419–446. Van Ackere, A. (1993), The principal-agent paradigm: Its relevance to various functional fields, European Journal of Operational Research, 70, 83–103

Maintenance Outsourcing

393

Vickerman, R. (2004), Maintenance incentives under different infrastructure regimes, Utilities Policy, 12, 315–322 Yun, W.Y., Murthy, D.N.P. and Jack, N. (2006), Warranty servicing with imperfect repair, Submitted for publication

16 Maintenance of Leased Equipment D.N.P. Murthy and J. Pongpech

16.1 Introduction Businesses need equipment to produce their outputs (goods/services). Equipment degrades with age and usage, and eventually fails (Blischke and Murthy 2000). This impacts business performance in several ways – reduced equipment availability, lower output quality, higher operating costs, increased customer dissatisfaction, etc. The degradation can be controlled through preventive maintenance (PM) actions whilst corrective maintenance (CM) actions restore failed equipment to its working state. Prior to 1970, businesses owned the equipment, and maintenance was done in house. Since 1970, there has been a shift towards outsourcing of maintenance. This was primarily due to a change in the management paradigm where activities in a business were classified as either core or non-core, with the non-core activities to be outsourced to external agents if this was deemed to be cost effective. Also, as technology became more complex it was no longer economical to carry out inhouse maintenance due to the need for expensive maintenance equipment and highly trained maintenance staff. Since 1990, there has been an increasing trend towards leasing rather than owning equipment. According to Fishbein et al. (2000) there are several reasons for this. Some of these are as follows: • • • •

Rapid technological advances have resulted in improved equipment appearing on the market, making the earlier generation equipment obsolete at an ever-increasing pace. The cost of owning equipment has been increasing very rapidly. Businesses viewing maintenance as a non-core activity. It is often economical to lease equipment, rather than buy, as this involves less initial capital investment and often there are tax benefits that make it attractive.

396

D. Murthy and J. Pongpech

In the USA, the Equipment Leasing Association (ELA) conducted a survey in 2002 (ELA, 2002a) and the results of their findings were as follows: • • • •

80% of businesses acquire equipment through leasing. Leasing accounts for roughly 30% of business capital investment. Nearly 50% of office equipment is leased. Leasing companies own more equipment than companies in other US industries.

The leasing industry grew from 1990 till the last quarter of year 2001 when it experienced an economic downturn due to the impact from 9/11. In 2002, the predictions made by the Department of Commerce for equipment leasing volume for 2003 and 2004 were $208 and $218 billion respectively. The ELA Online Focus Groups Report (ELA 2002b) states that 60% of leasing benefits come from maintenance options. This is because some equipment leases come with maintenance as an integral part of the lease so that the physical equipment is bundled with maintenance service and offered as a package under a lease contract. This implies that the lessee can focus on the core activities of the business and not be distracted with equipment maintenance. Maintenance of leased equipment raises several new issues for both the lessor and the lessee (Desai and Purohit 1998; Kleiman 2001). The strategic issues deal with the size and composition of the equipment fleet, the number and the location of lease centers, workshop facilities, warehouse for spares, etc. The operational issues include logistics, pricing, marketing, and maintenance strategies. In this chapter we touch on these issues and then focus our attention on maintenance strategies for leased equipment. The outline of the chapter is as follows. Section 16.2 starts with a general introduction to equipment leasing and then the different types of leases are discussed. Section 16.3 deals with a framework to study equipment leasing and reviews the relevant literature. In Section 16.4, we look at the maintenance of equipment under operational lease. We discuss the modeling issues and propose various maintenance policies. Section 16.5 looks at the analysis of two of these policies and the optimal selection of the policy parameters. We conclude with a brief discussion of topics for future research in Section 16.6. We use the following abbreviations and notation. Abbreviations AFT: Accelerated failure time PH: Proportional hazard NHPP: Non-homogeneous Poisson process ROCOF: Rate of occurrence of failure CM: Corrective maintenance PM: Preventive maintenance Notation F (t ) : Failure distribution for the time to first failure of new equipment f (t ), r (t ) : Failure density and hazard functions associated with F (t ) Intensity function with only CM actions λ0 (t ) :

Maintenance of Leased Equipment

λ (t ) : A: x: L: δj: tj : N ( L) : Y: G( y) : γ ,τ : C p (δ ) : Cu ( x) : Cf : Cn : Ct :

397

Intensity function with both CM and PM actions Age of used equipment Reduction in age with PM action Duration of lease period Reduction in intensity function with j-th PM action Time instant of j-th PM action Number of equipment failures over the lease period Time to carry out minimal repair (Random variable) Distribution function for Y Parameters of penalty cost Cost of PM action with reduction in intensity function δ Cost of PM action with reduction in virtual age x Mean cost of a CM action (minimal repair) Penalty cost per failure (when number of failures exceeds γ ) Penalty cost per unit time (when repair time exceeds τ )

16.2 Equipment Leasing 16.2.1 Lease Definition A lease is a contractual agreement under which the owner of equipment (referred to as the “lessor”) allows another person (referred to as the “lessee”) to operate the equipment for a stated period of time and under specified conditions. Examples of equipment can include aircraft, computers, telecommunications equipment, hospital equipment, office equipment, cars, forklifts, etc. 16.2.2 Types of Leases There are several types of leases but, unfortunately, there is no standard terminology. The terms used in the USA often differ from those used in the UK. We briefly discuss the three main types. 16.2.2.1 Operating Lease In an operating lease the lessee pays the lessor for the use of equipment over a specified period. Usually, new equipment (for example, cars) is leased with an operating lease but in some cases used equipment is also leased with this type of lease. The lease period is much shorter than the equipment’s expected useful life. At the end of the lease period, the lessor retains ownership of the equipment and can renew the lease contract (if the lessee is interested), lease the equipment to some other lessee, or sell the equipment as second-hand equipment. Additional services, such as operator training (to ensure that the leased item is operated properly – for example, the leasing of specialized industrial equipment) and maintenance (to ensure that the equipment is in a proper operating condition and meets the requirements stated in the lease contract), are provided by the lessor as part of the lease contract. This kind of lease is also referred to as a “true” lease. In the

398

D. Murthy and J. Pongpech

USA, the Internal Revenue Code defines a true lease as a transaction that allows the lessor to claim ownership and the lessee to claim rental payments as tax deductions. The advantages and disadvantages of an operating lease from the lessee’s perspective are as follows: Advantages • • •

The lessee can obtain new equipment (based on the latest technologies) and thus avoid the risks associated with equipment obsolescence. The lessee usually gets maintenance and other supports from the lessor so that the business can focus on core activities. Equipment disposal is the lessor’s responsibility.

Disadvantages • •

If the lessee’s needs change over the lease period, then premature termination of the lease agreeement can incur penalties. The risks associated with the lessor do not provide the level of maintenance needed.

16.2.2.2 Finance Lease In a finance lease, the lessee pays the lessor for the use of equipment over a specified period. At the end of the lease period, the lessee gets the ownership of the equipment either at no cost or at a previously established price. The entire payments by the lessee must cover the lessor’s initial investment (for acquiring the equipment) and the profit margin. The type of equipment sold with this type of lease can vary from very expensive industrial and commercial equipment (such as a financial institution leasing aircraft to an airline operator) to less expensive consumer products (banks or retailers leasing domestic appliances, cars, etc. to consumers who own the equipment at the end of the lease). This type of lease is also referred to as a “capital” or “full payout” lease. The advantages and disadvantages of a finance lease from the lessee’s perspective are as follows: Advantages • •

The lessee is able to spread the payments over the lease period (no need for initial cash at purchase). It offers greater flexibility as the lessee can choose from a range of lease options – especially, in the consumer product market when there are several institutions offering different types of leases.

Disadvantages •

If the lessee fails to make lease payments as per schedule, the leased equipment can be repossessed and sold by the lessor to recover the payments due.

Maintenance of Leased Equipment

• •

399

Maintenance is often not a part of the lease agreement so that the lessee has to provide for this separately. The overall cost to the lessee is significantly higher than purchase price of the equipment because the payments include not only the financing costs, but also other costs associated with insurance, taxes, etc.

16.2.2.3 Sale and Leaseback Under a sale and leaseback lease, the owner sells the equipment to a lessor (usually a finance company) and leases it immediately without ever surrendering the use of equipment. The maintenance is carried out either by the lessee or some third party. This type of lease is used mainly for infrastructure assets such as rail transport, ele