1,404 385 6MB
Pages 414 Page size 595.276 x 841.89 pts (A4) Year 2004
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Front Matter Table of Contents About the Author
Designing a Data Warehouse: Supporting Customer Relationship Management Chris Todman Publisher: Prentice Hall PTR First Edition December 01, 2000 ISBN: 0-13-089712-4, 360 pages
Today’s next-generation data warehouses are being built with a clear goal; to maximize the power of Customer Relationship Management. To make CRM_focused data warehousing work, you need new techniques, and new methodologies. In Designing A Data Warehouse, Dr. Chris Todman - one of the world’s leading data warehousing consultants - delivers the first start-to-finish methodolgy for defining, designing, and implementing CRM-focused data warehouses. Todman covers all this, and more: A new look at data warehouse conceptual models, logical models, and physical implementation; Project management: deliverables, assumptions, risks, and team-building - including a full breakdown of work; DW futures: temporal databases, OLAP SQL extensions, active decision support, integrating external and unstructured data, search agents, and more. only for RuBoard - do not distribute or recompile
1
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management List of Figures Preface FIRST GENERATION DATA WAREHOUSES SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT WHO SHOULD READ THIS BOOK Acknowledgments 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY 3. Design Problems We Have to Face Up To DIMENSIONAL DATA MODELS WHAT WORKS FOR CRM
2
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SUMMARY 4. The Implications of Time in Data Warehousing THE ROLE OF TIME PROBLEMS INVOLVING TIME CAPTURING CHANGES FIRST-GENERATION SOLUTIONS FOR TIME VARIATIONS ON A THEME CONCLUSION TO THE REVIEW OF FIRST-GENERATION METHODS 5. The Conceptual Model REQUIREMENTS OF THE CONCEPTUAL MODEL THE IDENTIFICATION OF CHANGES TO DATA DOT MODELING DOT MODELING WORKSHOPS SUMMARY 6. The Logical Model LOGICAL MODELING THE IMPLEMENTATION OF RETROSPECTION THE USE OF THE TIME DIMENSION LOGICAL SCHEMA PERFORMANCE CONSIDERATIONS CHOOSING A SOLUTION FREQUENCY OF CHANGED DATA CAPTURE CONSTRAINTS EVALUATION AND SUMMARY OF THE LOGICAL MODEL 7. The Physical Implementation THE DATA WAREHOUSE ARCHITECTURE CRM APPLICATIONS BACKUP OF THE DATA ARCHIVAL EXTRACTION AND LOAD SUMMARY 8. Business Justification THE INCREMENTAL APPROACH THE SUBMISSION SUMMARY 9. Managing the Project INTRODUCTION WHAT ARE THE DELIVERABLES? WHAT ASSUMPTIONS AND RISKS SHOULD I INCLUDE? WHAT SORT OF TEAM DO I NEED?
3
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Summary 10. Software Products EXTRACTION, TRANSFORMATION, AND LOADING OLAP QUERY TOOLS DATA MINING CAMPAIGN MANAGEMENT PERSONALIZATION METADATA TOOLS SORTS 11. The Future TEMPORAL DATABASES (TEMPORAL EXTENSIONS) OLAP EXTENSIONS TO SQL ACTIVE DECISION SUPPORT EXTERNAL DATA UNSTRUCTURED DATA SEARCH AGENTS DSS AWARE APPLICATIONS A. Wine Club Temporal Classifications B. Dot Model for the Wine Club APPENDIX B DOT MODEL FOR THE WINE CLUB C. Logical Model for the Wine Club D. Customer Attributes HOUSEHOLD AND PERSONAL ATTRIBUTES BEHAVIORAL ATTRIBUTES FINANCIAL ATTRIBUTES EMPLOYMENT ATTRIBUTES INTERESTS AND HOBBY ATTRIBUTES References
only for RuBoard - do not distribute or recompile
4
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Designing a Data Warehouse: Supporting Customer Relationship Management Library of Congress Cataloging-in-Publication Date
Todman, Chris. Designing a data warehouse: in support of customer relationship management/Chris Todman. p. cm. Includes bibliographical references and index. ISBN: 0-13-089712-4 1. Data warehousing. I. Title. HD30.2.T498 2001 CIP 658.4/03/0285574 21 1220202534 Credits
Editorial/Production Supervisor: Kerry Reardon
5
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Project Coordinator: Anne Trowbridge Acquisitions Editor: Jill Pisoni Editorial Assistant: Justin Somma Manufacturing Buyer: Maura Zaldivar Manufacturing Manager: Alexis Heydt Marketing Manager: Dan DePasquale Art Director: Gail Cocker-Bogusz Cover Designer: Nina Scuderi Cover Design Director: Jerry Votta Manager HP Books: Patricia Pekary Editorial Director, Hewlett Packard Professional
6
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Books: Susan Wright © 2001 by Hewlett-Packard Company Published by Prentice Hall PTR Prentice-Hall, Inc. Upper Saddle River, NJ 07458 Prentice Hall books are widely used by corporation and government agencies for training, marketing, and resale. The publisher offers discounts on this book when ordered in bulk quantities. For more information, contact Corporate Sales Department, Phone: 800-382-3419; FAX: 201-236-7141; E-mail: [email protected] Or write: Prentice Hall PTR, Corporate Sales Dept., One Lake Street, Upper Saddle River, NJ 07458 All rights reserved. No part of this book may be reproduced, in any form or by any means, without permission in writing from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mexico Prentice-Hall of India Private Limited, New Delhi
7
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Prentice-Hall of Japan, Inc., Tokyo Pearson Education Asia, Pte. Ltd. Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro only for RuBoard - do not distribute or recompile
8
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
List of Figures 1.1
Just who are our best customers?
1.2
CRM in an organization.
1.3
The components of CRM.
1.4
The number of communication channels is growing.
2.1
Fragment of data model for the Wine Club.
2.2
Three-dimensional data cube.
2.3
Wine sales dimensional model for the Wine Club.
2.4
Data model showing multiple join paths.
2.5
The main components of a data warehouse system.
2.6
General-state transition diagram.
2.7
State transition diagram for the orders process.
2.8
Star schema showing the relationships between facts and dimensions.
2.9
Stratification of the data.
2.10 Snowflake schema for the sale of wine. 2.11 Levels of summarization in a data warehouse. 2.12 Modified data warehouse structure incorporating summary navigation and data mining.
3.1
Star schema for the Wine Club.
3.2
Third normal form version of the Wine Club dimensional model.
3.3
Confusing and intimidating hierarchy.
3.4
Common organizational hierarchy.
3.5
Star schema for the Wine Club.
3.6
Sharing information.
3.7
General model for customer details.
3.8
General model for a customer with changing circumstances.
3.9
Example model showing customer with changing circumstances.
3.10 The general model extended to include behavior. 3.11 The example model extended to include behavior. 3.12 General conceptual model for a customer-centric data warehouse. 3.13 Wine Club customer changing circumstances. 3.14 Wine Club customer behavior.
9
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
3.15 Derived segment examples for the Wine Club.
4.1
Fragment of operational data model.
4.2
Operational model with additional sales fact table.
4.3
Sales hierarchy.
4.4
Sales hierarchy with sales table attached.
4.5
Sales hierarchy showing altered relationships.
4.6
Sales hierarchy with intersection entities.
4.7
Sales hierarchy with data.
4.8
Simple general business hierarchy.
4.9
Graphical representation of temporal functions.
4.10 Types of temporal query. 4.11 Traditional resolution of m:n relationships. 4.12 Representation of temporal attributes by attaching them to the dimension. 4.13 Representation of temporal hierarchies by attaching them to the facts. 4.14 Representation of temporal attributes by attaching them to the facts.
5.1
Example of a two-dimensional report.
5.2
Example of a three-dimensional cube.
5.3
Simple multidimensional dot model.
5.4
Representation of the Wine Club using a dot model.
5.5
Customer-centric dot model.
5.6
Initial dot model for the Wine Club.
5.7
Refined dot model for the Wine Club.
5.8
Dot modeling worksheet showing Wine Club sales behavior.
5.9
Example of a hierarchy
6.1
ER diagram showing new relationships to the time dimension.
6.2
Logical model of part of the Wine Club
7.1
The EASI data architecture.
7.2
Metadata model for validation.
7.3
Integration layer.
7.4
Additions to the metadata model to include the source mapping layer.
7.5
Metadata model for the VIM layer.
7.6
Customer nonchanging details.
7.7
The changing circumstances part of the GCM.
10
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
7.8
Behavioral model for the Wine Club.
7.9
Data model for derived segments.
7.10 Daily partitions. 7.11 Duplicated input.
8.1
Development abstraction shown as a roadmap.
9.1
Classic waterfall approach.
9.2
Example project team structure.
10.1 Extraction, transformation, and load processing. 10.2 Typical OLAP architecture. 10.3 Descriptive field distribution. 10.4 Numeric field distribution using a histogram. 10.5 Web plot that relates gender to regions. 10.6 Rule induction for wine sales. 10.7 Example of a multiphase campaign.
only for RuBoard - do not distribute or recompile
11
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
Preface The main subject of this book is data warehousing. A data warehouse is a special kind of database that, in recent years, has attracted a great deal of interest in the information technology industry. Quite a few books have been published about data warehousing generally, but very few have focused on the design of data warehouses. There are some notable exceptions, and these will be cited in this book, which concentrates, principally, on the design aspects of data warehousing. Data warehousing is all about making information available. No one doubts the value of information, and everyone agrees that most organizations have a potential “Aladdin's Cave” of information that is locked away within their operational systems. A data warehouse can be the key that opens the door to this information. There is strong evidence to suggest that our early foray in the field of data warehousing, what I refer to as first-generation data warehouses, has not been entirely successful. As is often the case with new ideas, especially in the information technology (IT) industry, the IT practitioners were quick to spot the potential, and they tried hard to secure the competitive advantage for their organizations that the data warehouse promised. In doing so I believe two points were overlooked. The first point is that, at first sight, a data warehouse can appear to be quite a simple application. In reality it is anything but simple. Quite apart from the basic issue of sheer scale (data warehouse databases are amongst the largest on earth) and the consequent performance difficulties presented by this, the data structures are inherently more complex than the early pioneers of these systems realized. As a result, there was a tendency to over-simplify the design so that, although the database was simple to understand and use, many important questions could not be asked. The second point is that data warehouses are unlike other operational systems in that it is not possible to define the requirements precisely. This is at odds with conventional systems where it is the specification of requirements that drives the whole development lifecycle. Our approach to systems design is still, largely, founded on a thorough understanding of requirements–the “hard” systems approach. In data warehousing we often don't know what the problems are that we are trying to solve. Part of the role of the data warehouse should be to help organizations to understand what their problems are. Ultimately it comes down to design and, again, there are two main points to consider. The
12
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
first concerns the data warehouse itself. Just how do we ensure that the data structures will enable us to ask the difficult questions? Secondly, the hard systems approach has been shown to be too restrictive and a softer technique is required. So not only do we need to improve our design of data warehouses, we also need to improve the way in which we
approach the design. It is in response to these two needs that this book has been written. only for RuBoard - do not distribute or recompile
13
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
FIRST GENERATION DATA WAREHOUSES Historically, the first-generation data warehouses were built on certain principles that were laid down by gurus in the industry. This author recognizes two great pioneers in data warehousing: Bill Inmon and Ralph Kimball. These two chaps, in my view, have done more to advance the development of data warehousing than any others. Although many claim to have been “doing data warehousing long before it was ever called data warehousing,” Inmon and Kimball can realistically claim to be the founders because they alone laid down the definitions and design principles that most practitioners are aware of today. Even if their guidelines are not followed precisely, it is still common to refer to Inmon's definition of a data warehouse and Kimball's rules on slowly changing dimensions. Chapter 2 of this book is an introduction to data warehousing. In some respects it should be regarded as a scene-setting chapter, as it introduces data warehouses from first principles by describing the following: Need for decision support How data warehouses can help Differences between operational systems and data warehouses Dimensional models Main components of a data warehouse Chapter 2 lays the foundation for the evolution to the second-generation data warehouses. only for RuBoard - do not distribute or recompile
14
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
SECOND-GENERATION DATA WAREHOUSES AND CUSTOMER RELATIONSHIP MANAGEMENT Before the introduction to data warehousing, we take a look at the business issues in a kind of rough guide to customer relationship management (CRM). Data warehousing has been waiting for CRM to appear. Without it, data warehouses were still popular but, very often, the popularity was as much in the IT domain as anywhere else. The IT management was quick to see the potential of data warehouses, but the business justification was not always the main driver and this has led to the failure of some data warehouse projects. There was often a reluctance on the part of business executives to sponsor these large and expensive database development projects. Those that were sponsored by IT just didn't hit the spot. The advent of CRM changed all that. CRM cannot be practiced in business without a major source of information, which, of course, is the data warehouse raison
d'etre. Interest in data warehousing has been revitalized, and this time it is the business people who are firmly in the driving seat. Having introduced the concept of CRM and described its main components, we explore, with the benefit of hindsight, the flaws in the approach to designing first-generation data warehouses and will propose a method for the next generation. We start by examining some of the design issues and pick our way carefully through the more sensitive areas in which the debate has smoldered, if not raged a little, over the past several years. One of the fundamental issues surrounds the representation of time in our design. There has been very little real support for this, which is a shame, since data warehouses are true temporal applications that have become pervasive and ubiquitous in all kinds of businesses. In formulating a solution, we reintroduce, from the mists of time, the old conceptual, logical, and physical approach to building data warehouses. There are good reasons why we should do this and, along the way, these reasons are aired. We have a short chapter on the business justification. The message is clear. If you cannot justify the development of the data warehouse, then don't build it. No one will thank us for designing and developing a beautifully engineered, high-performing system if, ultimately, it cannot pay for itself within an appropriate time. Many data warehouses can justify themselves several times over, but some cannot. We do not want to add to the list of failed projects. Ultimately, no one benefits from this and we should be quite rigorous in the justification process.
15
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Project management is a crucial part of a data warehouse development. The normal approach to project management doesn't work. There are many seasoned, top-drawer project managers who, in the beginning, are very uncomfortable with data warehouse projects. The uncertainty of the deliverables and the imprecise nature of the acceptance criteria send them howling for the safety net of the famous system specification. It is hoped that the chapter on project management will provide some guidance. People who know me think I have a bit of a “down” on software products and if I'm honest I suppose I do. I get a little irritated when the same old query tools get dusted off and relaunched as each new thing comes along as though they are new products. Once upon a time a query tool was a query tool. Now it's a data mining product, a segmentation product, and a CRM product as well. OK, these vendors have to make a living but, as professional consultants, we have to protect our customers, particularly the gullible ones, from some of these vendors. Some of the products do add value and some, while being astronomically expensive, don't add much value at all. The chapter on software products sheds some light on the types of tools that are available, what they're good at, what they're not good at, and what the vendors won't tell you if you don't ask. only for RuBoard - do not distribute or recompile
16
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
WHO SHOULD READ THIS BOOK Although there is a significant amount of technical material in the book, the potential audience is quite wide: For anyone wishing to learn the principles of data warehousing, Chapter 2 has been adapted from undergraduate course material. It explains, in simple terms: What data warehouses are How they are used The main components The data warehouse “jargon” There is also a description of some of the pitfalls and problems faced in the building of data warehouses. For consultants, the book contains a method for ensuring that the business objectives will be met. The method is a top-down approach using proven workshop techniques. There is also a chapter devoted to assisting in the building of the business justification. For developers of data warehouses, the book contains a massive amount of material about the design, especially in the area of the data model, the treatment of time, and the conceptual, logical, and physical layers of development. The book contains a complete methodology that provides assistance at all levels in the development. The focus is on the creation of a customer-centric model that is ideal for supporting the complex requirements of customer relationship management. For project managers there is an entire chapter that provides guidelines on the approach together with: Full work breakdown structure (WBS) Project team structure Skills needed
17
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The “gotchas” only for RuBoard - do not distribute or recompile
18
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Acknowledgments I would like to thank my wife, Chris, for her unflagging support during the past twenty years. Chris has been a constant source of encouragement, guidance, and good counsel. I would also like to thank Malcolm Standring and Richard Getliffe of Hewlett Packard Consulting for inviting me to join their Data Warehousing practice in 1995. Although I was already deeply involved in database systems, the role in HP has opened doors to many new and exciting possibilities. Thank you, Mike Newton and Prof. Pat Hall of the Open University, for your technical guidance over several years. As far as the book is concerned, thanks are due to Chris, again, for helping to tighten up the grammar. Thanks especially to Jim Meyerson, of Hewlett Packard, for a rigorous technical review and helpful suggestions. Finally, I am grateful to Jill Pisoni and the guys at Prentice Hall for publishing this work. only for RuBoard - do not distribute or recompile
19
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 1. Customer Relationship Management THE BUSINESS DIMENSION BUSINESS GOALS BUSINESS STRATEGY THE VALUE PROPOSITION CUSTOMER RELATIONSHIP MANAGEMENT SUMMARY only for RuBoard - do not distribute or recompile
20
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE BUSINESS DIMENSION First and foremost, this book is about data warehousing. Throughout the book, we will be exploring ways of designing data warehouses with a particular emphasis on the support of a customer relationship management (CRM) strategy. This chapter provides a general introduction to CRM and its major components. Before that, however, we'll take a short detour and review what has happened in the field of data warehousing, from a business perspective. Although data warehousing has received a somewhat mixed reception, it really has captured the imagination of business people. In fact, it has become so popular in industry that it is cited as being the highest-priority postmillennium project of more than half of Information Technology (IT) executives. It has been estimated that, as far back as 1997 (Menefee, 1998), $15 billion was spent on data warehousing worldwide. Recent forecasts (Business Wire, August 31, 1998) expect the market to grow to around $113 billion by the year 2002. A study carried out by the Meta Group (Meyer and Cannon, 1998) found that 95 percent of the companies surveyed intended to build a data warehouse. Data warehousing is being taken so seriously by the industry that the Transaction Processing Council (TPC), which has defined a set of benchmarks for general databases, introduced an additional benchmark specifically aimed at data warehousing applications known as TPC/D, followed up in 1999 by further benchmarks (TPC/H and TPC/R). As a further indication of the “coming of age of data warehousing,” a consortium has developed an adjunct to the TPC benchmark called “The data warehouse challenge” as a means of assisting prospective users in the selection of products. The benefits of building a data warehouse can be significant. For instance, increasing knowledge within an organization about customers' trends and business can provide a significant return on the investment in the warehouse. There are many documented examples of huge increases in revenue and profits as a result of decisions taken based upon information extracted from data warehouses. So if someone asked you the following question:
How many data warehouse projects ultimately are regarded as having failed?
21
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
How would you respond? Amazingly, research has shown that it's over 70 percent! This is quite staggering. Why is it happening and how will we know whether we are being successful? It's all about something we've never really been measured on in the past—business benefit. Data warehouses are different in quite a few ways from other, let us say traditional, IT projects. In order to explain one of these differences, we have to delve back into the past a little. Another charge that has historically been leveled at the IT industry is that the solution that is finally delivered to the customer, or users, is not the solution they were expecting. This problem was caused by the methods that were used by IT departments and system integration companies. Having identified that there was a problem to be solved, they would send in a team of systems analysts to analyze the current situation, interview the users and recommend a solution. This solution would then be built and tested by a system development team and delivered back to the users when it was finished. The system development lifecycle that was adopted consisted of a set of major steps: 1. Requirements gathering 2. Systems analysis 3. System design 4. Coding 5. System testing 6. Implementation The problem with this approach was that each step had to be completed before the next could really begin. It has been called the waterfall approach to systems development, and its other problem was that the whole process was carried out—out of the sight of the users who just continued with their day jobs until, one day, the systems team descended upon them their new, completed system. This process could have taken anything from six months to two years or more. So, when the systems team presents the new system to the users, what happens? The users say, “Oh!, But this isn't what we need.” And the systems project leader exclaims, “But this is what you asked for!” and it all goes a bit pear-shaped after that. Looking back, the problems and issues are clear to see. But suffice it to say there were always lots of arguments. The users were concerned that their requirements had clearly
22
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
not been understood or that they had been ignored. The systems team would get upset because they had worked hard, doing their level best to develop a quality system. Then there would be recriminations. The senior user, usually the one paying for the system, would criticize the IT manager. The systems analyst would be interrogated by the IT manager (a throat grip was sometimes involved in this) and eventually they would agree that there had been a communications misunderstanding. It is likely that the users had not fully explained their needs to the analyst. There are two main reasons for this. The first is that the analyst may have misinterpreted the needs of the user and designed a solution that was simply inappropriate. This happened a lot and was usually the fault of the analyst, whose job it was to ensure that the needs were clearly spelled out. The second reason is more subtle and is due to the fact that businesses change over time as a kind of natural evolution. Simple things like new products in the catalog or people changing roles can cause changes in the everyday business processes. Even if the analyst and the users had not misunderstood each other there is little hope that, after two years, the delivered system would match the needs of the people who were supposed to use it. Subsequently, the business processes would have to change again in order to accommodate the new system. After a little while, things would settle down, the “teething troubles” would be fixed and the system would hopefully provide several years of service. Anyway, switched-on IT managers had to figure out a way of ensuring that the users would not have any reason to complain in the future and the problem was solved by the introduction of the now famous “system specification” or simply “system spec.” Depending on the organization, this document had many different names including system manual, design spec, design manual, system architecture, etc. The purpose of the system spec was to establish a kind of contract between the IT department, or systems integrator, and the users. The system spec contained a full and detailed description of precisely what would be delivered. Each input screen, process, and output was drawn up and included in the document. Both protagonists “signed off” the document that reflected precisely what the IT department were to deliver and, theoretically at least, also reflected what the users expected to receive. So long as the IT department delivered to the system spec, they no longer could be accused of having ignored or misunderstood the requirements of the users. So the system spec was a document that was invented by IT as a means of protecting themselves from impossibly ungrateful users. In this respect it was successful and it is still the cornerstone of most development methods. When data warehouses started to be developed, the developers began using their tried and trusted methodologies to help to build them. And why not? This approach of nailing down the requirements has proved to be
23
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
successful in the past, at least as far as IT departments were concerned. The problem is that data warehouses are different. Until now, IT development practitioners have been working almost exclusively on streamlining and improving business processes and business functions. These are systems that usually have predefined inputs and, to some extent at least, predefined outputs. We know that is true because the system spec said so. The traditional methods we use are sometimes referred to as “hard” systems development methods. This means that they used to solve problems that are well defined. But, and this is where the story really starts, the requirements for a data warehouse are
never well defined! These are the softer “We think there's a problem but we're not sure what it is” type of issue. It's actually very difficult to design systems to solve this kind of problem and our familiar “hard” systems approach is clearly inappropriate. How can we write a system specification that nails down the problem and the solution when we can't even clearly define the problem? Unfortunately, most practitioners have not quite realized that herein lies the crux of the problem and, consequently, the users are often forced to state at least some requirements and sign the inevitable systems specification so that the “solution” can be developed. Then once the document is signed off, the development can begin as normal and the usual systems development life-cycle kicks in and away we go folks. The associated risk is that we'll finish up by contributing to the seventy percent failure statistic. Is there a solution? Yes there is. All it takes is recognition of this soft systems issue and to develop an approach that is sympathetic with it. The original question that was posed near the beginning of this section was “How will we know whether we are being successful?” If the main reason for failure is that we didn't produce sufficient business benefit, then the answer is to focus on the business. That means what the business is trying to achieve and not the requirements of the data warehouse. only for RuBoard - do not distribute or recompile
24
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
BUSINESS GOALS The phrase “Focus on what the business is trying to achieve,” refers to the overall business, or the part of the business that is engaging us in this project. This means going to the very top of the organization. Historically, it has been common for IT departments to invest in the building of data warehouses on a speculative basis, assuming that once in place the data warehouse will draw business users like bees to a honey pot. While the sentiments are laudable, this “build it and they will come” approach is generally doomed to fail from the start. The reason is that the warehouse is built around information that the IT department thinks is important, rather than the business. The main rule is quite simple. If you are hired by the CEO to solve the problem, you have to understand what the CEO is trying to achieve. If you are hired by the marketing director, then you have to find out what drives the marketing director. It all comes down to business goals. Each senior manager in an organization has goals. They may not always be written down. They may not be well known around the organization and to begin with, even the manager may not be able to articulate them clearly but they do exist. As a data warehouse practitioner, we need some extra “soft” skills and techniques to help us help our customers to express these soft system problems, and we explore this subject in detail in Chapter 5 when we build the conceptual model. So what is a business goal? Well it's usually associated with some problem that some executive has to solve. The success or failure on the part of the executive in question may be measured in terms of their ability to solve this problem. Their salary level may depend on their performance in solving this problem, and ultimately their job may depend on it as well. In short, it's the one, two, or three things that sometimes keep them awake at night. (jobwise that is). How is a business goal defined? Well, it's important to be specific. Some managers will say things like, “We need to increase market share” or “We'd like to increase our gross margin” or maybe “We have to get customer churn down.” These are not bad for a start but they aren't specific enough. Look at this one instead: “Our objective is to increase customer loyalty by 1 percent each year for the next five years.” This is a real goal from a real company and it's perfect. The properties of a good business goal are that they should be: 1. Measurable
25
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
2. Time bounded 3. Customer oriented This helps us to answer the question of how we'll know we've been successful. The managers will know whether they have been successful if they hit their measured goal targets within the stated time scale. Just a point about number three on the list. It is not an absolute requirement but it is a good check. There is a kind of edict inside of Hewlett Packard and it goes like this: “If you aren't doing it for a customer, don't do it!” In practice most business goals, in my experience, are customer oriented. Generally, as businesses we want to: Get more good customers Keep our better customers Maybe offload our worst customers Sell more to customers People have started to wake up to the fact that the customer is king. It's the customer we have to identify, convince, and ultimately satisfy if we are to be really successful. It's not about products or efficient processes, although these things are important too. Without the customer we might as well stay at home. Anyway, once we know what a manager's goals are, we can start to talk strategy. only for RuBoard - do not distribute or recompile
26
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
BUSINESS STRATEGY So now we know our customer's goals. The next step in ensuring success is to find out, for each goal, exactly how they plan to achieve it. In other words, what is their business strategy? Before we begin, let's synchronize terminology. There is a risk at this point of sinking into a semantic discussion on what is strategic and what is tactical. For our purposes, a strategy is defined as one or more steps to be employed in pursuit of a business goal. After a manager has explained the business goal, it is reasonable to then follow up with the question “And what is your strategy for achieving this goal?” only for RuBoard - do not distribute or recompile
27
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
THE VALUE PROPOSITION Every organization, from the very largest down to the very smallest, has a value proposition. A company's value proposition is the thing that distinguishes its business offering from all the others in the marketplace. Most senior managers within the organization should be able to articulate their value proposition but often they cannot. It is helpful, when dealing with these people, to discuss their business. Anything they do should be in some way relevant to the overall enhancement of the value proposition of their organization. It is generally accepted that the value proposition of every organization falls into one of three major categories of value discipline (Treacey and Wiersema, 1993). The three categories are customer intimacy, product leadership, and operational excellence. We'll just briefly examine these.
Customer Intimacy We call this the customer intimacy discipline because companies that operate this type of value proposition are the types of companies that really do try to understand their individual customer's needs and will try to move heaven and earth to accommodate their customers. For instance, in the retail clothing world, a bespoke tailor will know precisely how their customers like to have their clothes cut. They will specially order in the types and colors of fabric that the customer prefers and will always deal with the customer on a one-to-one, personal basis. These companies are definitely not cheap. In fact, their products are usually quite expensive compared to some and this is because personal service is an expensive commodity. It is expensive because it usually has to be administered by highly skilled, and therefore expensive, people. However, their customers prefer to use them because they feel as though they are being properly looked after and their lives are sufficiently enriched to justify the extra cost.
Product Leadership The product leaders are the organizations that could be described as “leading edge.” Their value proposition is that they can keep you ahead of the pack. This means that they are always on the lookout for new products and new ideas that they can exploit to keep their customers interested and excited. Technology companies are an obvious example of this
28
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
type of organization, but they exist in almost every industry. Just as with the bespoke service, there is an example in the retail fashion industry. The so-called designer label clothes are a good example of the inventor type of value proposition. The people who love to buy these products appreciate the “chic-ness” bestowed upon them. Another similarity with the customer intimate service is that these products also tend to be very expensive. A great deal of research and development often goes into the production of these products and the early adopters must expect to pay a premium.
Operational Excellence This type of organization excels at operational efficiency. They are quick, efficient, and usually cheap. Mail order companies that offer big discounts and guaranteed same-day or next-day delivery fall into this category. They have marketing slogans like “It's on time or
it's on us!” If you need something in a hurry and you know what you want, these are the guys who deliver. Don't expect a tailor-made service or much in the way of after-sales support, but do expect the lowest prices in town. Is there a fashion industry equivalent? Well, there have always been mail order clothes stores. Even some of the large department stores, if they're honest with themselves, would regard themselves as operationally efficient rather than being strong on personal service or product leadership. So are we saying that all companies must somehow be classified into one of the three groups? Well not exactly, but all companies would tend to have a stronger affinity to one of the three categories than with the other two, and it is important for an organization to recognize where its strengths lie. The three categories have everything to do with the way in which the organization interacts routinely with its customers. It is just not possible for a company that majors on operational excellence to become a product leader or to provide a bespoke service without a major change in its internal organization and culture. Some companies are very strong in two of the three categories, while others are working hard toward this. Marks and Spencer is a major successful retail fashion company. Traditionally its products are sold through a branch network of large department stores all over the world. It also has a growing mail order business. That would seem to place it pretty squarely in the operational excellence camp. However, recently it has opened up a completely new range of products, called “Autograph,” that is aimed at providing a bespoke service to customers. Large areas of its biggest stores are being turned over to this exciting new idea. So here is one company that has already been successful with one value proposition, aiming to achieve excellence in a second. Oddly enough, the Autograph range of products has been designed, in part, by established designers, and so they might even claim to be nibbling at the edges of the product leadership category, too!
29
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
The point is this: an organization needs to understand: 1. How it interacts with its customers 2. How it would like to interact with its customers You can then start to come up with a strategy to help to improve your customer relationship management. only for RuBoard - do not distribute or recompile
30
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks .
only for RuBoard - do not distribute or recompile
CUSTOMER RELATIONSHIP MANAGEMENT The world of business is changing rapidly as never before. Our customers are better informed and much more demanding than ever. The loyalty of our customers is something that can no longer be taken for granted and the loss of customers, sometimes known as
customer churn is a subject close to the heart of most business people today. It is said that it can cost up to 10 times as much to recruit a new customer as it does to retain an existing customer. The secret is to know who our customers are (you might be surprised how many organizations don't) and what it is that they need from us. If we can understand their needs then we can offer goods and services to satisfy those needs, maybe even go a little further and start to anticipate their needs so that they feel cared for and important. We need to think about finding products for our customers instead of finding customers for our products. Every business person is very keen to advance their share of the market and turn prospects into customers, but we must not forget that each of our customers is on someone else's list of hot prospects. If we do not satisfy their needs, there are many business people out there who will. The advent of the Internet intensifies this problem; our competitors are now just one mouse click away! And the competition is appearing from the strangest places. As an example, U.K supermarkets traditionally sold food and household goods, maybe a little stationery and odds and ends. The banks and insurance companies were shocked when the supermarket chains started offering a very credible range of retail financial services and it hasn't stopped there. Supermarkets now routinely sell: Mobile phones White goods Personal computers Drugs Designer clothes Hi-fi equipment The retail supermarket chains are ideally placed to penetrate almost all markets when products or services become commodity items, which, eventually, they almost always will. They have excellent infrastructure and distribution channels, not to mention economies of scale that most organizations struggle to compete with.
31
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
They also have something else—and that is customers. The one-stop shop service, added to extremely competitive prices offered by supermarkets, is irresistible to many customers and as a result they are tending to abandon the traditional sources of these goods. The message is clear. No organization can take its customers for granted. Any business executive who feels complacent about their relationship with their customers is likely to be heading for a fall.
So What Is CRM? Customer relationship management is a term that has become very popular. Many businesses are investing heavily in this area. What do we mean by CRM? Well, it's really a concept, a sort of cultural and attitudinal thing. However, in order to enable us to think about it in terms of systems, we need to define it. A working definition is:
CRM is a strategy for optimizing the lifetime value of customers. Sometimes, CRM is interpreted as a soft and fluffy, cuddly sort of thing where we have to be excessively nice to all our customers and then everything will become good. This is not the case at all. Of course, at the level of the customer facing part of our organization, courtesy, honesty, and trustworthiness are qualities that should be taken for granted. However, we are in business to make a profit. Our management and shareholders will try to see to it that we do. Studies by the First Manhattan Group have indicated that while 20 percent of a bank's customers contribute 150 percent of the profits, 40 to 50 percent of customers eliminate 50% of the profits. Similar studies reveal the same information in other industries, especially telecommunications. The graph in Figure 1.1 shows this. Notice too that the best (i.e., most profitable) customers are twice as likely to be tempted away to other suppliers as the average customer. So how do we optimize the lifetime value of customers? It is all about these two things: 1. Getting to know our customers better 2. Interacting appropriately with our customers Figure 1.1. Just who are our best customers?
32
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
During the ordinary course of business, we collect vast amounts of information about customers that, if properly analyzed, can provide a detailed insight into the circumstances and behavior of our customers. As we come to comprehend their behavior, we can begin to predict it and perhaps even influence it a little. We are all consumers. We all know the things we like and dislike. We all know how we would like to be treated by our suppliers. Well, surprise, surprise, our customers are just like that, too! We all get annoyed by blanket mail shots that have no relevance to us. Who has not been interrupted during dinner by an indiscriminate telephone call attempting to interest us in UPVC windows or kitchen makeovers? Organizations that continue to adopt this blanket approach to marketing do not deserve to succeed and, in the future, they and their methods will disappear. Our relationship with our customers has to be regarded more in terms of a partnership where they have a need that we can satisfy. If we show real interest in our customers and treat them as the unique creatures they are, then the likelihood is that they will be happy to continue to remain as customers. Clearly, a business that has thousands or even millions of customers cannot realistically expect to have a real personal relationship with each and every one. However, the careful interpretation of information that we routinely hold about customers can drive our behavior so that the customer feels that we understand their needs. If we can show that we
33
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
understand them and can satisfy their needs, then the likelihood is that the relationship will continue. What is needed is not blanket marketing campaigns, but targeted campaigns directed precisely at those customers who might be interested in products on offer. The concept of personalized marketing, sometimes called “one-to-one” marketing epitomizes the methods that are now starting to be employed generally in the marketplace, and we'll be looking at this and other components of CRM in the following sections. It is well known that we are now firmly embedded in the age of information. As business people we have much more information about all aspects of our business than ever before. The first-generation data warehouses were born to help us capture, organize, and analyze the information to help us to make decisions about the future based on past behavior. The idea was that we would identify the data that was needed to be gathered from our operational business systems, place it into the warehouse, and then ask questions of the data in order to derive valuable information. It will become clear, if indeed it is not already clear, that, in almost all cases, a data warehouse provides the foundation of a successful CRM strategy. CRM is partly a cultural thing. It is a service that fits as a kind of layer between our ordinary products and services and our customers. It is the CRM culture within an organization that leads to our getting to know our customers better, by the accumulation of knowledge. Equally, the culture enables the appropriate interactions to be conducted between our organization and our customers. This is shown in Figure 1.2. Figure 1.2. CRM in an organization.
34
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Figure 1.2 shows how the various parts of an organization fit together to provide the information we need to understand our customers better and the processes to enable us to interact with our customers in an appropriate manner. The CRM culture, attitudes, and behaviors can then be built on top of this, hopefully enhancing our customers' experiences in their dealings with us. In the remainder of this section, we will explore the various aspects of CRM. Although this book is intended to help with the design of data warehouses, it is important to understand the business dimension. This section should be enough to help you to understand sufficiently the business imperatives around CRM, and you should regard it as a kind of “rough guide” to CRM. The pie diagram in Figure 1.3 shows the components of CRM. Figure 1.3. The components of CRM.
35
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
As you can see in Figure 1.3, there are many slices in the CRM pie, so let us review some of the major ones:
Customer Loyalty and Customer Churn Customer loyalty and customer churn are just about the most important issues facing most business today, especially businesses that have vast numbers of customers. This is a particular problem for: Telecommunications companies Internet service providers Retail financial services Utilities Retail supermarket chains
36
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
First, let us define what we mean by customer churn. In simple terms, it relates to the number of customers lost to competitors over a period of time (usually a year). These customers are said to have churned. All companies have a churn metric. This metric of churn is the number of customers lost, expressed as a percentage of the total number of active customers at the beginning of the period. So if the company had 1,000 active customers at the beginning of the year and during the year 150 of those customers defected to a competitor, then the churn metric for the company is 15 percent. Typically, the metric is calculated each month on a rolling 12-month moving average. Some companies operate a kind of “net churn” metric. This is simply the number of active customers at the end of the period expressed as a percentage of the number of active customers at the beginning of the period, minus 100. So if the company starts the period with 1,000 customers and ends the period with 920 customers, then the calculation is:
This method is sometimes favored by business executives for two reasons: 1. It's easy to calculate. All you have to do is count the number of active customers at the beginning and end of the period. You don't have to figure out how many active customers you've lost and how many you've recruited. These kinds of numbers can be hard to obtain, as we'll see in the chapters on design. 2. It hides the truth about real churn. The number presents a healthier picture than is the reality. Also, with this method it's possible to end up with negative churn if you happen to recruit more customers than you lose. Great care must be taken when considering churn metrics. Simple counts are okay as a guide but, in themselves, they reveal nothing about your organization, your customers, or your relationship with your customers. For instance, in describing customer churn, several times the term active was used to describe customers. What does this mean? The answer might sound obvious but, astonishingly, in most organizations it's a devil of a job to establish which customers are active and which are not. There are numerous reasons why a customer might defect to another supplier. These reasons are called churn factors. Some common churn factors are:
The wrong value discipline. You might be providing a customer intimate style of service. However, customers who want quick delivery and low prices
37
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
are unlikely to be satisfied with the level of service you can provide in this area. Customers are unlikely to have the same value discipline requirements for all the products and services they use. For instance, one customer might prefer a customer intimacy type of supplier to do their automobile servicing but would prefer an efficient and inexpensive supplier to service their stationery orders.
A change in circumstances. This is a very common cause of customer churn. Your customer might simply move out of your area. They may get a new job with a much bigger salary and want to trade up to a more exclusive supplier. Maybe they need to make economies and your product is one they can live without.
Bad experience. Usually, one bad experience won't make us go elsewhere, especially if the relationship is well established. Continued bad experiences will almost certainly lead to customer churn. Bad experiences can include unkept promises, phone calls not returned, poor quality, brusque treatment, etc. It is important to monitor complaints from customers to measure the trends in bad experiences, although the behavior of customers in this respect varies from one culture to another. For instance, in the United Kingdom, it is uncommon for people to complain. They just walk.
A better offer. This is where your competitors have you beat. These days it is easy for companies in some industries to leap-frog each other in the services they provide and, equally, it is easy for customers to move freely from one supplier to another. A good example of this is the prepay mobile phone business. When it first came out, it was attractive because there was no fixed contract, but it placed restrictions on the minimum number of calls you had to make in any one period, say, $50 per quarter. As more vendors entered this market the restriction was driven down and down until it got to the stage where you only have to make one call in six months! So how can you figure out which of your customers are active, which are not, and which you are at the most risk of losing? What you need is customer insight!
Customer Insight In order to be successful at CRM, you simply have to know your customers. In fact some people define CRM as precisely that—knowing your customers. Actually it is much more
38
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
than that, but if you don't know your customers, then you cannot be successful at CRM. It's obvious really, isn't it? In order for any relationship to be successful, both participants in the relationship have to have some level of understanding so that they can communicate successfully and in a way that is rewarding for both parties. Notice the usage of the term “both.” Ultimately, a relationship consists of two people. Although we can say that we have a relationship with our friends, what we are actually saying is that we have a relationship with each of our friends and that results in many relationships. Each one of those relationships has to be initiated and developed, sometimes with considerable care. We invest part of ourselves, sometimes a huge amount, into maintaining the relationships that we feel are the most important. In order for the relationship to be truly successful, it has to be built on a strong foundation, and that usually means knowing as much as we can about the other person in the relationship. The more we know, the better we can be at responding to the other person's needs. Each of the parties involved in a relationship has to get some return from their investment in order to justify continuing with it. The parallels here are obvious. Customers are people, and if we are to build a sustained relationship with them, we have to make an investment. Whereas with personal relationships, the investments we make are our time and emotion and usually we do not measure these; business relationships involve time and money, and we can and should measure them. The purpose of a business relationship is profit. If we cannot profit from the business relationship, then we must consider whether the relationship is worth the investment. So how do we know whether our relationship with a particular customer is profitable? It's all tied in with the notion of customer insight—knowing about our customers. Whereas our knowledge regarding our friends is mostly in our heads, our knowledge about customers is generally held in stored records. Every order, payment, inquiry, and complaint is a piece of a jigsaw puzzle that, collectively, describes the customer. If we can properly manage this information, then we can build a pretty good picture of our customers, their preferences and dislikes, their behavioral traits, and their personal circumstances. Once we have that picture, we can begin to develop customer insight. Let's look at some of the uses of customer insight, starting with segmentation.
Segmentation Terms like knowing our customers and customer insight are somewhat abstract. As previously stated, a large company may have thousands or even millions of customers, and it is not possible to know each one personally in the same way as with our friends and family. A proper relationship in that sense is not a practical proposition for most
39
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
organizations. It has been tried, however. Some banks, for instance, operate “personal” account managers. These managers are responsible for a relatively small number of customers, and their mission is to build and maintain the relationships by getting to know the customers in their charge. However, this is not a service that is offered to all customers. Mostly, it applies to the more highly valued customer. Even so, this service is expensive to operate and is becoming increasingly rare. Our customers can be divided into categorized groups called segments. This is one way of getting to know them better. We automatically divide our friends into segments, perhaps without realizing that we are doing it. For instance, friends can be segmented as: Males or females Work mates Drinking buddies Football supporters Evening classmates Long-standing personal friends Clearly, there are many ways of classifying and grouping people. The way in which we do it depends on our associations and our interests. Similarly, there are many ways in which we might want to segment our customers and, as before, the way in which we choose to do it would depend on the type of business we are engaged upon. There are three main types of segmentation that we can apply to customers: the customer's circumstances, their behavior, and derived information. Let's have a brief look at these three. Circumstances
This is the term that I use to describe those aspects of the customer that relate to their personal details. Circumstances are the information that define who the customer is. Generally speaking, this type of information is customer specific and independent and has nothing to do with our relationship with the customer. It is the sort of information that any organization might wish to hold. Some obvious elements included in customer circumstances are: Name
40
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Date of birth Sex Marital status Address Telephone number Occupation A characteristic of circumstances is that they are relatively fixed. Some IT people might refer to this type of information as reference data and they would be right. It is just that
circumstances is a more accurate layperson's description. Some less obvious examples of circumstances are: Hobbies Ages of children Club memberships Political affiliations In fact, there is almost no limit to the amount of information relating to circumstances that you can gather if you feel it would be useful. In the appendix to this book, I have included several hundred that I have encountered in the past. One retail bank, for some reason, would even like to know whether the customer's spouse keeps indoor potted plants. The fact that I have described this type of information as relatively fixed means that it may well change. The reality is that some things will change and some will not. This type of data tends to change slowly over time. For instance, official government research shows that, in the United Kingdom, approximately 10 percent of people change their address each year. A date of birth, however, will never change unless it is to correct an error. Theoretically, each of the data elements that we record about a customer could be used for segmentation purposes. It is very common to create segments of people by: Sex
41
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Age group Income group Geography Behavioral Segmentation
Whereas a customer's circumstances tend not to relate to the relationship between us, a customer's behavior relates to their interaction with our organization. Behavior encompasses all types of interaction such as: Purchases—the products or services that the customer has bought from us. Payments—payments made by the customer. Contacts—where the customer has written, telephoned, or communicated in some other way. This might be an inquiry about product, a complaint about service, perhaps a request for assistance, etc. The kind of segmentation that could be applied to this aspect of the relationship could be: Products purchased or groups of products. For instance, an insurance company might segment customers into major product groups such as pensions, motor insurance, household insurance, etc. Spending category. Organizations sometimes segment customers by spending bands. Category of complaint. Derived Segmentation
The previous types of segmentation relating to a customer's circumstances and their behavior is quite straightforward to achieve because it requires no interpretation of data about the customer. For instance, if you wish to segment customers depending on the amount of money they spend with you, then it is a simple matter of adding up the value of all the orders placed over the period in question, and the appropriate segment for any particular customer is immediately obvious.
42
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Very often, however, we need to segment our customers in ways that require significant manipulation and interpretation of information. Such segmentation classifications may be derived from the customer's circumstances or behavior or, indeed, a combination of both. As this book is, essentially, intended to assist in the development of data warehouses for CRM, we will return to the subject of derived segmentation quite frequently in the up coming chapters. However, some examples of derived segmentation are:
Lifetime value. Once we have recorded a representative amount of circumstantial and behavioral information about a customer, we can use it in models to assist in predicting future behavior and future value. For instance, there is a group of young people that have been given the label “young plastics” (YP). The profile of a YP is someone who has recently graduated from college and is now embarking on the first few years of their working life. They have often just landed their first job and are setting about securing the trappings of life in earnest. Their adopted lifestyle usually does not quite match their current earnings, and they are often debt laden, becoming quite dependent on credit. At first glance, these people do not look like a good proposition upon which to build a business relationship. However, their long-term prospects are, statistically, very good. Therefore, it may be more appropriate to consider the potential lifetime value of the relationship and “cut them some slack,” which means developing products and services that are designed for this particular segment of customers.
Propensity to churn. We have already talked about the problem of churn earlier in this chapter. If we can assess our customers in such a way as to grade them on a scale of, say, 1 to 10 where 1 is a safe customer and 10 is a customer who we are at risk of losing, then we would be able to modify our own behavior in our relationship with the customer so as to manage the risk.
Up-sell and cross-sell potential. By carefully analyzing customers' behavior, and possibly combining segments together, it is possible to derive models of future behavior and even potential behavior. Potential behavior is a characteristic we all have; the term describes behavior we might engage in, given the right stimulus. Advertising companies stake their very existence on this. It enables us to identify opportunities to sell more products and services (up-selling) and different products and services (cross selling) to our customers.
43
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Entanglement potential. This is related to up-sell and cross-sell and it applies to customers who might be encouraged to purchase a whole array of products from us. Over time, it can become increasingly difficult and bothersome for the customer to disentangle their relationship and go elsewhere. The retail financial services industry is good at this. The bank that manages our checking account encourages us to pay all our bills out of the account automatically each month. This is good because it means we don't have to remember to write the checks. Also, if the bank can persuade us to buy our house insurance, contents insurance, and maybe even our mortgage from them too, then it becomes a real big deal if we want to transfer, say, the checking account to another bank. “Householding” is another example of this and, again, it's popular with the banks. It works by enticing all the members of a family to participate in a particular product or service so that, collectively, they all benefit from special deals such as a reduced interest rate on overdrawn accounts. If any of the family members withdraws from the service, then the deal can be revoked. This of course has an effect on the remainder of the family, and so there is an inducement to remain loyal to the bank. Sometimes there are relationships between different behavioral components that would not be spotted by analysts and knowledge workers. In order to try to uncover these relationships we have to employ different analytic techniques. The best way to do this is to employ the services of a data mining product. The technical aspects of data mining will be described later in this book, but it is important to recognize that there are other ways of defining segments. As a rather obvious example, we can all comprehend the relationship between, say, soft fruit and ice cream, but how many supermarkets place soft fruit and ice cream next to each other? If there is a statistically significant relationship such that customers who purchase soft fruit also purchase ice cream, then a data mining product would be able to detect such a relationship. As I said, this is an obvious example, but there will be others. For instance, is there a relationship between: Vacuum cleaner bags and dog food? Toothpaste and garlic? Diapers and beer? (Surely not!) As we have seen, there are many ways in which we can classify our customers into segments. Each segment provides us with opportunities that might be exploited. Of course, it is the business people that must decide whether the relationships are real or merely
44
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
coincidental. Once we have identified potential opportunities, we can devise a campaign to help us persuade our target customers of the benefits of our proposal. In order to do this, it might be helpful to employ another facet of CRM, campaign management.
Campaign Management As I see it, there are, essentially, two approaches to marketing: defensive marketing and aggressive marketing. Defensive marketing is all about keeping what we have already. For instance, a strategy for reducing churn in our customers could be regarded as defensive because we are deploying techniques to try to keep our customers from being lured by the aggressive marketing of our competitors. Aggressive marketing, therefore, is trying to get more. By “more” we could be referring to: Capturing more customers Cross-selling to existing customers Up-selling to existing customers A well-structured strategy involving well-organized and managed campaigns is a good example of aggressive marketing. The concept of campaigns is quite simple and well known to most of us. We have all been the target of marketing campaigns at some point. There are three types of campaign:
Single-phase campaigns are usually one-off special offers. The company makes the customer, or prospect, an offer (this is often called a treatment) and if the customer accepts the offer (referred to as the response), then the campaign, in that particular case, can be regarded as having been successful. An example of this would be three months free subscription to a magazine. The publishing company is clearly hoping the customer will enjoy the magazine enough to pay for a subscription at the end of the free period.
Multi-phase campaigns, as the name suggests, involve a set of treatments instead of just one. The first treatment might be the offer of a book voucher or store voucher if the customer, say, visits a car dealer and accepts a test drive in a particular vehicle. This positive response is recorded and may be followed up by a second treatment. The second treatment could be the offer
45
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
to lend that customer the vehicle for an extended period, such as a weekend. A positive response to this would result in further treatments that, in turn, provoke further responses, the purpose ultimately being to increase the sales of a particular model.
Recurring campaigns are usually running continually. For example, if the customer is persuaded to buy the car, then, shortly after, they can expect to receive a “welcome pack” that makes them feel they have joined some exclusive club. Campaigns can be very expensive to execute. Typically, they are operated under very tight budget constraints and are usually expected to show a profit. This means that the cost of running the campaign is expected to be recouped out of the extra profits made by the company as a result of the campaign. The problem is: How do you know whether the sale of your product was influenced by the campaign? The people who ended up buying the product might have done so without being targeted in the campaign. The approach that most organizations adopt in order to establish the efficacy of a campaign is to identify a “control” group in much the same way as is done in clinical trials and other scientific experiments. The control group is identified as a percentage of the target population. This group receives no special treatments. At the end of the campaign, the two groups are compared. If, say, 2 percent of the control group purchase the product and 5 percent of the target group purchase the product, then it is assumed that the 3 percent difference was due to the influence of the campaign. The box on the following page shows how it's figured out. One of the big problems with campaigns, as you can see, is the minuscule responses. If the response in our example had been 4 percent instead of 5, the campaign would have shown a loss of $19,000 instead of the healthy profit we actually got. We can see that the line between profit and loss is indeed a fine line to tread. It seems obvious that careful selection of the target population is critical to success. We can't just blitz the marketplace with a carpet-bombing approach. It's all about knowing, or having a good idea, about who might be in the market to buy a car. If you think about it, this is the most important part. And the scary thing is, it has nothing to do with campaign management. It has everything to do with knowing your customers. Campaign management systems are an important component of CRM, but the most important part, identifying which customers should be targeted, is outside the scope of most campaign management systems.
46
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
It would be really good if we could make our campaigns a little more tailored to the individual people in our target list. If the campaign had a target of one, instead of thousands, and we could be sure that this one target is really interested in the product, our chances of success would be far greater.
Personalized Marketing Personalized marketing is sometimes referred to as a “segment of one” or “one-to-one
marketing” and is the ultimate manifestation of having gotten to know our customers. Ideally, we know precisely: What they need When they need it How much they are willing to pay for it
47
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Then, instead of merely having a set of products that we try to sell to customers, we can begin to tailor our offerings more specifically to our customers. Some of the Internet- based companies are beginning to develop applications that recognize customers as they connect to the site and present them with “content” that is likely to be of interest to them. Different customers, therefore, will see different things on the same Web page depending on their previous visits. This means that the customer should never be presented with material that is of no interest to them. This is a major step forward in responding to customers' needs. The single biggest drawback is that this approach is currently limited to a small number of Internet-based companies. However, as other companies shift their business onto the Internet, then the capability for this type of solution will grow. It is worth noting that the success of these types of application depends entirely on information. It does not matter how sophisticated the application is; it is the information that underpins it that will determine success or failure.
Customer Contact One of the main requirements in the implementation of our CRM strategy is to get to know our customers. In order to do this we recognize the value in information. However, the information that we use tends to be collected in the routine course of daily business. For instance, we routinely store data regarding orders, invoices, and payments so that we can analyze the behavior of customers. In virtually all organizations there exists a rich source of data that is often discarded. Each time a customer, or prospective customer, contacts the organization in any way, we should be thinking about the value of the information that could be collected and used to enhance our knowledge about the customer. Let's consider for a moment the two main types of contact that we encounter every day: 1. Enquiries. Every time an existing customer or prospect makes an inquiry about our products or services, we might reasonably conclude that the customer may be interested in purchasing that product or service. How many companies keep a managed list of prospects as a result of this? If we did, we would have a ready-made list for personalized campaign purposes. 2.
Complaints. Customers complain for many reasons. Usually customers complain for good reasons, and most quality companies have a purpose built system for managing complaints. However, some customers are “serial” moaners and it would be good to know, when faced with people like this, what segments they are
48
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
classified under and whether they are profitable or loss making. If the operators dealing with the communications had access to this information, they could respond appropriately. Remember that appropriate interaction with customers is the second main requirement in the implementation of our CRM strategy. Principally, when we are considering customer contact, we tend to think automatically about telephone contact. These days, however, there is a plethora of different media that a customer might use to contact us. Figure 1.4 shows the major channels of contact information. Figure 1.4. The number of communication channels is growing.
49
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
There is quite a challenge in linking together all these channels into a cohesive system for the collection of customer contact. There is a further point to this. Each customer contact costs money for us to deal with. Remember that the overall objective for a CRM strategy is to optimize the value of a customer. Therefore, the cost of dealing with individual customers should be taken into account if we are to accurately assess the value of customers. Unfortunately, most organizations aren't able to figure out the cost of an individual customer. What tends to
50
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
happen is they sum up the total cost of customer contact and divide it by the total number of customers and use the result as customer cost. This arbitrary approach is OK for general information, but it is unsatisfactory in a CRM system where we really do want to know which are the individual customers that cost us money to service. only for RuBoard - do not distribute or recompile
51
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY Different people, clearly, have different perspectives of the world. Six experts will give six different views as to what is meant by customer relationship management. In this chapter I have expressed my view as to what CRM means, its definition, and why it's important to different types of businesses. Also we have explored the major components of CRM. We looked at ways in which the information that we routinely hold about customers might be used to help to support a CRM strategy. This book, essentially, is about data warehousing. Now we have figured out how to design a data warehouse that will support the kind of questions that we need to ask about customers in order to help us be successful at CRM. only for RuBoard - do not distribute or recompile
52
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 2. An Introduction to Data Warehousing INTRODUCTION WHAT IS A DATA WAREHOUSE? DIMENSIONAL ANALYSIS BUILDING A DATA WAREHOUSE PROBLEMS WHEN USING RELATIONAL DATABASES SUMMARY only for RuBoard - do not distribute or recompile
53
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
INTRODUCTION In this chapter we provide an introduction to data warehousing. It is sensible as a starting point, therefore, to introduce data warehousing using “first generation” principles so that we can then go on to explore the issues in order to develop a “second generation” architecture. Data warehousing relates to a branch of a general business subject known as decision support. So in order to understand what data warehousing is all about, we must first understand the purpose of decision support systems (DSS) in general. Decision support systems have existed, in different forms, for many years. Long before the invention of any form of database management systems (DBMS), information was being extracted from applications to assist managers in the more effective running of their organizations.
So what is a decision support system? The purpose of a decision support system is to provide decision makers in organizations with information. The information advances the decision makers' knowledge in some way so as to assist them in making decisions about the organization's policies and strategy. A DSS tends to have the following characteristics: They tend to be aimed at the less well structured, underspecified problems that more senior managers typically face. They possess capabilities that make them easy to use by noncomputer people interactively. They are flexible and adaptable enough to accommodate changes in the environment and decision-making approach of the user. The job of a DSS is usually to provide a factual answer to a question phrased by the user. For instance, a sales manager would probably be concerned if her actual product sales were falling short of the target set by her boss. The question she would like to be able to ask might be:
54
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Why are my sales not meeting my targets? There are, as yet, no such computer systems available to answer such a question. Imagine trying to construct an SQL (Structured Query Language) query that did that ! Her questioning has to be more systematic such that the DSS can give factual responses. So the first question might be:
For each product, what are the cumulative sales and targets for the year? A DSS would respond with a list of products and the sales figures. It is likely that some of the products are ahead of target and some are behind. A well-constructed report might highlight the offending products to make them easier to see. For instance, they could be displayed in red, or flashing. She could have asked:
What are the cumulative sales and targets for the year for those products where the actual sales are less than the target ? Having discovered those products that are not achieving the target, she might ask what the company's market share is for those products, and whether the market share is decreasing. If it is, maybe it's due to a recently imposed price rise. The purpose of the DSS is to respond to ad hoc questions like these, so that the user can ultimately come to a conclusion and make a decision. A major constraint in the development of DSS is the availability of data—that is, having access to the right data at the right time. Although the proliferation of database systems, and the proper application of the database approach, enables us to separate data from applications and provides for data independence, the provision of data represents a challenge. The introduction of sophisticated DBMSs has certainly eased the problems caused by traditional applications but, nonetheless, unavailability of data still persists as a problem for most organizations. Even today, data remains “locked away” in applications. The main reason is that most organizations evolve over time. As they do, the application systems increasingly fail to meet the functional requirements of the organization. As a result, the applications are continually being modified in order to keep up with the ever-changing business. There comes a time in the life of almost every application when it has been modified to the point where it becomes impossible or impractical to modify it further. At this point a decision is usually made to redevelop the application. When this happens, it is usual for the developers to take advantage of whatever improvements in technology have occurred during the life of the application. For instance, the original
55
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
application may have used indexed sequential files because this was the most appropriate technology of the day. Nowadays, most applications obtain their data through relational database management systems (RDBMS). However, most large organizations have dozens or even hundreds of applications. These applications reach the end of their useful lives at various times and are redeveloped on a piecemeal basis. This means that, at any point in time, an organization is running applications that use many different types of software technology. Further, large organizations usually have their systems on diverse hardware platforms. It is very common to see applications in a single company spread over the following: Large mainframe Several mid-range multi-processor machines External service providers Networked and stand-alone PCs A DSS may require to access information from many of the applications in order to answer the questions being put to it by its users.
Introduction to the Case Study To illustrate the issues, let us examine the operation of a (fictitious) organization that contains some of the features just described. The organization is a mail order wine club. With great originality, it is called the Wine Club. As well as its main products (wines), it also sells accessories to wines such as: Glassware—goblets, decanters, glasses, etc. Tableware—ice buckets, corkscrews, salvers, etc. Literature—books and pamphlets on wine-growing regions, reviews, vintages, etc. It has also recently branched out further into organizing trips to special events such as the Derby, the British Formula One Grand Prix, and the Boat Race. These trips generally involve the provision of a marquee in a prominent position with copious supplies of the
56
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
club's wines and a luxury buffet meal. These are mostly one-day events, but there are an increasing number of longer trips such as those that take in the French wine-growing regions by coach tour. The club's information can be modeled, by an entity attribute relationship (EAR) diagram. A high-level EAR diagram of the club's data is as shown in Figure 2.1 Figure 2.1. Fragment of data model for the Wine Club.
Accessory(ProductCode, ProductDesc, SellingPrice, CostPrice)
57
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Class (ClassCode,ClassName, Region) Color(ColorCode, ColorDesc) Customer(CustomerCode, CustomerName, CustomerAddress, CustomerPhone)
CustomerOrder (OrderCode, OrderDate, ShipDate, Status, TotalCost) OrderItem(OrderCode,ItemCode, Quantity, ItemCost) ProductGroup(GroupCode, Description) Reservation(CustomerCode, TripCode, Date, NumberOfPeople, Price) Shipment(ShipCode, ShipDate) Shipper(ShipperCode, ShipperName, ShipperAddress, ShipperPhone) Stock(LocationCode, StockOnHand ) Supplier(SupplierCode, SupplierName, SupplierAddress, SupplierPhone) Trip(TripCode, Description, BasicCost) TripDate(TripCode, Date, Supplement, NumberOfPlaces) Wine(WineCode, Name, Vintage, ABV, PricePerBottle, PricePerCase) The Wine Club has the following application systems in place:
Customer administration. This enables the club to add new customers. This is particularly important after an advertising campaign, typically in the Sunday color supplements, when many new customers join the club at the same time. It is important that the new customers' details, and their orders, are promptly dealt with in order to create a good first impression. This application also enables changes to a customer's address to be recorded, as well as removing ex-customers from the database. There are about 100,000 active customers.
Stock control. The goods inward system enables newly arrived stock to
58
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
be added to the stock records. The club carries about 2,200 different wines and 250 accessories from about 150 suppliers.
Order processing. The directors of the club place a high degree of importance on the fulfillment of customers' orders. Much emphasis is given to speed and accuracy. It is a stated policy that orders must be shipped within ten days of receipt of the order. The application systems that support order processing are designed to enable orders to be recorded swiftly so that they can be fulfilled within the required time. The club processes about 750,000 orders per year, with an average of 4.5 items per order.
Shipments. Once an order has been picked, it is packed and placed in a pre-designated part of the dispatch area. Several shipments are made every day.
Trip bookings. This is a new system that records customer bookings for planned trips. It operates quite independently of the other systems, although it shares the customer information held in the customer administration system. The club's systems have evolved over time and have been developed using different technologies. The order processing and shipments systems are based on indexed-sequential files accessed by COBOL programs. The customer administration system is held on a relational database. All these systems are executed on the same mid range computer. The stock control system is a software package that runs on a PC network. The trip bookings system is held on a single PC that runs a PC-based relational database system. There is a general feeling among the directors and senior managers that the club is losing its market share. Within the past three months, two more clubs have been formed and their presence in the market is already being felt. Also, recently, more customers than usual appear to be leaving the club and new customers are being attracted in fewer numbers than before. The directors have held meetings to discuss the situation. The information upon which the discussions are based is largely anecdotal. They are all certain that a problem exists but find it impossible to quantify. They also know that helpful information passes through their systems and should be available to answer questions. In reality, however, while it is not too difficult to get answers
59
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
to the day-to-day operational questions, it is almost impossible to get answers to more strategic questions.
Strategic and Operational Information It is very important to understand the difference between the terms strategic and
operational. In general, strategic matters deal with planning and policy making, and this is where a data warehouse can help. For instance, in the Wine Club the decision as to when a new product should be launched would be regarded as a strategic decision. Examples pertaining to other types of organization include: When a telecommunications company decides to introduce very cheap off-peak tariffs to attract callers away from the peak times, rather than install extra equipment to cope with increasing demand. A large supermarket chain deciding to open its stores on Sundays. A general 20 percent price reduction for one month in order to increase market share. Whereas strategic matters relate to planning and policy, operational matters are generally more concerned with the day-to-day running of a business or organization. Operations can be regarded as the implementation of the organization's strategy (its policies and plans). The day-to-day ordering of supplies, satisfying customers' orders, and hiring new employees are examples of operational procedures. These procedures are usually supported by computer applications and, therefore, they must be able to provide answers to operational questions such as: How many unfulfilled orders are there? On which items are we out of stock? What is the position on a particular order? Typically, operational systems are quite good at answering questions like these because
60
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
they are questions about the situation as it exists right now. You could add the words right
now to the end of each of those questions and they would still make sense. Questions such as these arise out of the normal operation of the organization. The sort of questions the directors of the Wine Club wish to ask are: 1. Which product lines are increasing in popularity and which are decreasing? 2. Which product lines are seasonal? 3. Which customers place the same orders on a regular basis? 4. Are some products more popular in different parts of the country? 5. Do customers tend to purchase a particular class of product? These, clearly, are not “right now” types of questions and, typically, operational systems are not good at answering such questions. Why is this? The answer lies in the nature of operational systems. They are developed to support the operational requirements of the organization. Let's examine the operational systems of the Wine Club and see what they actually do. Each application's role in the organization can usually be expressed in one or two sentences. The customer administration system contains details of current customers. The stock control system contains details of the stock currently held. The order processing system holds details of unfulfilled customer orders and the shipments system records details of fulfilled orders awaiting delivery to the customers. Notice the use of words like details and current in those descriptions. They underline the “right now” nature of operational systems. You could say that the operational systems represent a “snapshot” of an organization at a point in time. The values held are constantly changing. At any point in time, dozens or even hundreds of inserts, updates and deletes may be executing on all, or any, parts of the systems. If you were to freeze the systems momentarily, then they would provide an accurate reflection of the state of the organization at precisely that moment. One second earlier, or one second later, the situation would have changed. Now let us examine the five questions that the directors of the Wine Club need to ask in
61
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
order to reach decisions about their future strategy. What is it that the five questions have in common? If you look closely you will see that each of the five questions is concerned with sales of
products over time. Looking at the first question:
Which product lines are increasing in popularity and which are decreasing? This is obviously a sensible strategic business question. Depending on the answer, the directors might: Expand their range of some products and shrink their range of other products Offer a financial incentive on some products, such as reduced prices or discounts for volume purchases Enhance the promotional or advertising techniques for the products that are decreasing in popularity For the moment, let's focus in on sales of wine and assess whether the information required to ask such a question is available to the directors. Have a look back at the EAR diagram at the beginning of the case study. Remember we are looking for “sales of products over time.” The only way we can assess whether a product line is increasing or decreasing in popularity is to trace its demand over time. If the order processing information was held in a relational database, we could devise an SQL query such as:
Select Name, Sum(Quantity), Sum(ItemCost) Sales From CustomerOrder a, OrderItem b, Wine c Where a.OrderCode = b.OrderCode And a.WineCode = c.WineCode And OrderDate = 1 Order By CustomerName, WineName Question 4: Are some products more popular in different parts of the country?
This query shows, for each wine, both the number of orders and the total number of bottles ordered by area.
Select WineName, AreaDescription, Count(*) "Total Orders," Sum(Quantity) "Total Bottles" From Sales S, Wine W, Area A, Time T Where S.WineCode = W.WineCode And S.AreaCode = A.AreaCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by WineName,AreaDescription Order by WineName,AreaDescription Question 5: Do customers tend to purchase a particular class of product?
This query presents us with a problem. There is no reference to the class of wine in the data warehouse. Information relating to classes does exist in the original EAR model. So it seems that the star schema is incomplete. What we have to do is extend the Schema as shown in Figure 2.10. Figure 2.10. Snowflake schema for the sale of wine.
86
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Of course the Class information has to undergo the extraction and integration processing before it can be inserted into the database. A foreign key constraint must be included in the Wine table to refer to the Class table. The query can now be coded:
Select CustomerName, ClassName, Sum(Quantity) "TotalBottles" From Sales S,Wine W, Customer Cu, Class Cl, Time T Where S.WineCode = W.WineCode And S.CustomerCode = Cu.CustomerCode And W.ClassCode = Cl.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002 Group by CustomerName, ClassName Having Sum(Quantity) > 2 * (Select AVG(Quantity)
87
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
From Sales S,Wine W, Class C, Time T Where S.WineCode = W.WineCode And W.ClassCode = C.ClassCode And S.OrderTimeCode = T.TimeCode And T.PeriodNumber Between 012001 and 122002) Order by CustomerName, ClassName The query lists all customers and classes of wines where the customer has ordered that class of wine at more than twice the quantity as the average for all classes of wine. There are other ways that the query could be phrased. It is always a good idea to ask the directors precisely how they would define their questions in business terms before translating the question into an SQL query. There are any number of ways the directors can question the data warehouse in order to answer their strategic business questions. We have shown that the data warehouse supports those types of questions in a way in which the operational applications could never hope to do. The queries show very clearly that the arithmetic functions such as AVG() and particularly SUM() are used in just about every case. Therefore, a golden rule with respect to fact tables can be defined:
The nonkey columns in the fact table must be summable. Data attributes such as Quantity and ItemCost are summable, whereas text columns such as descriptions are not summable. Unfortunately, it is not as straightforward as it seems. Care must be taken to ensure that the summation is meaningful. In some attributes the summation is meaningful only across
certain dimensions. For instance, ItemCost can be summed by product, customer, area, and time with meaningful results. Quantity sold can be summed by product but might be regarded as meaningless across other dimensions. Although this problem applies to the Wine Club, it is much more easily explained in a different organization such as a supermarket. While it is reasonable to sum sales revenue across products (e.g., the revenue from sales of apples added to the revenue from sales of oranges and other fresh fruit each contribute toward the sum of revenue for fresh fruit), adding the quantity of apples sold to the quantity of oranges sold produces a meaningless
88
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
result. Attributes that are summable across some dimensions, but not all dimensions, are referred to as semisummable attributes. Clearly they have a valuable role to play in a data warehouse, but their usage must be restricted to avoid the generation of invalid results. So have we now completed the data warehouse design? Well not quite. Remember that the fact table may grow to more than 62 million rows over time. There is the possibility, therefore, that a query might have to trawl through every single row of the fact table in order to answer a particular question. In fact, it is very likely that many queries will require a large percentage of the rows, if not the whole table, to be taken into account. How long will it take to do that? The answer is - quite a long time. Some queries are quite complex, involving multiple join paths, and this will seriously increase the time taken for the result set to be presented back to the user, perhaps to several hours. The problem is exacerbated when several people are using the system at the same time, each with a complex query to run. If you were to join the 62-million row fact table to the customer table and the wine table, how many rows would the Cartesian product contain?
In principle, there is no need for rapid responses to strategic queries, as they are very different from the kind of day-to-day queries that are executed while someone is hanging on the end of the telephone waiting for a response. In fact, it could be argued that, previously, the answer was impossible to obtain, so even if the query took several days to execute, it would still be worth it. That doesn't mean we shouldn't do what we can as designers to try to speed things up as much as possible. Indexes might help, but in a great deal of cases the queries will need to access more than half the data, and indexes are much less efficient in those cases than a full sequential scan of the tables. No, the answer lies in summaries. Remember we said that almost all queries would be summing large numbers of rows together and returning a result set with a smaller number of rows. Well if we can predict, to
89
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
some degree, the types of queries the users will mostly be executing, we can prepare some summarized fact tables so that the users can access those if they happen to satisfy the requirements of the query. Where the aggregates don't supply the required data, then the user can still access the detail. If we question the users closely enough we should be able to come up with a set, maybe half a dozen or so, of summarized fact tables. The star schema and the snowflake principles still apply, but the result is that we have several fact tables instead of just one. It should be emphasized that this is a physical design consideration only. Its only purpose is to improve the performance of the queries. Some examples of summarization for the Wine Club might be: Customers by wine for each month Customers by wine for each quarter Wine by area for each month Wine by area for each quarter Notice that the above examples are summarizing over time. There are other summaries, and you may like to try to think of some, but summarizing over time is a very common practice in data warehouses. Figure 2.11 shows the levels of summarization commonly in use. Figure 2.11. Levels of summarization in a data warehouse
90
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
One technique that is very useful to people using the data warehouse is the ability to drill
down from one summary level to a lower, more detailed level. For instance, you might observe that a certain range of products was doing particularly well or particularly badly. By drilling down to individual products, you can see whether the whole range or maybe just one isolated product is affected. Conversely, the ability to drill up would enable you to make sure, if you found one product performing badly, that the whole range is not affected. This ability to drill down and drill up are powerful reporting capabilities provided by a data warehouse where summarization is used. The usage of the data warehouse must be monitored to ensure that the summaries are being used by the queries that are exercising the database. If it is found that they are not being used, then they should be dropped and replaced by others that are of more use.
Summary Navigation The introduction of summaries raises some questions: 1. How do users, especially noncomputer professionals, know which summaries are available and how to take advantage of them? 2. How do we monitor which summaries are, in fact, being used? One solution is to use a summary navigation tool. A summary navigator is an additional layer of software, usually a third-party product, that sits between the user interface (the presentation layer) and the database. The summary navigator receives the SQL query from the user and examines it to establish which columns are required and the level of summarization needed. How do summary navigators work? This is a prime example of the use of metadata. Remember metadata is data about data. Summary navigators hold their own metadata within the data warehouse (or in a database separate from the warehouse). The metadata is used to provide a “ mapping ” between the queries formulated by the users and the data warehouse itself. Tables 2.2 and 2.3 are example metadata tables.
91
.
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Summary_Tables
Table 2.2. Available Summary Tables for Aggregate Navigation Table_Name
DW_Column
Sales_by_Customer_by_Year
Sales
Sales_by_Customer_by_Year
Customer
Sales_by_Customer_by_Year
Year
Sales_by_Customer_by_Quarter
Sales
Sales_by_Customer_by_Quarter
Customer
Sales_by_Customer_by_Quarter
Quarter
Column_Map
Table 2.3. Metadata Mapping Table for Aggregate Navigation User_Column
User_Value
DW_Column
DW_Value
Rating
Year
2001
Year
2001
100
Year
2001
Quarter
Q1_2001
80
Year
2001
Quarter
Q2_2001
80
Year
2001
Quarter
Q3_2001
80
Year
2001
Quarter
Q4_2001
80
The Summary_Tables table contains a list of all the summaries that exist in the data warehouse, together with the columns contained within them. The Column_Map table provides a mapping between the columns specified in the user's query and the columns that are available from the warehouse. Let's look at an example of how it works. We will assume that the user wants to see the sum of sales for each customer for 2001. The simple way to do this is to formulate the following query:
Select CustomerName, Sum(Sales) "Total Sales" From Sales S, Customer C, Time T
92
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Where S.CustomerCode = C.CustomerCode And S.TimeCode = T.TimeCode And T.Year = 2001 Group By C.CustomerName As we know, this query would look at every row in the detailed Sales fact table in order to produce the result set and would very likely take a long time to execute. If, on the other hand, the summary navigator stepped in and grabbed the query before it was passed through to the RDBMS, it could redirect the query to the summary table called “Sales_by_Customer_by_Year.” It does this by: 1. Checking that all the columns needed are present in the summary table. Note that this includes columns in the “Where” clause that are not necessarily required in the result set (such as “Year” in this case). 2. Checking whether there is a translation to be done between what the user has typed and what the summary is expecting. In this particular case, no translation was necessary, because the summary table “Sales_by_Customer_by_Year” contained all the necessary columns. So the resultant query would be:
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Year S, Customer C Where S.CustomerCode = C.CustomerCode And S.Year = 2001 Group By C.CustomerName If, however, “Sales_by_Customer_by_Year” did not exist as an aggregate table (but “Sales_by_Customer_by_Quarter” did) then the summary navigator would have more work to do. It would see that Sales by Customer was available and would have to refer to the Column_Map table to see if the “Year” column could be derived. The Column_Map table shows that, when the user types “Year = 2001,” this can be translated to:
Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") So, in the absence of “Sales_by_Customer_by_Year,” the query would be reconstructed as follows:
93
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Select CustomerName, Sum(Sales) "Total Sales" From Sales_by_Customer_by_Quarter S, Customer C Where S.CustomerCode = C.CustomerCode And S.Quarter in ("Q1_2001," "Q2_2001," "Q3_2001," "Q4_2001") Group By C.CustomerName Notice that the Column_Map table has a rating column. This tells the summary navigator that “Sales_by_Customer_by_Year” is summarized to a higher level than “Sales_by_Customer_by_Quarter” because it has a higher rating. This directs the summary navigator to select the most efficient path to satisfying the query. You may think that the summary navigator itself adds an overhead to the overall processing time involved in answering queries, and you would be right. Typically, however, the added overhead is in the order of a few seconds, which is a price worth paying for 1000-fold improvements in performance that can be achieved using this technique. We opened this section with two questions. The first question asked how users, especially noncomputer professionals, know which aggregates are available and how to take advantage of them. It is interesting to note that where summary navigation is used, the user never knows which fact table their queries are actually using. This means that they don't need to know which summaries are available and how to take advantage of them. If “Sales_by_Customer_by_Year” were to be dropped, the summary navigator would automatically switch to using “Sales_by_Customer_by_Quarter.” The second question asked how we monitor which summaries are being used. Again, this is simple when you have summary navigator. As it is formulating the actual queries to be executed against the data warehouse, it knows which summary tables are being used and can record the information. Not only that, it can record: The types of queries that are being run to provide statistics so that new summaries can be built Response times
94
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Which users use the system most frequently All kinds of useful statistics can be stored. Where does the summary navigator store this information? In its metadata tables. As a footnote to summary navigation, it is worth mentioning that several of the major RDBMS vendors have expressed the intention of building summary navigation into their products. This development has been triggered by the enormous growth in data warehousing over the past few years.
Presentation of Information The final component of a data warehouse is the method of presentation. This is how the warehouse is presented to the users. Most data warehouse implementations adopt a client-server configuration. The concept of client-server, for our purposes, can be viewed as the separation of the users from the warehouse in that the users will normally be using a personal computer and the data warehouse will reside on a remote host. The connection between the machines is controlled by a computer network. There are very many client products available for accessing relational databases, many of which you may already be familiar with. Most of these products help the user by using the RDBMS schema tables to generate SQL. Similarly, most have the capability to present the results in various forms such as textual reports, pie charts, scatter diagrams, two- and 3-dimensional bar charts, etc. The choice is enormous. Most products are now available on Web servers so that all the users need is a Web browser to display their information. There are, however, some specialized analysis techniques that have largely come about since the invention of data warehouses. The presence of large volumes of time-variant data, hitherto unavailable, has allowed the development of a new process called data
mining. In our exploration into data warehousing and the ways in which it helps with decision support, the onus has always been placed on the user of the warehouse to formulate the queries and to spot any patterns in the results. This leads to more searching questions being asked as more information is returned. Data mining is a technique where the technology does more of the work. The users
95
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
describe the data to the data mining product by identitying the data types and the ranges of valid values. The data mining product is then launched at the database and, by applying standard pattern recognition algorithms, is able to present details of patterns in the data that the user may not be aware of. Figure 2.12 shows how a data mining tool fits into the data warehouse model. Figure 2.12. Modified data warehouse structure incorporating summary navigation and data mining.
96
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The technique has been used very successfully in the insurance industry, where a
97
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
particular insurance company wanted to decrease the number of proposals for life assurance that had to be referred to the company for approval. A data mining program was applied to the data warehouse and reported that men between the ages of 30 and 40 whose height to weight ratio was within a certain range had an increased risk probability of just 0.015. The company immediately included this profile into their automatic underwriting system, thereby increasing the level of automatic underwriting from 50 percent to 70 percent. Even with a data warehouse, it would probably have taken a human “data miner” a long time to spot that pattern because a human would follow logical paths, whereas the data mining program is simply searching for patterns. only for RuBoard - do not distribute or recompile
98
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
PROBLEMS WHEN USING RELATIONAL DATABASES It has been stated that the relational model supports the requirements of data warehousing, and it does. There are, however, a number of areas where the relational model struggles to cope. As we are coming to the end of our introduction to data warehousing, we'll conclude with a brief look at some of these issues.
Problems Involving Time Time variance is one of the most important characteristics of data warehouses. In the section on “Building the Data Warehouse,” we commented on the fact that there appeared to be a certain amount of data redundancy in the warehouse because we were duplicating some of the information, for example, Customers' details, which existed in the operational systems. The reason we have to do this is because of the need to record information over
time. As an example, when a customer changes address we would expect that change to be recorded in the operational database. When we do that we lose the old address. So when a query is next executed where that customer's details are included, any sales of wine, for that customer, will automatically be attributed to the new address. If we are investigating sales by area, the results will be incorrect (assuming the customer moved to a different area) because many of the sales were made when the customer was in another area. That's also the reason why we don't delete customers' details from the data warehouse simply because they are no longer customers. If they have placed any orders at all, then they have to remain within the system. True temporal models are very complex and are not well supported at the moment. We have to introduce techniques such as “start dates” and “end dates” to ensure that the data warehouse returns accurate results. The problems surrounding the representation of time in data warehousing are many. They are fully explored in Chapter 4.
Problems With SQL
99
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
SQL is based on set theory. It treats tables as sets and returns its results in the form of a set. There are cases where the use of procedural logic would improve functionality and performance. Ranking/Top (n)
While it is possible to get ranked output from SQL, it is difficult to do. It involves a correlated subquery, which is beyond the capability of most SQL users. It is also very time-consuming to execute. Some individual RDBMS vendors provide additional features to enable these types of queries to be executed, but they are not standardized. So what works on one RDBMS probably won't work on others. Top n Percent
It is not practically possible, for instance, to get a list of the top 10 percent of customers who place the most orders. Running Balances
It is impossible, in practical terms, to get a report containing a running balance using standard SQL. If you are not clear what a running balance is, it's like a bank statement that lists the payments in one column, receipts in a second column, and the balance, as modified by the receipts and payments, in a third or subsequent column. Complex Arithmetic
Standard SQL provides basic arithmetic functions but does not support more complex functions. The different RDBMS vendors supply their own augmentations but these vary. For instance, if it is required to raise a number by a power, in some systems the power has to be an integer, while in others it can be a decimal. Although data warehouses are used for the production of statistics, standard statistical formulas such as deviations and quartiles, as well as standard mathematical modeling techniques such as integral and differential calculus, are not available in SQL. Variables
Variables cannot be included in a standard SQL query.
100
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Almost all of these, and other deficiencies can be resolved by writing 3GL programs such as “C” or COBOL with embedded SQL. Also most RDBMS vendors provide a procedural extension to their standard SQL product to assist in resolving the problems. However, the standard interface between the products that are available at the presentation layer and the RDBMS is a standard called ODBC, which stands for open database connectivity. ODBC, and the more recent JDBC (Java database connectivity) is very useful because it has forced the industry to adopt a standard approach. It does not, at the time of this writing, support the procedural extensions that the RDBMS vendors have provided. It is worth noting that some of these issues are being tackled. We explore the future in Chapter 11. only for RuBoard - do not distribute or recompile
101
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
SUMMARY Data warehouses are a special type of database that are built for the specific purpose of
getting information out rather than putting data in, which is the purpose of most application databases. The emphasis is on supporting questions of a strategic nature, to assist the managers of organizations in planning for the future. A data warehouse is: Subject Oriented Non Volatile Integrated Time Variant Dimensional analysis is a technique used in identifying the requirements of a data warehouse and this is often depicted using a star schema. The star schema identifies the facts and the dimensions of analysis. A fact is an attribute, such as sales value, or call duration, which is analyzed across dimensions. Dimensions are things like customers and products over which the facts are analyzed. A typical query might be:
Show me the sales value of products by customer for this month and last month. Time is always a dimension of analysis. The data warehouse is almost always kept separate from the application databases because: 1. Application databases are optimized to execute insert and update type queries, whereas data warehouses are optimized to execute select type queries. 2. Application databases are constantly changing, whereas data warehouses are quiet (nonvolatile).
102
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. Application databases have large and complex schemas whereas data warehouses are simplified, often denormalized, structures. 4. Data warehouses need historical information and this is usually missing from application databases. There are five main components to a first generation data warehouse. 1. Extraction of the source data from a variety of application databases. These source applications are often using very different technology. 2. Integration of the data. There are two types of integration. First there is format integration, where logically similar data types (e.g., dates) are converted so that they have the same physical data type. Second, semantic integration so that the meaning of the information is consistent. 3. The database itself. The data warehouse database can become enormous as a new layer of fact data is added each day. The star schema is implemented as a series of tables. The fact table (the center of the star ) is long and thin in that it usually has a large number of rows and a small number of columns. The fact columns must be summable. The dimension tables (the points of the star) are joined to the fact table through foreign keys. Where a dimension participates in a hierarchy, the model is sometimes referred to as a snowflake. 4. Aggregate navigation is a technique which enables the users to have their queries automatically directed at aggregate tables without them being aware that it is happening. This is very important for query performance. 5. Presentation of information. This is how the information is presented to the users of the data warehouse. Most implementations opt for a client-server approach, which gives them the capability to view their information in a variety of tabular or graphical formats. Data warehouses are also useful data sources for applications such as data mining, which are software products that scan large databases searching for patterns and reporting the results back to the users. We review products in Chapter 10. There are some problems that have to be overcome, such as the use of time. Care has to be taken to ensure that the facts in the data warehouse are correctly reported with respect to time. We explore the problems surrounding time in Chapter 4.
103
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
Also, many of the queries that users typically like to ask of a data warehouse cannot easily be translated into standard SQL queries, and work-arounds have to be used, such as procedural programs with embedded SQL. only for RuBoard - do not distribute or recompile
104
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 3. Design Problems We Have to Face Up To In this chapter, we will be reviewing the traditional approaches to designing data warehouses. During the review we will investigate whether or not these methods are still appropriate now that the business imperatives have been identified. We begin this chapter by picking up on the introduction to data warehousing. only for RuBoard - do not distribute or recompile
105
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
only for RuBoard - do not distribute or recompile
DIMENSIONAL DATA MODELS In Chapter 2, we introduced data warehousing and described, at a high level, how we might approach a design. The approach we adopted follows the style of some of the major luminaries in the development of data warehousing generally. This approach to design can be given the general description of dimensional. Star schemes and snowflake schemes are both examples of a dimensional data model. The dimensional approach was adopted in our introduction for the following reasons: Dimensional data models are easy to understand. Therefore, they provide an ideal introduction to the subject. They are unambiguous. They reflect the way that business people perceive their businesses. Most RDBMS products now provide direct support for dimensional models. Research shows that almost all the literature supports the dimensional approach. Unfortunately, although the dimensional model is generally acclaimed, there are alternative approaches and this has tended to result in “religious” wars (but no deaths as far as I know). Even within the dimensional model camp there are differences of opinion. Some people believe that a perfect star should be used in all cases, while others prefer to see the hierarchies in the dimensions and would tend to opt for a snowflake design. Where deep-rooted preferences exist, it is not the purpose of this book to try to make “road to Damascus” style conversions by making nonbelievers “see the light.” Instead, it is intended to present some ideas and systematic arguments so that readers of this book can make their own architectural decisions based on a sound understanding of the facts of the matter. In any case, I believe there are far more serious design issues that we have to consider once we have overcome these apparent points of principle. Principles aside, we have also to consider any additional demands that customer relationship management might place on the data architecture. A good objective for this chapter would be to devise a general high-level data architecture for data warehousing. In doing so, we'll discuss the following issues:
1. Dimensional versus third normal form (3NF) models 2. Stars versus snowflakes
106
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
3. What works for CRM
Dimensional Versus 3NF There are two principal arguments in favor of dimensional models:
1. They are easy for business people to understand and use. 2. Retrieval performance is generally good. The ease of understanding and ease of use bit is not in dispute. It is a fact that business people can understand and use dimensional models. Most business people can operate spreadsheets and a dimensional model can be likened to a multidimensional spreadsheet. We'll be exploring this in Chapter 5 when we start to investigate the dot modeling methodology. The issue surrounding performance is just as clear cut. The main RDBMS vendors have all tweaked their query optimizers to enable them to recognize and execute dimensional queries more efficiently, and so performance is bound to be good in most instances. Even so, where the dimension tables are massively large, as the customer dimension can be, joins between such tables and an even bigger fact table can be problematic. But this is not a problem that is peculiar to dimensional models. 3NF data structures are optimized for very quick insertion, update, and deletion of discrete data items. They are not optimized for massive extractions of data, and it is nonsensical to argue for a 3NF solution on the grounds of retrieval performance.
What Is Data Normalization?
Normalization is a process that aims to eliminate the unnecessary and uncontrolled duplication of data, often referred to as 'data redundancy'. A detailed examination of normalization is not within the scope of this book. However, a brief overview might be helpful (for more detail see Bruce, 1992, or Batini et al., 1992). Normalization enables data structures to be made to conform to a set of well-defined rules. There are several levels of normalization and these are referred to as first normal form (1NF), second normal form (2NF), third normal form (3NF), and so on. There are exceptions, such as Boyce-Codd normal form (BCNF), but we won't be covering these. Also, we won't explore 4NF and 5NF as, for most purposes, an understanding of the levels up to 3NF is sufficient. In relational theory there exists a rule called the entity integrity rule. This rule concerns the primary key of any given relation and assigns to the key the following two properties:
1. Uniqueness. This ensures that all the rows in a relational table can be uniquely identified. 2. Minimality. The key will consist of one or more attributes. The minimality property ensures that the length of the key is no longer than is necessary to ensure that the first property, uniqueness, is guaranteed.
107
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Within any relation, there are dependencies between the key attributes and the nonkey attributes. Take the following Order relation as an example: Order Order number
Primary key
Item number
Primary key
Order date Customer ID Product ID Product description Quantity Dependencies can be expressed in the form of “determinant” rules, as follows:
1. The Order Number determines the Customer ID. 2. The Order Number and Item Number determine the Product ID. 3. The Order Number and Item Number determine the Product Description. 4. The Order Number and Item Number determine the Quantity. 5. The Order Number determines the Order Date. Notice that some of the items are functionally dependent on Order Number (part of the key), whereas others are functionally dependent on the combination of both the Order Number and the Order Item (the entire key). Where the dependency exists on the entire key, the dependency is said to be a fully functional dependency. Where all the attributes have at least a functional dependency on the primary key, the relation is said to be in 1NF. This is the case in our example. Where all the attributes are engaged in a fully functional relationship with the primary key, the relation is said to be in 2NF. In order to change our relation to 2NF, we have to split some of the attributes into a separate relation as follows: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID
108
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Product description Quantity These relations are now in 2NF, since all the nonkey attributes are fully functionally dependent on their primary key. There is one relationship that we have not picked up so far: The Product ID determines the Product Description. This is known as a transitive dependency because the Product ID itself can be determined by the combination of Order Number and Order Item (see dependency 2 above). In order for a relation to be classified as a 3NF relation, all transitive dependencies must be removed. So now we have three relations, all of which are in 3NF: Order Order number
Primary key
Order date Customer ID Order Item Order number
Primary key
Item number
Primary key
Product ID Quantity Product Product ID
Primary key
Product description There is one major advantage in a 3NF solution and that is flexibility. Most operational systems are implemented somewhere between 2NF and 3NF, in that some tables will be in 2NF, whereas most will be in 3NF. This adherence to normalization tends to result in quite flexible data structures. We use the term flexible to describe a data structure that is quite easy to change should the need arise. The changing nature of the business requirements has already been described and, therefore, it must be advantageous to implement a data model that is adaptive in the sense that it can change as the business requirements evolve over time. But what is the difference between dimensional and normalized? Let's have another look at the simple star schema for the Wine Club, from Figure 2.3, which we first produced in our introduction in Chapter 2 (see Figure 3.1).
Figure 3.1. Star schema for the Wine Club.
109
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Some of the attributes of the customer dimension are: Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account manager name This dimension is currently in 2NF because, although all the nonprimary key columns are fully functionally dependent on the primary key, there is a transitive dependency in that the account manager name can also be determined from the account manager ID. So the 3NF version of the customer dimension above would look like this:
110
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Customer Dimension Customer ID (Primary key) Customer name Street address Town County Zip Code Account manager ID Account Manager Dimension Account Manager ID Account Manager Name The diagram is shown in Figure 3.2.
Figure 3.2. Third normal form version of the Wine Club dimensional model.
111
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
So what we have done is convert a star into the beginnings of a snowflake and, in doing so, have started putting the model into 3NF. If this process is carried out thoroughly with all the dimensions, then we should have a complete dimensional model in 3NF. Well, not quite. We need also to look at the fact table. But first, let's go back to the original source of the Sales fact table, the Order, and Order Item table. They have the following attributes: Order Order number
(Primary key)
Customer ID Time Order Items Order number
(Primary key)
Order Items
(Primary key)
Wine ID Depot ID Quantity Value Both these tables are in 3NF. If we were to collapse them into one table, called Sales, by joining them on the Order Number, then the resulting table would look like this Sales Order number
(Primary key part 1)
Order item
(Primary key part 2)
Customer ID Wine ID Time Depot ID Quantity Value This table is now in 1NF because the Customer ID and Time, while being functionally dependent on the primary key, do not display the property of “full” functional dependency (i.e., they are not dependent on the Order Item). In our dimensional model, we have decided not to include the Order Number and Order Item details. If we remove them, is there another candidate key for the resulting table? The answer is, it depends! Look at this version of the Sales table:
112
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Sales Customer ID
(Primary key part 2)
Wine ID
(Primary key part 1)
Time
(Primary key part 3)
Depot ID Quantity Value The combination of Customer ID, Wine ID, and Time have emerged as a composite candidate key. Is this realistic? The answer is yes, it is. But, it all depends on the Time. The granularity of time has to be sufficiently fine as to ensure that the primary key has the property of uniqueness. Notice that we did not include the Depot ID as part of the primary key. The reason is that it does not add any further to the uniqueness of the key and, therefore, its inclusion would violate the other property of primary keys, that of minimality. The purpose of this treatise is not to try to convince you that all dimensional models are automatically 3NF models; rather, my intention is to show that it is erroneous to say that the choice is between dimensional models and 3NF models. The following sums up the discussion:
1. Star schemes are not usually in 3NF. 2. Snowflake schemes can be in 3NF.
Stars and Snowflakes The second religious war is being fought inside the dimensional model camp. This is the argument about star schema versus snowflake schema. Kimball (1996) proscribes the use of snowflake schemas for two reasons. The first is the effect on performance that has already been described. The second reason is that users might be intimidated by complex hierarchies. My experience has shown his assertion, that users and business people are uncomfortable with hierarchies, to be quite untrue. My experience is, in fact, the opposite. Most business people are very aware of hierarchies and are confused when you leave them out or try to flatten them into a single level. Kimball (1996) uses the hierarchy in Figure 3.3 as his example.
Figure 3.3. Confusing and intimidating hierarchy.
113
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
It is possible that it is this six-layer example in Figure 3.3 that is confusing and intimidating, rather than the principle. In practice, hierarchies involving so many layers are almost unheard of. Far more common are hierarchies like the one reproduced from the Wine Club case study, shown in Figure 3.4.
Figure 3.4. Common organizational hierarchy.
These hierarchies are very natural to managers because, in real-life scenarios, customers are organized into geographic areas or market segments and the managers' desire is to be able to ask questions about the business performance of this type of segmentation. Similarly, managers are quite used to comparing the performance of one department against other departments, or one product range against other product ranges. The whole of the business world is organized in a hierarchical fashion. We live in a structured world. It is when we try to remove it, or flatten it, that business people become perplexed. In any case, if we present the users of our data warehouse with a star schema, all we have done, in most instances, is precisely that: We have flattened the hierarchy in a kind of “denormalization” process. So I would offer a counter principle with respect to snowflake schemas: There is no such thing as a true star schema in the eyes of business people. They expect to see the hierarchies. Where does this get us? Well, in the dimensional versus 3NF debate, I suspect a certain amount of reading-between-the-lines interpretation is necessary and that what the 3NF camp is really shooting for is the retention of the online transaction processing (OLTP) schema in preference to a dimensional schema. The reason for this is that the OLTP model will more accurately reflect the underlying business processes and is, in theory at least, more flexible and adaptable to change. While this sounds like a great idea, the introduction of history usually makes this impossible to do. This is part of a major subject that we'll be covering in detail in Chapter 4. It is worth noting that all online analytical processing (OLAP) products also implement a dimensional data model. Therefore, the terms OLAP and dimensional are synonymous.
114
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
This brings us neatly onto the subject of data marts. The question “When is it a data warehouse and when is it a data mart”? has also been the subject of much controversy. Often, it is the commercial interests of the software vendors that carry the greatest influence. The view is held, by some, that the data warehouse is the big, perhaps enterprise-wide, repository and that data marts are much smaller, maybe departmental, extractions from the warehouse that the users get to analyze. By the way, this is very closely associated with the previous discussion on 3NF versus dimensional models. Even the most enthusiastic supporter of the 3NF/OLTP approach is prepared to recognize the value that dimensional models bring to the party when it comes to OLAP. In a data warehouse that has an OLAP component, it is that OLAP component that the users actually get to use. Sometimes it is the only part of the warehouse that they have direct access to. This means that the bit of the warehouse that the users actually use is dimensional, irrespective of the underlying data model. In a data warehouse that implements a dimensional model, the bit that the users actually use is, obviously, also dimensional. Therefore, everyone appears to agree that the part that the users have access to should be dimensional. So it appears that the only thing that separates the two camps is some part of the data warehouse that the users do not have access to. Returning to the issue about data marts, we decline to offer a definition on the basis that a subset of a data warehouse is still a data warehouse. There is a discussion on OLAP products generally in Chapter 10. only for RuBoard - do not distribute or recompile
115
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
WHAT WORKS FOR CRM Data warehousing is now a mature business solution. However, the evolution of business requires the evolution of data warehouses. That business people have to grasp the CRM nettle is an absolute fact. In order to do this, information is the key. It is fair to say that an organization cannot be successful at CRM without high-quality, timely, and accurate information. You cannot determine the value of a customer without information. You cannot personalize the message to your customer without information, and you cannot assess the risk of losing a customer without information. If you want to obtain such information, then you really do need a data warehouse. In order to adopt a personalized marketing approach, we have to know as much as we can about our customers' circumstances and behavior. We described the difference between circumstances and behavior in the section on market segmentation in Chapter 1. The capability to accurately segment our customers is one the important properties of a data warehouse that is designed to support a CRM strategy. Therefore, the distinction between circumstances and behavior, two very different types of data, is crucial in the design of the data warehouse. Let's look at the components of a “traditional” data warehouse to try and determine how the two different types of data are treated. The diagram in Figure 3.5 is our now familiar Wine Club example.
Figure 3.5. Star schema for the Wine Club.
It remains in its star schema form for the purposes of this examination, but we could just as easily be reviewing a snowflake model.
116
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The first two questions we have to ask are whether or not it contains information about:
1. Customers' behavior 2. Customers' circumstances Clearly it does. The Sales table (the Fact table) contains details of sales made to customers. This is behavioral information, and it is a characteristic of dimensional data warehouses and data marts that the Fact table contains behavioral information. Sales is a good example, probably the most common, but there are plenty more from all industries: Telephone call usage Shipments and deliveries Insurance premiums and claims Hotel stays Aircraft flight bookings These are all examples of the subject of a dimensional model, and they are all behavioral. The customer dimension is the only place where we keep information about customer circumstances. According to Ralph Kimball (1996), the principal purpose of the customer dimension, as with all dimensions in a dimensional model, is to enable constraints to be placed on queries that are run against the fact table. The dimensions merely provide a convenient way of grouping the facts and appear as row headers in the user's result set. We need to be able to slice and dice the Fact table data “any which way.” A solution based on the dimensional model is absolutely ideal for this purpose. It is simply made for slicing and dicing. Returning to the terms behavior and circumstances, a dimensional model can be described as behavior centric. It is behavior centric because its principal purpose is to enable the easy and comprehensive analysis of behavioral data. It is possible to make a physical link between Fact tables by the use of a common dimension tables as the diagram in Figure 3.6 shows.
Figure 3.6. Sharing information.
117
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
This “daisy chain” effect enables us to “drill across” from one star schema to another. This common dimension is sometimes referred to as a conformed dimension. We have seen previously how the first-generation data warehouses tended to focus on the analysis of behavioral information. Well, the second generation needs to support big business issues such as CRM and, in order to do this effectively, we have to be able to focus not only on behavior, but circumstances as well.
Customer Behavior and Customer Circumstances–The Cause-and-Effect Principle We have explored the difference between customers' circumstances and their behavior, but why is it important? Most of the time in data warehousing, we have been analyzing behavior. The Fact table in a traditional dimensional schema usually contains information about a customer's interaction with our business. That is: the way they behave toward us. In the Wine Club example we have been using, the Fact table contained information about sales. This, as has been shown, is the normal approach toward the development of data warehouses. Now let us look again at one of the most pressing business problems, that of customer loyalty and its direct consequence, that of customer churn. For the moment let us put ourselves in the place of a customer of a cellular phone company and think of some reasons why we, as a customer, may decide that we no longer wish to remain as a customer of this company: Perhaps we have recently moved to a different area. Maybe the new area has a poor reception for this particular company. We might have moved to a new employer and have been given a mobile phone as part of the deal, making the old one surplus to requirements. We could have a child just starting out at college. The costs involved might require economies to be made elsewhere, and the mobile phone could be the luxury we can do without.
118
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the above situations could be the cause for us as customers to appear in next month's churn statistics for this cellular phone company. It would be really neat if the phone company could have predicted that we are a high-risk customer. The only way to do that is to analyze the information that we have gathered and apply some kind of predictive model to the data that yields a score between, say, 1 for a very low risk customer to 10 for a very high risk customer. But what type of information is likely to give us the best indication of a customer's propensity to churn? Remember that, traditionally, data warehouses tend to be organized around behavioral systems. In a mobile telephone company, the most commonly used behavioral information is the call usage. Call usage provides information about: Types of calls made (local, long distance, collect, etc.) Durations of calls Amount charged for the call Time of day Call destinations Call distances If we analyze the behavior of customers in these situations, what do you think we will find? I think we can safely predict that, just before the customer churned, they stopped making telephone calls! The abrupt change in behavior is the effect of the change in circumstances. The cause-and-effect principles can be applied quite elegantly to the serious problem of customer churn and, therefore, customer loyalty. What we are seeing when we analyze behavior is the effect of some change in the customer's circumstances. The change in circumstances, either directly or indirectly, is the cause of their churning. If we analyze their behavior, it is simply going to tell us something that we already know and is blindingly obvious–the customer stopped using the phone. By this time it is usually far too late to do anything about it. In view of the fact that most dimensional data warehouses measure behavior, it seems reasonable to conclude that such models may not be much help in predicting those customers that we are at risk of losing. We need to turn our attention to being very much more rigorous in our approach to tracking changes in circumstances, rather than behavior. Thus, the second-generation data warehouses that are being built as an aid to the development of CRM applications need to be able to model more than just behavior. So instead of being behavior centric, perhaps they should be dimension centric or even circumstances centric. The preferred term is customer centric. Our second-generation data warehouses will be classified as customer centric. Does this mean that we abandon behavioral information? Absolutely not! It's just that we need to
119
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
switch the emphasis so that some types of information that are absolutely critical to a successful CRM strategy are more accessible. So what does this mean for the great star schema debate? Well, all dimensional schemes are, in principle, behavioral in nature. In order to develop a customer-centric model we have to use a different approach. If we are to build a customer-centric model, then it make sense to start with a model of the customer. We know that we have two major information types–behavior and circumstances. For the moment, let's focus in on the circumstances. Some of the kinds of things we might like to record about customers are: Customer Name Address Telephone number Date of birth Sex Marital status Of course, there are many, many more pieces of information that we could hold (check out Appendix D to see quite a comprehensive selection), but this little list is sufficient for the sake of example. At first sight, we might decide that we need a customer dimension as shown in Figure 3.7.
Figure 3.7. General model for customer details.
The customer dimension in Figure 3.7 would have some kind of customer identifier and a set of attributes like those listed in the table above. But that won't give us what we want. In order to enable our users to implement a data warehouse that supports CRM, one of the things they must be able to do is analyze, measure, and classify the effect of changes in a customer's circumstances. As far as we, that means the data architects, are concerned, a change in circumstances simply means a change in the value of some attribute. But, ignoring error corrections, not all attributes are subject to change as part of the ordinary course of business. Some attributes change and some don't. Even if an attribute does change, it does not necessarily mean that the change is of any real interest to our business. There is a business issue to be resolved here.
120
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
We can illustrate these points if we look a little more closely at the simple list of attributes above. Ignoring error corrections, which are the attributes that can change? Well, in theory at least, with the exception of the date of birth, they can all change. Now, there are two types of change that we are interested in:
1. Changes where we need to be able to see the previous values of the attribute, as well as the new value
2. Changes where the previous values of the attribute can be lost What we have to do is group the attributes into these two different types. So we end up with a model with two entities like the one in Figure 3.8.
Figure 3.8. General model for a customer with changing circumstances.
We are starting to build a general conceptual model for customers. For each customer, we have a set of attributes that can change as well as a set of attributes for which either they cannot change or, if they do, we do not need to know the previous values. Notice that the relationship has a cardinality of one to many. Please note this is not meant to show that there are many attributes that can change; it actually means that each attribute can change many times. For instance, a customer's address can change quite frequently over time. In the Wine Club, the name, telephone number, date of birth, and sex are customer attributes where the business feels that either the attributes cannot change or the old values can be lost. This means that the address and marital status are attributes where the previous values should be preserved. So, using the example, the model should look as shown in Figure 3.9.
Figure 3.9. Example model showing customer with changing circumstances.
121
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
So each customer can have many changes of address and marital status, over time. Now, the other main type of data that we need to capture about customers is their behavior. As we have discussed previously, the behavioral information comes from the customers' interaction with our organization. The conceptual general model that we are trying to develop must include behavioral information. It is shown in Figure 3.10.
Figure 3.10. The general model extended to include behavior.
Again the relationship between customers and their behavior is intended to show that there are many behavioral instances over time. The actual model for the Wine Club would look something like the diagram in Figure 3.11.
Figure 3.11. The example model extended to include behavior.
122
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Each of the behavioral entities (wine sales, accessories, and trips) probably would have previously been modeled as part of individual subject areas in separate star schemas or snowflake schemas. In our new model, guess what? They still could be! Nothing we have done so far means that we can't use some dimensional elements if we want to and, more importantly, if we can get the answers we need. Some sharp-eyed readers at this point might be tempted into thinking, “Just hold on a second, what you're proposing for the customer is just some glorified form of common (or conformed) dimension, right?.” Well, no. There is, of course, some resemblance in this model to the common dimension model that was described earlier on. But remember this: The purpose of a dimension, principally, is to constrain queries against the fact table. The main purpose of a common dimension is to provide a drill across facility from one fact table to another. They are still behavior-centric models. It is not the same thing at all as a model that is designed to be inherently customer centric. The emphasis has shifted away from behavior, and more value is attached to the customer's personal circumstances. This enables us to classify our customers into useful and relevant segments. The difference might seem quite subtle, but it is, nevertheless, significant. Our general model for a customer centric-data warehouse looks very simple, just three main entity types. Is it complete? Not quite. Remember that there were three main types of customer segmentation. The first two were based on circumstances and behavior. We have discussed these now at some length. The third type of segment was referred to as a derived segment. Examples of derived segments are
123
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
things like “estimated life time value” and “propensity to churn.” Typically, the inclusion and classification of a customer in these segments is determined by some calculation process such as predictive modeling. We would not normally assign a customer a classification in a derived segment merely by assessing the value of some attribute. It is sensible, therefore, to modify our general model to incorporate derived segments, as shown in Figure 3.12.
Figure 3.12. General conceptual model for a customer-centric data warehouse.
So this is it. The diagram in Figure 3.12 is the boiled-down general model for a customer centric data warehouse. In theory, it should be able to answer almost any question we might care to throw at it. I say “in theory” because in reality the model will be far more complex than this. We will need to be able to cope with customers whose classification changes. For example, we might have a derived segment called “life time value” where every customer is allocated an indicator with a value from, say, 1 to 20. Now, Lucie Jones might have a lifetime value indicator of “9.” But when Lucie's salary increases, she might be allocated a lifetime value indicator of “10.” It might be useful to some companies to invent a new segment called, say, “increasing life time values.” This being the case, we may need to track Lucie's lifetime value indicator over time. When we bring time into our segmentation processes, the possibilities become endless. However, the introduction of time also brings with it some very difficult problems, and these will be discussed in the next chapter. Our model can be described as a general conceptual model (GCM) for a customer-centric data warehouse. The GCM provides us with a template from which all our actual conceptual models in the future can be derived. While we are on the subject of conceptual models, I firmly believe it is high time that we reintroduce the conceptual, logical, and physical data model trilogy into our design process.
Whatever Happened to the Conceptual/Logical/Physical Trilogy? In the old days there was tradition in which we used a three-stage process for designing a database. The first model that was developed was called the conceptual data model and it was usually represented by an entity relationship diagram (ERD). The purpose of the ERD was to provide an
124
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
abstraction that represented the data requirements of the organization. Most people with any experience of database design would be familiar with the ERD approach to designing databases. One major characteristic of the conceptual model is that it ought to be able to be implemented using any type of database. In the 1970s, the relational database was not the most widely used type of DBMS. In those days, the databases tended to be:
1. Hierarchical databases 2. Network databases The idea was that the DBMS to be ultimately deployed should not have any influence over the way in which the requirements were expressed. So the conceptual data model should not imply the technology to be used in implementing the solution. Once the DBMS had been chosen, then a logical model would be produced. The logical model was normally expressed as a schema in textual form. So, for instance, where the solution was to be implemented using a relational database, a relational schema would be produced. This consisted of a set of relations, the relationships expressed as foreign key constraints, and a set of domains from which the attributes of the relations would draw their values. The physical data model consisted of the data definition language (DDL) statements that were needed to actually build, in a relational environment, the tables, indexes, and constraints. This is sometimes referred to as the implementation model. One of the strengths of the trilogy is that decisions relating to the logical and physical design of the database could be taken and implemented without affecting the abstract model that reflected the business requirements. The astonishing dominance of relational databases since the mid-1980s has led, in practice, to a blurring of the boundaries between the three models, and it is not uncommon nowadays for a single model to be built, again in the form of an ERD. This ERD is then converted straight into a set of tables in the database. The conceptual model, logical model, and physical model are treated as the same thing. This means that any changes that are made to the design for, say, performance-enhancing reasons are reflected in the conceptual model as well as the physical model. The inescapable conclusion is that the business requirements are being changed for performance reasons. Under normal circumstances, in OLTP-type databases for instance, we might be able to debate the pros and cons of this approach because the business users don't ever get near the data model, and it is of no interest to them. They can, therefore, be shielded from it. But data warehouses are different. There can be no debate; the users absolutely have to understand the data in the data warehouse. At least they have to understand that part of it that they use. For this reason, the conceptual data model, or something that can replace its intended role, must be reintroduced as a necessary part of the development lifecycle of data warehouses. There is another reason why we need to reinvent the conceptual data model for data warehouse development. As we observed earlier, in the past 15 years, the relational database has emerged as a
125
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
de facto standard in most business applications. However, to use the now well worn phrase, data warehouses are different. Many of the OLAP products are nonrelational, and their logical and physical manifestations are entirely different from the relational model. So the old reasons for having a three-tier approach have returned, and we should respond to this.
The Conceptual Model and the Wine Club Now that we have the GCM, we can apply its principles to our case study, the Wine Club. We start by defining the information about the customer that we want to store. In the Wine Club we have the following customer attributes: Customer Information Title Name Address Telephone number Date of birth Sex Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details The attributes have to be divided into two types:
1. Attributes that are relatively static (or where previous values can be lost) 2. Attributes that are subject to change Customer's Static Information Title Name Telephone number Date of birth Sex Customer's Changing Information Address
126
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Marital status Childrens' details Spouse details Income Hobbies and interests Trade or profession Employers' details We now construct a model like the one in Figure 3.13.
Figure 3.13. Wine Club customer changing circumstances.
This represents the customer static and changing circumstances. The behavior model is shown in Figure 3.14.
Figure 3.14. Wine Club customer behavior.
127
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
Now, thinking about derived segments, these are likely to be very dynamic in the sense that some derived segments will change over time, some will remain fairly static over time, and others still will appear for a short while and then disappear. Some examples of these, as they apply to the Wine Club, are: Lifetime value. This is a great way of classifying customers, and every organization should try to do this. It is an example of a fairly static form of segmentation. We would not expect dramatic changes to customers' positions here. It would be good to know which customers are on the “generally increasing” and the “generally decreasing” scale. Recently churned. This is an example of a dynamic classification that will be constantly changing. The ones that we lose that had good lifetime value classifications would appear in our “Win back” derived segment. Special promotions. These can be good examples where a kind of “one-off” segment can be used effectively. In the Wine Club there would be occasions when, for instance, it needs to sell off a particular product quickly. The requirement would be to determine the customers most likely to buy the product. This would involve examination of previous behavior as well as circumstances (e.g., income category in the case of an expensive wine). The point is that this is a “use once” segment. Using the three examples above, our derived segments model looks as shown in Figure 3.15.
Figure 3.15. Derived segment examples for the Wine Club.
128
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
There is a design issue with segments generally, and that is their general dynamic nature. The marketing organization will constantly want to introduce new segments. Many of them will be of the fairly static and dynamic types that will have long lives in the data warehouse. What we don't want is for the data warehouse administrator to have to get involved in the creation of new tables each time a new classification is invented. This, in effect, results in frequent changes to the data warehouse structure and will lead to the marketing people getting involved in complex change control procedures and might ultimately result in a stifling of creativity. So we need a way of allowing the marketing people to add new derived segments without involving the database administrators too much. Sure, they might need help in expressing the selection criteria, but we don't want to put too many obstacles in their path. This issue will be explored in more detail in Chapter 7—the physical model. only for RuBoard - do not distribute or recompile
129
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
SUMMARY I have a theory, which is that data warehousing has always been about customer relationships. It's just that previously we didn't entirely make the connection, because, in the early days, CRM had not really been recognized as any kind of discipline at the systems level. The advent of CRM has put the spotlight firmly back on data warehousing. Data warehouses provide the technology to enable us to perform customer relationship analysis. The management of the relationship is the value that is added by the business. This is the decision-making part. The warehouse is doing what it has always done, providing decision support. That is why this book is about supporting customer relationship management. In this chapter we have been looking at some of the design issues and tried to quantify and rationalize some aspects of data warehouse design that have been controversial in the past. There are good reasons for using dimensional schemas, but there are cases where they can work against us. The best solution for CRM is to use dimensional models where they absolutely do add value, in modeling customers' behavior, but to use more “conventional” approaches when modeling customers' circumstances. Toward the end of the chapter we developed a general conceptual model (Figure 3.12) for a customer-centric data warehouse. We will develop this model later on in this book by applying some of the needs of our case study, the Wine Club. Firstly, however, the awfully thorny subject of time has to be examined. only for RuBoard - do not distribute or recompile
130
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks only for RuBoard - do not distribute or recompile
Chapter 4. The Implications of Time in Data Warehousing The principal subject of this book is the design of data warehouses. One of the least well understood issues surrounding data warehouse design is the treatment and representation of time. This chapter introduces the characteristics of time and the way that it is used in data warehousing applications. The chapter goes on to describe fully the problems encountered with time. We need to introduce some rigor into the way that time is treated in data warehousing, and this chapter lays out the groundwork to enable that to be achieved. We also examine the more prominent solutions that other major practitioners have proposed in the past and that have been used ubiquitously in first-generation data warehouses. We will see that there are some issues that arise when these methods are adopted. The presence of time and the dependence upon it is one of the things that sets data warehouse applications apart from traditional operational systems. Most business applications are suited to operating in the present environment where time does not require special treatment. In many cases, dates are no more than descriptive attributes. In a data warehouse, time affects the very structure of the system. The temporal requirements of a data warehouse are very different from those of an operational system, yet it is the operational system that feeds information about changed data to the data warehouse. In a temporal database management system, support for time would be implicit within the DBMS and the query language would contain time-specific functions to simplify the manipulation of time. Until such systems are generally available, the data warehouse database has to be designed to take account of time. The support for time has to be explicitly built into the table structures and the queries. only for RuBoard - do not distribute or recompile
131
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
THE ROLE OF TIME In data warehousing the addition of time enables historical data to be held and queried upon. This means that users of data warehouses can view aspects of their enterprise at any specific point or over any period of time for which the historical data is recorded. This enables the observation of patterns of behavior over time so that we can make comparisons between similar or dissimilar periods, for example, this year versus last year, seasonal trends. Armed with this information, we can extrapolate with the use of predictive models to assist us with planning and forecasting. We are, in effect, using the past to attempt to predict the future: If men could learn from history, what lessons it might teach us! But passion and party blind our eyes, and the light which experience gives is a lantern on the stern, which shines only on the waves behind us! —(Coleridge, 1835) Despite this gloomy warning from the nineteenth century, the use of information from past events and trends is commonplace in economic forecasting, social trend forecasting, and even weather forecasting. The value and importance of historical data are generally recognized. It has been observed that the ability to store historical data is one of the main advantages of data warehousing and that the absence of historical data, in operational systems, is one of the motivating factors in the development of data warehouses. Some people argue that most operational systems do keep a limited amount of history, about 60–90 days. In fact, this is not really the case, because the data held at any one time in, say, an order processing system will be orders whose lifecycle has not been completed to the extent that the order can be removed from the system. This means that it may take, on average, 60–90 days for an order to pass through all its states from inception to deletion. Therefore, at any one time, some of the orders may be up to 90 days old with a status of “invoiced,” while others will be younger, with different status such as “new,” “delivered,” “back ordered,” etc. This is not the same as history in our sense.
Valid Time and Transaction Time Throughout this chapter and the remainder of the book, we will make frequent reference to the valid times (Jensen et al., 1994) and transaction times of data warehouse records. These two times are defined in field of temporal database research and have quite precise meanings that are now explained. The valid time associated with the value of, say, an attribute is the time when the value is true in modeled reality. For instance, the valid time of an order is the time that the order was taken. Such
132
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
values may be associated with:
1. A single instant. Defined to be a time point on an underlying time axis. An event is defined as an instantaneous fact that occurs at an instant.
2. Intervals (periods) of time. Defined to be the time between two instants. The valid time is normally supplied by the user, although in some cases, such as telephone calls, the valid time can be provided by the recording equipment. The transaction time associated with the value of, say, an attribute records the time at which the value was stored in the database and is able to be retrieved. Transaction times are system generated and may be implemented using transaction commit times. Transaction times may also be represented by single instants or time intervals. Clearly, a transaction time can provide only an approximation of the valid time. We can say, generally, that the transaction time that records an event will never be earlier than the true valid time of the same event.
Behavioral Data In a dimensional data warehouse, the source systems from which the behavioral data is derived are the organization's operational systems such as order processing, supply chain, and billing. The source systems are not usually designed to record or report upon historical information. For instance, in an order processing system, once an order has satisfactorily passed through its lifecycle, it tends to be removed from the system by some archival or deletion process. After this, for all practical purposes, the order will not be visible. In any case, it will have passed beyond the state that would make it eligible to be captured for information purposes. The task of the data warehouse manager is to capture the appropriate entities when they achieve the state that renders them eligible to be entered into the data warehouse. That is, when the appropriate event occurs, a snapshot of the entity is recorded. This is likely to be before they reach the end of their lifecycle. For instance, an order is captured into the data warehouse when the order achieves a state of, say, “invoiced.” At this point the invoice becomes a “fact” in the data warehouse. Having been captured from the operational systems, the facts are usually inserted into the fact table using the bulk insertion facility that is available with most database management systems. Once loaded, the fact data is not usually subject to change at all. The recording of behavioral history in a fact table is achieved by the continual insertion of such records over time. Usually each fact is associated with a single time attribute that records the time the event occurred. The time attribute of the event would, ideally, be the “valid time,” that is, when the event occurred in the real world. In practice, valid times are not always available and transaction times (i.e., the time the data was recorded) have to be used. The actual data type used to record the time of the event will vary from one application to another depending on how precise the time has to be (the granularity of time might be day, month, and year when recording sales of wine, but would need to be more precise in the case of telephone calls and
133
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
would probably include hours, minutes, and seconds).
Circumstantial Data Operational data, from which the facts are derived, is accompanied by supporting data, often referred to as reference data. The reference data relates to entities such as customers, products, sales regions, etc. Its primary purpose, within the operation processing systems, is to enable, for instance, the right products and documentation to be sent to the right addresses. It is this data that is used to populate the dimensions and the dimensional hierarchies in the data warehouse. In the same way that business transactions have a lifecycle, these reference entities also have a lifecycle. The lifecycle of reference data entities is somewhat different to transactions. Whereas business transactions, under normal circumstances, have a predefined lifecycle that starts at inception and proceeds through a logical path to deletion, the lifecycle of reference data can be much less clear. The existence of some entities can be discontinuous. This is particularly true of customer entities who may switch from one supplier to another and back again over time. It is also true of some other reference information, such as products (e.g., seasonal products). Also, the attributes are subject to change due to people moving, changing jobs, etc. only for RuBoard - do not distribute or recompile
134
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
PROBLEMS INVOLVING TIME There are several areas in data warehousing where time presents a problem. We'll now explore those areas.
The Effect of Time on the Data Model Organizations wishing to build a data warehouse have often already built a data model describing their operational business data. This model is sometimes referred to as the corporate data model. The database administrator's office wall is sometimes entirely obscured by a chart depicting the corporate data model. When building a data warehouse, practitioners often encounter the requirement to utilize the customer's corporate data model as the foundation of the warehouse model. The organization has invested considerably in the development of the model, and any new application is expected to use it as the basis for development. The original motivation for the database approach was that data should be entered only once and that it should be shared by any users who were authorized to have access to it. Figure 4.1 depicts a simple fragment of a data model for an operational system. Although the Wine Club data model could be used, the model in Figure 4.1 provides a clearer example.
Figure 4.1. Fragment of operational data model.
Figure 4.1 is typical of most operational systems in that it contains very little historical data. If we are to introduce a data warehouse into the existing data model, we might consider doing so by the addition of a time variant table that contained the history that is needed. Taking the above fragment of a corporate data model as a starting point, and assuming that the warehouse subject area is “Sales,” a dimensional warehouse might be created as in Figure 4.2.
135
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Figure 4.2. Operational model with additional sales fact table.
Figure 4.2 shows a dimensional model with the fact table (Sales) at the center and three dimensions of analysis. These are time, customer, and salesperson. The salesperson dimension participates in a dimensional hierarchy in which a department employs salespeople and a site contains many departments. Figure 4.2 further shows that the sales fact table is populated by the data contained in the orders table, as indicated by the dotted arrow (not part of standard notation). That is, all new orders that have achieved the state required to enable them to be classified as sales are inserted into the sales table and are appended to the data already contained in the table. In this way the history of sales can be built. At first sight, this appears to be a satisfactory incorporation of a dimensional data warehouse into an existing data model. Upon closer inspection, however, we find that the introduction of the fact table “Sales” has had interesting effects. To explain the effect, the sales dimensional hierarchy is extracted as an example, shown in Figure 4.3. This hierarchy shows that a site may contain many departments and a department may employ many salespeople. This sort of hierarchy is typical of many such hierarchies that exist in all organizations.
Figure 4.3. Sales hierarchy.
The relationships shown here imply that a salesperson is employed by one department and that a department is contained in one site. These relationships hold at any particular point in time.
136
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
The addition of a fact table, which contains history, is attached to the hierarchy as shown in Figure 4.4.
Figure 4.4. Sales hierarchy with sales table attached.
The model now looks like a dimensional model with a fact table (sales) and a single dimension (salesperson). The salesperson dimension participates in a dimensional hierarchy involving departments and sites. Assuming that it is possible, during the course of ordinary business, for a salesperson to move from one department to another, or for a department to move from one site to another, then the cardinality (degree) of the relationships “Contains” and “Employs” no longer holds. The hierarchy, consisting of salespeople, departments, and sites contains only the latest view of the relationships. Because sales are recorded over time, some of the sales made by a particular salesperson may have occurred when the salesperson was employed by a different department. Whereas the model shows that a salesperson may be employed by exactly one department, this is only true where the relationship is viewed as a “snapshot” relationship. A more accurate description is that a salesperson is employed by exactly one department at a time. Over time, a salesperson may be employed by one or more departments. Similarly, a department is contained by exactly one site at a time. If it is possible for departments to move from one site to another then, over time, a department may be contained by one or more sites. The introduction of time variance, which is one of the properties of a data warehouse, has altered the degree of the relationships within the hierarchy, and they should now be depicted as many-to-many relationships as shown in Figure 4.5. This leads to the following observation: The introduction of a time-variant entity into a time-invariant model potentially alters the degree of one or more of the relationships in the model.
Figure 4.5. Sales hierarchy showing altered relationships.
137
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
A point worth noting is that it is the rules of the business, not a technical phenomenon, that cause these changes to the model. The degree to which this causes a problem will vary from application to application, but dimensions typically do contain one or more natural hierarchies. It seems reasonable to assume, therefore, that every organization intending to develop a data warehouse will have to deal with the problem of the degree of relationships being altered as a result of the introduction of time. The above example describes the kind of problem that can occur in relationships that are able to change over time. In effect, the cardinality (degree) of the relationship has changed from “one to many” to “many to many” due to the introduction of time variance. In order to capture the altered cardinality of the relationships, intersection entities would normally be introduced as shown in Figure 4.6
Figure 4.6. Sales hierarchy with intersection entities.
This brief introduction to the problem shows that it is not really possible to combine a time-variant data warehouse model with a non-time-variant operational model without some disruption to the original model. If we compare the altered data model to the original model, it is clear that the introduction of the time-variant sales entity has had some repercussions and has forced some changes to be made. This is one of the main reasons that forces data warehouses to be built separately from operational systems. Some practitioners believe that the separation of the two is merely a performance issue in that most database products are not able to be optimized to support the highly disparate nature of operational versus decision support type of queries. This is not the case. The example shows that the structure of the data is actually incompatible. In the future it is likely that operational systems will be
138
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
built with more “decision support awareness,” but any attempt to integrate decision support systems into traditional operational systems will not be successful.
The Effect of Time on Query Results As these entities change over time, in operational processing systems, the new values tend to replace existing values. This gives the impression that the old, now replaced, value never existed. For instance, in the Wine Club example, if a customer moves from one address to another and, at the same time, switches to a new region, there is no reason within the order processing system to record the previous address as, in order to service orders, the new address is all that is required. It could be argued that to keep information about the old address is potentially confusing, with the risk that orders may be inadvertently dispatched to the wrong address. In a temporal system such as a data warehouse, which is required to record and report upon history faithfully, it may be very important to be able to distinguish between the orders placed by the customer while resident at the first address from the orders placed since moving to the new address. An example of where this information would be needed is where regional sales were measured by the organization. In the example described above, the fact that the customer, when moving, switched regions is important. The orders placed by the customer while they were at the previous address need to have that link preserved so that the previous region continues to receive the credit for those orders. Similarly, the new region should receive credit for any subsequent orders placed by the customer during their period of residence at the new address. Clearly, when designing a data warehouse in support of a CRM strategy, such information may be very important. If we recall the cause-and-effect principle and how we applied it to changing customer circumstances, this is a classic example of precisely that. So the warehouse needs to record not only the fact that the data has changed but also when the change occurred. There is a conflict between the system supplying the data, which is not temporal, and the receiving system, which is. The practical problems surrounding this issue are dealt with in detail later on in this chapter. The consequences of the problem can be explored in more detail by the use of some data. Figure 4.7 provides a simple illustration of the problem by building on the example given. We'll start by adding some data to the entities.
Figure 4.7. Sales hierarchy with data.
139
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
140
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The example in Figure 4.7 shows a “Relational” style of implementation where the relationships are implemented using foreign key columns. In the data warehouse, the “Salesperson” dimension would be related directly to the sales fact table. Each sales fact would include a foreign key attribute that would contain the sales identifier of the salesperson who was responsible for the sale. In order to focus on the impact of changes to these relationships, time is omitted from the following set of illustrative queries. In order to determine the value of sales by salesperson, the SQL query shown in Query Listing 4.1 could be written:
Listing 4.1 Total sales by sales-person.
Select name, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by name In order to determine the value of sales by department, the SQL query shown in Query Listing 4.2 could be written:
Listing 4.2 Total sales by department.
Select department_name, sum(sales_value) from sales s1, sales_person s2, department d where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_code group by department_name If the requirement was to obtain the value of sales attributable to each site, then the query in Query Listing 4.3 could be used:
Listing 4.3 Total sales by site.
Select address, sum(sales_value) from sales s1, sales_person s2, department d, site s3 where s1.sales_id = s2.sales_id and s2.dept_id = d.dept_id and d.site = s3.site_code group by address The result sets from these queries would contain the sum of the sales value grouped by salesperson, department, and site.
141
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The results will always be accurate so long as there are no changes in the relationships between the entities. However, as previously shown, changes in the dimensions are quite common. As an example, if Sneezy were to transfer from department “SW” to department “NW,” his relationship between the salesperson entity and the department entity will have changed. If the same three queries are executed again, the results will be altered. The results of the first query in Query Listing 4.1, which is at salesperson level, will be the same as before because the sales made by Sneezy are still attributed to him. However, in Query Listing 4.2, which is at the department level, all sales that Sneezy was responsible for when he worked in department “SW” will in future be attributed to department “NW.” This is clearly an invalid result. The result from the query in Query Listing 4.3, which groups by site address, will still be valid because, although Sneezy moved from SW department to NW department, both SW and NW reside at the same address, Bristol. If Sneezy had moved from SW to SE or NE, then the Listing 4.3 results would be incorrect as well. The example so far has focused on how time alters the cardinality of relationships. There is, equally, an effect on some attributes. If we look back at the salesperson entity in the example, there is an attribute called “Grade.” This is meant to represent the sales grade of the salesperson. If we want to measure the performance of salespeople by comparing volume of sales against grades, this could be achieved by the following query:
Select grade, sum(sales_value) from sales s1, sales_person s2 where s1.sales_id = s2.sales_id group by grade If any salesperson has changed their grade during the period covered by the query, then the results will be inaccurate because all their sales will be recorded against their current grade. In order to produce an accurate result, the periods of validity of the salesperson's grades must be kept. This might be achieved by the introduction of another intersection entity. If no action is taken, the database will produce inaccurate results. Whether the level of inaccuracy is acceptable is a matter for the directors of the organization to decide. Over time, however, the information would become less and less accurate, and the value of the information is likely to become increasingly questionable. How do the business people know which are the queries that return accurate results and, more importantly, which ones are suspect? Unfortunately for our users, there is no way of knowing.
142
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
The Time Dimension The time dimension is a special dimension that contains information about times. For every possible time that may appear in the fact table, an entry must exist in the time dimension table. This time attribute is the primary key to the time dimension. The non–key attributes are application specific and provide a method for grouping the discrete time values. The groupings can be anything that is of interest to the organization. Some examples might be: Day of week Week end Early closing day Public holidays/bank holidays 24-hour opening day Weather conditions Week of year Month name Financial month Financial quarter Financial year Some of the groupings, listed above, could be derived from date manipulation functions supplied by the database management system, whereas others, clearly, cannot.
The Effect of Causal Changes to Data Upon examination, it appears that some changes are causal in nature, in that a change to the value of one attribute implies a change to the value of some other attribute in the schema. The extent of causality will vary from case to case, but the designer must be aware that a change to the value of a particular attribute, whose historical values have low importance to the organization, may cause a change to occur in the value of another attribute that has much greater importance. While this may be true in all systems, it is particularly relevant to data warehousing because of the disparate nature of the source systems that provide the data used to populate the warehouse. It is possible, for instance, that the source system containing customer addresses may not actually hold information about sales areas. The sales area classification may come from, say, a marketing database or some kind of demographic data. Changes to addresses, which are detected in the operational database, must be
143
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
implemented at exactly the same time as the change to the sales area codes. Acknowledgment of the causal relationship between attributes is essential if accuracy and integrity are to be maintained. In the logical model it is necessary to identify the dependencies between attributes so that the appropriate physical links can be implemented. only for RuBoard - do not distribute or recompile
144
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
CAPTURING CHANGES Let's now examine how changes are identified in the source systems and can be subsequently captured into the data warehouse and the problems that can occur. Capturing Behavior As has been previously stated, the behavioral facts relate to the business transactions of the organization. Facts are usually derived from some entity having been “frozen” and captured at a particular status in its lifecycle. The process by which this status is achieved is normally triggered by an event. What do we mean by the term event? There are two ways of considering the definition of an event. If the data warehouse is viewed in isolation so that the facts that it records are not perceived as related to the source systems from which they were derived, then they can be viewed purely as events that occurred at a single point in time. If, however, the data warehouse is perceived as part of the “enterprise” database systems, then the facts should be viewed within the wider context, and they become an entity preserved at a “frozen” state, having been triggered by an event. Either way, the distinguishing feature of facts is that they do not have a lifespan. They are associated with just one time attribute. For the purpose of clarity the following definition of facts will be adopted: A fact is a single state entity that is created by the occurrence of some event. In principle, the processes involved in the capture of behavior are relatively straightforward. The extraction of new behavioral facts for insertion into the data warehouse is performed on a periodic, very often daily, basis. This tends to occur during the time when the operational processing systems are not functioning. Typically this means during the overnight “batch” processing cycle. The main benefit of this approach is that all of the previous day's data can be collected and transferred at one time. The process of identifying the facts varies from one organization to another from between being very easy to almost impossible to accomplish. For instance, the fact data may come from: Telephonic network switches or billing systems, in the case of telecommunications companies Order processing systems as in the case of mail order companies such as the Wine Club Cash receipts in the case of retail outlets Once the facts have been identified, they are usually stored into sequential files or streams that are appended to during the day. As the data warehouse is usually resident on a hardware platform that is separate from the operational system, the files have to be moved before they can be processed further. The next step is to validate and modify each record to ensure that it conforms to the format and semantic integration rules that were described in Chapter 2. The actual loading of the data is usually performed using the “bulk” load utility that most database management systems provide.
145
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Once recorded, the values of fact attributes never change so they should be regarded as single state or stateless. There is a time element that applies to facts, but it is simply the time that the event occurred. It is usually implemented in the form of a single timestamp. The timestamp will vary, in granularity, from one application to another. For instance, in the Wine Club, the timestamp of a sale records the date of the sale. In a telecommunications application, the timestamp would record not only the date but also the hour, minute, and second that the call was placed. Capturing Circumstances The circumstances and dimensions are derived from what has been referred to as the reference entities within the organization. This is information such as customer, product, and market segment. Unlike the facts in a data warehouse, this type of information does have a lifespan. For instance, products may have various states during their lifespan from new to fast moving to slow moving to discontinued to deleted. The identification and capture of new or changed dimensional information are usually quite different to the capture of facts. For instance, it is often the case that customer details are captured in the operational systems some time after the customer starts using the services of the organization. Also, the date at which the customer is enrolled as a customer is often not recorded in the system. Neither is the date when they cease to become a customer. Similarly, when a dimensional attribute changes, such as the address of a customer, the new address is duly recorded in such a way as to replace the existing address. The date of the change of address is often not recorded. The dates of changes to other dimensional attributes are also, usually, not recorded. This is only a problem if the attribute concerned is one for which there is a requirement to record the historic values faithfully. In the Wine Club, for instance, the following example attributes need to have their historic values preserved: Customers' addresses Customers' sales areas Wine sales prices and cost prices Suppliers of wines Managers of sales areas If the time of the change is not recorded in the operational systems, then it is impossible to determine the valid time that the change occurred. Where the valid time of a change is not available, then it may be appropriate to try to ascertain the transaction time of the change event. This would be the time that the change was recorded in the database, as opposed to the time the change actually occurred. However, in the same way that the valid time of changes is not recorded, the transaction time of changes is usually not recorded explicitly as part of the operational application. In order for the data warehouse to capture the time of the changes, there are methods available that can assist us in identifying the transaction times:
146
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Make changes to the operational systems. The degree to which this is possible is dependent on a number of factors. If the system has been developed specifically for the organization, either by an organization's own IT staff or by some third party, as long as the skills are available and the costs and timescales are not prohibitive, then the operational systems can be changed to accommodate the requirements of the data warehouse. Where the application is a standard package product, it becomes very much more difficult to make changes to the system without violating commercial agreements covering such things as upgrades and maintenance. If the underlying database management system supporting the application is relational, then it is possible to capture the changes by the introduction of such things as database triggers. Experience shows that most organizations are reluctant to alter operational applications in order to service informational systems requirements for reasons of cost and complexity. Also, the placing of additional processing inside of operational systems is often seen as a threat to the performance of those systems. Interrogation of audit trail. Some operational applications maintain audit trails to enable changes to be traced. Where these exist, they can be a valuable source of information to enable the capture of transaction time changes. Interrogation of DBMS log files. Most database management systems maintain log files for system recovery purposes. It is possible, if the right skills are available, to interrogate these files to identify changes and their associated transaction times. This practice is discouraged by the DBMS vendors, as log files are intended for internal use by the DBMS. If the files are damaged by unauthorized access, the ability of the DBMS to perform a recovery may be compromised. Also, the DBMS vendors always reserve the right to alter the format of the log files without notice. If this happens, processes that have been developed to capture changes may stop working or may produce incorrect results. Obviously, this approach is not available to non-DBMS applications. File comparison. This involves the capture of an entire file, or table, of dimensional data and the copying of the file so that it can be compared to the data already held in the data warehouse. Any changes that are identified can then be applied to the warehouse. The time of the change is taken to be the system time of the detection of the change, that is, the time the file comparison process was executed. Experience shows that the file comparison technique is the one most frequently adopted when data warehouses are developed. It is the approach that has the least impact on the operational environment, and it is the least costly to implement. It should also be remembered that some dimensions are created by the amalgamation of data from several operational systems and some external systems. This will certainly exacerbate an already complex problem. Where the dimensions in a dimensional model are large (some organizations have several million customers), the capture of the data followed by the transfer to the data warehouse environment and subsequent comparison is a process that can be very time-consuming. Consequently, most organizations place limits on the frequency with which this process can be executed. At best, the frequency is weekly. The processing can then take place over the weekend when the systems are
147
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
relatively quiet and the extra processing required to perform this exercise can be absorbed without too much of an impact on other processing. Many organizations permit only monthly updates to the dimensional data, and some are even less frequent than that. The problem is that the only transaction time available, against which the changes can be recorded, is the date upon which the change was discovered (i.e., the file comparison date). So, for example, let us assume that the frequency of comparison is monthly and the changes are captured at the end of the month. If a customer changes address, and geographic region, at the beginning of the month, then any facts recorded for the customer during the month will be credited permanently to the old, incorrect region. It is possible that, during a single month, more than one change will occur to the same attribute. If the data is collected by the file comparison method, the only values that will be captured are those that are in existence at the time of capture. All intermediate changes will be missed completely. The degree to which this is a problem will vary from application to application. It is accepted that, in general, valid time change capture for dimensions is not, practically speaking, realistic. However, it is important that practitioners recognize the issue and try to keep the difference between transaction time and valid time as small as possible. The fact that some data relating to time as well as other attributes is found to be absent from the source systems can come to dominate data warehouse developments. As a result of these problems, the extraction of data is sometimes the longest and riskiest part of a data warehouse project. Summary of the Problems Involving Time So far in this chapter we have seen that maintaining accuracy in a data warehouse presents a challenging set of problems that are summarized below: 1. Identifying and capturing the temporal requirements. The first problem is to identify the temporal requirements. There is no method to do this currently. The present data warehousing modeling techniques do not provide any real support for this. 2. Capture of dimensional updates. What happens when a relationship changes (e.g., a salesperson moves from one department to another)? What happens when a relationship no longer exists (e.g., a salesperson leaves the company)? How does the warehouse handle changes in attribute values (e.g., a product was blue, now it is red)? Is there a need to be able to report on its sales performance when it was red or blue, as well as for the product throughout the whole of its lifecycle? 3. The timeliness of capture. It now seems clear that absolute accuracy in a data warehouse is not a practical objective. There is a need to be able to assess the level of inaccuracy so that a degree of confidence can be applied to the results obtained from queries. 4. Synchronization of changes. When an attribute changes, a mechanism is required for identifying dependent attributes that might also need to be changed. The absence of synchronization affects the credibility of the results.
148
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register . it. Thanks
We have seen that obtaining the changed data can involve complex processing and may require sophisticated design to implement in a way that provides for both accuracy of information and reasonable performance. Also in this chapter we have explored the various problems associated with time in data warehousing. Some of these problems are inherent in the standard dimensional model, but it is possible to overcome these problems by making changes to the way dimensional models are designed. Some of the problems relate to the way data warehouses interact with operational systems. These problems are more difficult to solve and, sometimes, impossible to solve. Nevertheless, data warehouse designers need to be fully aware of the extent of the problems and familiar with the various approaches to solving them. These are discussed in the coming chapters. The biggest set of problems lies in the areas of the capture and accurate representation of historical information. The problem is most difficult when changes occur in the lifespan of dimensions and the relationships within dimensional hierarchies, also where attributes change their values and there is a requirement to faithfully reflect those changes through history. Having exposed the issues and established the problems, let's have a look at some of the conventional ways of solving them. only for RuBoard - do not distribute or recompile
149
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
only for RuBoard - do not distribute or recompile
FIRST-GENERATION SOLUTIONS FOR TIME We now go on to describe some solutions to the problems of the representation of time that have been used in first-generation data warehouses. One of the main problems is that the business requirements with respect to time have not been systematically captured at the conceptual level. This is largely because we are unfamiliar with the temporal semantics due to the fact that, so far, we have not encountered temporal applications. Logical models systematically follow conceptual models, and so a natural consequence of the failure to define the requirements in the conceptual model is that the requirements are also absent from the logical and physical implementation. As a result, practitioners have subsequently found themselves faced with problems involving time and some have created solutions. However, the solutions have been developed on a somewhat ad hoc basis and are by no means comprehensive. The problem is sufficiently large that we really do need a rigorous approach to solving it. As an example of the scale of the problem, there is, as previously mentioned, evidence in a government-sponsored housing survey that, in the United Kingdom, people change their addresses, on average, every 10 years. This means that an organization can expect to have to implement address details changes to about 10 percent of their customers each year. Over a 10-year period, if an organization has one million customers, it can expect to have to deal with one million changes of address. Obviously, some people will not move, but others will move more than once in that period. This covers only address changes. There are other attributes relating to customers that will also change, although perhaps not with the same frequency as addresses. One of the major contributors to the development of solutions in this area is Ralph Kimball (1996). His collective term for changes to dimensional attributes is slowly changing dimensions. The term has become well known within the data warehouse industry and has been adopted generally by practitioners. He cites three methods of tracking changes to dimensional attributes with respect to time. These he calls simply types one, two, and three. Within the industry, practitioners are generally aware of the three types and, where any support for time is provided in dimensional models, these are the approaches that are normally used. It is common to refer to products and methods as being consistent with Kimball's type one, two, or three. In a later work Kimball (1998) recognizes a type zero that represents dimensions that are not subject to change.
The Type 1 Approach The first type of change, known as Type 1, is to replace the old data values with the new values. This means that there is no need to preserve the previous value. The advantage of this approach, from a system perspective, is that it is very easy to implement. Obviously there is no temporal support being offered in this solution. However, this method sometimes offers the most appropriate solution. We
150
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
don't need to track the historical values of every single database element and, sometimes, overwriting the old values is the right thing to do. In the Wine Club example, attributes like the customer's name are best treated in this way. This is an attribute for which there is no requirement to retain historical values. Only the latest value is deemed by the organization to be useful. All data warehouse applications will have some attributes where the correct approach is to overwrite the previous values with the new values. It is important that the updating is effected on a per attribute basis rather than a per row basis. Each table will have a mixture of attributes, some of which will require the type one replacement approach, while others will require a more sophisticated approach to the treatment of value changes over time. The worst scenario is a full table replacement approach where the dimension is, periodically, completely overwritten. The danger here is that any rows that have been deleted in the operational system may be deleted in the data warehouse. Any rows in the fact table that refer to rows in the dimension that have been deleted will cause a referential integrity violation and will place the database in an invalid state. Thus, the periodic update of dimensions in the data warehouse must involve only inserts and updates. Any logical deletions (e.g., where a customer ceases to be a customer) must be processed as updates in the data warehouse. It is important to know whether a customer still exists as a customer, but the customer record must remain in the database for the whole lifespan of the data warehouse or, at the very least, as long as there are fact table records that refer to the dimensional record. Due to the fact that Type 1 is the simplest approach, it is often used as the default approach. Practitioners will sometimes adopt a Type 1 solution as a short-term expedient, where the application really requires a more complete solution, with the intention of providing proper support for time at a later stage in the project. Too often, the pressures of project budgets and implementation deadlines force changes to the scope of projects and the enhancements are abandoned. Sometimes, Type 1 is adopted due to inadequate analysis of the requirements with respect to time.
The Type 2 Approach The second solution to slowly changing dimensions is called Type 2. Type 2 is a more complex solution than Type 1 and does attempt to faithfully record historical values of attributes by providing a form of version control. Type 2 changes are best explained with the use of an example. In the case study, the sales area in which a customer lives is subject to change when they move. There is a requirement to faithfully reflect regional sales performance over time. This means that the sales area prevailing at the time of the sale must be used when analyzing sales. If the Type 1 approach was used when recording changes to the sales area, the historical values will appear to have the same sales area as current values. A method is needed, therefore, that enables us to reflect history faithfully. The Type 2 method attempts to solve this problem by the creation of new records. Every time an attribute's value changes, if faithful recording of history is required, an entirely new record is created with all the unaffected attributes unchanged. Only the affected attribute is changed to reflect its new value. The obvious problem with this approach is that it would immediately compromise the
151
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
uniqueness property of the primary key, as the new record would have the same key as the previous record. This can be turned into an advantage by the use of surrogate keys. A surrogate key is a system-generated identifier that introduces a layer of indirection into the model. It is a good practice to use surrogate keys in all the customer and dimensional data. The main reason for this is that the production key is subject to change whenever the company reorganizes its customers or products and that this would cause unacceptable disruption to the data warehouse if the change had to be carried through. It is better to create an arbitrary key to provide the property of uniqueness. So each time a new record is created, following a change to the value of an attribute, a new surrogate key is assigned to the record. Sometimes, a surrogate approach is forced upon us when we are attempting to integrate data from different source systems where the identifiers are not the same. There are two main approaches to assigning the value of the surrogate:
1. The identifier is lengthened by a number of version digits. So a customer having an identifier of “1234” would subsequently have the identifier “1234001.” After the first change, a new row would be created that would have an identifier of “1234002.” The customer would now have two records in the dimension. Most of the attribute values would be the same. Only the attribute, or attributes, that had changed would be different.
2. The identifier could be truly generalized and bear no relation to the previous identifiers for the customer. So each time a new row is added, a completely new identifier is created. In a behavioral model, the generalized key is used in both the dimension table and the fact table. Constraining queries using a descriptive attribute, such as the name of the customer, will result in all records for the customer being retrieved. Constraining or grouping the results by use of the name and, say, the sales area attribute will ensure that history is faithfully reflected in the results of queries, of course assuming that uniqueness in the descriptive attribute can be guaranteed. The Type 2 approach, therefore, will ensure that the fact table will be correctly joined to the dimension and the correct dimensional attributes will be associated with each fact. Ensuring that the facts are matched to the correct dimensional attributes is the main concern. An obvious variation of this approach is to construct a composite identifier by retaining the previous number “1234” and adding a new counter or “version” attribute that, initially, would be “1.” This approach is similar to 1, above. The initial identifier is allocated when the customer is first entered into the database. Subsequent changes require the allocation of new identifiers. It is the responsibility of the data warehouse administrator to control the allocation of identifiers and to maintain the version number in order to know which version number, or generalized key, to allocate next. In reality, the requirement would be to combine Type 1 and Type 2 solutions in the same logical row. This is where we have some attributes that we do want to track and some that we do not. An example of this occurs in the Wine Club where, in the customer's circumstances, we wish to trace the history of attributes like the address and, consequently, the sales area but we are not interested in the history of the customer's name or their hobby code. So in a single logical row, an attribute like address would need to be treated as a Type 2, whereas the name would be treated as a Type 1. Therefore, if the customer's name changes, we would wish to overwrite it. However, there may be
152
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
many records in existence for this customer, due to previous changes to other attributes. Do we have to go back and overwrite the previous records? In practice, it is likely that only the latest record would be updated. This implies that, in dimensions where type two is implemented, attributes for which the type one approach would be preferred will be forced to adopt an approach that is nearly, but not quite, Type 2. In describing the problems surrounding time in data warehousing, we saw how the results of a query could change due to a customer moving. The approach taken was to simply overwrite the old addresses and the sales area codes with the new values. This is equivalent to Kimball's Type 1 approach. If we implement the same changes using the Type 2 method, the results would not be disrupted, as new records would be created with a new surrogate identifier. Future insertions to the sales fact table will be related to the new identifying customer codes, and so the segmentation will remain consistent with respect to time for the purposes of this particular query. One potential issue here is that, by making use of generalized keys, it becomes impossible to recognize individual customers by use of the identifying attribute. As each subsequent change occurs, a new row is inserted and is identified by a key value that is in no way associated with the previous key value. For example Lucie Jones's original value for the customer code attribute might be, say, 1136, whereas the value for the new customer code for the new inserted row could be anything, say, 8876, being the next available key in the domain range. This means that, if it was required to extract information on a per customer basis, the grouping would have to be on a nonidentifying attribute, such as the customer's name, that is:
select customer_name "Name", sum(quantity) "Bottles",sum(value) "Revenue" from sales s, customer c where c. customer _code=s. customer _code group by customer _name order by customer _name Constraining and grouping queries using descriptive attributes like names are clearly risky, since names are apt to be duplicated and erroneous results could be produced. Another potential issue with this approach is that, if the keys are truly generalized, as with key hashing, it may not be possible to identify the latest record by simply selecting the highest key. Also, the use of generalized keys means that obtaining the history of, say, a customer's details may not be as simple as ordering the keys into ascending sequence. One solution to this problem is the addition of a constant descriptive attribute, such as the original production key, that is unique to the logical row. Alternatively, a variation as previously described, in which the original key is retained but is augmented by an additional attribute to define the version,
153
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
would also provide the solution to this. The Type 2 method does not allow the use of date columns to identify when changes actually take place. For instance, this means that it is not possible to establish with any accuracy when a customer actually moved. The only date available to provide any clue to this is the transaction date in the fact table. There are some problems associated with this. A query such as “List the names and addresses of all customers who have purchased more than twelve bottles of wine in the last three months” might be useful for campaign purposes. Such a query will, however, result in incorrect addresses being returned for those customers who have moved but not since placed an order. The query in Query Listing 4.4 shows this.
Listing 4.4 Query to produce a campaign list.
select c.customer_code, customer_name, customer_address,sum(quantity) from customer c, sales s, time t where c.customer_code = s.customer_code and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c.customer_code,customer_name, customer_address having sum(quantity) > 12 The table in Table 4.1 is a subset of the result set for the query in Listing 4.4.
Table 4.1. List of Customers to be Contacted Customer Code Customer Name
Customer Address
Sum (Quantity)
1332
A.J. Gordon
82 Milton Ave, Chester, Cheshire
49
1315
P. Chamberlain
11a Mount Pleasant, Sunderland
34
2131
Q.E. McCallum
32 College Ride, Minehead, Somerset
14
1531
C.D. Jones
71 Queensway, Leeds, Yorks
31
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
32
2141
J.K. Noble
79 Priors Croft, Torquay, Devon
58
4153
C. Smallpiece
58 Ballard Road, Bristol
21
1321
D. Hartley
88 Ballantyne Road, Minehead, Somerset
66
The highlighted row is an example of the point. Customer L. Jones has two entries in the database. Because Lucie has not purchased any wine since moving, the incorrect address was returned by the query. The result of a simple search is shown in Table 4.2.
154
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Table 4.2. Multiple Records for a Single Customer Customer Code
Customer Name
Customer Address
1136
L. Jones
9 Broughton Hall Ave, Woking, Surrey
8876
L. Jones
44 Sea View Terrace, West Bay, Bridport, Dorset
If it can be assumed that if the generalized key is always ascending, then the query could be modified, as in the following query, to select the highest value for the key.
select customer_code, customer_name, customer_address from customer where customer_code = (select max(customer_code) from customer where customer_name = 'L. Jones') This query would return the second of the two rows listed in Table 4.2. Using the other technique to implement the Type 2 method, we could have altered the customer code from “1136” to “113601” for the original row and, subsequently, to “113602” for the new row containing the changed address and sales area. In order to return the correct addresses, the query in Query Listing 4.5 has to be executed.
Listing 4.5 Obtaining the latest customers details using Type 2 with extended identifiers.
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and c.customer_code = (select max(customer_code) from customer c2 where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4)) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The query in Listing 4.5 is another correlated subquery and contains the following line:
where substr(c1.customer_code,1,4) = substr(c2.customer_code,1,4) The query matches the generic parts of the customer code by use of a “substring” function, provided
155
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
by the query processor. It is suspected that this type of query may be beyond the capability of most users. This approach is dependent on the fact that all the codes are of the same fundamental format. That is, they have four digits plus a suffix. If the approach was to start from 1 through 9,999, then this technique could not be adopted, because the substring function would not produce the right answer. The obvious variation on the above approach is to add an extra attribute to distinguish versions. The identifier then becomes the composite of two attributes instead of a single attribute. In this case, the original attribute remains unaltered, and the new attribute is incremented, as shown in Table 4.3.
Table 4.3. A Modification to Type 2 Using Composite Identifiers Customer Code
Version Number
1136
01
1136
02
1136
03
Using this technique, the following query is executed:
select c1.customer_code,customer_name, customer_address,sum(quantity) from customer c1, sales s, time t where c1.customer_code = s.customer_code and s.counter = c1.counter and c.counter = (select max(counter) from customer c2 where c1.customer_code = c2.customer_code) and s.time_code = t.time_code and t.month in (200010, 200011, 200012) group by c1.customer_code,customer_name,customer_address having sum(quantity) > 12 The structure of this query is the same as in Query Listing 4.5. However, this approach does not require the use of substrings to make the comparison. This means that the query will always produce the right answer irrespective of the consistency, or otherwise, of the encoding procedures within the organization. These solutions do not resolve the problem of pinpointing when a change occurs. Due to the absence of dates in the Type 2 method, it is impossible to determine precisely when changes occur. The only way to extract any form of alignment with time is via a join to the fact table. This, at best, will give an approximate time for the change. The degree of accuracy is dependent on the frequency of fact table entries relating to the dimensional entry concerned. The more frequent the entries in the fact table, the more accurate will be the traceability of the history of the dimension, and vice versa. It is also not
156
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
possible to record gaps in the existence of dimensional entries. For instance, in order to track precisely the discontinuous existence of, say, a customer, there must be some kind of temporal reference to record the periods of existence.
Problems With Hierarchies
So far in this chapter, attention has focused on so-called slowly changing dimensions and how these might be supported using the Type 2 solution. Now we will turn our attention to, what we shall call, slowly changing hierarchies. As an example, we will use the dimensional hierarchy illustrated in Figure 3.4. The attributes, using a surrogate key approach, are as follows: Sales_Area(Sales_Area_Key, Sales Area Code, Manager key, Sales Area Name) Manager(Manager_Key, Manager Code, Manager Name) Customer(Customer_Key, Customer code, Customer Name, Customer Address, Sales Area key, Hobby Code, Date Joined) Let's say the number of customers and the spread of sales areas in the case study database is as shown in Table 4.4.
Table 4.4. Customers Grouped by Sales Area Sales Area
Count
North East
18,967
North West
11,498
South East
39,113
South West
28,697
We will assume that we have implemented the type two solution to slowly changing dimensions. If sales area SW was to experience a change of managers from M9 to M12, then a new sales area record would be inserted with an incremented counter, together with the new manager_code. So if the previous record was (1, SW, M9, “South West”), the new record with its new key, assumed to be 5, would contain (5, SW, M12, “South West”). However, that is not the end of the change. Each of the customers, from the “SW” sales area still have their foreign key relationship references pointing to the original sales area record containing the reference to the old manager (surrogate key 1). Therefore, we also have to create an entire set of new records for the customers, with each of their sales area key values set to “5”. In this case, there are 11,498 new records to be created. It is not valid to simply update the foreign keys with the new value, because the old historical link will be lost.
157
This document was created by an unregistered ChmMagic, please go to http://www.bisenter.com to register it. Thanks
.
Where there are complex hierarchies involving more levels and more rows, it is not difficult to imagine very large volumes of inserts being generated. For example, in a four-level hierarchy where the relationship is just 1:100 in each level, a single change at the top level will cause over a million new records to be inserted. A relationship of 1:100 is not inordinately high when there are many data warehouses in existence with several million customers in the customer dimension alone. The number of extraneous insertions generated by this approach could cause the dimension tables to grow at a rate that, in time, becomes a threat to performance. For the true Star Schema advocates, we could try flattening the hierarchies into a single dimension (de-normalizing). This converts a snowflake schema into a star schema. If this approach is taken, the effect is that, in the four-level 1:100 example, the number of insertions reduces from 1.01 million insertions to 1 million insertions. So reducing the number of insertions is not a reason for flattening the hierarchy.
Browse Queries
The Type 2 approach does cause some problems when it comes to browsing. It is generally reckoned that some 80 percent of data warehouse queries are dimension-browsing queries. This means that they do not access any fact table. A typical browse query we might wish to perform is to count the number of occurrences. For instance, how many customers do we have? The standard way to do this in SQL is shown in the following query:
Select count(*) from