Artificial Intelligence: A Modern Approach, 3rd Edition

1 INTRODUCTION In which we try to explain why we consider artificial intelligence to be a subject most worthy ofstudy,

6,349 578 48MB

Pages 1133 Page size 612.319 x 792 pts Year 2011

Report DMCA / Copyright

DOWNLOAD FILE

Recommend Papers

Artificial Intelligence: A Modern Approach (3rd Edition)

20,869 15,248 88MB Read more

Artificial intelligence: a modern approach

1,008 216 13MB Read more

Artificial Intelligence: A Modern Approach ( )

Artificial Intelligence A Modern Approach Stuart J. Russell and Peter Norvig Contributing writers: John F. Canny, Jite

1,854 919 36MB Read more

Artificial intelligence - a modern approach

1,434 750 11MB Read more

Artificial Intelligence

Other books in the Lucent Library of Science and Technology series: Bacteria and Viruses Black Holes Comets and Aste

8,786 3,888 13MB Read more

Intermediate Statistics: A Modern Approach, 3rd edition

James P. Stevens Lawrence Erlbaum Associates New York London Cover design by Kathryn Houghtaling. Lawrence Erlbaum A

1,943 961 3MB Read more

Artificial Intelligence: A New Synthesis

9,150 5,525 11MB Read more

Artificial Intelligence For Games

Ian Millington crosses the boundary between academic and professional game AI with his book . Most books either lack aca

1,671 5 4MB Read more

Artificial Intelligence for Games, Second Edition

1,602 101 4MB Read more

Artificial Intelligence for Games, Second Edition

898 9 4MB Read more

File loading please wait...

Citation preview

1

INTRODUCTION

In which we try to explain why we consider artificial intelligence to be a subject most worthy ofstudy, and in which we try to decide what exactly it is, this being a good thing to decide before embarking.

INTELLIGENCE

We call ourselves

Homo sapiens-man

the wise-because our

to us. For thousands of years, we have tried to understand

intelligence

how we think;

i s so important

that is, how a mere

handful of matter can perceive, understand, predict, and manipulate a world far larger and ARTIFICIAL INTELLIGENCE

more complicated than itself. The field of attempts not just t o understand but also to

artificial intelligence,

or AI, goes further still: i t

build intelligent entities.

AI is one of the newest fields in science and engineering. Work started in eamest soon after World War II, and the name itself was coined in

1956.

Along with molecular biology,

AI is regularly cited as the "field I would most like to be in" by scientists in other disciplines. A student in physics might reasonably feel that all the good ideas have already been taken by Galileo, Newton, Einstein, and the rest. AI, on the other hand, still has openings for several full-time Einsteins and Edisons. AI currently encompasses a huge variety of subfields, ranging from t11e general (leaming and perception) to the specific, such as playing chess, proving mathematical theorems, writing poetry, driving a car on a crowded street, and diagnosing diseases.

AI is relevant to any

intellectual task; i t is truly a universal field.

1.1

WHAT Is AI?

We have claimed that AI is exciting, but w e have not said what i t

is.

In Figure

1.1

w e see

eight definitions of AI, laid out along two dimensions. The definitions on top are concemed with

thought processes and reasoning,

definitions on the left measure success in terms RATIONALITY

the bottom address behavior. The of fidelity to human performance, whereas

whereas the ones on

the ones on the right measure against an

ideal performance

measure, called

rationality.

A

system is rational if it does the "right thing," given what it knows. Historically, all four approaches to AI have been followed, each by different people with different methods. A human-centered approach must be in prut an empirical science, in-

2

Chapter

1.

Introduction

Thinking Hmnanl y

Thinking Rationally

"The exciting new effort to make comput-

"The study of mental faculties through the

ers think ... machines

use of computational models."

with minds, in the

full and literal sense." (Haugeland,

1985)

(Charniak and McDermott, 1985)

the computations that make

"[The automation of] activities that we

'The study of

associate with human thinking, activities

it possible to perceive, reason, and act."

such as decision-making, problem solv-

(Winston,

1992)

ing, learning ... "(Bellman, 1978)

Acting Humanly

Acting Rationally

'The art of creating machines that per-

"Computational Intelligence is the study

form functions that require intelligence

of the design of intelligent agents." (Poole

when performed by people." (Kurzweil,

et al., 1998)

1990) 'The study of how to make computers do

"AI ...is concemed with intelligent be-

things at which, at the moment, people are

havior in artifacts." (Nilsson, 1998)

better." (Rich and Knight, 1991)

Figure 1.1

Some definitions of artificial intelligence, organized into four categories.

1 volving observations and hypotheses about human behavior. A rationalist approach involves a combination of mathematics and engineering. The various group have bot.h disparaged and helped each other. Let us look at the four approaches in more detail. 1.1.1 TURING lEST

The

Acting humanly: The Turing Test approach

Thring Test, proposed by Alan Turing (1950), was designed to provide a satisfactory

operational definition of intelligence. A computer passes the test if a human interrogator, after posing some written questions, cannot tell whether the written responses come from a person or

from a computer. Chapter 26 discusses the details of the test and whether a computer would

really be intelligent if it passed. For now, we note that programming a computer to pass a rigorously applied test provides plenty to work on. The computer would need to possess the following capabilities: NATURAl IANGUAGF PROCESSING KNCWLEDGE REPRESENTATION AUTOMATED REASONING

•

natm·allanguage processing to enable it to communicate succes sfully i n English;

•

knowledge representation to store what it knows or hears;

•

automated reasoning to use the stored information to answer questions and to draw new conclus.ions;

MACHINE LEARNING

•

machine learning to adapt to new circumstances and to detect and extrapolate patterns.

1 By distinguishing between human and rational behavior, we are not suggesting that humans are necessarily "irrational" in the sense of "emotionally unstable" or "insane." One merely need note that we are not perfect: not all chess players are grandmasters; and, unfortunately, not everyone gets an A on the exam. Some systematic enors in hwnan reasoning are cataloged by Kahneman et al. (1982).

Section 1.1.

What Is AI?

3

Turing's test deliberately avoided direct physical interaction between the interrogator and the computer, because physical simulation of a person is unnecessary for intelligence. However, TOTALlURINGTEST

the so-called

total Turing Test includes a video signal so that the interrogator can test the

subject's perceptual abilities, as well as the opportunity for the interrogator to pass physical objects "through the hatch." To pass the total Turing Test, the computer will need coMPUTERVISION

•

computer vision to perceive objects, and

FOBOTICS

•

robotics to manipulate objects and move about.

These six disciplines compose most of AI, and Turing deserves credit for designing a test that remains relevant 60 years later. Yet AI researchers have devoted little effort to passing the Turing Test, believing that it is more important to study the underlying principles of in telligence than to duplicate an exemplar. The quest for "artificial flight " succeeded when the Wright brothers and others stopped imitating birds and started using wind tunnels and learn ing about aerodynamics. Aeronautical engineering texts do not define the goal of their field as making "machines that fly so exactly like pigeons that they can fool even other pigeons." 1.1.2

Thinking humanly: The cognitive modeling approach

If we are going to say that a given program thinks like a human, we must have some way of determining how hwnans think. We need to get

inside the actual workings of human minds.

There are three ways to do this: through introspection-trying to catch our own thoughts as they go by; through psychological experiments-observing a person in action; and through brain imaging-observing the brain in action. Once we have a sufficiently precise theory of

the mind, it becomes possible to express the theory as a computer program. If the p ro gram' s input-{)utput behavior matches corresponding human behavior, that is evidence that some of the program's mechanisms could also be operating in humans. For example, Allen Newell and Herbert Simon, who developed GPS, the "General Problem Solver " (Newell and Simon, 1961), were not content merely to have their program solve problems correctly. They were more concemed with comparing the trace of its reasoning steps to traces of human subjects coGNITIVE SCIENCE

solving the same problems. The interdisciplinary field of computer models from AI and experimental techniques

cognitive science brings together

from psychology

to construct precise

and testable theories of the human mind. Cognitive science is a fascinating field in itself, worthy of several textbooks and at least one encyclopedia (Wilson and Keil, 1999). We will occasionally comment on similarities or differences between AI techniques and human cognition. Real cognitive science, however, is necessarily based on experimental investigation of actual humans or animals. We will leave that for other books, as we assume the reader has only a computer for experimentation. ln the early days of AI there was often confusion between the approaches: an author would argue that an algorithm performs well on a task and that it is

therefore a good model

of hwnan performance, or vice versa. Modem authors separate the two kinds of claims; tl1is distinction has allowed both AI and cognitive science to develop more rapidly. The two fields continue to fertilize each other, most notably in computer vision, which incorporates neurophysiological evidence into computational models.

4

Chapter

1.1.3

1.

Introduction

Thinking rationally: The "laws of thought" approach

The Greek philosopher Aristotle was one of the first to attempt to codify "right thinking;' that SYLLOGISM

is, irrefutable reasoning processes. His

syllogisms provided

patterns for argument structures

that always yielded correct conclusions when given correct premises-for example, "Socrates is a man; all men are mortal; therefore, Socrates is mortal." These laws of thought were LOGIC

su pposed to govem the operation of the mind; their study initiated the field called

logic.

Logicians in the 19th century developed a precise notation for statements about all kinds of objects in the world and the relations among them. (Contrast this with ordinary arithmetic

numbers.) By 1965, programs existed any solvable problem described in logical notation. (Although

notation, which provides only for statements about that could, in principle, so]ve LOGICIST

if no solution exists, the program might loo p forever.) The so-called

logicist tradition

within

at1ificial intelligence ho pes to build on such programs to create intelligent systems. There are two main obstacles to this approach. First, it is not easy to take informal knowledge and state it in the formal terms required by logical notation, particularly when the knowledge is less than I 00% certain. Second, there is a big difference between solving a problem "in principle" and solving it .in practice. Even problems with just a few hundred facts can exhaust the computational resources of any computer unless it has some guidance as to which reasoning steps to try first. Although both of these obstacles apply to any attempt to build computational reasoning systems, they appeared first in the logicist tradition.

1.1.4 AGENT

Acting rationally: The rational agent approach

An agent

is just something that acts

(agent comes from the Latin agere,

all computer programs do something, but computer agents

are

to do). Of course,

expected to do more: operate

autonomously, perceive their environment, persist over a p rolonged time period, adapt to RATIONALAGENT

change, and create and pursue goals. A

rational agent is

one that acts so as to achieve the

best outcome or, when there is tmcertainty, the best expected outcome. In the "laws of thought" approach to AI, the emphasis was on correct inferences. Mak ing correct inferences is sometimes

part

of being a rational agent, because one way to act

rationally is to reason logically to the conclusion that a given action will achieve one's goals and then to act on that conclusion. On the other hand, correct inference is not

all of

ration

ality; in some situations, there is no provably correct thing to do, but something must still be done. There are also ways of acting rationally that carmot be said to involve inference. For example, recoiling from a hot stove is a reflex action that is usually more successful than a slower action taken after cru·eful deliberation. All the skills needed for the Turing Test also allow an agent to act rationally. Knowledge representation and reasoning enable agents to reach good decisions. We need to be able to generate comprehensible sentences in natu.ral language to get by in a complex society. We need learning not only for erudition, but also because it imp roves our ability to generate effective behavior. The rational-agent approach has two advantages over the other approaches. First, it is more general than the "laws of thought" a pproach because correct inference is just one of several possible mechanisms for achieving rationality.

Second, it is more amenable to

Section

1.2.

5

The Foundations of Artificial Intelligence

scientific development than are approaches based on human behavior or human thought. The standard of rationality is mathematically well defined and completely general, and can be "unpacked" to generate agent designs that provably achieve it. Human behavior, on the other hand, is well adapted for one specific environment and is defined by, well, the sum total

This book therefore concentrates on general principles of rational agents and on components for constructing them. We will see that despite the of all the things that humans do.

apparent simplicity with which the problem can be stated, an enonnous variety of issues come up when we try to solve it. Chapter 2 outlines some of these issues in more detail. One important point to keep in mind: We will see before too long that achieving perfect rationality-always doing the right thing-is not feasible in complicated environments. The computational demands are just too high. For most of the book, however, we will adopt the working hypothesis that perfect rationality is a good starting point for analysis. It simplifies the problem and provides the appropriate setting for most of

�'[J� �LITY

the field.

Chapters

5

and

17

the

deal explicitly with the issue of

fotmdational material in

limited

rationality-ac ting

appropriately when there is not enough time to do all the computations one might like.

1 .2

THE FOUNDATIONS OF ARTIFICIAL INTELLIGENCE

In this section, we provide a brief history of the disciplines that contributed ideas, viewpoints, and techniques to AI. Like any history, this one is forced to concentrate on a small nwnber of people, events, and ideas and to ignore others that also were important. We organize the history around a series of questions. We certainly would not wish to give the impression that these questions are the only ones the disciplines address or that the disciplines have all been working toward AI as their ultimate fruition.

1.2.1

• • • •

Philosophy Can formal rules be used to draw valid conclusions? How does the mind arise from a physical brain? Where does knowledge come from? How does knowledge lead to action?

Aristotle

(384-322 B.c.),

whose bust appears on the front cover of this book, was the first

to formulate a precise set of Jaws governing the rational part of the mind. He developed an informal system of syllogisms for proper reasoning, which in principle allowed one to gener ate conclusions mechanically, given initjal premises. Much later, Ramon Lull (d.

1315)

had

the idea that useful reasoning could actually be carried out by a mechanical artifact. Thomas Hobbes

(1588-1679) proposed

that reasoning was like numerical computation, that "we add

and subtract in our silent thoughts." TI1e automation of computation itself was already well w1der way. Around

1500, Leonardo da Vinci (1452-1519) designed but did not build

a me

chanical calculator; recent reconstructions have shown the design to be functional. The first

1623 by the German scientist Wilhelm Schickard (1 592-1635), although the Pascaline, built in 1642 by Blaise Pascal (1623-1662), known calculating machine was constructed arow1d

6

Chapter

1.

Introduction

is more famous. Pascal wrote that "the arithmetical machine produces effects which appear nearer to thought than all the actions of animals." Gottfried Wilhelm Leibniz

(1646-1716)

built a mechanical device intended to catTy out operations on concepts rather than numbers, but its scope was rather limited. Leibniz did surpass Pascal by building a calculator that could add, subtract, multiply, and take roots, whereas the Pascaline could only add and sub tract. Some speculated that machines might not just do calculations but actually be able to think and act on their own. In his

1651

book

Leviathan, Thomas Hobbes suggested

the idea

of an "artificial animal," arguing "For what is the herut but a spring; and the nerves, but so many strings; and the joints, but so many wheels." It's one thing to say that the mind operates, at least in part, according to logical rules, and to build physical systems that emulate some of those mles; it's another to say that the mind itself

is

such a physical system. Rene Descartes

(1596-1 650)

gave the first clear discussion

of the distinction between mind and matter and of the problems that arise. One problem with a purely physical conception of the mind is that it seems to leave little room for free will: if the mind is govemed entirely by physical Jaws, then it has no more free will than a rock

"deciding" to fall toward the center of the earth. Descartes was a strong advocate ofthe power

rationalism,

RATIONALISM

of reasoning in understanding the world, a philosophy now called

DUALISM

and one that

counts Aristotle and Leibnitz as members. But Descrutes was also a proponent of

dualism.

He held that there is a prut of the human mind (or soul or spirit) that is outside of nature, exempt from physical laws. Animals, on the other hand, did not possess this dual quality; MATERIALISM

they could be treated as machines. An alternative to dualism is that the brain's operation according to the laws of physics

materialism,

constitutes the mind.

which holds Free will is

simply the way that the perception of available choices appears to the choosing entity. Given a physical mind that manipulates knowledge, the next problem is to establish EMPIRICISM

empiricism movement, stru·ting with Francis Bacon's (15612 1626) Novum Organum, is characterized by a dictum of John Locke (1632-1704): "Nothing is in the understanding, which was not first in the senses." David Hume's (1711-1776) A Treatise of Human Nature (Hwne, 1739) proposed what is now known as the principle of

INDUCTION

induction:

the source of knowledge. The

that general mles are acquiJed by exposure to repeated associations between their

(1889-1951) and Bertrand Russell Rudolf Camap (1891-1970), developed the

elements. Building on the work of Ludwig Wittgenstein

(1872-1970), the LOGICALPOSITIVISM OBSERVATION SENTENCES CONFIRMATION THEORY

doctrine of logical

famous Vienna Circle, led by

positivism.

This doctrine holds that all knowledge can be characterized by

logical theories connected, ultimately, to

observation sentences

that correspond to sensory 3 inputs; thus logical positivism combines rationalism and empiJicism. The confirmation the

ory of edge

(1905-1997) attempted to analyze the acquisition of knowl Carnap's book The Logical Structure of the World (1928) defined an

Carnap and Carl Hempel

from experience.

explicit computational procedure for extracting knowledge from elementary experiences. It was probably the first theory of mind as a computational process.

2

The Novum Organum is an update of Aristotle's Organon., or instrument of thought. Thus Aristotle can be seen as both an empiricist and a rationalist. 3 In thls picture, all meaningful statements can be verified or falsified either by experimentation or by analysis of the meaning of the words. Because this rules out most of metaphysics, as was the intention, logical positivism was unpopular in some circles.

Section 1.2.

7

The Foundations of Artificial Intelligence

The final element in the philosophical picture of the mind is the cmmection between knowledge and action. This question is vital to AI because intelligence requires action as well as reasoning. Moreover, only by understanding how actions are justified can we understand how to build an agent whose actions are justifiable (or rational). Aristotle argued (in De Motu Animalium) that actions are justified by a logical connection between goals and knowledge of the action's outcome (the last part of this extract also appears on the front cover of this book, in the original Greek):

But how does it happen that thinking is sometimes accompanied by action and sometimes not, sometimes by motion, and sometimes not? It looks as if almost the same thing happens as in the case of reasoning and making inferences about unchanging objects. But in that case the end is a speculative proposition ... whereas here the conclusion which results from the two premises is an action. ... I need covering; a cloak is a covering. I need a cloak. What I need, I have to make; I need a cloak. I have to make a cloak. And the conclusion, the "I have to make a cloak," is an action. In the

Nicomachean Ethics

(Book 01. 3, 1112b), Aristot1e further elaborates on this topic,

suggesting an algorithm:

We deliberate not about ends, but about means. For a doctor does not deliberate whether he shall heal, nor an orator whether he shall persuade, ... They assume the end and consider how and by what means it is attained, and if it seems easily and best produced thereby; while if it is achieved by one means only they consider how it will be achieved by this and by what means this will be achieved, till they come to the first cause, ... and what is last in the order of analysis seems to be first in the order of becoming. And if we come on an impossibility, we give up the search, e.g., if we need money and this cannot be got; but if a thing appears possible we try to do it. Aristotle's algorithm was implemented 2300 years later by Newell and Simon in their G PS program. We would now call it a regression planning system (see Chapter 10). Goal-based analysis is useful, but does not say what to do when several actions will achieve the goal or when no action will achieve it completely. Antoine Arnauld (1612-1694)

to take in cases like this Utilitarianism (Mill, 1863) promoted

correctly described a quantitative formula for deciding what action (see Chapter 16). John Stuart Mill's (1806-1873) book

the idea of rational decision criteria in all spheres of hwnan activity. The more formal theory of decisions is discussed in the following section.

1.2.2

• • •

Mathematics What are the formal rules

to draw valid conclusions?

What can be computed? How do we reason with uncertain information?

Philosophers staked out some of the fundamental ideas of AI, but the leap

to a formal science

required a level of mathematical formalization in three fundamental areas: logic, computa tion, and probability. The idea of formal logic can be traced back to the philosophers of ancient Greece, but its mathematical development really began with the work of George Boole ( 1815-1864), who

8

Chapter

1.

Introduction

worked out the details of propositional, or Boolean, logic (Boole, 1847). In 1879, Gottlob Frege (1848-1925) extended Boote's logic to include objects and relations, creating the first 4 order logic that is used today. Alfred Tarski (1902-1983) introduced a theory of reference

that shows how to relate the objects in a logic to objects in the real world. The next step was to determine the limits of what could be done with logic and com ALGORITHM

putation. The first nontrivial

algorithm

greatest conunon divisors. The word

is thought to be Euclid's algorithm for computing

algorithm

(and the idea of studying them) comes from

al-Khowarazmi, a Persian mathematician of the 9th century, whose writings also introduced Arabic numerals and algebra to Europe. Boole and others discussed algorithms for logical deduction, and, by the late 19th century, efforts were under way to formalize general mathe matical reasoning as logical deduction. In 1930, Kurt Godel (1906-1978) showed that there exists an effective procedw·e to prove any tme statement in the first-order logic of Frege and Russell, but that first-order logic could not capture the principle of mathematical induction needed to characterize the natural numbers. ln 1931, GOdel showed that limits on deduc INCOMPLETENESS THEOREM

tion do exist. His

incompleteness theorem

showed that in any formal theory as strong as

Peano arithmetic (the elementary theory of natural numbers), there

are

true statements that

are undecidable in the sense that they have no proof within the theory. This fundamental result can also be interpreted as showing that some functions on the integers cannot be represented by an algorithm-that is, they cannot be computed. motivated Alan Turing (1912-1954) to try to characterize exactly which functions COMPUTABLE

putable--capable

the notion

This

are com

of being computed. This notion is actually slightly problematic because

of a computation or effective procedure really ca1mot be given a formal definition.

However, the Church-Turing thesis, which states that the Turing machine (Tw·ing, 1936) is capable of computing any computable function, is generally accepted as providing a sufficient definition. Turing also showed that there were some functions that no Turing machine can compute. For example, no machine can tell an

in general whether a

given program will return

answer on a given input or run forever. Although decidability and computability are important to an understanding of computa

TRACTABIUTY

tion, the notion of tractability has had an even greater impact. Roughly speaking, a problem is called intractable if the time required to solve instances of the problem grows exponentially with the size of the instances. The distinction between polynomial and exponential growth in complexity was first emphasized in the mid-1960s (Cobham, 1964; Edmonds, 1965). It is important because exponential growth means that even moderately large instances ca1mot be solved in any reasonable time. Therefore, one should strive to divide the overall problem of generating intelligent behavior into tractable subproblems rather than intractable ones. How can one recognize an intractable problem? The theory of NP-completeness, pio

NP-COMPLETENESS

neered by Steven Cook (1971) and Richard Karp (1972), provides a method. Cook and Karp showed the existence of large classes of canonical combinatorial search and reasoning prob lems that are NP-complete. Any problem class to which the class of NP-complete problems can be reduced is likely to be intractable. (Although it has not been proved that NP-.'Pioration that must be undertaken by

a vacuurn-deaning agent in an initia11y unknown environment. LEARNING

Our definition requires a rational agent not only to gather infonnation but also to as much as possible

from what

learn

it perceives. The agent's initial configuration could reflect

some prior knowledge of the environment, but as the agent gains experience this may be modified and augmented. There are extreme cases in which the environment is completely known a priori. In such cases, the agent need not perceive or Jearn; it simply acts correctly. Of cow·se, such agents are fragile. Consider the Jowly dung beetle. After digging its nest and laying its eggs, it fetches a ball of dung from a nearby heap to plug the entrance. I f the ball of dung is removed from its grasp en route, the beetle continues its task and pantomimes plug ging the nest with the nonexistent dung ball, never noticing that it is missing. Evolution has built an asswnption into the beetle's behavior, and when it is violated, unsuccessful behavior results. Slightly more intelligent is the sphex wasp. The female sphex will dig a burrow, go out and sting a caterpillar and drag it to the burrow, enter the burrow again to check all is well, drag

the caterpillar inside, and lay its eggs. The caterpillar serves as a food source when

the eggs hatch. So far so good, but if an entomologist moves the caterpillar a few inches away while

the sphex is doing

the check, it will revert to the "drag " step of its plan and will

continue the plan without modification, even after dozens of caterpillar-moving interventions. The sphex is unable to lean1 that its innate plan is failing, and thus will not change it. To the extent that an agent relies on the prior knowledge of its designer rather AUTONOMY

on its own percepts, we say that the agent lacks

than

autonomy. A rational agent should be

autonomous-it should learn what it can to compensate for partial or incorrect prior knowl edge.For example, a vacuum-cleaning agent that learns to foresee where and when additional dirt will appear will do better than one that does not. As a practical matter, one seldom re quires complete autonomy from the start: when the agent has had little or no experience, it would have to act randomly unless the designer gave some assistance. So, just as evolution provides animals with enough built-in reflexes to survive long enough to learn for themselves, it would be reasonable to provide an artificial intelligent agent with some initial knowledge as well as an ability to learn. After sufficient experience of its environment, the behavior of a rational agent can become effectively

independent

of its prior knowledge. Hence, the

incorporation of learning allows one to design a single rational agent that will succeed in a vast va riety of environments.

40 2.3

Chapter

2.

Intelligent Agents

THE NATURE OF ENVIRONMENTS

Now that we have a definition of rationality, w e TASKENVIRONMENT

are

rational agents. First, however, we must think about

almost ready

to think

about building

task environments, which are essen

tially the "problems" to which rational agents are the "solutions." We begin by showing how

to specify

a task envirorunent, illustrating the process with a nwnber of examples. We then

show that task environments come in a variety of flavors. The flavor of the task envirorunent directly affects the approptiate design for the agent program. 2.3.1

Specifying the task environment

In our discussion of the rationality of the simple vacuum-cleaner agent, we had to specify the performance measure, the environment, and the a.gent's actuators and sensors. We group all these under the heading of the task environment. For the acronymically minded, we can PEAS

this the PEAS (Performance, Environment, Actuators, Sensors) description. In designing an agent, the first step must always be to specify the task environment as fully as possible. The vacuum world was a simple example; let us consider a more complex problem: an automated taxi driver. We should point out, before the reader becomes alarmed, that a fully automated taxi is currently somewhat beyond the capabilities of existing technology. (page 28 describes an existing driving robot.) The full driving task is extremely

open-ended.

There is

no limit to the novel combinations of circumstances that can arise-another reason we chose it as a focus for discussion. Figure 2.4 summarizes the PEAS description for the taxi's task environment. We discuss each element in more detail in the following paragraphs. Agent Type

Perfonnance Measure

Environment

Actuators

Sensors

Taxi driver

Safe, fast, legal, comfortable trip,

Roads, other traffic,

Steering, accelemtor,

Catneras, sonar, speedometer,

maximize profits

pedestrians, customers

bmke, signal, horn, display

GPS, odometer, accelerometer, engine sensors, keyboard

Figure 2.4

PEAS description ofthe task environment for an automated taxi.

First, what is the

performance measure to which we would like our automated dtiver

to aspire? Desirable qualities include getting to the correct destination; minimizing fuel con sumption and wear and tear; minimizing the trip time or cost; minimizing violations of traffic Jaws and disturbances

to other drivers;

maximizing safety and passenger comfort; maximiz

ing profits. Obviously, some of these goals conflict, so tradeoffs will be required. Next, what is the driving

environment that the taxi will face? Any taxi driver must

deal with a variety of roads, ranging from rural lanes and urban alleys to 12-Jane freeways. The roads contain other traffic, pedestrians, stray animals, road works, police cars, puddles,

Section 2.3.

41

The Nature ofEnvironments

and potholes. The taxi must also interact with potential and actual passengers. There are also some optional choices. The taxi might need to operate in Southern California, where snow is seldom a problem, or in Alaska, where it seldom is not. It could always be driving on the right, or we might want it

to be flexible enough to drive on the left when in Britain or Japan.

Obviously, the more restricted the envirorunent, the easier the design problem. The actuators for an automated taxi include those available to a hwnan driver: control over the engine through the accelerator and control over steering and braking. In addition, it

to a display screen or voice synthesizer to talk back to the passengers, perhaps some way to commw1icat.e with other vehicles, politely or otherwise.

will need output The basic that it can see

sensors for the taxi will

and

include one or more controllable video cameras

so

the road; it might augment these with infrared or sonar sensors to detect dis

tances to other cars and obstacles. To avoid speeding tickets, the taxi should have a speedome ter, and to control the vehicle properly, especially on curves, it should have an accelerometer. To detetmine the mechanical state of the vehicle, it. will need the usual an·ay of engine, fuel, and electrical system sensors. Like many human drivers, it might. want a global positioning system (GPS) so that it doesn't get lost. Finally, it will need a keyboard or microphone for the passenger to request a destination. In Figure 2.5, we have sketched the basic PEAS elements for a number of additional agent types. Further examples appear in Exercise 2.4. It may come as a surprise ers that our list. of agent types n i cludes some programs that operate in

to some read

the entirely artificial

environment. defined by keyboard input and character output. on a screen. "Surely," one might say, "this is not a real environment, is it?" In fact., what matters is not the distinction between "real" and "artificial" environments, but the complexity of the relationship among the behav

ior of the agent, the percept sequence generated by the environment, and the performance measure. Some "real" environments are actually quite simple. For example, a robot designed to inspect parts as they come by on a conveyor belt can make use of a number of simplifying assumptions: that. the lighting is always just

so,

that. the only thing on the conveyor belt will

be patts of a kind that it knows about, and that only two actions (accept. or reject.) are possible. In contrast., some software agents (or software robots or

soflWAREAGENT soFTBOT

softbots) exist in rich, unlim-

to scan Internet news sources and advertising space to generate revenue.

it.ed domains. Imagine a softbot Web site operator designed show the interesting items to its users, while selling

To do well, that operator will need some natural language processing abilities, it will need to learn what each user and advertiser is interested in, and it will need to change its plans dynamically-for example, when the connection for one news source goes down or when a new one comes online. The Internet is an environment whose complexity rivals that of the physical world and whose inhabitants include many attificial and human agents.

2.3.2

Properties of task environments

The range of task envirorunents that might arise in AI is obviously vast. We can, however, identify a fairly small number of dimensions along which task environments can be catego rized. These dimensions detennine,

to a large

extent, the appropriate agent design and the

applicability of each of the principal families of techniques for agent implementation. First,

42

Chapter

Agent Type

Perfonnance Measure

Environment

Medical

Healthy patient,

diagnosis system

reduced costs

2.

Intelligent Agents

Actuators

Sensors

Patient, hospital,

Display of

Keyboard entry

staff

questions, tests,

of symptoms,

diagnoses,

findings, patient's

treatments,

answers

referrals

Satellite image

Correct image

Downlink from

Display of scene

Color pixel

analysis system

categorization

orbiting satellite

categorization

arrays

Part-picking

Percentage of

Conveyor belt

Jointed arm and

Camera, joint

robot

parts in correct

with parts; bins

hand

angle sensors

bins

Refinery

Purity, yield,

Refinery,

Valves, pumps,

Temperature,

controller

safety

operators

heaters, displays

pressure, chemical sensors

Interactive

Student's score

Set of students,

Display of

English tutor

on test

testing agency

exe.rcisest

Keyboard entry

suggestions, corrections

Figure2.5

Examples of agent types and their PEAS descriptions.

we list the dimensions, then we analyze several task environments to illustrate the ideas. The definitions here are informal; later chapters provide more precise statements and examples of each kind of environment. FULLY OBSERVABLE PJ'.RTIALLY OBSERVABLE

Fully observable vs. partially observable: If an agent's sensors give it access to the complete state of the environment at each point in time, then we say that the task environ ment is fully observable. A task environment is effectively fully observable if the sensors detect all aspects that are relevant to the choice of action; relevance, in turn, depends on the perfonnance measure. Fully observable envirorunents are convenient because the agent need not maintain any internal state to keep track of the world. An envirorunent might be partially observable because of noisy and inaccurate sensors or because parts of the state are simply missing from the sensor data-for example, a vacuum agent with only a local dirt sensor cannot tell whether there is dirt in other squares, and an automated taxi cannot see what other drivers are thinking.

UNOBSERVABLE

If the agent has no sensors at all then the environment is unobserv

able. One might think that in such cases the agent's plight is hopeless, but, as we discuss in Chapter 4, the agent's goals may still be achievable, sometimes with certainty.

SINGLE AGENT MULTIAGENT

Single agent vs. multiagent: The distinction between single-agent and multiagent en-

Section 2.3.

The Nature of Envirorunents

43

vironments may seem simple enough. For example, an agent solving a crossword puzzle by itself is clearly in a single-agent environment, whereas an agent playing chess is in a two agent envirorunent. There are, however, some subtle issues. First, we have described how an entity

may be viewed as an agent, but we have not explained which entities must be viewed

as agents. Does an agent A

(the taxi driver for example) have to treat an object B (another

vehicle) as an agent, or can it be treated merely as an object behaving according to the laws of physics, analogous to waves at the beach or leaves blowing in the wind? The key distinction is whether B's behavior is best described as maximizing a performance measure whose value depends on agent A's behavior. For example, in chess, the opponent entity B is trying to maximize its performance measure, which, by the rules of chess, minimizes agent A's percoMPETITIVE

formance measure. Thus, chess is a competitive multiagent environment. In the taxi-driving envirorunent, on the other hand, avoiding collisions maximizes the performance measure of

cooPERATIVE

all agents, so it is a partially

cooperative multiagent environment. It is also partially com

petitive because, for example, only one car can occupy a parking space. The agent-design problems in multiagent environments are often quite different from those in single-agent en vironments; for example,

communication often emerges as a rational behavior in multiagent environments; in some competitive environments, randomized behavior is rational because it avoids the pitfalls of predictability. DETERNINISTIC srocH>ISTIC

Deterministic vs. stochastic. If the next state of the environment is completely determined by the current state and the action executed by the agent, then we say the environment is deterministic; otherwise, it is stochastic. In principle, an agent need not worry about uncer tainty n i a fully observable, deterministic environment. (In our definition, we ignore uncer tainty that arises purely from the actions of other agents in a multiagent environment; thus,

a game can be deterministic even though each agent may be tmable to predict the actions of the others.) If the environment is partially observable, however, then it could stochastic. Most real situations

are so complex that it is impossible to keep

appear to be

track of all the

unobserved aspects; for practical purposes, they must be treated as stochastic. Taxi driving is clearly stochastic in this sense, because one can never predict the behavior of traffic exactly; moreover, one's tires blow out and one's engine seizes up without warning. The vacuum world as we described it is deterministic, but variations can include stochastic elements such as randomly appearing dirt and an unreliable suction mechanism (Exercise 2.13). We say an UNCERTAIN

envirorunent is

uncertain if it is not fully observable or not deterministic. One final note:

our use of the word "stochastic" generally implies that uncertainty about outcomes is quanNONDETERMINisnc

tified in terms of probabilities; a

nondeterministic envirorunent is one in which actions are

characterized by their possible outcomes, but no probabilities are attached to them. Nonde terministic environment descriptions are usually associated with performance measures that require the agent to succeed for all possible outcomes of its actions. EPISODe sEaUEHTIAL

Episodic vs. sequential: In an episodic task environment, the agent's experience is divided into atomic episodes. In each episode the agent receives a percept and then performs a single action. Crucially, the next episode does not depend on the actions taken in previous episodes.

Many classification tasks are episodic. For example, an agent that has to spot

defective parts on an assembly line bases each decision on the current part, regardless of previous decisions; moreover,

the

current decision doesn't affect whether the next part is

Chapter

44

Intelligent Agents

2.

defective. In sequential environments, on the other hand, the current decision could affect all future decisions.3 Chess and taxi driving are sequential: in both cases, short-tenn actions

can have Jong-tenn

consequences. Episodic environments are much simpler

than sequential

environments because the agent does not need to think ahead.

Static vs. dynamic:

STATIC 0\'NAMIC

If the environment can change while an agent is deliberating, then

we say the environment is dynamic for that agent; otherwise, it is static. Static environments are easy to deal with because the agent need not keep looking at the world while it is deciding on an action, nor need

H worry about the

passage of time. Dynamic environments, on the

other hand, are continuously asking the agent what it wants to do; if it hasn't decided yet,

that counts as deciding to do nothing.

If the environment itself does not change with the

passage of time but the agent's performance score does, then we say the environment is SEMIDI'NAMIC

semidynamic.

Taxi driving is clearly dynamic: the other cars and the taxi itself keep moving

while the driving algorithm dithers about what to do next. Chess, when played with a clock, is semidynarnic. Crossword puzzles are static.

Discrete vs. continuous:

DISCRETE CONTINUOUS

environment, to the way time

The discrete/continuous distinction applies to the state of the

is handled, and to the

percepts and actions of the agent. For

example, the chess environment has a finite number of distinct states (excluding the clock). Chess also has a discrete set of percepts and actions. Taxi driving is a continuous-state and continuous-time problem: the speed and location of the taxi and of the other vehicles sweep through a range of continuous values and do so smoothly over time. Taxi-driving actions are also continuous (steering angles, etc.). Input from digital cameras is discrete, strictly speak ing, but is typically treated as representing continuously varying intensities and locations.

Known vs. unknown:

KNCWN UNKNCWN

Strictly speaking, this distinction refers not to the environment

itself but to the agent's (or designer's) state of knowledge about the "laws of physics" of the environment.

In a known environment, the outcomes (or outcome probabilities if the

environment is stochastic) for all actions

are given.

Obviously, if the environment is unknown,

the agent will have to learn how it works in order to make good decisions.

Note that the

distinction between known and unknown environments is not the same as the one between fully and partially observable environments. to be

partially observable-for example,

It is quite possible for a

known

environment

in solitaire card games, I know the rules but am

still unable to see the cards that have not yet been turned over. Conversely, an

unknown

environment can be fully observable-in a new video game, the screen may show the entire game state but I still don't know what the buttons do until I try them. As one might expect, the hardes[ case is

partially observable, multiagent, stoclu:lstic,

sequential, dynamic, continuous, and unknown. Taxi driving is hard in all these senses, except that for the most part the driver's environment is known. Driving a rented car in a new country with unfamiliar geography and traffic laws is a Jot more exciting. Figure 2.6 lists the properties of a number of familiar environments.

Note that the

answers are not always cut and dried. For example, we describe the part-picking robot as episodic, because it normally considers each part n i isolation. But if one day there is a large 3 The word "sequential" is also used in computer science as the antonym of "parallel." largely unrelated.

The two meanings are

Section 2.3.

The Nature ofEnvirorunents

45

Observable Agents Determn i ist i c

Task Environment Crossword puzzle Chess with a clock

Fully Fully

Episodic

Static

Discrete

Single Deterministic Sequental i Multi Deterministic Sequential

Static Semi

Discrete Discrete

Static Static

Discrete Discrete

Poker Backgammon

Partially

Fully

Multi Multi

Stochastic Stochastic

Sequential Sequential

Taxi driving Medical diagnosis

Partially Partially

Multi Single

Stochastic Stochastic

Sequential Dynamic Continuous Sequential Dynamic Continuous

Image analysis Part-picking robot

Fully Partially

Single Deterministic Single Stochastic

Refinery controller Interact ive English tutor

Partially Partially

Single Multi

Figure2.6

Stochastic Stochastic

Episodic Episodic

Semi Continuous Dynamic Continuous

Sequential Dynamic Continuous Sequential Dynamic Discrete

Examples of task environments and their characteristics.

batch of defective parts, the robot should Jearn from several observations that the distribution of defects has changed, and should modify its behavior for subsequent parts. We have not included a "known/unknown" coltunn because, as explained earlier, this is not strictly a prop erty of the environment. For some environments, such as chess and poker, it is quite easy to supply the agent with full knowledge of the rules, but it is nonetheless interesting to consider how an agent might Jearn to play these games without such knowledge. Several of the answers in the table depend on how the task envirorunent is defined. We

have listed the medical-diagnosis task as single-agent because the disease process in a patient is not profitably modeled as an agent; but a medical-diagnosis system might also have to deal with recalcitrant patients and skeptical staff, so the envirorunent could have a multiagent aspect.. Furthermore, medical diagnosis is episodic if one conceives of the task as selecting a diagnosis given a Jist ofsymptoms; the problem is sequential ifthe task can include proposing a series of tests, evaluating progress over the course of treatment, and so on. Also, many envirorunents are episodic at higher levels than the agent's individual actions. For example, a chess tournament consists of a sequence of games; each game is an episode because (by and large) the contribution of the moves in one game to the agent's overall performance is not affected by the moves in its previous game. On the other hand, decision making within a single game is certainly sequential. The code repository associated with this book (aima.cs.berkeley.edu) includes imple mentations of a number of environments, together with a general-purpose envirorunent simu lator that places one or more agents in

a simulated

environment, observes their behavior over

time, and evaluates them according to a given performance measure. Such experiments are often carried out not for a single environment but for many environments drawn from ENVIAO.iMENT CLASS

vironment class. nm

an en

For example, to evaluate a taxi driver in simulated traffic, we would want to

many simulations with different traffic , lighting, and weather conditions. If we designed

the agent for a single scenario, we might be able to take advantage of specific properties of the pa1ticular case but might not identify a good design for driving in general. For this

Chapter

46 ENVIRONMENT GENERATOR

2.

Intelligent Agents

reason, the code repository also includes an environment generator for each envirorunent class that selects particular environments (with certain likelihoods) in which to run the agent. For example, the vacuum environment generator initializes the dirt pattern and agent location randomly. We are then interested in the agent's average performance over the environment class. A rational agent for a given environment class maximizes this average performance. Exercises 2.8 to 2.13 take you through the process of developing an environment class and evaluating various agents therein.

2.4

THE STRUCTURE OF AGENTS

So far we have talked about agents by describing behavior-the action that is performed after AGENT PROGRAM

any given sequence of percepts. Now we must bite the bullet and talk about how the insides work. The job of AI is to design an agent program that implements the agent fLmction

ARCHITECTURE

the mapping from percepts to actions. We asswne this program will run on some sort of computing device with physical sensors and actuators-we call tllis the architecture:

agent = architecture +program . Obviously, the program we choose has to be one that is appropriate for the architecture. If the program is going to recommend actions like

Walk, the architectw·e had better have legs.

The

architectLLre might be just an ordinary PC, or it might be a robotic car with several onboard computers, cameras, and other sensors. In general, the architecture makes the percepts from the sensors available to the program, nms the program, and feeds the program's action choices to the actuators as they are generated. Most of this book is about designing agent programs, although Chapters 24 and 25 deal directly with the sensors and actuators.

2.4.1

Agent programs

The agent programs that we design in this book all have the same skeleton: they take the

current percept as input from the sensors and return an action to the actuators.4 Notice the difference between the agent program, which takes the current percept as input, and the agent function, which takes the entire percept history. The agent program takes just the current percept as input because nothing more is available from the environment; if the agent's actions need to depend on the entire percept sequence, the agent will have to remember the percepts. We describe the agent programs in the simple pseudocode language that is defined in Appendix B . (The online code repository conta.ins implementations in real programming languages.) For example, Figure 2.7 shows a rather trivial agent program that keeps track of the percept sequence and then uses it to index into a table of actions to decide what to do. The table-an example of which is given for the vacuum world in Figw·e 2.3-represents explicitly the agent function that the agent program embodies. To build a rational agent in 4

There are other choices for the agent program skeleton; for example, we could have the agent programs be coroutines that run asynchronously with d1e environment. Each such coroutine has an input and output port and consists of a loop that rends the input port for percepts and writes actions to the output port.

Section 2.4.

The Structure of Agents

47

ftmction TABLE-DRIVEN-AGENT(pen:ept) returns an action persistent: percepts, a sequence, initially empty table, a table of actions, n i dexed by percept sequences, n i itially fully specified append percept to the end of percepts

action

Figure 3.32

�

3.

Solving Problems by Searching

� � gl� �

The track pieces in a wooden railway set; each is labeled with the number of

copies in the set. Note that curved pieces and "fork" pieces ("switches" or "points") can be flipped over so they can curve n i either direction. Each curve subtends 45 degrees.

3.14

Which of the following are true and which are false? Explain your answers.

a. Depth-first search always expands at least as many nodes as A* search with an admissib. c. d.

e.

ble heuristic. h(n) = 0 is an admissible heuristic for the 8-puzzle. A* is of no use in robotics because percepts, states, and actjons are continuous. Breadth-first search is complete even if zero step costs are allowed. Assume that a rook can move on a chessboard any number of squares in a straight line, vertically or horizontally, but cannot jump over other pieces. Manhattan distance is an admissible heuristic for the problem of moving the rook from square A to square B in the smallest number of moves.

Consider a state space where the start state is number 1 and each state k has two successors: numbers '2k and '2k + 1.

3.15

a. Draw the portion of the state space for states 1 to b. Suppose the goal state is

15.

11. List the order in which nodes will be visited for breadth

first search, depth-limited search with limit 3, and iterative deepening search. c. How well would bidirectional search work on this problem? What is the branching factor in each direction of the bidirectional search? d. Does the answer to (c) suggest a reformulation of the problem that would allow you to solve the problem of getting from state 1 to a given goal state with almost no search? e. Call the action going from k to 2k Left, and the action going to 2k + 1 Right. Can you find an algorithm that outputs the solution to this problem without any search at all? 3.16

A basic wooden railway set contains the pieces shown in Figure 3.32. The task is to

connect these pieces into a railway that has no overlapping tracks and no loose ends where a

train could run off onto the floor. a. Suppose that the pieces fit together exactly with no slack. Give a precise formulation of

the task as a search problem. b. Identify a suitable tminformed search algorithm for this task and explain your choice. c. Explain why removing any one of the "fork" pieces makes the problem tmsolvable.

117

Exercises

d. Give an upper bound on the total size of the state space defined by your formulation.

(Hint:

think about the maximum branching factor for the construction process and the

maximum depth, ignoring the problem of overlapping pieces and loose ends. Begin by pretending that every piece is unique.)

3.17

On page

90, we mentioned iterative lengthening search,

an iterative analog of uni

form cost search. The idea is to use increasing limits on path cost. If a node is generated whose path cost exceeds the current limit, it is immediately discarded. For each new itera tion, the limit is set to the lowest path cost of any node discarded in the previous iteration.

a. Show that this algorithm is optimal for general path costs. b. Consider a uniform tree with branching factor b, solution depth d, and unit step costs. How many iterations will iterative lengthening require?

c. Now consider step costs drawn from the continuous range

[E, 1], where 0 < E < 1 . How

many iterations are required in the worst case?

d. Implement the algorithm and apply it to instances of the 8-puzzle and traveling sales person problems. Compare the algorithm's perfonnance to that of unif01m-cost search, and comment on your results.

3.18

Describe a state space in which iterative deepening search pe1fonns much worse than

depth-first search (for example,

3.19

O(n2) vs. O(n)).

Write a program that will take as input two Web page URLs and find a path of links

from one to the other. What is an appropriate search strategy? Is bidirectional search a good

idea? Could a search engine be used to implement a predecessor function?

3.20

Consider the vacuum-world problem defined in Figure 2.2.

a. Which of the algorithms defined in this chapter would be appropriate for this problem? Should the algorithm use tree search or graph search?

b. Apply your chosen algorithm to compute an optimal sequence of actions for a 3 x 3 world whose initial state has dirt in the three top squares and the agent in the center.

c. Construct a search agent for the vacuum world, and evaluate its performance in a set of

3 x 3 worlds with probability 0.2 of dirt in each square.

Include the search cost as well

as path cost in the performance measure, using a reasonable exchange rate.

d. Compare your best search agent with a simple randomized reflex agent that sucks if there is dirt and otherwise moves randomly.

e. Consider what would happen if the world were enlarged to n x n. How does the per formance of the search a.gent and of the reflex agent vary with n'?

3.21

Prove each of the following statements, or give a cow1terexample:

a. Breadth-first search is a special case of uniform-cost search. b. Depth-first search is a special case of best-first tree search. c. Uniform-cost search is a special case of A* search.

Chapter

118 3.22

3.

Solving Problems by Searching

Compare the pe1formance of A* and RBFS on a set of randomly generated problems

in the 8-puzzle (with Manhattan distance) and TSP (with MST-see Exercise 3.30) domains. Discuss your results. What happens to the performance of RBFS when a small random num ber is added to the hew·istic values in the 8-puzzle domain?

3.23

Trace the operation of A* search applied to the problem of getting to Bucharest from

Lugoj using the straight-line distance heuristic. That is, show the sequence of nodes that the algorithm will consider and the

3.24

Devise a state space in which

with an HEURISTIC PATH ALGORITHM

f, g, and h score for each node.

A* using GRAPH-SEARCH returns a suboptimal solution

h(n) fw1ction that is admissible but inconsistent.

1977) is a best-first search in which the evalu ation function is f(n) = (2 - w)g(n) + wh(n). For what values of w is this complete? For what values is it optimal, assuming that h is admissible? What kind of search does this perform for w = 0, w = 1 , and w = 2'? 3.25

The heuristic

3.26

Consider the unbounded version of the regular 2D grid shown in Figure

state is at the origin, a.

path algorithm (Pohl,

(0,0), and the goal state is at (x, y).

What is the branching factor b in this state space?

b. How many distinct states are there at depth c.

3.9. The start

k (for k > 0)?

What is the maximwn number of nodes expanded by breadth-first tree search?

d. What is the maximum number of nodes expanded by breadth-first graph search? e.

Is

h = iu - xi + lv - Y l

an admissible hew·istic for a state at

(u, v)? Explain.

f. How many nodes are expanded by A* graph search using h? g. Does h remain admissible if some links are removed'? h. Does h remain admissible if some links are added between nonadjacent states?

3.27 n vehicles occupy squares (1, 1) through (n, 1) (i.e., the bottom row) of ann x n grid. The vehicles must be moved to the top row but in reverse order; so the vehicle

i that starts in

(i, 1) must end up in (n- i + 1 , n) . On each time step, every one of the n vehicles can move

one square up, down, left, or right, or stay put; but if a vehicle stays put, one other adjacent vehicle (but not more than one) can hop over it. Two vehicles cannot occupy the same square. a.

Calculate the size of the state space as a function of n.

b. Calculate the branching factor as a fw1ction of n. c.

Suppose that vehicle

i is at (xi, Yi); write a nontrivial admissible heuristic hi for the i + 1 , n), assuming no

number of moves it will require to get to its goal location ( n other vehicles are on the grid.

d. Which of the following heuristics are admissible for the problem of moving all n vehicles to their destinations'? Explain. (i) (ii) (iii)

L::=

hi. max{h1 , . . . , hn} · min {h1 , . . . , hn } · 1

1 19

Exercises 3.28

Invent a heuristic function for the 8-puzzle that sometimes overestimates, and show

how it can lead to a suboptimal solution on a particular problem. (You can use a computer to help if you want.) Prove that if h never overestimates by more than c, A* using h returns a solution whose cost exceeds that of the optimal solution by no more than c. 3.29

Prove that if a heuristic is consistent, it must be admissible. Construct an admissible

heuristic that is not consistent.

I••�

3.30

The traveling salesperson problem (TSP) can be solved with the minimum spanning -

tree (MS1) heuristic, which estimates the cost of completing a tour,

given that a pa rtial tour

has already been constructed. The MST cost of a set of cities is the smallest sum of the link costs of any tree that c01mects all the cities. a.

Show how this heuristic can be derived from a relaxed version of the TSP.

b. Show that the MST heuristic dominates straight-line distance. c.

Write a problem generator for instances of the TSP where cities are represented by random points in the unit square.

d. Find an efficient algorithm in the literature for constructing the MST, and use it with A*

graph search to solve instances of the TSP. On page 105, we defined the relaxation of the 8-puzzle in which a ti'le can move from

3.31

square A to square B if B is blank. The exact solution of this problem defines Gaschnig's

1979). Explain why Gaschnig's heuristic is at least as accurate as h1 (misplaced tiles), and show cases where it is more accurate than both h1 and h2 (Manhattan heuristic (Gaschnig,

distance). Explain how to calculate Gaschnig's heuristic efficiently.

liiii�

3.32

We gave two simple heuristics for the 8-puzzle: Manhattan distance and misplaced

tiles. Several heuristics in the literature purport to improve on this--see, for example, Nils son

(1971), Mostow and Priedi tis (1989), and Hansson eta/. ( 1 992). Test these claims by

implementing the heuristics and comparing the performance of the resulting algorithms.

BEYOND CLASSICAL

4

SEARCH

In which we relax the simplifying assumptions of the previous chapter; thereby getting closer to the real world. Chapter

3 addressed

a single category of problems: observable, deterministic, known envi

ronments where the solution is a sequence of actions. In this chapter, we look at what happens when these assumptions are relaxed. We begin with a fairly simple case: Sections cover algorithms that perfonn purely

4.1 and 4.2

local search in the state space, evaluating and modify

ing one or more current states rather than systematically exploring paths from an initial state. These algorithms are suitable for problems in which all that matters is the solution state, not the path cost to reach it. The family of local search algorithms includes methods inspired by statistical physics

(simulated annealing) and evolutionary biology (genetic algorithms).

Then, in Sections

4.3-4.4, we examine

what happens when we relax the assumptions

ofdeterminism and observability. The key idea is that if an agent cannot predict exactly what percept it will receive, then it will need to consider what to do tmder each

contingency that

its percepts may reveal. With partial observability, the agent will also need to keep track of the states it might be in. Finally, Section

4.5

investigates

online search, in which the agent is faced with a state

space that is initially unknown and must be explored.

4.1

LOCAL SEARCH ALGORITHMS AND OPTIMIZATION PROBLEMS

The search algorithms that we have seen so far are designed to explore search spaces sys tematically. This systematicity is achieved by keeping one or more paths in memory and by recording which alternatives have been explored at each point along the path. When a goal is

the path to that goal also constitutes a solution to the problem. In many problems, how ever, the path to the goal is irrelevant. For example, in the 8-queens problem (see page 71),

found,

what matters is the final configuration of queens, not the order in which they are added. The same general property holds for many important applications such as integrated-circuit de sign, factory-floor layout,job-shop scheduling, automatic programming, telecommunications network optimization, vehicle routing, and portfolio management.

120

Section 4.1.

LOCAL SEARCH CURRENT NODE

OPTIMIZATION PROBLEM OBJECTIVE FUNCTION

STATE-SPliCE LANDSCAPE

GLOBAL MINIMUM GLOBAL MAXIMUM

Local Search Algorithms and Optimization Problems

121

If the path to the goal does not matter, we might consider a different class of algo rithms, ones that do not worry about paths at all. Local search algorithms operate using a single current node (rather than multiple paths) and generally move only to neighbors of that node. Typically, the paths followed by the search are not retained. Although local search algorithms are not systematic, they have two key advantages: (1) they use very little memory-usually a constant ammmt; and (2) they can often find reasonable solutions in large or infinite (continuous) state spaces for which systematic algorithms are unsuitable. In addition to finding goals, local search algorithms are useful for solving pure op timization problems, in which the aim is to find the best state according to an objective function. Many optimization problems do not fit the "standard" search model introduced in Chapter 3. For example, nature provides an objective function-reproductive fitness-that Darwinian evolution could be seen as attempting to optimize, but there is no "goal test" and no "path cost" for this problem. To understand local search, we find it useful to consider the state-space landscape (as in Figure 4.1 ). A landscape has both "location" (defined by the state) and "elevation" (defined by the value of the heuristic cost function or objective function). If elevation corresponds to cost, then the aim is to find the lowest valley-a global minimum; if elevation corresponds to an objective function, then the aim is to find the highest peak-a global maximum. (You can convert from one to the other just by inserting a minus sign.) Local search algorithms explore this landscape. A complete local search algorithm always finds a goal if one exists; an optimal algorithm always finds a global minimum/maximum.

objective func.tion

-- global maximum

shoulder

�

.------

local maximum

'-------1---� state space current state Figure 4.1

A one-dimensional state-space landscape in which elevation corresponds to the

objective function. The aim is to find the global maximum. Hill-climbing search modifies the current state to try to m i prove it, as shown by the arrow. The various topographic features are

defined in the text.

122

Chapter

4.

Beyond Classical Search

function HILL-CLIMBING(pmblem) returns a state that is a local maximum

current ..- MAKE-NODE(problem.lNITIAL-STATE) loop do

neighbor .._ a highest-valued successor of curTent

if neighbor.VALUE ::; current.VALUE then return cun-ent.STATE

cun-ent ....- neighbor

Figure 4.2

The hill-climbing search algorithm, which is the most basic local search tech

nique. At each step the current node is replaced by the best neighbor; in this version, that means the neighbor with the highest VALUE, but if a heuristic cost estimate h is used, we would find the neighbor with the lowest h.

4.1.1

Hill-climbing search

HILLCLIMBING

TI1e

STEEPESTASCENT

simply a loop that continually moves in the direction of increasing value-that is, uphill. It terminates when it reaches a "peak" where no neighbor has a higher value. The algorithm

hill-climbing search algorithm (steepest-ascent version) is shown in Figure 4.2. It is

does not maintain a search tree, so the data structure for the current node need only record the state and the value of the objective function. Hill climbing does not look ahead beyond the immediate neighbors of the current state. This resembles trying to find the top of Mow1t Everest in a thick fog while suffering from amnesia. To illustrate hill climbing, we will use the 8-queens problem introduced on page 71. Local search algorithms typically use a complete-state formulation, where each state has 8 queens on the board, one per column. The successors of a state are all possible states generated by moving a single queen to another square in the same column (so each state has 8 x 7 = 56 successors). The hemistic cost function

h is the number of pairs of queens that

are attacking each other, either directly or indirectly. The global minimum of this function is zero, which occurs only at perfect solutions. Figure 4.3(a) shows a state with h = 17. The figure also shows the values of all its successors, with the best successors having

h = 12.

Hill-climbing algorithms typically choose randomly among the set of best successors if there is more than one. GREEDY LOCAL SEARCH

Hill climbing is sometimes called greedy local search because it grabs a good neighbor state without thinking allead about where to go next. Although greed is considered one of the seven deadly sins, it tums out that greedy algorithms often perform quite well. Hill climbing often makes rapid progress toward a solution because it is usually quite easy to improve a bad state. For example, from the state in Figure 4.3(a), it takes just five steps to reach the state in Figure 4.3(b), which has h = 1 and is very nearly a solution. Unfortunately, hill climbing often gets stuck for the following reasons:

LOCALMAXIMUM

•

Local maxima: a local maximum is a peak that is higher than each of its neighboring states but lower than the global maximum. Hill-climbing algorithms that reach the vicinity of a local maximum will be drawn upward toward the peak but will then be stuck with nowhere else to go. Figure 4.1 illustrates the problem schematically. More

Section 4.1.

Local Search Algorithms and Optimization Problems

123

14 16 14

'if

13

16

13 16

15

�

14

16

16

18

15

'i¥

15

15

15

14

14

� -

14

17

17

� -

18

14

14

14

13

'if 14

17

15

'if

16

� 16

(a)

(b)

(a) An 8-queens state with heuristic cost estimate h = 17, showing the value of

Figure 4.3

h for each possible successor obtained by moving a queen within its column. The best moves (b) A local minimum in the 8-queens state space; the state has h = 1 but every

are marked.

successor has a highercost.

concretely, the state in Figure 4.3(b) is a local maximum (i.e., a local minimum for the cost RIDGE

h); every move of a single queen makes the situation worse.

• Ridges:

a ridge is shown in rigure 4.4. Ridges result in a sequence of local maxima

that is very difficult for greedy algorithms to navigate. PLATEAU SHOULDER

• Plateaux:

a plateau is a flat area of the state-space landscape.

maximum, from which no uphill exit exists, or a

It can be a flat local

shoulder, from

which progress is

possible. (See Figure 4.1.) A hill-climbing search might get lost on the plateau. In each case, the algoritlun reaches a point at which no progress is being made. Stat1ing from a randomly generated 8-queens state, steepest-ascent hill climbing gets stuck 86% of the time, solving only 14% of problem instances. It works quickly, taking just 4 steps on average when 8 it succeeds and 3 when it gets stuck-not bad for a state space with 8 � 17 million states. The algorithm in Figure 4.2 halts if it reaches a plateau where the best successor has the same value as the current state. Might it not be a good idea to keep going-to allow a

SIDEWAYS MOVE

sideways mo,·e n i the hope that the plateau is really a shoulder, as shown in Figure 4.1? The answer is usually yes, but we must take care. If we always allow sideways moves when there are no uphill moves, an infinite loop will occur whenever the algorithm reaches a flat local maximum that is not a shoulder. One common solution is to put a limit on the number of con secutive sideways moves allowed. For example, we could allow up to, say, 100 consecutive sideways moves in the 8-queens problem. This raises the percentage of problem instances solved by hill climbing from 14% to 94%. Success comes at a cost: the algorithm averages roughly 21 steps for each successful instance and 64 for each failure.

124

Chapter

Figure 4.4

4.

Beyond Classical Search

Dlustration of why ridges cause difficulties for hill climbing. The grid of states

(dark circles) is superimposed on a ridge rising from left to right, creating a sequence oflocal maxima that are not directly connected to each other. From each local maximum, all the available actions point downhill. STOCHASriC HILL CLIMBING

Many variants of hill climbing have been invented. Stochastic hill climbing chooses at random from among the uphill moves; the probability of selection can vary with the steepness

FIRST.CHOICE HILL CLIMBING

of the uphill move. This usually converges more slowly than steepest ascent, but in some state landscapes, it finds better solutions.

First-choice hill climbing implements stochastic

hill climbing by generating successors randomly until one is generated that is better than the current state. This is a good strategy when a state has many (e.g., thousands) of successors. The hill-climbing algorithms described so far are incomplete-they often fail to find a goal when one exists because they RANOOM-RESTART HILLCLIMBING

can

get stuck on local maxima.

climbing adopts the well-known adage, "If at first you

Random-restart hill

don't succeed, try, try again." It con

1

ducts a series of hill-climbing searches from randomly generated initial states, until a goal is found. It is trivially complete with probability approaching I , because it will eventually generate a goal state as the initial state. If each hill-climbing search has a probability

p of

1/p. For 8-queens instances with � 0.14, so we need roughly 7 iterations to find a goal (6 fail

success, then the expected number of restarts required is no sideways moves allowed, p

1 success). The expected number of steps is the cost of one successful iteration plus (1 -p)jp times the cost offailure, or roughly 22 steps in all. When we allow sideways moves, 1/0.94 � 1.06 iterations are needed on average and ( 1 x 2 1 ) + (0.06/0.94) x 64 � 25 steps.

ures and

For 8-queens, then, random-restart hill climbing is very effective indeed. Even for three mil 2 lion queens, the approach can find solutions in tmder a minute. 1 Generating a random state from an implicitly specified state space can be a hard problem in itself.

2 Luby et al. ( 1993) prove that it is best, in some cases, to restart a randomized search algorithm after a particular, fixed amount of time and that this can be much more efficient than letting each search continue indefinitely. Disallowing or limiting the number of sideway� moves is an example of this idea.

Section 4.1.

125

Local Search Algorithms and Optimization Problems

The success of hill climbing depends very much on the shape of the state-space land scape: if there are few local maxima and plateaux, random-restatt hill climbing will find a good solution very quickly. On the other hand, many real problems have a landscape that looks more like a widely scattered family of balding porcupines on a flat floor, with miniature porcupines living on the tip of each porcupine needle, ad infinitum. NP-hard problems typi cally have an exponential number of local maxima to get stuck on. Despite this, a reasonably good local maximum can often be found after a smaU number of restarts. 4.1.2

SIMULATED ANNEALING

GRADIENT DESCENT

Simulated annealing

A hill-climbing algorithm that never makes "downhill" moves toward states with lower value (or higher cost) is guaranteed to be incomplete, because it can get stuck on a local maxi mum. In contrast, a purely random walk-that is, moving to a successor chosen uniformly at random from the set of successors-is complete but extremely inefficient. Therefore, it seems reasonable to try to combine hill climbing with a random walk in some way that yields both efficiency and completeness. Simulated annealing is such an algorithm. In metallurgy, annealing is the process used to temper or harden metals and glass by heating them to a high temperature and then gradually cooling them, thus allowing the material to reach a low energy crystalline state. To explain simulated annealing, we switch our point of view from hill climbing to gradient descent (i.e., minimizing cost) and imagine the task of getting a ping-pong ball into the deepest crevice in a bwnpy surface. If we just let the ball roll, it will come to rest at a local minimum. If we shake the swface, we can bounce the ball out of the local minimum. 1l1e trick is to shake just hard enough to bounce the ball out of local min ima but not hard enough to dislodge it from the global minimum. The simulated-annealing solution is to start by shaking hard (i.e., at a high temperature) and then gradually reduce the intensity of the shaking (i.e., lower the temperature). The innermost loop ofthe simulated-annealing algorithm (Figure 4.5) is quite similar to hill climbing. Instead ofpicking the best move, however, it picks a random move. Ifthe move improves the situation, it is always accepted. Otherwise, the algorithm accepts the move with some probability less than 1. The probability decreases exponentially with the "badness" of the move-the amount tJ.E by which the evaluation is worsened. The probability also de creases as the "temperature" T goes down: "bad" moves are more likely to be allowed at the start when T is high, and they become more unlikely as T decreases. If the schedule lowers T slowly enough, the algorithm will find a global optimwn with probability approaching l . Simulated annealing was first used extensively to solve VLSI layout problems in the

early 1980s. It has been applied widely to factory scheduling and other large-scale optimiza tion tasks. In Exercise 4.4, you are asked to compare its performance to that of random-restart hill climbing on the 8-queens puzzle. 4.1.3 LOCAL BEAM SEARCH

Local beam search

Keeping just one node in memory might seem to be an extreme reaction to the problem of memory limitations. The local beam search algorithm3 keeps track of k states rather than 3 Local berun search is an adaptation of beam search, which is a path based algorithm. -

Chapter 4.

126

Beyond Classical Search

function SJMULATED-ANNEALING(problem, schedule) returns a solution state inputs:

pmblem, a problem schedule, a mapping from time to "temperature"

current ..- MAKE-NODE(pmblem.INJTIAL-STATB)

for t = l to oo do

T +- schedule(t) if T = 0 then return current

next..- a randomly selected successor of current f:l.E +- next.VALUE - current. VALUE if f:l.E > 0 then cun-ent ..- next else cun-ent +-next only with probability ei:>.E/T

Figure 4.5

The simulated annealing algorithm, a version of stochastic hill climbing where

some downhill moves are allowed. Downhill moves are accepted readily early in the anneal ing schedule and then less often as time goes on. The schedule input determines the value of the temperature T as a function of time.

STOCHASTIC BEAM SEARCH

just one. It begins with k randomly generated states. At each step, all the successors of a]I k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best successors from the complete list and repeats. At first sight, a local beam search with k states might seem to be nothing more than mnning k random restarts in parallel instead of in sequence. In fact, the two algorithms are quite different. In a random-restart search, each search process runs independently of the others. In a local beam search, useful information is passed among the parallel search threads. In effect, the states that generate the best successors say to the others, "Come over here, the grass is greener!" The algoritlun quickly abandons unfruitful searches and moves its resources to where the most progress is being made. In its simplest form, local beam search can suffer from a lack of diversity among the k states-they can quickly become concentrated in a small region of the state space, making the search little more than an expensive version of hill climbing. A variant called stochastic beam search, analogous to stochastic hill climbing, helps alleviate this problem. Instead of choosing the best k from the the pool of candidate successors, stochastic beam search chooses k successors at random, with the probability of choosing a given successor being an increasing function of its value. Stochastic beam search bears some resemblance to the process of natural selection, whereby the "successors" (offspring) of a "state" (organism) populate the next generation according to its "value" (fitness). 4.1.4

GENETIC ALGORITHM

Genetic algorithms

i which successor states A genetic algorithm (or GA) is a variant of stochastic beam search n are generated by combining two parent states rather than by modifying a single state. The analogy to natural selection is the same as in stochastic beam search, except that now we are dealing with sexual rather than asexual reproduction.

Section 4.1.

7

Local Search Algorithms and Optimization Problems

24748552 32752411 24415124 1 325432131

327!5 2411 24 7!4 8552 32752�11 24415!124

11

(a)

(b)

(c)

Initial Population

Fitnc.�s Function

Selection

Figure 4.6

12

�1 3274@52 1 247�2411 �I 24752411 1 3275 2l124 1 ·1 32@;2124 �I 2441541rn (d)

Crossover

(e) Mutation

The geneli

symbol is sometimes written in other books as :J or �.

(if and only if). The sentence W1'3 ¢::> write this as =:.

•W2'2 is a biconditional. Some other books

Sentence -+ AtomicSentence I ComplexSentence A tomicSentence -+

T1·ue I False I P I

QI

R

I

ComplexSentence -+ ( Sentence ) I [ Sentence ] -, Sentence Sentence 1\ Sentence Sentence V Sentence Sentence => Sentence

Sentence ¢:} Sentence OPERATOR PRECEDENCE

Figure 7.7

•,/\, V,=?,¢}

A BNF (Backus-Naur Form) grammar of sentences in propositional logic,

along with operator precedences, from highest to lowest.

Section 7.4.

Propositional Logic: A Very Simple Logic

245

Figure 7.7 gives a formal grammar of propositional logic; see page 1060 if you are not familiar with the BNF notation. The BNF grammar by itself is ambiguous; a sentence with several operators can be parsed by the grammar in multiple ways. To eliminate the ambiguity we define a precedence for each operator. The "not" operator ( •) has the highest precedence, which means that in the sentence ·A 1\ B the --, binds most tightly, giving us the equivalent of (·A) 1\ B rather than •(A 1\ B). (The notation for ordinary arithmetic is the same: -2 + 4 is 2, not -6.) When in doubt, use parentheses to make sure of the right interpretation. Square brackets mean the same thing as parentheses; the choice of square brackets or parentheses is solely to make it easier for a human to read a sentence. 7.4.2

TRUTH VALUE

Semantics

Having specified the syntax of propositional logic, we now specify its semantics. The se mantics defines the rules for determining the truth of a sentence with respect to a particular model. In propositional logic, a model simply fixes the truth value-true or false-for ev ery proposition symbol. For example, if the sentences in the knowledge base make use of the proposition symbols P1,2, P2,2, and P3,1, then one possible model is

m1 = {P1,2 =false, P2,2 =false, ?3,1 = true} . 3 With three proposition symbols, there are 2 = 8 possible models-exactly those depicted in Figure 7.5. Notice, however, that the models are purely mathematical objects with no necessary connection to wumpus worlds. P1,2 is just a symbol; it might mean "there is a pit in [1 ,2]" or "I'm in Paris today and tomorrow." The semantics for propositional logic must specify how to compute the truth value of any sentence, given a model. This is done recm·sively. All sentences are constructed from

atomic sentences and the five connectives; therefore, we need to specify how to compute the truth of atomic sentences and how to compute the truth of sentences formed with each of the five connectives. Atomic sentences are easy: • •

True is true in every model and False is false in every model. The truth value of every other proposition symbol must be specified directly in the model. For example, in the model m1 given earlier, P1,2 is false.

For complex sentences, we have five rules, which hold for any subsentences P and Q in any model m (here "iff" means "if and only if''): • --,p is true iff P •

•

P 1\ Q is true iff both P and Q are true in m. P V Q is true iff either P or Q is true in m.

• P� •

TRUTH "!ABLE

is false in m.

Q is true unless P is true and Q is false in m.

P ¢:? Q is true iff P and Q are both true or both false in m.

The rules can also be expressed with truth tables that specify the truth value of a complex sentence for each possible assignment of truth values to its components. Truth tables for the five connectives are given in Figure 7.8. From these tables, the truth value of any sentence s can be computed with respect to any model m by a simple recursive evaluation. For example,

246

Chapter

7.

Logical Agents

p

Q

,p

P I\ Q

PVQ

P � Q

P ¢:? Q

false false true true

false true false true

true true

false false false true

false true t·rue t·rue

true true

true

false false

false t·rue

false false true

Figure 7.8 Truth tables for the five logical connectives. To use the table to compute, for example, the value of P V Q when P is true and Q is false, first look on the left for the row where Pis t1-ue and Q is false (the third row). Then look in that row under the PvQ column to see the result: true.

the sentence •P1,2 /\ (P2,2 V ?3,1), evaluated in m1, gives true 1\ (false V true ) = true 1\ true = true. Exercise 7.3 asks you to write the algorithm PL-TRUE?(s, m), which computes the truth value of a propositional logic sentence s in a model m. The truth tables for "and," "or;' and "not" are in close accord with our intuitions about

the English words. The main point of possible confusion is that P V Q is true when P is true

or Q is true or both. A different connective, called "exclusive or" ("xor" for short), yields false when both disjuncts are true.7 There is no consensus on the symbol for exclusive or; some choices are V or =f. or $. The truth table for � may not quite fit one's intuitive understanding of "P implies Q" or "if P then Q." For one thing, propositional logic does not require any relation ofcausation or relevance between P and Q. The sentence "5 is odd implies Tokyo is the capital of Japan" is a true sentence of propositional logic (under the normal interpretation), even though it is a decidedly odd sentence of English. Another point of confusion is that any implication is true whenever its antecedent is false. For example, "5 is even implies Sam is smart" is true, regardless of whether Sam is smart. This seems bizarre, but it makes sense if you think of "P � Q" as saying, "If P is true, then I am claiming that Q is true. Otherwise I am making no claim." The only way for this sentence to be false is if P is true but Q is false. The biconditional, P ¢:? Q, is true whenever both P � Q and Q � P are true. In English, this is often written as "P if and only if Q." Many of the rules ofthe wumpus world are best written using ¢:?. For example, a square is breezy ifa neighboring square has a pit, and a square is breezy only ifa neighboring square has a pit. So we need a biconditional,

B1,1 ¢:? (P1,2 v P2,1) , where B1 1 means that there is a breeze in [1 ,1]. '

7.4.3

A simple knowledge base

Now that we have defined the semantics for propositional logic, we can construct a knowledge base for the wumpus world. We focus first on the immutable aspects of the wumpus world, leaving the mutable aspects for a later section. For now, we need the following symbols for each [x, ·y] location: 7

Lntin has a separate word, aut, for exclusive or.

Section 7.4.

Propositional Logic: A Very Simple Logic

247

Px,y is true ifthere is a pit in [x, y]. Wx,y is true if there is a wumpus in [x, y], dead or alive. Bx,y is true if the agent perceives a breeze in [x, y]. Sx,y is true if the agent perceives a stench in [x, y]. The sentences we write will suffice to derive •P1,2 (there is no pit in [ 1 ,2]), as was done informally in Section 7.3. We label each sentence Ri so that we can refer to them: • There is no pit in [1,1]:

R1 :

•H' 1 .

• A square is breezy if and only if there is a pit in a neighboring square. This has to be

stated for each square; for now, we n i clude just the relevant squares:

¢:? (P1,2 v P2,1) . R3 : ¢:? (P1,1 V P2,2 V P3,1) . • The preceding sentences are true in all wwnpus worlds. Now we include the breeze R2 :

B1,1 B2,1

percepts for the first two squares visited in the specific world the agent is in, leading up to the situation in Figure 7.3(b).

R4 : Rs : 7.4.4

•B11 ' . B2,1 .

A simple inference procedure

Our goal now is to decide whether KB f= a for some sentence a:. For example, is •P1,2 entailed by our KB'? Our first algorithm for inference is a model-checking approach that is a direct implementation of the definition of entailment: enumerate the models, and check that a is true in every model in which KB is true. Models are assignments of true or false to every proposition symbol. Retuming to our wumpus-world example, the relevant proposi tion symbols are B1,1, B2,1, H,1, H,2, P2,1, P2,2, and P3,1· With seven symbols, there are 27 = 128 possible models; in three of these, KB is true (Figure 7.9). In those three models, --,pl '2 is true, hence there is no pit .in [1,2]. On the other hand, P2'2 is true in two of the three models and false in one, so we cannot yet tell whether there is a pit in [2,2]. Figure 7.9 reproduces in a more precise form the reasoning illustrated in Figure 7.5. A general algorithm for deciding entailment in propositional logic is shown in Figure 7. I 0. Like the BACKTRACKING-SEARCH algorithm on page 215, TT-ENTAILS'? performs a recursive enumeration of a finite space of assignments to symbols. The algorithm is sound because it implements directly the definitjon of entailment, and complete because it works for any KB and a and always terminates-there are only finitely many models to examine. Of course, "finitely many" is not always the same as "few." If KB and a contain n symbols in all, then there are 2n models. Thus, the time complexity of the algorithm is 0(2n). (The space complexity is only O(n) because the emuneration is depth-first.) Later in tllis chapter we show algorithms that are much more efficient in many cases. Unfortunately, propositional entailment is co-NP-complete (i.e., probably no easier than NP-complete-see Appendix A), so every known inference algorithm for propositional logic has a worst-case

complexity that is exponential in the size ofthe input.

248

Chapter

7.

Logical Agents

B1,1

B2,1

11,1

Pt,2

P2,1

P2,2

Ps,1

R1

R2

Rs

I4

Rs

KB

false

false

false

false

false

false

true

t1·ue

true

tme

false

false

false

false

false

false

false

false

false true

true

t1-ue

false

t1·ue

false

false

false

true

false

false

false

false

false

true

tl-ue

false

t1·ue

true

false

false

true

false

false

false

false

true

true

t1-ue

true

t1·ue

true

true

false

true

false

false

false

true

false

true

true

true

true

true

true

false

true

false

false

false

t1-ue

true

true

true

true

t1·ue

true

true

false

true

false

false

true

false

false

true

false

false

tme

true

false

tme

true

true

true

true

true

true

false

t1-ue

true

false

true

false

A truth table constructed for the knowledge base given in the text. KB is true

Figure 7.9

if R1 through Rs are true, which occurs in just 3 of the 128 rows (the ones underlined in the

right-hand column). In all 3 rows, Pt,2 is false, so there is no pit in [I ,2]. On the other hand, there might (or might not) be a pit in [2,2].

function TT-ENTAILS?(KB,a) returns true or false inputs: KB, the knowledge base, a sentence in propositional logic

a, the query, a sentence in propositional logic symbols False

A grammar for conjunctive normal form, Horn clauses, and definite clauses. A clause such as A 1\ B => Cis still a definite clause when it is written as -,A V -,B V C,

Figure 7.14

but only the former is considered the canonical form for definite clauses. One more class is

the k-CNF sentence, which s i a CNF sentence where each clause has at most k literals.

FORWARD-CHAINING BAC�AAD CHAINI�G

2. Inference with Hom clauses can be done through the forward-chaining and backward chaining algorithms, which we explain next. Both of these algoritluns are natural, in that the inference steps are obvious and easy for humans to follow. This type of inference is the basis for logic programming, which is discussed in Chapter 9.

3. Deciding entailment with Hom clauses can be done in time that is linear n i the size of the knowledge base-a pleasant surprise. 7.5.4

Forward and backward chaining

The forward-chaining algorithm PL-FC-ENTAILS?(KB,q) determines if a single proposi tion symbol q-the query-is entailed by a knowledge base of definite clauses. It begins from known facts (positive literals) in lhe knowledge base. If all the premises of an implica tion are known, then its conclusion is added to the set of known facts. For example, if £1,1 and Breeze are known and (£1,1 1\ Breeze) => B1,1 is in the knowledge base, then B1,1 can be added. This process continues until the query q is added or until no further inferences can be made. The detailed algoritlun is shown in Figure 7.15; the main point to remember is that it runs in linear time. The best way to understand the algoritlun is through an example and a picture. Fig ure 7 .16(a) shows a simple knowledge base of Hom clauses with A and B as known facts. Figure 7.16(b) shows the same knowledge base drawn as an AND-OR graph (see Chap ter 4). ln AND-OR graphs, multiple links joined by an arc indicate a conjunction-every link must be proved-while multiple links without an arc indicate a disjunction-any link can be proved. It is easy to see how forward chaining works in the graph. The known leaves (here, A and B) are set, and inference propagates up the graph as far as possible. ¥/her ever a conjunction appeat-s, the propa.gation waits w1til all the conjuncts are known before proceeding. The reader is encouraged to work through the example in detail.

Chapter

258

7.

Logical Agents

function PL-FC-ENTAILS?(KB, q) returns tr·ue or false inputs:

KB, the knowledge base, a set of propositional definite clauses

q, the query, a proposition symbol count � a table, where count[c1 is the number of symbols in c's premise inferred +- a table, where infen-ed[s] is initially false for all symbols agenda � a queue of symbols, initially symbols known to be true in KB while agenda s i not empty do p ..- POP(agenda) if p = q then return tme if inferred[p1 = false then infen-ed [p1 � tme for each clause c in KB where p is in c.PREMISE do decrement count[c1 if count[c)= 0 then add c. CONCLUSION to agenda return false The forward-chaining algorithm for propositional logic. The agenda keeps track of symbols known to be true but not yet "processed." The count table keeps track of Figure 7.15

how many premises of each implication are as yet unknown. Whenever a new symbol p from the agenda is processed, the count is reduced by one for each implication in whose premise

p appears (easily identified in constant time with appropriate indexing.) If a count reaches zero,

all the premises of the m i plication

are known, so its conclusion can

be added to

the

agenda. Finally, we need to keep track of which symbols have been processed; a symbol that is already in the set of inferred symbols need not be added to the agenda again. This avoids

redundant work and prevents loops caused by implications such as P

FIXED POINT

OO.lii·DRIVEN

::::;. Q and Q ::::;. P.

It is easy to see that f01ward chaining is sound: every inference is essentially an appli cation of Modus Ponens. Forward chaining is also complete: every entailed atomic sentence will be derived. The easiest way to see this is to consider the final state of the inferred table (after the algorithm reaches a fixed point where no new inferences are possible). The table contains true for each symbol inferred during the process, and false for all other symbols. We can view the table as a logical model; moreover, every definite clause in the original KB is true in this model. To see this, assume the opposite, namely that some clause a1A . . . Aak => b is false in the model. Then a1 A . . . A ak must be true in the model and b must be false in the model. But this contradicts ow· assumption that the algorithm has reached a fixed point! We can conclude, therefore, that the set of atomic sentences inferred at the fixed point defines a model of the original KB. Fm1hermore, any atomic sentence q that is entailed by the KB must be true in all its models and in this model in particular. Hence, every entailed atomic sentence q must be inferred by the algorithm. Forward chaining is an example of the general concept of data-driven reasoning-that is, reasoning in which the focus of attention starts with the known data. It can be used within an agent to detive conclusions from incoming percepts, often without a specific query in mind. For example, the wumpus agent might TELL its percepts to the knowledge base using

Section 7.6.

Effective Propositional Model Checking

259

Q

P => Q L /\ M => P B /\ L =? M A /\ P => L A /\ B =? L A B A (a) Figure 7.16

B

(b)

(a) A set of Horn clauses. (b) The corresponding AND-OR graph.

an incremental forward-chaining alg01ithm in which new facts can be added to the agenda to initiate new inferences. In htunans, a certain amount of data-driven reasoning occurs as new information arrives. For example, if I am indoors and hear rain starting to fall, it might occw· to me that the picnic will be canceled. Yet it will probably not occur to me that the seventeenth petal on the largest rose in my neighbor's garden will get wet; humans keep forward chaining tmder careful control, lest they be swamped with irrelevant consequences. The backward-chaining algorithm, as its name suggests, works backward from the yut:ry. If lilt: yut:ry q is kuuwu

GOAL·DRECTED REASONING

7.6

lu

bt: lrut:, lht:ll uu wurk is llt:t:tktl. Otltt:rwist:, lilt: algurilluu

finds those implications in the knowledge base whose conclusion is q. If all the premises of one of those implications can be proved true (by backward chaining), then q is true. When applied to the query Q in Figw·e 7.16, it works back down the graph tmtil it reaches a set of known facts, A and B, that forms the basis for a proof. The algorithm is essentially identical to the AND-OR-GRAPH-SEARCH algorithm in Figure 4 . 1 1 . As with forward chaining, an efficient implementation runs in linear time. Backward chaining is a form of goal-directed reasoning. It is useful for answering specific questions such as "What shall I do now?" and "Where are my keys?" Often, the cost of backward chaining is much less than linear in the size of the knowledge base, because the process touches on1y relevant facts.

EFFECTIVE PROPOSITIONAL MODEL CHECKING

In this section, we describe two famities of efficient algotithms for general propositional inference based on model checking: One approach based on backtracking search, and one on local hill-climbing search. These algorithms are part of the "technology" of propositional logic. This section can be skimmed on a first reading of the chapter.

260

Chapter

7.

Logical Agents

The algorithms we desctibe are for checking satisfiability: the SAT problem. (As noted earlier, testing entailment, a: f= f3, can be done by testing unsatisfiability of a: 1\ •f3.) We have already noted the connection between finding a satisfying model for a logical sentence and finding a solution for a constraint satisfaction problem, so it is perhaps not surprising that the two families of algorithms closely resemble the backtracking algorithms of Section 6.3 and the local search algorithms of Section 6.4. They are, however, extremely important in their own right because so many combinatorial problems in computer science can be reduced to checking the satisfiability of a propositional sentence. Any improvement in satisfiability algorithms has huge consequences for our ability to handle complexity in general. 7.6.1 OO.VIS-PUTNAM ALGORITHM

A complete backtracking algorithm

TI1e first algorithm we consider is often called the Davis-Putnam algoritlun, after the sem inal paper by Martin Davis and Hilary Putnam (1960). The algorithm is in fact the version described by Davis, Logemann, and Loveland (1962), so we will call it DPLL after the ini tials of all four authors. DPLL takes as input a sentence in conjunctive normal form-a set of clauses. Like BACKTRACKING-SEARCH and TT-ENTAILS?, it is essentially a recursive, depth-first enumeration of possible models. It embodies three improvements over the simple scheme of TT-ENTAILS? :

• Early termination: The algorithm detects whether the sentence must be true or false, even with a pattially completed model. A clause is true if any literal is true, even if the other literals do not yet have truth values; hence, the sentence as a whole could be judged true even before the model is complete. For example, the sentence (A V B) 1\ (A V C) is true if A is true, regardless of the values of B and C. Similarly, a sentence is false if any clause is false, which occurs when each of its literals is false. Again, this can occur long before the model is complete. Early termination avoids examination of entire subtrees in the search space. PURE SYMBOL

• Pure symbol heuristic: A pure symbol is a symbol that always appears with the same

"sign" in all clauses. For example, in the three clauses (A V B) (-,B V •C), and (C V A), the symbol A is pure because only the positive literal appears, B is pure because only the negative literal appears, and C is impure. It is easy to see that if a sentence has a model, then it bas a model with the pure symbols assigned so as to make their literals true, because doing so can never make a clause false. Note that, in determining the purity of a symbol, the algorithm can ignore clauses that are already known to be true in the model constructed so far. For example, if the model contains B =false, then the clause ( -,B V C) is already true, and in the remaining clauses C appears only as a positive literal; therefore C becomes pw·e. ·

,

·

• Unit clause heuristic: A unit clause was defined earlier as a clause with just one lit eral. In the context of D PLL, it also means clauses in which all literals but one are already assigned false by the model. For example, if the model contains B = true, then (-,B V C) simplifies to -,C, which is a unit clause. Obviously, for this clause to be true, C must be set to false. The unit clause heuristic assigns all such symbols before branching on the remainder. One impottant consequence of the heuristic is that ·

Section 7.6.

Effective Propositional Model Checldng

ftmction

261

DPLL-SATISFIABLE?(s) returns true or false s, a sentence in propositional logic

inputs:

clauses ....- the set of clauses in the CNF representation of s symbols ....- a list of the proposition symbols in s return DPLL(clauses, symbols, { }) ftmction

DPLL(clauses, symbols, model) returns true or false

f i every clause in clauses is true in model then return t1'ue if some clause in

clauses is false in model then return false

P, value < FIND-PUR!l-SYMDOL(symbols, clauses, model)

if Pis non-null then return DPLL(clauses, symbols - P, model

U { P=value})

P, value t- FIND-UNIT-CLAUSE(clauses, model)

if Pis non-null then return DPLL(clauses, symbols - P, model U { P=value})

P t- FIRST(symbols); Test ....- REST(symbols) return DPLL(clauses, rest, model U {P=true}) or DPLL(clauses, rest, model U {P=false}))

The DPLL algorithm for checking satisfiability of a sentence in propositional behind FIND-PURE-SYMBOL and FIND-UNIT-CLAUSE are described in the text; each returns a symbol (or null) and the truth value to assign to that symbol. Like Figure 7.17

logic. The ideas

TT-ENTAILS?, DPLL operates overpartial models.

UNIT PROAIGATION

any attempt to prove (by refutation) a literal that is already in the knowledge base will succeed immediately (Exercise 7.23). Notice also that assigning one unit clause can create another unit clause-for example, when C is set to false, (C V A) becomes a unit clause, causing true to be assigned to A. This "cascade" of forced assignments is called unit propagation. It resembles the process of forward chaining with definite clauses, and indeed, if the CNF expression contains only definite clauses then DPLL essentially replicates forward chaining. (See Exercise 7.24.) The DPLL algorithm is shown in Figure 7.17, which gives the the essential skeleton of the search process. What Figure 7.17 does not show are the tricks that enable SAT solvers to scale up to large problems. It is interesting that most of these tricks are in fact rather general, and we have seen them before in other guises:

1 . Component analysis (as seen with Tasmania in CSPs): As DPLL assigns truth values to variables, the set of clauses may become separated into disjoint subsets, called com ponents, that share no unassigned variables. Given an efficient way to detect when this occurs, a solver can gain considerable speed by working on each component separately.

2. Variable and value ordering (as seen in Section 6.3.1 for CSPs): Our simple imple mentation of DPLL uses an arbitrary variable ordering and always tries the value true before false. The degree heuristic (see page 216) suggests choosing the variable that appears most frequently over all remaining clauses.

262

Chapter

7.

Logical Agents

3. Intelligent backtracking (as seen in Section 6.3 for CSPs): Many problems that can not be solved in hours of run time with chronological backtracking can be solved in seconds with intelligent backtracking that backs up all the way to the relevant point of conflict.. All SAT solvers that do intelligent backtracking use some form of conflict clause learning to record conflicts so that they won't be repeated later in the search. Usually a limited-size set of conflicts is kept, and rarely used ones are dropped.

4. Random restarts (as seen on page 124 for hill-climbing): Sometimes a run appears not to be making progress. In this case, we can start over from the top of the search tree, rather than trying to continue. After restarting, different random choices (in variable and value selection) are made. Clauses that are learned in the first run are retained after the restart and can help prune the search space. Restarting does not guarantee that a solution will be found faster, but it does reduce the variance on the time to solution.

5. Clever indexing (as seen in many algorithms): The speedup methods used in DPLL itself, as well as the tricks used in modem solvers, require fast indexing of such things as "the set of clauses in which variable

Xi

appears as a positive literal." This task is

complicated by the fact that the algorithms are interested only in the clauses that have not yet been satisfied by previous assignments to variables, so the indexing structures must be updated dynamically as the computation proceeds. With these enhancements, modern solvers can handle problems with tens of millions of vari ables. They have revolutionized areas such as hardware verification and security protocol verification, which previously required laborious, hand-guided proofs. 7.6.2

Local search algorithms

We have seen several local search algorithms so far in this book, including HILL-CLIMBING (page 122) and SIMULATED-ANNEALING (page 126). These algorithms can be applied di rectly to satisfiability problems, provided that we choose the right evaluation function. Be cause the goal is to find an assignment that satisfies every clause, an evaluation function that counts the nwnber of unsatisfied clauses will do the job. In fact, this is exactly the measure used by the MIN-CONFLICTS algorithm for CSPs (page 221). All these algorithms take steps in the space of complete assignments, flipping the tmth value of one symbol at a time. The space usually contains many local minima, to escape from which various forms of random ness are required. In recent years, there has been a great deal of experimentation to find a good balance between greediness and randomness. One ofthe simplest and most effective algorithms to emerge from all this work is called WALKSAT (Figure 7.18). On every iteration, the algorithm picks an unsatisfied clause and picks a symbol in the clause to flip. It chooses randomly between two ways to pick which symbol to flip: (I) a "min.-expressions provide a useful notation in whlch new function symbols are constructed "on the fly." For

example, the function that squares its argumem can be written as (>.x x x x) and can be applied to arguments just like any other function symbol. A >.-expression can also be defined and used as a predicate symbol. (See Chapter 22.) The lambda operator in Lisp plays exactly the same role. Notice that the use of>. in thls way does not increase the formal expressive power of first-{)rder logic, because any sentence that includes a >.-expression can be rewritten by "plugging in" its arguments to yield an equivalent sentence.

Section 8.2.

Syntax and Semantics of First-Order Logic

295

ATOMICSENTENCE

sentence (or atom for sh01t) is formed from a predicate symbol optionally followed by a

ATOM

parenthesized Jist of terms, such as

Brother(Richard, John). This states, under the intended interpretation given earlier, that Richard the Lionheart is the brother of King John.6 Atomic sentences can have complex terms as arguments. Thus,

Married (Father(Richard), Mother( John)) states that Richard the Lionheart's father is married to King Jolm's mother (again, under a suitable interpretation). An atomic sentence is true in a given model ifthe relation referred to by the predicate symbol holds among the objects referred to by the arguments. 8.2.5

Complex sentences

We can use logical connectives to construct more complex sentences, with the same syntax and semantics as .in propositional calculus. Here are four sentences that are tme in the model ofFigure 8.2 under our intended interpretation:

·Brother(LeftLeg(Richard), John) Brother(Richard, John) A Brother(John, Richm{Mo)

(N )l I�M { ii Mo) Mi.u

1l jO•n • s(Nano,M)

l ... e ne nl)tx.America)MfflosJi l e(x)

�Mi.'iSils(y)l/l.t...Sdls(lVe.fi,)' are markup directives that specify how the page is displayed. For example, the string Select means to switch to italic font, display the word Select, and then end the use of italic font. A page identifier such as http: I Iexample . com/books is called a uniform resource locator (URL). The markup Books means to create a hypertext link to uri with the anchor text Books.

Web user would see pages displayed as an array of pixels on a screen, the shopping agent will perceive

a

page as a character string consisting of ordinary words interspersed with for

matting commands in the HTML markup language. Figure 12.8 shows a Web page and a corresponding HTML character string. The perception problem for the shopping agent in volves extracting useful information from percepts of this kind. Clearly, perception on Web pages is easier than, say, perception while driving a taxi in Cairo. Nonetheless, there are complications to the Intemet perception task. The Web page in Figure 12.8 is simple compared to real shopping sites, which may include CSS, cookies, Java, Javascript, Flash, robot exclusion protocols, malfonned HTML, sound files, movies, and text that appears only as part of a JPEG image. An agent that can deal with all of the Internet is almost as complex as a robot that can move in the real world. We concentrate on a simple agent that ignores most of these complications. The agent's first task is to collect product offers that are relevant to a query. If the query is "laptops," then a Web page with a review of the latest high-end laptop would be relevant, but if it doesn't provide a way to buy, it isn't an offer. For now, we can say a page is an offer if it contains the words "buy" or "price" or "add to cart" within an HTML link or form on the

464

Chapter

12.

Knowledge Representation

page. For example, if the page contains a string of the form " 100, making 0(2n) impractical. The full joint distribution in tabular form is just not a practical tool for building reasoning systems. Instead, it should be viewed as the theoretical foundation on which more effective approaches may be built, just as truth tables formed a theoretical foundation for more practical algorithms like DPLL. The remainder of this chapter introduces some of the basic ideas required in preparation for the development of realistic systems in Chapter 14. 1 3 .4

INDEPENDENCE

Let us expand the full joint distribution in Figure 13.3 by adding a fourth variable, Weather. The full joint distribution then becomes P( Toothache, Catch, Cavity, Weather), which has 2 x 2 x 2 x 4 = 32 entries. It contains four "editions" of the table shown in Figw·e 13.3, one for each kind of weather. What relationship do these editions have to each other and to the original three-variable table? For example, how are P(toothache, catch, cavity, cloudy) and ?(toothache, catch, cavity) related? We can use the product rule:

P( toothache, catch, cavity, cloudy) = P( cloudy I toothache, catch, cavity)P(toothache, catch, cavity) . Now, unless one is in the deity business, one should not imagine that one's dental problems influence the weather. And for indoor dentistry, at least, it seems safe to say that the weather does not influence the dental variables. Therefore, the following assertion seems reasonable:

P( cloudy I toothache, catch, cavity) = P( cloudy) .

(13.10)

From this, we can deduce

?(toothache, catch, cavity, cloudy) = P(cloudy)P(toothache, catch, cavity) . A similar equation exists for every entry in P( Toothache, Catch, Cavity, Weather). In fact, we can write the general equation

P( Toothache, Catch, Cavity, Weathe1·) = P( Toothache, Catch, Cavity)P( Weathe1·) .

INDEPENDENCE

Thus, the 32-element table for four variables can be constructed from one 8-element table and one 4-element table. This decomposition is illustrated schematically in Figure 13.4(a). The property we used in Equation (13. 10) is called independence (also marginal in dependence and absolute independence). In particular, the weather is independent of one's dental problems. Independence between propositions a and b can be written as

P(a l b) = P(a) or P(b l a) = P(b) or P(a /\ b) = P(a)P(b) .

(13.11)

All these forms are equivalent (Exercise 13.12). Independence between variables X and Y can be written as follows (again, these are all equivalent):

P(X I Y) = P(X) or P(Y I X ) = P(Y) or P(X, Y) = P(X)P(Y) . Independence assertions are usually based on knowledge of the domain. As the toothache weather example illustrates, they can drama.tically reduce the amount of information nec essary to specify the full joint distribution. If the complete set of variables can be divided

Section 13.5.

495

Bayes' Rule and Its Use

Cavity Weather dec�mposes mto

n

decomposes

V

into

®

(a)

� ......

(b)

®

Figure 13.4 Two examples of factoring a large joint distribution into smaller distributions, using absolute independence. (a) Weather and dental problems are independent. (b) Con i flips are independent.

into n i dependent subsets, then the full joint distribution can be factored n i to separate joint

distributions on those subsets. For example, the fu.U joint distribution on the outcome of n

P(Ct, . . . , Cn) , has 2n entries, but it can be represented as the prod uct of n single-variable distributions P(Ci) · In a more practical vein, the independence of independent coin flips,

dentistry and meteorology is a good thing, because otherwise the practice of dentistry might require intimate knowledge of meteorology, and vice versa. When they are available, then, independence assertions can help in reducing

the size of

the domain representation and the complexity of the inference problem. Unfortunately, clean separation of entire sets of variables by independence is quite rare. Whenever a connection, however indirect, exists between two variables, independence wiU fail to hold. Moreover, even independent subsets can be quite large--for example, dentistry might involve dozens of diseases and hundreds of symptoms, all of which are interrelated. To handle such problems, we need more subtle methods than the straightforward concept of independence.

13.5

BAYES' RULE AND ITS USE

On page 486, we defined the product rule. It can actually be written in two forms:

P(a A b) = P(a I b)P(b)

and

a1��f(b) .

P(a A b) = P(b I a)P(a) .

Equating the two right-hand sides and dividing by

P(bl a) = BAYES" RULE

P(a), we get

P(

This equation is known as Bayes'

(13.12)

rule

(also Bayes' law or Bayes' theorem). This simple

equation underlies most modem AI systems for probabilistic inference.

496

Chapter

13.

Quantifying Uncertainty

The more general case of Bayes' rule for multivalued variables can be written in the notation as follows:

P(Y I X)

=

P(X I Y)P(Y) P(X)

P

'

As before, this is to be taken as representing a set of equations, each dealing with specific val ues of the variables. We will also have occasion to use a more general version conditionalized on some background evidence

P(Y I X , e) 13.5.1

=

e: P(X I Y,e)P(Y I e) P(X Ie)

(13.13)

Applying Bayes' rule: The simple case

On the stuface, Bayes' mle does not seem very useful. It allows us to compute the single term P(b a) in terms of three terms: P(a b), P(b), and P(a). That seems like two steps backwards, but Bayes' rule is useful in practice because there are many cases where we do have good probability estimates for these three numbers and need to compute the fow1h. Often, we perceive as evidence the effect of some unknown cause and we would like to determine that cause. In that case, Bayes' mle becomes

I

I

P(effect I cause)P(cause) P(effect) The conditional probability P(effect I cause) quantifies the relationship in the causal direc tion, whereas P( cause I �!feet) describes the dia�ostic direction. In a task such as medical , ff:

P( cause I e J Ject) -

DIAGNOSTIC

_

diagnosis, we often have conditional probabilities on causal relationships (that is, the doctor knows P(symptoms disease)) and want to derive a diagnosis, P(disease symptoms). For example, a doctor knows that the disease meningitis causes the patient to have a stiff neck, say, 70% of the time. The doctor also knows some unconditional facts: the prior probabil ity that a patient has meningitis is l/50,000, and the prior probability that any patient has a stiff neck is 1%. Letting s be the proposition that the patient has a stiff neck and m be the propositjon that the patient has meningitis, we have

I

P(s l m) P(m) P(s) P(m l s)

I

0.7 1/50000 0.01

P(sl m)P(m) P(s)

=

0.7 x

1/50000

0.01

=

O·OO14 ·

(13. 14)

That is, we expect less than 1 in 700 patients with a stiff neck to have merungitis. Notice that even though a stiff neck is quite strongly indicated by meningitis (with probability 0.7), the probability of meningitis in the patient remains small. This is because the prior probability of stiff necks is much higher than that of meningitis. Section 13.3 illustrated a process by which one can avoid assessing the prior probability of the evidence (here, P(s)) by instead computing a posterior probability for each value of

Section 13.5.

497

Bayes' Rule and Its Use

the query variable (here, m and •m) and then normalizing the results. The same process can be applied when using Bayes' rule. We have

P(M I s) = a (P(s I m)P(m), P(s l •m)P(•m)) . Thus, to use this approach we need to estimate P(s l •m) instead of P(s). There is no free lunch-sometimes this is easier, sometimes it is harder. The general form of Bayes' rule with normalization is (13.15) P(Y I X) = aP(X I Y)P(Y) , where a is the normalization constant needed to make the entries in P(Y I X) swn to l . One obvious question to ask about Bayes' rule is why one might have available the conditional probability in one direction, but not the other. In the meningitis domain, perhaps the doctor knows that a stiff neck implies meningitis in 1 out of 5000 cases; that is, the doctor has quantitative information in the diagnostic direction from symptoms to causes. Such a doctor has no need to use Bayes' rule. Unfortunately,

fragile than causal knowledge.

diagnostic knowledge is often more

If there is a sudden epidemic of meningitis, the unconditional

probability of meningitis, P(m), will go up. The doctor who derived the diagnostic proba bility P(m I s) directly from statistical observation of patients before the epidemic will have

P(m I s) from the other three values wi11 see that P(m I s) should go up proportionately with P(m). Most important, the causal information P(s I m) is unaffected by the epidemk, because it simply reflects the way no idea how to update the value, but the doctor who computes

meningitis works. The use of this kind of direct causal or model-based knowledge provides the crucial robustness needed to make probabilistic systems feasible in the real world.

13.5.2 Using Bayes' rule: Combining evidence We have seen that Bayes' rule can be usefu1 for answering probabilistic queries conditioned on one piece of evidence-for example, the stiff neck. In particular, we have argued that probabilistic information is often available in the form

P(effect I cause).

What happens when

we have two or more pieces of evidence? For example, what can a dentist conclude if her nasty steel probe catches in the aching tooth of a patient? If we know the full joint distribution (Figure 13.3), we can read off the answer:

P( Caviiy I toothache A catch) = a (0.108, 0.016)

�

{0.871, 0.129)

.

We know, however, that such an approach does not scale up to larger numbers of variables. We can try using Bayes' rule to reformulate the problem:

P( Caviiy I toothache A catch) = a P(toothache A catch I Cavity) P( Cavity) .

(13.16)

For this reformulation to work, we need to know the conditional probabilities of the conjunc tion toothache A catch for each value of Cavity. That might be feasible for just two evidence

variables, but again it does not scale up. If there are n possible evidence variables (X rays,

diet, oral hygiene, etc.), then there are 2n possible combinations of observed values for which we would need to know conditional probabilities. We might as well go back to using the full joint distribution. This is what first led researchers away from probability theory toward

498

Chapter

13.

Quantifying Uncertainty

approximate methods for evidence combination that, while giving incorrect answers, require fewer nwnbers to give any answer at all.

Rather than taking this route, we need to find some additional assertions about the domain that will enable us to simplify the expressions. The notion of independence in Sec tion

13.4 provides a clue, but needs refining. It would be nice if Toothache and

Catch were

independent, but they are not: if the probe catches in the tooth, then it is likely that the tooth has a cavity and that the cavity causes a toothache. These variables ever, given

the presence or the absence ofa cavity.

are independent,

how

Each is directly caused by the cavity, but

neither has a direct effect on the other: toothache depends on the state of the nerves in the tooth, whereas the probe's accuracy depends on the dentist's skill, to which the toothache is iiTelevant. 5 Mathematically, this property is written as

CONDITIONAL INDEPENDENCE

P(toothache 1\ catch I Cavity) = P(toothache I Cavity)P(catch I Cavity) . (13.17) This equation expresses the conditional independence of toothache and catch given Cavity. We can plug it into Equation

(13.16) to obtain the probability of a cavity:

P( Cavity I toothache 1\ catch) = a P( toothache I Cavity) P( catch I Cavity) P( Cavity) .

(13.18)

Now the infonnation requirements are the same as for inference, using each piece of evi dence separately: the prior probability

P( Cavity) for the query variable and the conditional

probability of each effect, given its cause. The general definition of conditional independence of two variables

third variable

X and Y, given a

Z, is P(X, y I Z) = P(X I Z)P(Y I Z) .

In the dentist domain, for example, it seems reasonable to assert conditional independence of

Toothache and Catch, given Cavity: P( Toothache, Catch I Cavity) = P( Toothache I Cavity)P( Catch I Cavity) .

the variables

(13.19)

Notice that this assertion is somewhat stronger than Equation (13.17), which asserts indepen

dence only for specific values of Equation

Toothache and Catch. As with absolute independence in

(13.11), the equivalent forms

P(X I Y, Z) = P(X I Z)

and

P(Y I X,Z) = P(Y I Z)

can also be used (see Exercise 13.17). Section 13.4 showed that absolute independence as sert.ions allow a decomposition of the full joint distribution into much smaller pieces. It turns out that the same is true for conditional independence assertions. For example, given assertjon in Equation

the

(13.19), we can derive a decomposition as follows:

P( Toothache, Catch, Cavity) = P( Toothache, Catch I Cavity)P( Cavity) (product rule) = P( Toothache I Cavity)P( Catch I Cavity)P( Cavity) (using 13.19). (The reader can easily check that this equation does ill fact hold in Figure

13.3.) In this way,

the original large table is decomposed into three smaller tables. The original table has seven

5 We assume that the patient and dentist are distinct individuals.

Section 13.6.

499

The Wwnpus World Revisited

independent numbers dent).

(23 = 8 entries

in the table, but they must sum to 1 , so

7 are indepen

The smaller tables contain five independent numbers (for a conditional probability

P(TIC there are two rows of two numbers, and each row swns to I, so that's two independent numbers; for a prior distribution like P( C) there is only one indepen

distributions such as

dent number). Going from seven to five might not seem like a major triumph, but is that, for n symptoms that are all conditionally independent given

SEPARATION

the point

Cavity, the size of the representation grows as O(n) instead of 0(2"). That means that conditional independence assertions can allow probabilistic systems to scale up; moreover, they are much more com monly available than absolute independence assertions. Conceptually, Cavity separates Toothache and Catch because it is a direct cause of both of them. The decomposition of large probabilistic domains into weakly connecred subsets through conditional independence is one of the most important developments in the recent history of AI. The dentistry example illustrates

a commonly

OCCUlTing pattern in which a single cause

directly influences a number of effects, all of which are conditionally independent, given the cause. The full joint distribution can be written as

P( Cause, Effect1 , . . . , Effect,) NAIVE 1'11\YES

=

P( Cause) IT P(Effecti I Cause) . i

Such a probability distribution is called a

naive Bayes

model-"naive" because it is often

used (as a simplifying assumption) in cases where the "effect" variables are

not

actually

conditionally independent given the cause variable. (The naive Bayes model is sometimes

Bayesian classifier, a somewhat careless usage that has prompted true Bayesians it the idiot Bayes model.) In practice, naive Bayes systems can work surpiisingly

called a to call

well, even when the conditional independence asstunption is not true. Chapter

20 describes

methods for learning naive Bayes distributions from observations.

1 3 .6

THE WUMPUS WORLD REVISITED

We can combine of the ideas in this chapter to solve probabilistic reasoning problems in the wwnpus world. (See Chapter 7 for a complete description of the wumpus world.) Uncertainty arises in the wumpus world because the agent's sensors give only partial information about the world. For example, Figure 13.5 shows a situation in which each of the three reachable squares-[ I ,3], [2,2], and [3, l]-might contain a pit. Pure logical inference can conclude nothing about which square is most likely to be safe, so a logical agent might have to choose

randomly. We will see that a probabilistic agent can do much better than

the logical agent.

Our aim is to calculate the probability that each of the three squares contains a pit. (For this example we ignore the wumpus and the gold.) The relevant properties of the wumpus world are that (1) a pit causes breezes in all neighboring squares, and (2) each square other than [1, 1] contains a pit with probability 0.2. The first step is to identify the set of random variables we need:

•

As in

the propositional

logic case, we want one Boolean variable

which is true iff square [i,j] actually contains a pit.

f'ij for each square,

500

Chapter

1,4

2,4

1,3

1,2

B

2,2

3,4

4,4

3,3

4,3

3,2

4,2

3.1

4.1

13.

Quantifying Uncertainty

OK

1.1

2.1 OK

B OK

(b)

(a)

Figure 13.5 (a) After finding a breeze in both [1,2] and [2,1], the agent is stuck-there is no safe place to explore. (b) Division of the squares into Known, Frontier, and Other, for a query about [ 1,3).

•

We also have Boolean variables

Bij that are true iff square [i,j] is breezy; we include

these variables only for the observed squares-in this case, [1,1], [1,2], and [2,1]. The next step is to specify the full joint distribution,

P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1). Ap

plying the product rule, we have

P(P1,1, . . . , P4,4, B1,1, B1,2, B2,1) = P(B1,1, B1,2, B2,1 I P1,1, . . . ) P4,4)P(P1,1, . . . ) P4,4) . nus decomposition makes it easy to see what the joint probability values should be. The first term is the conditional probability distribution of a breeze configuration, given a pit configuration; its values are 1 if the breezes are adjacent to the pits and

0 otherwise.

The

second term is the prior probability of a pit configuration. Each square contains a pit with probability 0.2, independently of the other squares; hence,

4,4 P(P1,1, . . . ) P4,4) = II P(�,j) . i,j 1,1

(13.20)

=

For a particular configuration with exactly n pits,

P(H,b . . . , P4,4) = 0.2n x 0.81 6-n .

In the situation in Figure 13.5(a), the evidence consists of the observed breeze (or its absence) in each square that is visited, combined with the fact that each such square contains A A A� AlJ.2' and no pit. We abbreviate these facts as ' ' ' We are interested in answering queries such as how likely is it that [1,3]

b = -.b1 1

,2 1 kno wn= •P1 ,1 •P1 2 •P2 1 · P(P1,3 I kno wn, b):

contains a pit, given the observations so far? To answer this query, we can follow the standard approach of Equation (13.9), namely, summing over entries from the full joint disllibution. Let

Unkno wn be the set of PiJ vari-

Section 13.6.

The Wwnpus World Revisited abies for squares other than the

501

Known squares and the query square [1,3]. Then, by Equa

tion (13.9), we have

P(P1,3 I known, b) =

a

L P(P1,3, unkno wn, known, b) .

unknown

The full joint probabilities have already been specified, so we are done-that is, unless we care about computation. There are 12 unknown squares; hence the summation contains

2 12 = 4096 terms. In general, the s

ummation

grows exponentially with the number of squares. Surely, one might ask, aren't the other squares irrelevant? How could [4,4] affect

whether [ I ,3] has a pit? Indeed, this intuition is correct. Let

Prontier be the pit va.riables

(other than tlle query variable) that are adjacent to visited squares, in tllis case just [2,2] and

[3,1]. Also, let Other be the pit variables for the other unknown squares; in this case, there are 10 other squares, as shown in Figure 13.5(b). The key insight is that the observed breezes are

conditionally i11dependent of the other variables, given the known, frontier, and query vari ables. To use the insight, we manipulate the query formula into a form in which the breezes are conditioned on all the other variables, and then we apply conditional independence:

P(P1,3 I known, b) L P(P1 ,3, known, b, unknown) a

a

(by Equation (13.9))

L P(b I P1,3, J..·1wwn, unknown)P(P1,3, known, unknown) (by the product mle)

a

L L P(b I known, P1,3,Jrontier, other)P(PL,3, known , frontier, other)

frvutir::'l· utlu:w a

L L P(b I known, P1,3,Jrontier)P(P1,3, known, frontier, other) ,

frontier other

b

other

known, Other

where the final step uses conditional independence: is independent of given and Now, the first term in this expression does not depend on the

P1,3,

frontier.

variables, so we can move the summation inward:

P(P1,3 I known, b) = L P(b I kno wn, P1,3 , jrontier) L P(P1,3, known,frontier, other) . a

frcntier

other

By independence, as in Equation (13.20), the prior term can be factored, and then the terms can b e reordered:

P(P1,3 I },:nown,b) L l'(b I known, P1 ,3,frontier) L P (P1,3)P( known)P(rj ontier)P( other) a

frontier

a

other

P(known)P(P1,3) L P(b I known, P1,3,Jrontier)P(frontier) L P( other) frontier

a:'P( Ft, 3) L P(bl known,P1,3,frontier)P( frontier) , frontier

other

502

Chapter

13

1;3

,•

1,2 8 2,2

1,2 B 2,2

OK 2, 1 1,1 OK

1;3

B 3' OK

I

0.2 X 0.2= 0.04

OK 2,1 1,1 31 B , OK

OK

0.2x 0.8=0.16

I

1,2B 2,2 OK 2,1 3.1 1,1 B OK OK 0.8 X 0.2 =0.16

13.

Quantifying Uncertainty

�

�

'

1,2 8 2,2

I

OK 2,1 1'1 3,1 B OK OK 0.2 X 0.2= 0.04

(a)

Figure 13.6

1,2 B 2,2

I

OK 1,1 3,1 2,1 8 OK OK 0.2 X 0.8 = 0.16

I

(b)

.

.

Consistent models for the frontier variables P2 2 and Ps 1, showing

P(frontie1·) for each model: (a) three models with Pt,3 = tr·ue showing two or three pits,

and (b) two models with Pt,3 =false showing one or two pits.

where the last step folds P(known) into the normalizing constant and uses the fact that

Lother P( other) equals 1.

Now, there are just four terms in the summation over the frontier variables P2,2 and

P3,1· The use of independence and conditional independence has completely eliminated the other squares from consideration.

Notice that the expression

P(b I known, P1,3,jrontier) is 1 when the frontier is consis

tent with the breeze observations, and 0 otherwise. Thus, for each value of P1 ' 3, we sum over

the logical models for the frontier variables that are consistent with the known facts. (Com-

pare with the enumeration over models in Figure 7.5 on page 241.) The models and their associated prior probabilities-P(/rontier )-are shown in Figure P(P1,3

I

known, b) =

a'

13.6.

We have

(0.2(0.04 +0.16+0.16), 0.8(0.04 +0.16)) � (0.31,0.69).

That is, [1,3] (and [3,1] by symmetry) contains a pit with roughly 31% probability. A similar calculation, which the reader might wish to perform, shows that [2,2] contains a pit with roughly 86% probability.

The wumpus agent should definitely avoid [2,2]!

Note that mu·

logical agent from Chapter 7 did not know that [2,2] was worse than the other squares. Logic can tell us that it is unknown whether there is a pit in [2, 2], but we need probability to tell u s how likely it is. What this section has shown is that even seemingly complicated problems can be for mulated precisely in probability theory and solved with simple algorithms. To get

efficient

solutions, independence and conditional independence relationships can be used t o simplify the summations required. These relationships often correspond to our natural understanding of how the problem should be decomposed. In the next chapter, we develop formal represen tations for such relationships as well as algorithms that operate on those representations to perform probabilistic inference efficiently.

Section

13.7

13.7.

Summary

503

SUMMARY

This chapter has suggested probability theory as a suitable fotmdation for tmcertain reasoning and provided a gentle introduction to its use. •

Uncertainty arises because of both laziness and ignorance. I t is inescapable in complex, nondetenninistic, or partially observable envirorunents.

•

Probabilities express the agent's inability to reach a definite decision regarding the tmth of a sentence. Probabilities summa.ize the agent's beliefs relative to the evidence.

•

Decision theory combines the agent's beliefs and desires, defining the best action as the one that maximizes expected utility.

•

Basic probability statements include

prior probabilities and conditional probabilities

over simple and complex propositions. •

The axioms o f probability constrain the possible assignments of probabilities to propo sitions.

•

The

An agent that violates

the axioms must behave irrationally in some cases.

full joint probability distribution specifies the probability of each complete as

signment of values to random variables. It is usually too large to create or use in its explicit form, but when it is available it can be used to answer queries simply by adding up entries for the possible worlds corresponding to the query propositions.

•

Absolute independence between subsets of random variables allows the full joint dis tribution to be factored into smaller joint distributions, greatly reducing its complexity.

Absolute independence seldom occurs in practke. •

Bayes' rule allows

unknown probabilities to be computed from known conditional

probabilities, usually in the causal direction. Applying Bayes' mle with many pieces of evidence runs into the same scaling problems as does the full joint distribution. •

Conditional independence brought about by direct causal relationships in the domain might allow the full joint distribution to be factored into smaller, conditional distri butions. The

naive Bayes model assumes

the conditional independence of all effect

variables, given a single cause variable, and grows linearly with the number of effects. •

A wumpus-world agent can calculate probabilities for unobserved aspects of the world, thereby improving on the decisions of a purely logical agent. Conditional independence makes these calculations tractable.

BIBLIOGRAPHICAL AND HISTORICAL NOTES Probability theory was invented as a way of analyzing games of chance. In about

850 A.D.

the Indian mathematician Mahaviracarya described how to arrange a set of bets that can't lose (what we now call a Dutch book). In Europe, the first significant systematic analyses were produced by Girolamo Cardano around

1565, although publication was posthumous (1663).

By that time, probability had been established as a mathematical discipline due to a series of

504

Chapter

13.

Quantifying Uncertainty

results established in a famous correspondence between Blaise Pascal and Pierre de Fermat in 1654. As with probability itself, the results were initially motivated by gambling problems (see Exercise 13.9). The first published textbook on probability was De Ratiociniis in Ludo

Aleae (Huygens, 1657). The "laziness and ignorance" view of uncertainty was described by John Arouthnot in the preface of his translation of Huygens (Arbuthnot, 1692): "It is impossible for a Die, with such determin'd force and direction, not to fall on such determin'd side, only I don't know the force and direction which makes it fall on such determin'd side, and therefore I call it Chance, which is nothing but the want of art..." Laplace (1816) gave an exceptionally accurate and modern overview of probability; he was the first to use the example "take two urns, A and B, the first containing four white and two black balls, . . . " The Rev. Thomas Bayes (1702-1761) introduced the rule for reasoning about conditional probabilities that was named after him (Bayes, 1763). Bayes only con sidered the case of uniform priors; it was Laplace who independently developed the general case. Kolmogorov (1950, first published in German in 1933) presented probability theory in a rigorously axiomatic framework for the first time. Renyi (1970) later gave an axiomatic presentation that took conditional probability, rather than absolute probability, as primitive. Pascal used probability in ways that required both the objective interpretation, as a prop erty of the world based on symmetry or relative frequency, and the subjective interpretation, based on degree of belief-the former in his analyses of probabilities in games of chance, the latter in the famous "Pascal's wa.ger" argument about the possible existence of God. How ever, Pascal did not clearly realize the distinction between these two interpretations. The distinction was first drawn clearly by James Bernoulli (1654-1705). Leibniz introduced the "classical" notion of probability as a proportion of enumerated, equally probable cases, which was also used by Bernoulli, although it was brought to promi nence by Laplace (1749-1827). This notion is ambiguous between the frequency interpreta tion and the subjective interpretation. The cases can be thought to be equally probable either because of a natural, physical symmetry between them, or simply because we do not have any knowledge that would lead us to consider one more probable than another. The use of this latter, subjective consideration to justify assigning equal probabilities is known as the

PRINCIPLE OF INDIFFfqfNCE PRINCIPLE OF INSUFFI�IENT REASON

principle of indifference. The principle is often attributed to Laplace, but he never isolated the principle explicitly. George Boole and John Venn both referred to it as the principle of insufficient reason; the modern name is due to Keynes (1921). The debate between objectivists and subjectivists became sharper in the 20th century. Kolmogorov (1963), R. A. Fisher (1922), and Richard von Mises (1928) were advocates of the relative frequency interpretation. Karl Popper's (1959, first published in German in 1934) "propensity" interpretation traces relative frequencies to an underlying physical symmetry. Frank Ramsey (1931), Bruno de Finetti (1937), R. T. Cox (1946), Leonard Savage (1954), Richard Jeffrey (1983), and E. T. Jaynes (2003) interpreted probabilities as the degrees of belief of specific individuals. Their analyses of degree of belief were closely tied to utili ties and to behavior-specifically, to the willingness to place bets. Rudolf Carnap, following Leibniz and Laplace, offered a different kind of subjective interpretation of probability not as any actual individual's degree of belief, but as the degree of belief that an idealized individual should have in a particular proposition a, given a particular body of evidence

e.

Bibliographical and Historical Notes

CONFIRMATION INDUCTIVE LOGIC

505

Carnap attempted to go further than Leibniz or Laplace by making this notion of degree of confirmation mathematically precise, as a logical relation between a and e. The study of this relation was intended to constitute a mathematical discipline called inductive logic, analo gous to ordinary deductive logic (Camap, 1948, 1950). Carnap was not able to extend his inductive logic much beyond the propositional case, and Putnam (1963) showed by adversar ial arguments that some fundamental difficulties would prevent a strict extension to languages capable of expressing arithmetic. Cox's theorem (1946) shows that any system for uncertain reasoning that meets his set of assumptions is equivalent to probability theory. This gave renewed confidence to those who already favored probability, but others were not convinced, pointing to the assumptions (primarily that belief must be represented by a single number, and thus the belief in •P must be a function of the belief in p). Halpern (1999) describes the assumptions and shows some gaps in Cox's original fonnulation. Hom (2003) shows how to patch up the difficulties. Jaynes (2003) has a similar argument that is easier to read. The question of reference classes is closely tied to the attempt to find an inductive logic. The approach of choosing the "most specific" reference class of sufficient size was formally proposed by Reichenbach (1949). Various attempts have been made, notably by Henry Ky hurg (1977, 19R1), to formulate more sophisticatecl policies in orcler to a.voicl some oh vious fallacies that arise with Reichenbach's rule, but such approaches remain somewhat ad hoc. More recent work by Bacchus, Grove, Halpern, and Koller (1992) extends Camap's methods to first-order theories, thereby avoiding many of the difficulties associated with the straight forward reference-class method. Kyburg and Teng (2006) contrast probabilistic inference with nonmonotonic logic. Bayesian probabilistic reasoning has been used in AI since the 1960s, especially in medical diagnosis. It was used not only to make a diagnosis from available evidence, but also to select further questions and tests by using the theory of information value (Section 16.6) when available evidence was inconclusive (Gmry, 1968; Gorry et al., 1973). One system outperfonned human experts in the diagnosis of acute abdominal illnesses (de Dombal et at., 1974). Lucas et at. (2004) gives an overview. These early Bayesian systems suffered from a number of problems, however. Because they lacked any theoretical model of the conditions they were diagnosing, they were vulnerable to tmrepresentative data occUlTing in situations for which only a small sample was avali able (de Dombal et al., 1981). Even more fundamen tally, because they lacked a concise formalism (such as the one to be described in Chapter 14) for representing and using conditional independence information, they depended on the ac quisition, storage, and processing of enormous tables of probabilistic data. Because of these difficulties, probabilistic methods for coping with tmcertainty fell out of favor in AI from the 1970s to the mid-1980s. Developments since the late 1980s are described in the next chapter. The:: 11aive:: Baye::s mudt:l fur juinl distributions has b�n sludit:d e::xle::mivt:ly in lht: pal tern recognition literatu.re since the 1950s (Duda and Hart, 1973). It has also been used, often tmwittingly, in information retre i val, beginning with the work of Maron (1961). The proba bilistic foundations of this technique, described further in Exercise 13.22, were elucidated by Robertson and Sparck Jones (1976). Domingos and Pazzani (1997) provide an explanation

506

Chapter

13.

Quantifying Uncertainty

for the surprising success of naive Bayesian reasoning even in domains where the indepen dence assumptions are clearly violated. There are many good introductory textbooks on probability theory, including those by Bertsekas and Tsitsiklis (2008) and Grinstead and Snell (1997). DeGroot and Schervish (2001) offer a combined introduction to probability and statistics from a Bayesian stand point. Richard Hamming's (1991) textbook gives a mathematically sophisticated introduc tion to probability theory from the standpoint of a propensity interpretation based on physical symmetry. Hacking (1975) and Hald (1990) cover the early history of the concept of proba bility. Bernstein (1996) gives an entertaining popular account of the story of risk.

EXERCISES

13.1

Show from first principles that

P (a I b 1\ a) = 1.

13.2

Using the axioms of probability, prove that any probability distribution on a discrete random variable must sum to 1.

13.3

For each of the following statements, either prove it is true or give a counterexample.

a. If P(a I b, c) = P(b I a, c), then P (a I c) = P(b I c) b. If P(a I b, c) = P(a), then P(b I c) = P(b) c. If P(a I b) = P(a), then P(a I b, c) = P(a I c) Would it be rational for an agent to hold the three beliefs P(A) = 0.4, P(B) = 0.3, and P(A VB) = 0.5? If so, what range of probabilities would be rational for the agent to hold for A 1\ B? Make up a table like the one in Figure 13.2, and show how it supports your argwnent about rationality. Then draw another version of the table where P(A VB)= 0.7. Explain why it is rational to have this probability, even though the table shows one case that is a loss and three that just break even. (Hint: what is Agent 1 committed to about the probability of each of the four cases, especially the case that is a loss?) 13.4

13.5

ATOMIC EVENT

This question deals with the properties of possible worlds, defined on page 488 as assignments to all random variables. We will work with propositions that correspond to exactly one possible world because they pin down the assignments of all the variables. In probability theory, such propositions are called atomic events. For example, with Boolean variables X!. X2, x3, the proposition Xl 1\ 'X2 1\ 'X3 fixes the assignment of the variables; in the language of propositional logic, we would say it has exactly one model.

a.

Prove, for the case of n Boolean variables, that any two distinct atomic events mutually exclusive; that is, their conjunction is equivalent to false.

are

b. Prove that the disjunction of all possible atomic events is logicaliy equivalent to true. c. Prove that any proposition is logicaliy equivalent to the disjunction of the atomic events that entail its truth.

Exercises

507 Prove Equation (13.4) from Equations (13.1) and (13.2).

13.6

Consider the set of all possible five-card poker hands dealt fairly from a standard deck of fifty-two cards.

13.7

a.

How many atomic events are there in the joint probability distribution (i.e., how many five-card hands are there)?

b. c.

What is the probability of each atomic event? What is the probability of being dealt a royal straight flush? Four of a kind? Given the full joint distribution shown in Figure 13.3, calculate the following:

13.8 a.

P(tuuthache) .

b. P( Cavity) . c. P(Toothache I cavity) . d. P( Cavity I toothache V catch) . In his letter of August 24, 1654, Pascal was trying to show how a pot of money should be allocated when a gambling game must end prematurely. Imagine a game where each tum consists of the roll of a die, player E gets a point when the die is even, and player 0 gets a point when the die is odd. The first player to get 7 points wins the pot. Suppose the game is interrupted with E leading 4-2. How should the money be fairly split in this case? What is the general formula? (Fermat and Pascal made several errors before solving the problem, but you should be able to get it right the first time.) 13.9

Deciding to put probability theory to good use, we encounter a slot machine with three independent wheels, each producing one of the four symbols BAR, BELL, LEMON, or CHERRY with equal probability. The slot machine has the following payout scheme for a bet of 1 coin (where "?" denotes that we don't care what comes up for that wheel): 13.10

BAR/BAR/BAR pays 20 coins BELL/BELL/BELL pays 15 coins LEMON/LEMON/LEMON pays 5 coins CHERRY/CHERRY/CHERRY pays 3 coins CHERRY/CHERRY/? pays 2 coins CHERRY/?/? pays 1 coin

a.

Compute the expected "payback" percentage of the machine. In other words, for each coin played, what is the expected coin return?

b.

Compute the probability that playing the slot machine once will result in a win.

c.

Estimate the mean and median number of plays you can expect to make until you go

broke, if you start with 10 coins. You can run a simulation to estimate this, rather than trying to compute an exact answer.

We wish to transmit ann-bit message to a receiving agent. The bits in the message are independently corrupted (flipped) during transmission withE probability each. With an extra parity bit sent along with the original information, a message can be corrected by the receiver 13.11

508

Chapter

13.

Quantifying Uncertainty

if at most one bit in the entire message (including the parity bit) has been corrupted. Suppose we want to ensure that the correct message is received with probability at least 1- 8. What is the maximum feasible value of n? Calculate this value for the case E

= 0.001, 8 = 0.01.

(13.11) are equivalent.

13.12

Show that the three forms of independence in Equation

13.13

Consider two medical tests, A and B, for a virus. Test A is

95%

effective at recog

nizing the virus when it is present, but has a I 0% false positive rate (indicating that the virus

90%

is present, when it is not). Test B is

effective at recognizing the virus, but has a

5% false

positive rate. The two tests use independent methods of identifying the virus. The virus is carried by

1% of all people. Say that a person is tested for the

virus using only one of the tests,

and that test comes back positive for carrying the virus. Which test retuming positive is more indicative of someone really carrying the virus? Justify your answer mathematically.

13.14

Suppose you are given a coin that lands

probability

1 - x.

heads

with probability x and

tails

with

Are the outcomes of successive flips of the coin independent of each

other given that you know the value of x? Are the outcomes of successive flips of the coin independent of each other if you do

13.15

noi know the

value of x? Justify your answer.

After your yearly checkup, the doctor has bad news and good news. The bad news

is that you tested positive for a serious disease and that the test is probability of testing positive when you do have the disease is

0.99,

99%

accurate (i.e., the

as is the probability of

testing negative when you don't have the disease). The good news is that this is a rare disease, striking only

1

in

10,000

people of your age. Why is it good news that the disease is rare?

What are the chances that you actually have the disease?

It is quite often useful

13.16

to

consider the effect of some specific propositions in the

context of some general background evidence that remains fixed, rather than in the complete absence of information. The following questions ask you to prove more general versions of the product rule and Bayes' rule, with respect

a.

to some background evidence e:

Prove the conditionalized version of the general product rule:

P(X,Y I e) = P(X I Y, e)P(Y I e) . b.

Prove the conditionalized version of Bayes' rule in Equation

13.17

(13.13).

Show that the statement of conditional independence

P(X, y I Z) = P(X I Z)P(Y I Z) is equivalent to each of the statements

P(X I Y,Z) = P(X I Z) 13.18

and

P(B I X,Z) = P(Y I Z) .

Suppose you are given a bag containing n unbiased coins. You are told that n - 1 of

these coins are normal, with heads on one side and tails on the other, whereas one coin is a fake, with heads on both sides.

a.

Suppose you reach into the bag, pick out a coin at random, flip it, and get a head. What is the (conditional) probability that the coin you chose is the fake coin?

509

Exercises

b.

Suppose you continue flipping the coin for a total of

k times after picking it and see k

heads. Now what is the conditional probability that you picked the fake coin?

c.

Suppose you wanted to decide whether the chosen coin was fake by flipping it The decision procedure returns

normal. 13.19

k times.

fake if all k flips come up heads; otherwise it returns

What is the (unconditional) probability that this procedure makes an error?

In this exercise, you will complete the normalization calculation for the meningitis

P( 8 l •m), and use it to calculate Ulmormalized P(m I 8) and P(•m I 8) (i.e., ignoring the P( 8) term in the Bayes' rule expression,

example. First, make up a suitable value for values for

Equation ( 13.14)). Now normalize these values so that they add to 1.

X, Y, Z be Boolean random variables. Label the eight entries in the joint dis tribution P(X,Y, Z) as through h. Express the statement that X and Y are conditionally independent given Z, as a set of equations relating through h. How many nonredundant 13.20

Let

a

a

equations are there? (Adapted from Pearl ( 1988).) Suppose you are a \'v'itness to a nighttime hit-and-run

13.21

accident involving a taxi in Athens. A11 taxis in Athens are blue or green. You swear, under oath, that the taxi was blue. Extensive testing shows that,tmder the dim lighting conditions, discrimination between blue and green is

a.

75% reliable.

(Hint: distinguish carefully the proposition that it appears blue.)

Is it possible to calculate the most likely color for the taxi? between the proposition that the taxi

b.

What if you know that

13.22

9 out of

is blue and

10 Athenian taxis are green?

Text categorization is the task of assigning a given document to one of a fixed set of

categories on the basis of the text it contains. Naive Bayes models are often used for this task. In these models, the query variab]e is the doctunent category, and the "effect" variables are the presence or absence of each word in the language; the assumption is that words occur independently in documents, with frequencies determined by the document category.

a.

Explain precisely how such a model can be constructed, given as "training data" a set of documents that have been assigned to categories.

b. Explain precisely how to categorize a new document. c. Is the conditional independence assumption reasonable? Discuss. 13.23

In our analysis of the wumpus world, we used the fact that each square contains a

pit with probability

0.2, independently

of the contents of the other squares. Suppose instead

N/5 pits are scattered at random among tl1e N squares other than [ 1,1]. Are the variables f'i,j and Pk ,l still independent? What is the joint distribution P(P1, 1 , . . . , P4,4)

that exactly

now? Redo the calculation for the probabilities of pits in

13.24

Redo the probability calculation for pits in

contains a pit with probability

0.0 1,

[1 ,3] and [2,2].

[1,3] and [2,2],

assuming that each square

independent of the otl1er squares. What can you say

about the relative performance of a logical versus a probabilistic agent in this case?

13.25

Implement a hybrid probabilistic a.gent for the wumpus world, based on the hybrid

agent in Figure

7.20 and

the probabilistic inference procedure outlined in this chapter.

14

PROBABILISTIC REASONING

In which we explain how to build network models to reason under uncertainty according to the laws of probability theory.

Chapter 13 introduced the basic elements of probability theory and noted the importance of independence and conditional independence relationships in simplifying probabilistic repre sentations of the world. This chapter introduces a systematic way to represent such relation ships explicitly in the form of

Bayesian networks.

these networks and show how they can be used

We define the syntax and semantics of

to capture

uncertain knowledge in a natu

ral and efficient way. We then show how probabilistic inference, although computationally intractable in the worst case, can be done efficiently in many practical situations. We also describe a variety of approximate inference algorithms that are often applicable when exact inference is infeasible. We explore ways in which probability theory can be applied to worlds with objects and relations-that is, tofirst-order,

as opposed

to propositional, representations.

Finally, we survey alternative approaches to uncertain reasoning.

14.1

REPRESENTING KNOWLEDGE IN AN UNCERTAIN DOMAIN

In Chapter

13,we saw that the full joint probability distribution can answer any question about

the domain, but can become intractably large as the number of variables grows. Furthennore, specifying probabilities for possible worlds one by one is unnatural and tedious. We also saw that independence and conditional independence relationships among vari ables can greatly reduce the number of probabilities that need BAYESIAN NETWORK

to be specified in order to define

the full joint distribution. This section introduces a data stmcture called a

Bayesian network 1

to represent the dependencies among variables. Bayesian networks can represent essentially

any full joint

probability distribution and in many cases can do so very concisely.

1 This is the most common name, but there many synonyms, including belief network, probabilistic net work, causal network, and knowledge map. In statistics, the term graphical model refers to a somewhat broader class that includes Bayesian networks. An extension of Baye"sian networks called a decision network or influence diagram is covered in Chapter 16. are

510

Section 14.1.

Representing Knowledge in an Uncertain Domain

511

A Bayesian network i s a directed graph in which each node i s annotated with quantita tive probability infonnation. The full specification is as follows:

l . Each node corresponds to a random variable, which may be discrete or continuous.

2.

A set of directed links or arrows connec ts pa irs of nodes. If there is an arrow from node

X to node Y, X is said to be a parent of Y. The graph has no directed cycles (and hence is a directed acyclic graph, or DAG.

3.

Each node

xi has a conditional probability distribution P(Xi I Parents(Xi)) that quan-

tifies the effect of the parents on the node. The topology of the network-the set of nodes and links-specifies the conditional indepen dence relationships that hold in the domain, in a way that will be made precise shortly. The

intuitive meaning

of an arrow is typically that

X has a direct influence on Y, which suggests

that causes should be parents of effects. It is usually easy for a domain expert to decide what direct influences exist in the domain-much easier, in fact, than actually specifying the prob abilities themselves. Once the topology of the Bayesian network is laid out, we need only specify a conditional probability distribution for each variable, given its parents. We will see that the combination of the topology and the conditional distributions suffices to specify (implicitly) the full joint distribution for all the variables. Recall the simple world described in Chapter 13, consisting of the variables

Cavity, Catch, and Weather.

Toothache,

Weather is independent of the other vari Toothache and Catch conditionally independent,

We argued that

ables; furthermore, we argued that

are

Cavity. These relationships are represented by the Bayesian network structure shown Fonnally, the conditional independence of Toothache and Catch, given Cavity, is indicated by the absence of a link between Toothache and Catch. Intuitively,the network represents the fact that Cavity is a direct cause of Toothache and Catch, whereas no direct causal relationship exists between Toothache and Catch.

given

in Figure 14.1.

Now consider the following example, which is just a little more complex. You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also

responds on occasion

to

minor earthquakes. (This example is due to Judea Pearl, a resident

of Los Angeles-hence the acute interest in earthquakes.) You also have two neighbors, John and Mar,y who have promised to call you at work when they hear the alarm. John nearly always calls when he hears the alann, but sometimes confuses the telephone ringing with

Figure 14.1 A simple Bayesian network in which Weather is independent of the other three variables and Toothache and Catch are conditionally independent, given Cavity.

512

Chapter

Probabilistic Reasoning

14.

� 2

E I

f I

f

P(A) .95 .94 .29

.001

-------

( MarvCal� ) �

A r

/

P(M)

.70 .01

Figure 14.2 A typical Bayesian network, showing both the topology and the conditional probability tables (CPTs). 1n the CPTs, the letters B, E, A, J, and M stand for Burglary, Em·thquake, Ala1'm, JohnCalls, and MaryCalls, respectively.

the alann and calls then, too. Mary, on the other hand, likes rather loud music and often misses t.he alann altogether. Given the evidence of who has or has not called, we would like to estimate the probability of a burglary.

A

Bayesian network for this domain appears in Figure 14.2. The network structure

shows that burglary and earthquakes directly affect the probability of the alarm's going off,

but whether John and Mary call depends only on the alarm. The network thus represents our assumptions that they do not perceive burglaries directly,they do not notice minor earth quakes, and they do not confer before calling. The conditional distributions in Figme 14.2 are shown as a

CONDITIONAL PROBABILITYTABLE

table, or CPT. (This form of table can be used

conditional probability

for discrete variables; other representations,

including those suitable for continuous variables, are described in Section 14.2.) Each row CONDITIONING CASE

in a CPT contains the conditional probability of each node value for a

A conditioning

conditioning case.

case is just a possible combination of values for the parent nodes-a minia

ture possible world, if you like. Each row must sum to 1, because the entries represent an exhaustive set of cases for the variable. For Boolean variables, once you know that the prob

- p, so we often omit the second number, as in Figure 14.2. In general, a table for a Boolean variable with k Boolean parents ability of a true value is contains 2

k

p, the probability of false must be 1

independently specifiable probabilities.

A node with no parents has only one row,

representing the prior probabilities of each possible value of the variable. Notice that the network does not have nodes corresponding to Mary's currently listening to loud music or to the telephone ringing and confusing John. These factors in the uncertainty associated with the links from

are

summarized

Alarm to JohnCalls and MaryCalls. This

shows both laziness and ignorance in operation: it would be a lot of work to find out why those factors would be more or less likely in any particular case,and we have no reasonable way to obtain the relevant information anyway. The probabilities actually summarize a

potentially

Section

14.2.

The Semantics of Bayesian Networks

infinite

513

set of circwnstances in which the alarm might fail

to go off

(high humidity, power

failure, dead battery, cut wires, a dead mouse stuck inside the bell, etc.) or John or Mary might fail

to call and report it (out to lunch, on vacation, temporarily deaf, passing helicopter,

etc.). In this way, a small agent can cope with a very large world, at least approximately. The degree of approximation can be improved if we introduce additional relevant information.

14.2

THE SEMANTICS OF BAYESIAN NETWORKS

The previous section described what a network is, but not what it means. There are two ways in which one can understand the semantics of Bayesian networks. The first is to see the network as a representation of the joint probability distribution. The second is

to

view

it as an encoding of a collection of conditional independence statements. The two views equivalent, but the first tums out to be helpful .in understanding how to

are

construct networks,

whereas the second is helpful in designing inference procedures.

14.2.1

Representing the full joint distribution

Viewed as a piece of "syntax," a Bayesian network is a directed acyclic graph with some numeric parameters attached to each node. One way to define what the network means-its semantics-is to define the way in which it represents a specific joint distribution over all the variables. To do this, we first need

to retract (temporarily)

what we said earlier about the pa

rameters associated with each node. We said that those parameters con·espond

to conditional we assign semantics to

P(Xi I Parents(Xi)); this is a true statement, but until the network as a whole, we should think of them just as numbers B(Xi I Parents(Xi)). probabilities

A generic entry in the joint distribution is the probability of a conjunction of particular

assignments to each variable, such as

P(X1 = x1 1\ ... 1\ Xn = Xn)·

We use the notation

P(x1, . . . , Xn) as a n abbreviation for this. The value of this entry is given by the formula n

where

parents(Xi)

(14.1)

i =l

denotes the values o f

Parents(Xi) that appear i n x1, . . . ,xn.

Thus,

each entry in the joint distribution is represented by the product of the appropriate elements of the conditional probability tables (CPTs) in the Bayesian network. From this definition, it is easy

exactly the conditional probabilities (see Exercise

14.2).

to prove

that the parameters

B(Xi I Parents(Xi)) are

P(Xi I Parents(Xi)) implied by the joint distribution

Hence, we can rewrite Equation

(14.1)

as

P(xt, . . . ,xn) = IT P(x. l parents(Xi)) . i= 1

(14.2)

In other words, the tables we have been calling conditional probability tables really ditional probability tables according

to the

semantics defined in Equation

are con

(14.1).

To illustrate this, we can calculate the probability that the alarm has sounded, but neither a burglary nor an earthquake has occuned, and both John and Mary call. We multiply entries

514

Chapter

14.

Probabilistic Reasoning

from the joint distribution (using single-letter names for the variables):

P(j I a)P(m I a)P(a l •b 1\ •e)P(·b)P(•e) 0.90 0. 70 X 0.001 X 0.999 X 0.998 = 0.000628 .

P(j, m, a, -ob, •e)

X

Section 13.3 explained that the full joint distribution can be used to answer any query about the domain. If a Bayesian network is a representation of the joint distribution, then it too can

be used to answer any query, by summing all the relevant joint entries. Section 14.4 explains

how to do this, but a1so describes methods that are much more efficient.

A method for constructing Bayesian networks Equation ( 14.2) defines what a given Bayesian network means. The next step is to explain how to

construct

a Bayesian network in such a way that the resulting joint distribution is a

good representation of a given domain. We will now show that Equation ( 14.2) implies certain conditional independence relationships that can be used to guide the knowledge engineer in constructing the topology of the network. First, we rewrite the entries in the joint distribution in terms of conditional probability, using the product rule (see page 486):

P(xi, . . . , Xn) = P(xn I Xn-i, . . . ,xi)P(xn-i, . . . , Xi) . Then we repeat the process, reducing each conjunctive probability to a conditional probability and a smaller conjunction. We end up with one big product:

P(xn I Xn-i, . . . , xi)P(xn-i I Xn-2 , . . . , Xi) · · · P(x2 1 Xi)P(xi) n IT P(xi I Xi-i, . . . , Xi) . i= i This identity is called the chain rule. I t holds for any set of random variables. Comparing it P(xi, . . . , Xn)

CHAIN RULE

with Equation ( 14.2), we see that the specification of the joint distribution is equivalent to the

Xi in the network, ( 1 4.3) P(Xi I Xi-i, . . . ,Xi) = P(Xi I Parents(Xi)) , provided that Parents(Xi) � {Xi-i, . . . , Xi}. This last condition is satisfied by numbering

general assertion that, for every variabl·e

the nodes in a way that is consistent with the partial order implicit in the graph structure. What Equation ( 14.3) says is that the Bayesian network is a correct representation of the domain only if each node is conditionally independent of its other predecessors in the node ordering, given its parents. We can satisfy this condition with this methodology: 1.

Nodes: First determine order them,

the set of variables that

Links: •

to model the domain. Now

{Xi, . . . , Xn }· Any order will work, but the resulting network will be more

compact if the variables 2.

are required

are

ordered such that causes precede effects.

For i = 1 to n do:

Choose, from

X1, . . . , Xi-l, a minimal set of

parents for

Xi, such that Equa

tion ( 14.3) is satisfied.

Xi.

•

For each parent insert a link from the parent to

•

CPTs: Write down the conditional probability table,

P(Xil Parents(Xi)).

Section

14.2.

The Semantics of Bayesian Networks

515

Intuitively, the parents of node Xi should contain all those nodes in X1, . . . , Xi-l that directly influence Xi. For example, suppose we have completed the network in Figure 14.2 except for the choice of parents for MaryCalls. MaryCalls is certainly influenced by whether there is a Bu·rglary or an Earthquake, but not directly influenced. Intuitively, our knowledge of the domain tells us that these events influence Mary's calling behavior only through their effect on the alarm. Also, given the state of the alann, whether John calls has no influence on Mary's calling. Formally speaking, we believe that the following conditional independence statement holds: P(MaryCalls I JohnCalls, Alann, Ea1"thquake, Bu1ylary) = P(Maf"'YCalls I Ala1'm) . Thus, Alarm will be the only parent node for MaryCalls. Because each node is connected only to earlier nodes, this construction method guaran tees that the network is acyclic. Another important property of Bayesian networks is that they contain no redundant probability values. If there is no redundancy, then there is no chance for inconsistency: it is impossible for the knowledge engineer or domain expert to create a Bayesian network that violates the axioms of probability.

Compactness and node ordering

LOCALLY STRUCTURED Sll\RSE

As well as being a complete and nonredundant representation of the domain, a Bayesian net work can often be far more compact than the full joint distribution. This property is what makes it feasible to handle domains with many variables. The compactness of Bayesian net works is an example of a general property of locally structured (also called sparse) systems. In a locally structured system, each subcomponent interacts directly with only a bow1ded number of other components, regardless of the total number of components. Local structure is usually associated with linear rather than exponential growth in complexity. In the case of Bayesian networks, it is reasonable to suppose that in most domains each random variable is directly influenced by at most k others, for some constant k. If we assume n Boolean variables for simplicity, then the amount of information needed to specify each conditional probability table will be at most 2k numbers, and the complete network can be specified by n2k numbers. In contrast, the joint distribution contains 2n numbers. To make this concrete, suppose we have n = 30 nodes, each with five parents (k = 5). Then the Bayesian network requires 960 numbers, but the full joint distribution requires over a billion. There are domains .in which each variable can be influenced directly by all the others, so that the network is fully cmmected. Then specifying the conditional probability tables re quires the same amount of information as specifying the joint distribution. In some domains, there will be slight dependencies that should strictly be included by adding a new link. But if these dependencies are tenuous, then it may not be worth the additional complexity in the network for the small gain in accw·acy. For example, one might object to our burglary net work 011 lht:: gruumls thai if lht::rt:: i); au t::arlhquakt::, tht::ll JuiUJ aud Mary would uul �.:all t::Vt::ll if they heard the alarm, because they assume that the earthquake is the cause. Whether to add the link from Earthquake to JohnCalls and MaryCalls (and thus enlarge the tables) depends on comparing the importance of getting more accurate probabilities with the cost of specifying the extra information.

516

Chapter

(a)

14.

Probabilistic Reasoning

(b)

Figure 14.3 Network structure depends on order of introduction. In each network, we have introduced nodes in top-to-bottom order.

Even in a locally structured domain, we will get a compact Bayesian network only if we choose the node ordering well What happens if we happen to choose the wrong or de?r Consider the burglary example again. Suppose we decide to add the nodes in the order MaryCalls, JohnCalls, Alarm, Burglary, Earthquake. We then get the somewhat more complicated network shown in Figure 14.3(a). The process goes as follows:

Adding MaryCalls: No parents. • Adding JohnCalls: If Mary calls, that probably means the alarm has gone off, which of course would make it more likely that John calls. Therefore, JohnCalls needs MaryCalls as a parent. • Adding Alarm: Clearly, if both call, it is more likely that the alann has gone off than if just one or neither calls, so we need both Mary Calls and JohnCalls as parents. • Adding Burglary: If we know the alarm state, then the call from John or Mary might give us infonnation about our phone ringing or Mary's music, but not about burglary: •

•

P(Burylary I Alarm, JohnCalls, MaryCalls) = P(Burylary I Alarm) . Hence we need just Alarm as parent. Adding Earthquake: If the alann is on, it is more likely that there has been an earth quake. (The alann is an earthquake detector of sorts.) But if we know that there has been a burglary, then that explains the alarm, and the probability of an earthquake would be only slightly above nmmal. Hence, we need both Alarm and Burglary as parents.

The resulting network has two more links than the original network in Figure 14.2 and re quires three more probabilities to be specified. What's worse, some of the links represent tenuous relationships that require difficult and unnatural probability judgments, such as as-

Section 14.2.

The Semantics of Bayesian Networks

517

sessing the probability of Earthquake, given Burglary and Alarm. This phenomenon is quite general and is related to the distinction between causal and diagnostic models intro duced in Section 13.5.1 (see also Exercise 8.13). If we try to build a diagnostic model with links from symptoms to causes (as from MaryCalls to Alarm or Alarm to Burglary), we end up having to specify additional dependencies between otherwise independent causes (and often between separately occurring symptoms as well). Ifwe stick to a causal model, we end up having to specify fewer numbers, and the numbers will often be easier to come up with. In the domain of medicine, for example, it has been shown by Tversky and Kahneman (1982) that expert physicians prefer to give probability judgments for causal rules rather than for diagnostic ones. Figure 14.3(b) shows a very bad node ordering: MaryCalls, JohnCalls, Earthquake, Burglary, Alarm. This network requires 31 distinct probabilities to be specified-exactly the same number as the full joint distribution. It is important to realize, however, that any of the three networks can represent exactly the samejoint distribution. The last two versions simply fail to represent all the conditional independence relationships and hence end up specifying a lot of tmnecessary numbers instead .

14.2.2

DESCENDANT

Conditional independence relations in Bayesian networks

We have provided a "numerical" semantics for Bayesian networks in terms of the represen tation of the full joint distribution, as in Equation (14.2). Using this semantics to derive a method for constructing Bayesian networks, we were led to the consequence that a node is conditionally independent of its other predecessors, given its parents. It turns out that we can also go in the other direction. We can start from a "topological" semantics that specifies the conditional independence relationships encoded by the graph structure, and from this we can derive the "nwnerical" semantics. The topological semantics2 specifies that each vari able is conditionally independent of its non descendants, given its parents. For example, in Figure 14.2, JohnCalls is independent of Burglary, Earthquake, and MaryCalls given the value of Alarm. The definition is illustrated in Figure 14.4(a). From these conditional inde pendence assertions and the interpretation of the network parameters B(Xi I Parents(Xi)) as specifications of conditional probabilities P(Xi I Parents(Xi)), the full joint distribution given in Equation (14.2) can be reconstructed. In this sense, the "numerical" semantics and the "topological" semantics are equivalent. -

Another important independence property is implied b y the topological semantics: a

MARKeN BLANKET

node is conditionally independent of allother nodes in the network, given its parents, children, and children's parents-that is, given its Markov blanket. (Exercise 14.7 asks you to prove this.) For example, Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake. This property is illustrated in Figure 14.4(b).

There is also a general topological criterion called d-set•aration for deciding whether a set of nodes X is conditionally independent of another set Y, given a third set Z. The criterion is mther complicated and is not needed for deriving the algorithms in this chapter, so we omit it. Details may be found in Pearl (1988) or Darwiche (2009). Shnchter (1998) gives a more intuitive method of ascertaining d-separation. 2

518

Chapter

(a)

14.

Probabilistic Reasoning

(b)

Figure 14.4 (a) A node X is conditionally independent of its non-descendants (e.g., the Zi3s) given its parents (the Uis shown in the gray area). (b) A node X is conditionally independent of all other nodes in the network given its Markov blanket (the gray area) .

14.3

EFFICIENT REPRESENTATION O F CONDITIONAL DISTRIBUTIONS

Even if the maximum number of parents k is smallish, tilling in the CPT for a node requires up to 0(2k) numbers and perhaps a great deal of experience with all the possible conditioning cases. In fact, this is a worst-case scenario in which the relationship between the parents and

CANONI::Al DISTRIBUTION

DETER!IINISTIC NODES

Y .OR NOIS

the child is completely arbitrary. Usually, such relationships are describable by a canonical distribution that fits some standard pattern. In such cases, the complete table can be specified by naming the pattern and perhaps supplying a few parameters-much easier than supplying an exponential number of parameters. The simplest example is provided by deterministic nodes. A deterministic node has its value specified exactly by the values of its parents, with no tmcertainty. The relationship can be a logical one: for example, the relationship between the parent nodes Canadian, US, Mea;ican and the child node NorthAmerican is simply that the child is the disjunction of the parents. The relationship can also be nwnerical: for example, if the parent nodes are the prices of a particular model of car at several dealers and the child node is the price that a bargain hunter ends up paying, then the child node is the minimum of the parent values; or if the parent nodes are a lake's inflows (rivers, nmoff, precipitation) and outflows (rivers, evaporation, seepage) and the child is the change in the water level of the lake, then the value of the child is the sum of the inflow parents minus the sum of the outflow parents. Uncertain relationships can often be characterized by so-called noisy logical relation ships. The standard example is the noisy-OR relation, which is a generalization of the log ical OR. In propositional logic, we might say that Fever is true if and only if Cold, Flu, or Malaria is true. The noisy-OR model allows for tmcertainty about the abiJjty of each par ent to cause the child to be true-the causal relationship between parent and child may be

Section

14.3.

519

Efficient Representation of Conditional Distributions

inhibited, and so a patient could have a cold, but not exhibit a fever. The model makes two assumptions. First, it assumes that all the possible causes are listed. (If some are missing, LEAK NODE

we can always add a so-called

leak node that covers "miscellaneous causes.")

Second, it

assumes that inhibition of each parent is independent of inhibition of any other parents: for example, whatever inhibits

Flu from causing a fever.

Malaria from causing a fever is independent of whatever inhibits Given these assumptions, Fever is false if and only if all its true

parents are inhibited, and the probability of this is the product of the inhibition probabilities

q for each

parent. Let us suppose these individual inhibition probabilities are as follows:

= P(•fever I cold, •flu, •malaria) = 0.6 , Qflu = P(•fever I •cold, flu, •malaria) = 0.2 , qmalaria = P(•fever I •cold, •flu, malaria) = 0.1 .

qoold

Then,from this information and the noisy-OR assumptions, the entire CPT can be built. The general rule is that

IT

P(xi lparents(Xi)) = 1 -

{j:X;

=

true}

qj ,

where the product is taken over the parents that are set to true for that row of the CPT. The following table illustrates this calculation:

Cold

Flu Malaria P(Fever) P(•Fever)

F

F

F

F

F

T

0.0 0.9

F

T

F

O.R

T

F

F

T

F

T

T

T

F

T

T

T

F

T

0.98 0.4 0.94 0.88 0.988

T

1.0 0.1

0.2

0.02 = 0.2 X 0.1

0.6

0.06 = 0.6 X 0.1 0.12 = 0.6 X 0.2 0.012 = 0.6 X 0.2

In general, noisy logical relationships in which a variable depends on scribed using

X

0.1

k parents can be de

O(k) parameters instead of 0(2k) for the full conditional probability table.

This makes assessment and learning much easier. For example, the CPCS network (Prad han

et al., 1994) uses noisy-OR and noisy-MAX distributions to model relationships among

diseases and symptoms in internal medicine. With

8,254 values instead of 133,931,430 for a

448

nodes and

906

links, it requires only

network with full CPTs.

Bayesian nets with continuous variables Many real-world problems involve continuous quantities, such as height, mass, temperature, and money; in fact, much of statistics deals with random variables whose domains are contin uous. By definition, continuous variables have an infinite number of possible values, so it is impossible to specify conditional probabilities explicitly for each value. One possible way DISCRETIZATION

to

handle continuous variables is to avoid them by using discretization-that is,dividing up the

0

52

Chapter

ubsidy_

14.

Probabilistic Reasoning

Harvest

(cost

I

Buys Figure 14.5 A simple network with discrete variables (Subsidy and Buys) and continuous variables (Harvest and Cost).

possible values into a fixed set of inte1vals. For example, temperatures could be divided into (

100°C).

Discretization is sometimes an adequate solution,

but often results in a considerable loss of accuracy and very large CPTs. The most com mon solution is to define standard families of probability density ftmctions (see Appendix A) PARAMETER

that are specified by a finite number of distribution

N(J.�,,a2)(x)

parameters.

For example, a Gaussian (or nonnal)

has the mean J.l and the variance a2

as

nonparametric

parameters. Yet another

NONPARAMETRIC

solution-sometimes called a representation-is to define the conditional distribution implicitly with a collection of instances, each containing specific values of the parent and child variables. We explore this approach further in Chapter 18.

HYilRID Ill.YES IAN NElWORK

network.

A network with both discrete and continuous variables is called a

hybrid Bayesian

To specify a hybrid network, we have to specify two new kinds of distributions:

the conditional distribution for a continuous variable given discrete or continuous parents; and the conditional distribution for a discrete variable given continuous parents. Consider the simple example in Figure 14.5, in which a customer buys some fruit depending on its cost, which depends in turn on the size of the harvest and whether the govemrnent's subsidy scheme is operating. The variable

Cost

is continuous and has continuous and discrete parents; the

Buys is discrete and has a continuous parent. Fur lllt: Cusl variable::, we:: llt:t:d lu spe::�;ify P( Cusl l Ha·ruesl, Subsidy ) . The:: dis�;relt: parent is handled by enumeration-that is, by specifying both P( Cost I Harvest, subsidy) and P( Cost I Harvest, -,subsidy) . To handle Harvest, we specify how the distributjon over the cost c depends on the continuous value h of Harvest. In other words, we specify the parameters of the cost distribution as a function of h. The most common choice is the linear variable

LINEAR GAUSSIAN

Gaussian distribution, in which the child has a Gaussian distribution whose mean J.l varies linearly with the value of the parent and whose standard deviation

a is fixed.

We need two

distributions, one for subsidy and one for •subsidy, with different parameters:

P(clh, subsidy)

N(ath + bt,ai)(c) =

1

1

$ e- 2

at 211"

( c-(a1h+brl) "'

1

2

(c-(ajh+bj)) 2

� e- 2 "! a1 211" For this example, then, the conditional distribution for Cost is specified by naming the linear Gaussian distribution and providing the parameters at, bt, at, af, b1, and a,. Figures 14.6(a) P(c l h, -,subsidy) = N(aJh + bJ,aJ)(c) =

Section 14.3.

Efficient Representation of Conditional Distributions

P(c I 1!, subsidy) 0.4 0.3 0.2 0.1

0.3 0.2 0.1 0

521

P(c I h) 0.4 0.3 0.2 0.1 0 0 2 4 6 8

02 4 6 8

Cost

Cost e

c

(b)

(a)

(c)

Figure 14.6 The graphs in (a) and (b) show the probability distribution over Cost as a function of Har·vest size, with Subsidy true and false, respectively. Graph (c) shows the distribution P( Cost I Harvest), obtained by summing over the two subsidy cases.

and (b) show these two relationships. Notice that in each case the slope is negative, because cost decreases as supply increases. (Of course, the assumption of }jnearity implies that the cost becomes negative at some point; the linear model is reasonable only if the harvest size is limited to a natTOW range.) Figure 14.6(c) shows the distribution two possible values of

P(c I h), averaging over the

Subsidy and assuming that each has prior probability 0.5. This shows

that even with very simple models, quite interesting distributions can be represented. The linear Gaussian conditional distribution has some special properties. A network containing only continuous variables with }jnear Gaussian distributions has a joint distribu tion that is a multivariate Gaussian distribution (see Appendix A) over all the variables (Exer 3 cise 14.9). FUithermore, the posterior distribution given any evidence also has this property. When discrete variables are added as parents (not as children) of continuous va tiables, the CONDITIONAL GAUSSIAN

network defines a

conditional Gaussian, or CG, distribution:

given any assigrunent to the

discrete variables, the distribution over the continuous variables is a multivariate Gaussian. Now we tum to the distributions for discrete variables with continuous parents. Con sider, for example, the

Buys

node in Figure 14.5. It seems reasonable

to

asswne that the

customer will buy if the cost is low and will not buy if it is high and that the probability of buying varies smoothly in some intermediate region. In other words, the conditional distribu tion is like a "soft" threshold function. One way to make soft thresholds is to use the

integral

of the standard normal distribution:

(x) =

j_� N(O, l)(x)dx .

Then the probability of

Buys given Cost might be

P(buys I Cost = c) = (( -c + �t)fa)

,

which means that the cost threshold occurs arotmd iJ.,the width of the ttu·eshold region is pro pmtional to

a,

and the probability of buying decreases

as

cost increases. This

probit distri-

It follows that inference in linear Gaussian networks takes only O(n3) ti e in the worst case, regardless of the network topology. In Section 14.4, we see that inference for networks of discrete variables is NP-hard. 3

m

522

Chapter

I

I

0.8 �

�

"-

14.

Probabilistic Reasoning

�

\,\,

0.8 "

-

0.6

�

0.4

""

"-

�

/\

0.2 0 0

2

4

6

8

0.4 0.2

\'

0 10

12

--------·

0.6

0

Cost e

(a) Figure 14.7

Logit Probit

2

4

6

8

10

12

Coste

(b)

(a) A normal (Gaussian) distribution for the cost threshold, centered on

J.l = 6.0 with standard deviation a = 1.0. (b) Logit and probit distributions for the probability

of buys given cost, for the parameters J.l = 6.0 and a= 1.0.

bution (pronounced "pro-bit" and short for "probability unit") is illustrated in Figure 14.7(a).

PROBIT DISTRIBUTION

TI1e form can be justified by proposing that the underlying decision process has a hard

LOGIT DISTRIBUTION LOGISTIC FUNCTION

thresh

old, but that the precise location of the threshold is subject to random Gaussian noise. An alternative to the probit model is the logit distribution (pronounced "low-jit"). It uses the logistic function 1/(1 + e-"') to produce a soft threshold: 1 P(buys I Cost = c) = . 1 + ea;p( -2 �) " This is illustrated in Figure 14.7(b). The two distributions look similar, but the logit actually has much longer "tails." The probit is often a better fit to real situations, but the logit is some times easier to deal with mathematically. It is used widely .in neural networks (Chapter 20). Both probit and logit can be generalized to handle multiple continuous parents by taking a linear combination of the parent values.

14.4

EXACT INFERENCE IN BAYES IAN NETWORKS

TI1e basic task for any probabilistic EVENT

HIDDEN VARIABLE

inferenc.e system is to compute the posterior probability distribution for a set of query variables, given some observed event-that is, some assign ment of values to a set of evidence variables. To simplify the presentation, we will consider only one query variable at a time; the algorithms can easily be extended to queries with mul tiple variables. We will use the notation from Chapter 13: X denotes the query variable; E denotes the set of evidence variables E1, . . . , Em, and e is a particular observed event; Y will denotes the nonevidence, nonquery variables Y�, . . . , Yj (called the hidden variables). Thus, the complete set of variables is X = {X} U E U Y. A typical query asks for the posterior probability distribution P(X I e).

Section 14.4.

523

Exact Inference in Bayesian Networks

In the burglary network, we might observe the event in which JohnCalls = true and MaryCalls = true. We could then ask for, say, the probability that a burglary has occurred:

P(Burglary I JohnCalls = true, MaryCalls = t·rue) = (0.284, 0. 716) . In this section we discuss exact algorithms for computing posterior probabilities and will consider the complexity of this task. It turns out that the general case is intractable, so Sec tion 14.5 covers methods for approximate inference. 14.4.1

Inference by enumeration

Chapter 1 3 explained that any conditional probability can be computed by summing tenns from the full joint distribution. More specifically, a query P(X I e) can be answered using Equation (13.9), which we repeat here for convenience:

P(X I e) = aP(X,e) = a L P(X,e,y) . y

Now, a Bayesian network gives a complete representation of the full joint distribution. More specifically, Equation (14.2) on page 513 shows that the terms P(x, e, y) n i the joint distri bution can be written as products of conditional probabilities from the network. Therefore, a query can be answered using a Bayesian network by computing sums ofproducts of condi tional probabilities from the network. Consider the query P(Burglary I JohnCalls = true, MaryCalls = true). The hidden variables for this query are Earthquake and Alarm. From Equation (13.9), using initial letters for the variables to shorten the expressions, we have4

P(B IJ,m) = aP(B,j,m) = a L L P(B,j, m,e,a,) . e

a

The semantics of Bayesian networks (Equation (14.2)) then gives us an expression in terms of CPT entries. For simplicity, we do this just for Burglary = true:

P(blj,m)

= a L L P(b)P(e)P(a l b, e)P(j l a)P(m la) . e

a

To compute this expression, we have to add four terms, each computed by multiplying five numbers. In the worst case, where we have to sum out almost all the variables, the complexity of the algorithm for a network with n Boolean variables is 0(n2n). An improvement can be obtained from the following simple observations: the P(b) lt:nn is a l;UIIslaut am) l;au bt:: muvt::i.l uutsitlt:: tht:: summatiuus uvt::r u aut! e, antl lltt:: P(e) temt can be moved outside the summation over a. Hence, we have

P(b l j, m)

= a P(b) L P(e) L P(a I b, e)P(j I a)P(m I a) . e

a

(14.4)

This expression can be evaluated by looping lhrough the variables in order, multiplying CPT entries as we go. For each summation, we also need to loop over the variable's possible 4 An expression such as L:;. P(a, e) means to sum P(A = a, E = e) for all possible values of e. When E is Boolean, there is an ambiguity in that P(e) is used to mean both P(E = t e) and P(E = e), but it should be clear from context which is intended; in particular, in the context ofa sum the latter is intended. ru

524

14.

Chapter

values. The structure of this computation is shown in Figure

14.2, we obtain P(b I j, m) ·b yields a x 0.0014919; hence,

Figure

= a x 0.00059224.

Probabilistic Reasoning

14.8.

Using the numbers from

The corresponding computation for

P(B I j, m) = a (0.00059224, 0.0014919) ;:::: (0.284, 0.716) . TI1at is, the chance of a burglary, given calls from both neighbors, is about 28%.

(14.4) is shown as an expression algorithm in Figure 14.9 evaluates such trees

The evaluation process for the expression in Equation tree in Figure

14.8.

The ENUMERATION-ASK

using depth-first recursion. The algorithm is very similar in structure to the backtracking al

gorithm for solving CSPs (Figure 6.5) and the DPLL algorithm for satisfiability (Figure 7.17). The space complexity of ENUMERATION-ASK is only linear in the number of variables:

the algorithm sums over the full joint distribution without ever constructing it explicitly. Un

n Boolean variables is always 0(2n) O(n 2n) for the simple approach described earlier, but still rather grim. Note that the tree in Figure 14.8 makes explicit the repeated subexpressions evalu ated by the algorithm. The products P(j I a)P(m I a) and P(j l •a)P(m l •a) are computed

fortunately, its time complexity for a network with better than the

twice, once for each value of

e. The next section describes a general method that avoids such

wasted computations. 14.4.2

The variable elimination algoritlun

The enumeration algorithm can be improved substantially by eliminating repeated calcula tions of the kind illustrated in Figure

14.8.

The idea is simple: do the calculation once and

save the results for later use. This is a form of dynamic progranuning. There are several ver VARIABLE ELIMINATION

sions of this approach; we present the variable elimination algorithm, wnich is the simplest. Variable elimination works by evaluating expressions such

as Equation

(14.4) in right-to-left

order (that is, bottom up in Figure 14.8). Intennediate results are stored, and swnmations over each variable are done only for those portions of the expression that depend on the variable. Let us illustrate this process for the burglary network. We evaluate the expression

P(B IJ,m) = a P(B) L: P(e) L P(a I B,e) P(j I a) P(m I a) . �

f1(B)

e

,___...

f2(E)

a "-,..-''-.,-''-..,-'

f3(A,B,E)

f4(A)

f5(A)

Notice that we have annotated each part of the expression with the name of the corresponding FACTOR

factor; each factor is a matrix indexed by the values of its argument variables. For example, the factors

f4(A) and f5(A) corresponding to P(j I a) and P(m I a) depend just on A because

J and M are fixed by the query. They are therefore two-element vectors:

f4(A) =

P(! I a) ) = ( ) ( P(J l •a)

(

0. 90 0.05

) ( )

P(m I a) = f5(A) = P(m l •a)

0.70 0.01

.

f3(A, B, E) will be a 2 x 2 x 2 matrix, which is hard to show on the printed page. (The "first" element is given by P(a I b, e)= 0.95 and the "last" by P(•a l •b, •e) = 0.999.) In tenns of factors, the query expression is written as

P(B i j,m) = a f1 (B) x L:r2(E) x L:r3(A,B,E) x f4(A) x f5(A) e

a

Section 14.4.

Exact Inference in Bayesian Networks

P(jla)

525

P(jl•a)

P(jl•a) .05

.90

.05

P(mla) .70

Figure 14.8

The structure of the expression shown in Equation

(14.4).

The evaluation

proceeds top down, multiplying values along each path and summing at the "+" nodes. Notice the repetition of the paths for j and m.

ftmction ENUMERATION-ASK(X,e, bn) returns a distribution over X inputs: X, the query variable e, observed values for variables E bn, a Bayes net with variables {X} U E U Y

I* Y

=

hidden variables * I

Q(X) ..- a distribution over X, initially empty for each value Xi of X do Q(xi) t- ENUMERATE-ALL(bn.VARS,e.,,) where e.,, is e extended with X = Xi return NORMALIZE(Q(X))

function ENUMERATE-ALL(vars, e) returns a real number if EMPTY?(va1·s) then return 1.0 Y ..- FIRST(vars) if Y has value y in e

then return P(y I parents(Y)) x ENUMERATE-ALL(REST(vars),e) else return I:Y P(y I parents(Y)) x ENUMERATE-ALL(REST(va1·s),ey) where ey is e extended with Y = y

Figure 14.9

The enumeration algorithm for answering queries on Bayesian networks.

526

POINTWISE PRODUCT

Chapter

14.

Probabilistic Reasoning

where the " x " operator is not ordinary matrix multiplication but instead the pointwise prod uct operation, to be described shortly. The process of evaluation is a process of summing out variables (right to left) from pointwise products of factors to produce new factors, eventually yielding a factor that is the solution, i.e., the posterior distribution over the query variable. The steps are as follows: •

First, we sum out A from the product of f3, f4, and fs. This gives us a new 2 fs(B, E) whose indices range over just B and E:

fs(B,E)

x

2 factor

a

Now we are left with the expression

P(B IJ, m) = af1(B) x Lf2(E) x fs(B,E) . e

• Next, we sum out E from the product of f2 and f6: e

f2(e) x fs(B,e) + f2(•e) x fs(B, •e) . This leaves the expression

P(Bij,m) = a f1(B) x f7(B) which can be evaluated by taking the pointwise product and normalizing the result. Examining this sequence, we see that two basic computational operations are required: point wise product of a pair of factors, and summing out a variable from a product of factors. The next section describes each of these operations.

Operations on factors TI1e pointwise product of two factors f1 and f2 yields a new factor f whose variables are the union of the variables in f1 and f2 and whose elements are given by the product of the corresponding elements in the two factors. Suppose the two factors have variables }]. , . . . , Yk in common. Then we have

f(X1 . . . Xj, Y1 . . . Yk, Z1 . . . Zi) = f1 (X1 . . . Xj, Y1 . . . Yk) f2(Y1 . . . Yk, Z, . . . Zt)· If all the variables are binary, then f1 and f2 have 2Hk and 2k+l entries, respectively, and the pointwise product has 2Hk+l entries. For example, given two factors f1(A, B) and f2(B,C), the pointwise product f1 x f2 = f3(A,B,C) has 21+1+l = 8 entries, as illustrated in Figure 14.10. Notice that the factor resulting from a pointwise product can contain more variables than any of the factors being multiplied and that the size of a factor is exponential in the number of variables. This is where both space and time complexity arise in the variable elimination algorithm.

Section 14.4.

527

Exact Inference in Bayesian Networks

A

B

f1(A, B)

B

c

f2(B,C)

A

B

c

T T F F

T F T F

.3 .7 .9

T T F F

T F T F

.2 .8 .6 .4

T T T T F F F F

T T F F T T F F

T F T F T F T F

Figure

.1

14.10

lllustrating pointwise multiplication: f1(A, B) x f2(B, C)

fs(A, B, C) .3 X .2= .06 .3 X .8= .24 .7 x .6= .42 .7 X .4= .28 .9 x .2= .18 .9 x .8= .72 .I X .6= .06 .I X .4= .04 = fs(A, B, C).

Summing out a variable from a product of factors is done by adding up the submatrices formed by fixing the variable to each of its values in tw-n. For example, to sum out A from fs(A,B, C), we write

f(B,C) = Lfs(A, B,C) = f3(a,B,C) + fs(•a,B,C) =

( .42 .06 .24 ) + ( . 72 ) = ( .24 ) .48 .32 .28 .06 .04 .I8

.96

.

The only trick is to notice that any factor that does not depend on the variable to be summed out can be moved outside the summation. For example, if we were to sum out E first in the burglary network, the relevant part of the expression would be

L f2(E) x fa(A, B, E) x f-t(A) x fs(A) = f-t(A) x fs(A) x L f2(E) x fa(A, B, E) .

e e Now the pointwise product inside the summation is computed, and the variable is summed out of the resulting matrix. Notice that matrices are not multiplied until we need to sum out a variable from the accumulated product. At that point, we multiply just those matrices that include the variable to be summed out. Given functions for pointwise product and summing out, the variable elimination algorithm itself can be written quite simply, as shown in Figure 14.11.

Variable ordering and variable relevance The algorithm in Figure 14.11 includes an unspecified ORDER ftmction to choose an ordering for the variables. Every choice of ordering yields a valid algorithm, but different orderings cause different intermediate factors to be generated during the calculation. For example, in the calculation shown previously, we eliminated A before E; if we do it the other way, the calculation becomes

P(B lj, m) = af1(B) x L f4(A) x fs(A) x L f2(E) x fs(A, B,E) , a

e

duting which a new factor f6(A, B) will be generated. In general, the time and space requirements of variable elimination are dominated by the size of the largest factor constmcted during the operation of the algoritlun. This in tum

528

Chapter

function

14.

Probabilistic Reasoning

ELIMINATION-ASK(X, e, bn) returns a distribution over X X, the query variable

inputs:

e,

observed values forvariables E

bn, a Bayesian network specifyingjoint distribution P(X1 , . . . , Xn) factOI'S

f-

[j

var in ORDER(bn. VARS) do factors .__ [MAKE-FACTOR(vm·, e) lfactors] if var is a hidden variable then factoi'S .__ SUM-OUT(va1·,facto1'S) return NORMALIZE(POINTWISE-PRODUCT(/actors))

for each

Figure 14.11

The variable elimination algorithm for inference in Bayesian networks.

is detennined by the order of elimination of variables and by the structure of the network. It turns out to be intractable to dete1mine the optimal ordering, but several good heuristics are available.

One fairly effective method is a greedy one: eliminate whichever variable

minimizes the size of the next factor to be constructed. Let us consider one more query:

P(JohnCalls I Burglary = true).

As usual, the first

step is to write out the nested summation:

P(J I b) = o: P(b) L P(e) L P(a I b, e)P(J I a) L P(m Ia) .

m Evaluating this expression from right to left, we notice something interesting: I:m P(m I a) e

a

is equal to I by definition! Hence, there was no need to include it in the first place; the vari able M is irrelevant to this query. Another way of saying this is that the result of the query

P( JohnCalls I Burglary= true)

is unchanged if we remove

MaryCalls

from the network

altogether. In general, we can remove any leaf node that is not a query variable or an evidence variable. After its removal, there may be some more leaf nodes, and these too may be irrele vant. Continuing this process, we eventually find that every variable that is not an ancestor

of a

query variable or evidence variable is irrelevant to the query.

A variable elimination

algorithm can therefore remove all these variables before evaluating the query. 14.4.3

The complexity of exact inference

TI1e complexity of exact inference in Bayesian networks depends strongly on the structm·e of the network. The burglary network of Figure 14.2 belongs to the family of networks in which i the network. These there is at most one tmdirected path between any two nodes n

are

called

SINGLY CONNECTED

singly connected networks or polytrees, and they have a particularly nice property:

POLYTREE

and space complexity ofexact inference in polytrees is linear in the size ofthe network. Here,

�

the size is defined as the number of CPT entries; if the number of parents of each node is

MULTIPLY OONNECTED

The time

botu1ded by a constant, then the complexity will also be linear in the number of nodes. For multiply connected networks, such as that of Figure 14.12(a), variable elimination can have exponential time and space complexity in the worst case, even when the number of parents per node is botu1ded. This is not surprising when one considers that because it

Section 1 4.4.

Exact Inference in Bayesian Networks

529

IP(C)=.51

c I f

�C

M / P(Sl l {!plinlder .I 0

.50

Rai11

� .

s I I f f

R P(W)

I f I f

.99

.90 .90

.00

(a)

1

f

IP(C)=.51

P(R)

�

.80 .20

:I

S+R P(W) I I If f I ff

.99

.90

.90

.00

Spr+Rai11

® -

c I f

P(S+R=x)

liJ tf t fp, Jif .08 .02 .72 . 1 8 .10 .40 .10 .40

.

(b)

Figure 14.12 (a) A multiply connected network with conditional probability tables. (b) A clustered equivalent of the multiply connected network.

includes inference in propositional logic as a special case, inference in Bayesian networks is NP-hard. In fact, it can be shown (Exercise 14.16) that the problem is as hard as that of com puting the number of satisfying assignments for a propositional logic formula. This means that it is #P-hard ("number-P hard")-that is, strictly harder than NP-complete problems. There is a close connection between the complexity of Bayesian network inference and the complexity of constraint satisfaction problems (CSPs). As we discussed in Chapter 6, the difficulty of solving a discrete CSP is related to how "treelike" its constraint graph is. Measures such as tree width, which bound the complexity of solving a CSP, can also be applied directly to Bayesian networks. Moreover, the va.riable elimination algorithm can be generalized to solve CSPs as well as Bayesian networks. 14.4.4

CLUSTERING JOIN TREE

Clustering algorithms

The variable elimination algorithm is simple and efficient for answering individual queries. If we want to compute posterior probabilities for all the variables in a network, however, i1 can be Jess efficient. For example, in a polytree network, one would need to issue O(n) queries 2 costing O(n) each, for a total of O(n ) time. Using clustering algorithms (also known as join tree algorithms), the time can be reduced to O(n). For this reason, these algorithms are widely used in commercial Bayesian network tools. The basic idea of clustering is to join individual nodes of the network to fmm clus ter nodes in such a way that the resulting network is a polytree. For example, the multiply connected network shown in Figure 14.12(a) can be converted into a polytree by combin ing the Sprinkler and Rain node into a cluster node called Sprinkler+Rain, as shown in Figure 14.12(b). The two Boolean nodes are replaced by a "meganode" that takes on four possible values: tt, tf, ft, and ff. The meganode has only one parent, the Boolean variable Cloudy, so there are two conditioning cases. Although this example doesn't show it. the process of clustering often produces meganodes that share some variables.

530

Chapter

14.

Probabilistic Reasoning

Once the network is in polytree form, a special-purpose inference algorithm is required, because ordinary inference methods carmot handle meganodes that share variables with each other. Essentially, the algorithm is a form of constraint propagation (see Chapter 6) where the constraints ensure that neighboring meganodes agree on the posterior probability of any vari ables that they have in common. With careful bookkeeping, this algorithm is able to compute posterior probabilities for all the nonevidence nodes in the network in time linear in the size of the clustered network. However, the NP-hardness of the problem has not disappeared: if a network requires exponential time and space with variable elimination, then the CPTs in the clustered network will necessarily be exponentially large. 14.5

MONTE�RLO

APPROXIMATE INFERENCE IN B AYESIAN NETWORKS Given the intractability of exact inference in large, multiply connected networks, it is essen tial to consider approximate inference methods. This section describes randomized sampling algorithms, also called Monte Carlo algorithms, that provide approximate answers whose accuracy depends on the number of samples generated. Monte Carlo algorithms, of which simulated annealing (page 126) is an example, are used in many branches of science to es timate quantities that are difficult to calculate exactly. In this section, we are interested in sampling applied to the computation of posterior probabilities. We describe two families of algorithms: direct sampling and Markov chain sampling. Two other approaches-variational methods and loopy propagation-are mentioned in the notes at the end of the chapter. 14.5.1

Direct sampling methods

TI1e primitive element in any sampling algoritlun is the generation of samples from a known probability distribution. For example, an unbiased coin can be thought of as a random variable Coin with values (heads, tails) and a prior distribution P(Coin) = (0.5,0.5). Sampling from this distribution is exactly like flipping the coin: with probability 0.5 it will return heads, and with probability 0.5 it will return tails. Given a source of random numbers uniformly distributed in the range [0, 1 ] , it is a simple matter to sample any distribution on a single variable, whether discrete or continuous. (See Exercise 14.17.) TI1e simplest kind of random sampling process for Bayesian networks generates events from a network that has no evidence associated with it.. The idea is to sample each variable in tum, in topological order. The probability distribution from which the value is sampled is conditioned on the values already assigned to the variable's parents. This algorithm is shown in Figure 14.13. We can illustrate its operation on the network in Figw·e 14.1 2(a), assuming an ordering [Cloudy, Sprinkler, Rain, WetGrass]: 1. Sample from P( Cloudy)

= (0.5, 0.5), value is true. 2. Sample from P(Sprinkler I Cloudy = true) = (0.1,0.9), value is false. 3. Sample from P(Rain I Cloudy = true) = (0.8, 0.2), value is true. 4. Sample from P( WetGrass I Sprinkler =false, Rain = t·rue) = (0.9, 0.1 ) , value is true. In this case, PRIOR-SAMPLE returns the event [true,Jalse, true, true].

Section 14.5.

Approximate Inference in Bayesian Networks

531

ftmction PRIOR-SAMPLE(bn) returns an event sampled from the prior specified by bn inputs: bn, a Bayesian network specifying joint distribution P(X1 , . . . ,Xn)

x .._ an event with n elements foreach variable Xi in X1 , . . . , Xn do x[i]._ a random sample from P(Xi I pm·ents(Xi)) return x

Figure 14.13 A sampling algorithm that generates events from a Bayesian network. Each variable is sampled according to the conditional distribution given the values already sampled for the variable's parents.

It is easy to see that PRIOR-SAMPLE generates samples from the prior joint distribution specified by the network. First, let Sps(xl, . . . , Xn) be the probability that a specific event is generated by the PRIOR-SAMPLE algorithm. Just looking at the sampling process, we have

n

Sps(xl . . . xn) = IT P(xi iparents(Xi)) because each sampling step depends only on the parent values. This expression should look familiar, because it is also the probability of the event according to the Bayesian net's repre sentation of the joint distribution, as stated in Equation (14.2). That is, we have

Sps(xl . . . Xn) = P(x1 . . . Xn) . This simple fact makes it easy to answer questions by using samples. In any sampling algmithm, the answers are computed by counting the actual samples generated. Suppose there are N total samples, and let NPS (x1, . . . , Xn) be the number of times the specific event x1, . . . , Xn occurs in the set of samples. We expect this number, as a fraction of the total, to converge in the limit to its expected value according to the sampling probability:

Nps(xb . . . ,xn) = S ( X1, . . . ,Xn ) = P( X1, . . . 1Xn) . N For example, consider the event produced earlier: [true ,false, true, true:. l. lffi

PS

N-+oo

(14.5) The sampling

probability for this event is

Sps(true,Jalse, true, true) = 0.5 x 0.9 x 0.8 x 0.9 = 0.324 . Hence, in the limit of large N, we expect 32.4% of the samples to be of this event. CONSISTENT

Whenever we use an approximate equality (":=:::::") in what follows, we mean it in exactly this sense-that the estimated probability becomes exact in the large-sample limit. Such an estimate is called consistent. For example, one can produce a consistent estimate of the probability of any partially specified event Xl, . . . , Xm, where m :::; n, as follows:

P(x1, . . . ,xm)

:=:::::

Nps(xb . . . ,xm )/N .

(14.6)

That is, the probability of the event can be estimated as the fraction of all complete events generated by the sampling process that match the partially specified event. For example, if

532

Chapter

14.

Probabilistic Reasoning

we generate 1000 samples from the sprinkler network, and 511 of them have Rain = true, then the estimated probability of rain, written as P(Rain = true), is 0.511.

REJECTION SAMPLING

Rejection sampling in Bayesian networks Rejection sampling is a general method for producing samples from a hard-to-sample distri

bution given an easy-to-sample distribution. In its simplest form, it can be used to compute conditional probabilities-that is, to determine P(X I e). The REJECTION-SAMPLING algo rithm is shown in Figure 14.14. First, it generates samples from the prior distribution specified by the network. Then, it rejects all those that do not match the evidence. Finally, the estimate P(X = x e) is obtained by counting how often X = x occurs in the remaining samples. Let P(X I e) be the estimated distribution that the algorithm returns. From the definition of the algorithm, we have

l

, P(X I e) = o:Nps(X,e) = Nps(X,e) Nps(e)

.

From Equation (14.6), this becomes

, P(X I e) � P(X,e) P(e) = P(X I e) . TI1at is, rejection sampling produces a consistent estimate of the true probability. Continuing with our example from Figw·e 14.12(a), let us assume that we wish to esti mate P(Rain I Sprinkler = true), using 100 samples. Of the 100 that we generate, suppose that 73 have Sprinkler= false and are rejected, while 27 have Sprinkler= true; of the 27, 8 have Rain = t·rue and 19 have Rain = false. Hence,

P(Rain I Sprinkler= true) � NORMALIZE((8, 19)) = (0.296,0.704) . TI1e true answer is (0.3, 0. 7). As more samples are collected, the estimate will converge to the true answer. The standard deviation of the error in each probability will be proportional to 1/ fo, where n is the number of samples used in the estimate. The biggest problem with rejection sampling is that it rejects so many samples! The fraction of samples consistent with the evidence e drops exponentially as the number of evi dence variables grows, so the procedure is simply tmusable for complex problems. Notice that rejection sampling is very similar to the estimation of conditional probabili ties directly from the real world. For example, to estimate P(Rain I RedSkyAtNight = true), one can simply count how often it rains after a red sky is observed the previous evening ignoring those evenings when the sky is not red. (Here, the world itself plays the role of the sample-generation algorithm.) Obviously, this could take a long time if the sky is very seldom red, and that is the weakness of rejection sampling.

UKaiHOOD WEIGHTING IMPORTANCE SAMPLING

Likelihood weighting Likelihood weighting avoids the inefficiency of rejection sampling by generating only events that are consistent with the evidence e. It is a particular instance of the general statistical teclmique of importance sampling, tailored for inference in Bayesian networks. We begin by

Section 14.5.

533

Approximate Inference in Bayesian Networks

ftmction REJECTION-SAMPLING(X ,e, bn, N) returns an estimate of P(X Ie) inputs: X, the query variable e, observed values for variables E bn, a Bayesian network N, the total number of samples to be generated local variables: N, a vector of counts for each value of X, nit ally zero

i i

for j

=

1 to N do

x t-

PRIOR-SAMPLE(bn)

if x is consistent withe then N[x] +--- N[x]+l where x is the value of X in x return NORMALIZE(N) Figure 14.14 The rejection-sampling algorithm for answering queries given evidence in a Bayesian network.

describing how the algorithm works; then we show that it works correctly-that is, generates consistent probability estimates. LIKELIHOOD-WEIGHTING (see Figure 14.15) fixes the values for the evidence vari ables E and samples only the nonevidence variables. This guarantees that each event gener ated is consistent with the evidence. Not all events are equal, however. Before tallying the cmmts in the distribution for the query variable, each event is weighted by the likelihood that the event accords to the evidence, as measured by the product of the conditional probabilities for each evidence variable, given its parents. Intuitively, events in which the actual evidence appears unlikely should be given less weight. Let us apply the algorithm to the network shown n i Figure 14.12(a), with the query P(Rain I Cloudy = true, WetGmss = true) and the ordering Cloudy, Sprinkler, Rain, Wet Grass. (Any topological ordering will do.) The process goes as follows: First, the weight w is set to 1.0. Then an event is generated:

Cloudy is an evidence variable with value true. Therefore, we set P( Cloudy = true) = 0.5 . 2. Sprinkler is not an evidence variable, so sample from P(Sprinkler I Cloudy = true) = (0.1, 0.9); suppose this returns false. 3. Similarly, sample from P(Rain I Cloudy = true) = (0.8, 0.2); suppose this returns true. 4. WetGrass is an evidence variable with value true. Therefore, we set P( WP.UimRR = t·rv.P. I SprinklP.r =fn.lRP., R.n.in = trnP.) = 0.4!'i . Here WEIGHTED-SAMPLE returns the event [true ,false, true, true] with weight 0.45, and this is tallied under Rain= true. 1.

w 0. In the umbrella example, this might mean computing the probability of rain three days

from now, given all the observations to date. Prediction is useful for evaluating possible courses of action based on their expected outcomes.

2 The tenn "filtering" refers to the roots of this problem in early work on signal processing, where the problem is to filter out the noise in signal by estimating its underlying properties. a

Section

15.2.

571

Inference in Temporal Models

SMOOTHING

•

Smoothing: This is the task of computing the posterior distribution over a past state, given all evidence up to the present. That is, we wish to compute P(Xk I el:t) for some k such that 0

� k < t.

In the umbrella example, it might mean computing the probability

that it rained last Wednesday, given all the observations of the tunbrella carrier made up to today. Smoothing provides a better estimate of the state than was available at the 3 time, because it incorporates more evidence. •

Most Ukely explanation: Given a sequence of observations, we might wish to find the sequence of states that is most likely to have generated those observations. That is, we wish to comput.e a.rgmaxxl:t

P(xt:t I el:t)· For example, ifthe umbrella appears on each

of the first three days and is absent on the fourth, then the most likely explanation is that it rained on the first three days and did not rain on the fourth. Algorithms for this task are useful in many applications, including speech recognition-where the aim is to find the most likely sequence of words, given a series of sounds-and the reconstruction of bit strings transmitted over a noisy charu1el. In addition to these inference tasks, we also have •

Learning:

The transition and sensor models, if not yet known, can be learned from

observations. Just as with static Bayesian networks, dynamic Bayes net learning can be done as a by-product of inference. Inference provides an estimate of what transitions actually occtuTed and of what states generated the sensor readings, and these estimates can be used to update the models. The updated models provide new estimates, and the process iterates to convergence. The overall process is an instance of the expectation maximization or

EM algorithm. (See Section 20.3.)

Note that learning requires smoothing, rather than filtering, because smoothing provides bet ter estimates of the states of the process. Leaming with filtering can fail to converge correctly; consider, for example, the problem of learning to solve murders: unless you are an eyewit ness, smoothing is

always

required to infer what happened at the murder scene from the

observable variables. The remainder of this section describes generic algoritluns for the four inference tasks, independent of the particular kind of model employed. Improvements specific to each model are described in subsequent sections.

15.2.1

Filtering and prediction

As we pointed out in Section

7.7.3, a useful filtering algoritlun needs to maintain

a current

state estimate and update it, rather than going back over the entire history of percepts for each update. (Otherwise, the cost of each update increases as time goes by.) In other words, given

the result of filtering up

ESTIMATION

time t, the agent needs

to

compute the result for t 1 1 from the

et+l • P(Xt+l l el:t+l ) = f (�+l, P(Xt I el:t)) , for some function f. This process is called recursive estimation. We can view the calculation

new evidence

RECURSIVE

to

In particular, when tracking a movingobject with inaccurate position observations, smoothing gives a smoother estimated trajectory than filtering-hence the name. 3

Chapter

572

15.

Probabilistic Reasoning over Time

as being composed of two parts: first, the current state distribution is projected forward from

t to t + 1 ; then it is updated using the new evidence �+1·

This two-part process emerges quite

simply when the formula is rearranged:

P(Xt+l l el:t+l) = P(Xt+l l el:t, �+1) (dividing up the evidence) = a P(�+1 I xt+1, el:t) P(Xt+l I e1:t) (using Bayes' rule) = a P(�+1 I Xt+1) P(Xt+1 I e1:t) (by the sensor Markov assumption).

(15.4)

Here and throughout this chapter, a is a normalizing constant used to make probabilities sum

P(Xt+l I e1:t) represents a one-step prediction of the next state, and the first term updates this with the new evidence; notice that P( Ct+l l Xt+1) is obtainable up to

1.

The second term,

directly from the sensor model. Now we obtain the one-step prediction for the next state by

Xt: P(Xt+l I e1:t+l) = a P( Ct+l I Xt+l) L P(Xt+l I Xt, el:t)P(xt I el:t)

conditioning on the current state

= a

P(�+1 I Xt+1) L P(Xt+l I Xt)P(xt I e1:t)

(Markov assumption).

(15.5)

Within the stunmation, the first factor comes from the transition model and the second comes

from the current state distribution. Hence, we have the desired recursive formulation. We can think of the filtered estimate

P(Xt I el:t) as a "message" fl:t that is propagated forward along

the sequence, modified by each transition and updated by each new observation. The process is given by

f1:t+1 = a FORWARD(fl:t, �+1) , where FORWARD implements the update described in Equation (15.5) and the process begins with f1:0 = P(Xo). When all the state variables discrete, the time for each update is are

constant (i.e., independent of t), and the space required is also constant.

(The constants

depend, of course, on the size of the state space and the specific type of the temporal model in question.)

The time and space requirements for updating must be constant ifan agent with limited memory is to keep track ofthe current state distribution over an unbounded sequence ofobservations. Let us illustrate the filtering process for two steps in the basic umbrella example (Fig

ure

15.2.) That is, we will compute P(R2 I U1:2) as follows: On day 0, we have no observations, only the security guard's prior beliefs; Jet's assume that consists ofP(Ro) = (0.5,0.5). On day 1, the umbrella appears, so U1 = true. The prediction from t = 0 to t = 1 is

•

•

P(R1)

L P(R1 I ro)P(ro) ro

(0. 7, 0.3)

X

0.5 + (0.3, 0. 7) X 0.5 = (0.5, 0.5) .

Then the update step simply multiplies by the probability of the evidence for t = 1 and normalizes, as shown in Equation

P(R1 I u1)

a

(15.4):

P(u1 l R1)P(Rl) = a (0.9, 0.2) (0.5, 0.5)

= a (0.45, 0.1)

�

(0.818, 0.182) .

Section 15.2.

Inference in Temporal Models •

573

On day 2, the tunbrella appears, so U2 = true. The prediction from t = 1 to t = 2 is

LP(R2 I rt)P(rt I u1)

P(R2 I u1)

(0.7, 0.3)

X

0.818 + (0.3, 0.7)

X

0.182

�

(0.627, 0.373) ,

and updating it with the evidence for t = 2 gives

P(R2 I u1, U2)

= a P(u2 l �)P(R2 I Ut) = a (0.9, 0.2)(0.627, 0.373) = a (0.565,0.075) � (0.883, 0.117) .

Intuitively, the probability of rain increases from day 1 to day 2 because rain persists. Exer cise 1 5.2(a) asks you to investigate this tendency further. The task of prediction can be seen simply as filtering without the addition of new evidence. In fact, the filtering process already incorporates a one-step prediction, and it is easy to derive the following recursive computation for predicting the state at t + k + 1 from a prediction for t + k:

P(Xt+k+t l el:t) = L P(Xt+k+l l Xt+k)P(xt+k I el:t) .

(15.6)

Xt+k

MIXING TIME

Naturally, this computation involves only the transition model and not the sensor model. It is interesting to consider what happens as we try to predict further and further into the future. As Exercise 15.2(b) shows, the predicted distribution for rain converges to a fixed point (0.5, 0.5), after which it remains constant for all time. This is the stationary distribution of the Markov process defined by the transition model. (See also page 537.) A great deal is known about the properties of such distributions and about the mixing time roughly, the time taken to reach the fixed point. In practical terms, this dooms to failure any

attempt to predict the actual state for a number of steps that is more than a small fraction of the mixing time, unless the stationary distribution itself is strongly peaked in a small area of the state space. The more uncertainty there is in the transition model, the shorter will be the mixing time and the more the future is obscured. In addition to filtering and prediction, we can use a forward recursion to compute the likelihood of the evidence sequence, P(el t)· This is a useful quantity if we want to compare different temporal models that might have produced the same evidence sequence (e.g., two different models for the persistence of rain). For this recursion, we use a likelihood message ft:t(Xt) = P(Xt el:t)· It is a simple exercise to show that the message calculation is identical to that for filtering: :

,

fl:t+l = FORWARD(£1:t,et+l) . Having computed ft:t, we obtain the actual likelihood by summing out Xt: L1:t = P(el:t ) = L fl:t(xt) .

(15.7)

Notice that the likelihood message represents the probabilities of longer and longer evidence sequences as time goes by and so becomes numerically smaller and smaller, leading to under flow problems with floating-point arithmetk. This is an important problem in practice, but we shall not go into solutions here.

574

Chapter

15.

Probabilistic Reasoning over Time

- - - - - C >- A result in irrational behavior. (b) The decomposability axiom.

>-

In other words, once the probabilities and utilities ofthe possible outcome states are specified, the utility of a compound lottery involving those states is completely determined. Because the outcome of a nondeterministic action is a lottery, it follows that an agent can act rationally that is, consistently with its preferences--only by choosing an action that maximizes expected utility according to Equation (16.1 ). The preceding theorems establish that a utility ftmction exists for any rational agent, but they do not establish that it is unique. It is easy to see, in fact, that an agent's behavior would not change if its utility function U(S)were transformed according to

U'(S)= aU(S) + b , where a and b are constants and a >

VALUE FUNCTION ORDINALUTIUTY FUNCTION

(16.2)

0; an affine transformation.4 This fact was noted in Chapter 5 for two-player games of chance; here, we see that it is completely general. As in game-playing, in a deterministic envirorunent an agent just needs a preference ranking on states-the numbers don't matter. This is called a value function or ordinal utility function.

It is important to remember that the existence of a utility function that describes an agent's preference behavior does not necessarily mean that the agent is explicitly maximizing that utility function in its own deliberations. As we showed in Chapter 2, rational behavior can be generated n i any number of ways. By observing a mtional agent's preferences, however, an observer can construct the utility function that represents what the agent is actually trying to achieve (even if the agent doesn't know it).

In this sense, utilities resemble temperatures: a temperature in Fahrenheit is 1.8 times the Celsius temperature plus 32. You get the same results in either measurement system. 4

Section

16.3

16.3.

615

Utility Functions

UTILITY FUNCTIONS

Utility is a function that maps from lotteries to real munbers. We know there are some axioms

on utilities that all rational agents must obey. Is that all we can say about utility functions?

Strictly speaking, tl1at is it: an agent can have any preferences it likes. For e: L P(s' l s, -rr[s]) U[s']thendo a e A(s)

s'

7r[s].- argmax a e A(s)

81

L P(s' l s,a) U[s'] s'

unchanged? .-- false until unchanged? return 1r Figure 17.7

The policy iteration algorithm for calculating an optimal policy.

Chapter

658

17.

Making Complex Decisions

The algorithms we have described so far require updating the utility or policy for all states at once. It turns out that this is not strictly necessary. In fact, on each iteration, we can

ASYNCHRONOUS POLICY ITERATION

pick

any subset of states and apply either kind of updating (policy improvement or sim plified value iteration) to that subset. This very general algoritlun is called asynchronous policy iteration. Given certain conditions on the initial policy and initial utility function, asynchronous policy iteration is guaranteed to converge

to an optimal policy.

The freedom

to choose any states to work on means that we can design much more efficient heuristic algorithms-for example, algorithms that concentrate on updating the values of states that are likely to be reached by a good policy. This makes a lot of sense in real life: if one has no intention of throwing oneself off a cliff, one should not spend time worrying about the exact value ofthe resulting states.

17.4

PARTIALLY OBSERVABLE MDPS

TI1e description of Markov decision processes in Section was

fully observable.

17.1

assumed that the environment

With this assumption, the agent always knows which state it is in.

Tiris, combined with the Markov assumption for the transition model, means that the optimal policy depends only on the current state. When the environment is only

partially observable,

the situation is, one might say, much less clear. The agent does not necessarily know which state it is in, so it cannot execute the action

1r(8) recommended for that state. Furthermore, the

utility of a state 8 and the optimal action in 8 depend not just on 8, but also on how

PARTIALLY OBSERVABLE MOP

agent knows when it is in 8.

For these reasons,

much the

partially observable MDPs (or POMDPs

pronounced "pom-dee-pees") are usually viewed as much more difficult than ordinary MOPs. We cannot avoid POMDPs, however, because the real world is one.

17.4.1

Definition of POMDPs

To get a handle on POMDPs, we must first define them properly. elements as an MDP-the transition model

()

A POMDP has

P(8' 1 8, a), actions A(8), and reward function

R 8 -but, like the partially observable search problems of Section

model P(e I 8). ing evidence

the same

4.4, it also has a sensor

Here, as in Chapter 15, the sensor model specifies the probability of perceiv

e in state 8.3 For example, we can convert the 4 x 3 world of Figure 17.1 into

a POMDP by adding a noisy or partial sensor instead of asstuning that the agent knows its location exactly. Such a sensor might measure the

to be 2 in all the nonterminal

number ofadjacent walls, which happens

squares except for those in the third column, where the value

is

1 ; a noisy version might give the wrong value with probability 0.1. In Chapters 4 and 1 1 , we studied nondeterministic and partially observable planning problems and identified the belief state-the set of actual states the agent might be in-as a key concept for describing and calculating solutions. In POMDPs, the belief state

b becomes a

probability distribution over all possible states, just as in Chapter 15. For example, the initial 3 As with the reward function for MDPs, the sensor model can also depend on the action and outcome state, but again this change is not fundamental.

Section 17.4.

Partially Observable MDPs

659

belief state for the 4 x 3 POMDP could be the uniform disttibution over the nine nonterminal states, i.e., (�, �� ' �' �� 0, 0). We writ.e b(s) for the probability assigned to the acn•al state s by belief state b. The agent can calculate its cuJTent belief state as the conditional probability distribution over the acwal states given the sequence of percepts and actions so far. This is essentially the filtering task described in Chapter 15. The basic recursive filtering equation (15.5 on page 572) shows how to calculate the new belief state from the previous belief state and the new evidence. For POMDPs, we also have an action to consider, but the result is essentially the same. If b(s) was the previous belief state, and the agent does action a and then perceives evidence e, then the new belief state is given by

b'(s ') =a: P(e I s') L P(s' l s,a)b(s) , s

where a: is a normalizing constant that makes the belief state sum to 1 . By analogy with the update operator for filtering (page 572), we can write this as

b1 = FORWARD(b,a,e) .

(17.11)

In the 4 x 3 POMDP, suppose the agent moves Left and its sensor reports 1 adjacent wall; then it's quite likely (although not guaranteed, because both the motion and the sensor are noisy) that the agent is now in (3, 1 ). Exercise 17.13 asks you to calculate the exact probability values for the new belief state. The fundamental insight required to understand POMDPs is this: the optimal action depends only on the agent's current beliefstate. That is, the optimal policy can be described by a mapping 1r*(b) from belief states to actions. It does not depend on the actual state the agent is in. This is a good thing, because the agent does not know its actual state; all it knows is the belief state. Hence, the decision cycle of a POMDP agent can be broken down into the following three steps: l . Given the current belief state b, execute the action a = 1r* (b). 2. Receive percept e. 3. Set the current belief state to FORWARD(b, a, e) and repeat. Now we can think of POMDPs as requiring a search in belief-state space, just like the meth ods for sensorless and contingency problems in Chapter 4. The main difference is that the POMDP belief-state space is continuous, because a POMDP belief state is a probability dis tribution. For example, a belief state for the 4 x 3 world is a point in an 11-dimensional continuous space. An action changes the belief state, not just the physical state. Hence, the action is evaluated at least in part according to the information the agent acquires as a result. POMDPs therefore include the value of information (Section 16.6) as one component of the decision problem. Let's look more carefully at the outcome of actions. In particular, let's calculate the probability that an agent in belief state breaches belief state b' after executing action a. Now, if we knew the action and the subsequent percept, then Equation (17.11) would provide a deterministic update to the belief state: b' = FORWARD(b, a, e). Of course, the subsequent percept is not yet known, so the agent might arrive in one of several possible belief states b', depending on the percept that is received. The probability of perceiving e, given that a was

Chapter 17.

660

Making Complex Decisions

perfonned starting in belief state b, is given by summing over all the actual states s' that the agent might reach:

P(ela,b)

=

L P(ela, s',b)P(s'la,b) •

'

L P(e I s')P(s'la, b) •

'

LP(e l s')L P(s' l s, a)b(s) . •'

s

b' from b, given action a, as P(b' I b, a)). Then that

Let us write the probability of reaching gives us

P(b' l b,a)

=

P(b'la,b) = l:.:: P(b'le,a, b)P(ela,b) e

e

s

'

(17.12) s

where P(b'le,a, b) is 1 ifb' = FORWARD(b, a, e) and 0 otherwise. Equation (17.12) can be viewed as defining a transition model for the belief-state space. We can also define a reward function for belief states (i.e., the expected reward for the actual states the agent might be n i ):

p(b) = L b(s)R(s) . s

Together, P(b' I b, a) and p(b) define an observable MOP on the space of belief states. Fur thermore, it can be shown that an optimal policy for this MDP, 1r•(b), is also an optimal policy for the original POMDP. In other words, solving a POMDP on a physical state space can be reduced to solving an MDP on the corresponding belief-state space. This fact is perhaps less surprising ifwe remember that the belief state is always observable to the agent, by definition. Notice that, although we have reduced POMDPs to MOPs, the MOP we obtain has a continuous (and usually high-dimensional) state space. None of the MDP algorithms de scribed in Sections 17.2 and 17.3 applies directly to such MOPs. The next two subsec tions describe a value iteration algorithm designed specifically for POMDPs and an online decision-making algoritlun, similar to those developed for games in Chapter 5. 17.4.2

Value iteration for POMDPs

Section 17.2 described a value iteration algoritlun that computed one utility value for each state. With infinitely many belief states, we need to be more creative. Consider an optimal policy 7r* and its application in a specific belief state b: the policy generates an action, then, for each subsequent percept, the belief state is updated and a new action is generated, and so on. For this specific b, therefore, the policy is exactly equivalent to a conditional plan, as de tined in Chapter 4 for nondeterministic and partially observable problems. Instead of thinking about policies, let us think about conditional plans and how the expected utility of executing a fixed conditional plan varies with the initial belief state. We make two observations:

Section 17.4.

Partially Observable MDPs

661

1. Let the utility ofexecuting afixedconditional plan p starting in physical state 8 be ap(8). Then the expected utility of executing pin belief state b is just E. b(8)ap(8), or b · et.p if we think of them both as vectors. Hence, the expected utility of a fixed conditional plan varies linearly with b; that is, it corresponds to a hyperplane in belief space. 2. At any given belief state b, the optimal policy will choose to execute the conditional plan with highest expected utility; and the expected utility of b under the optimal policy is just the utility of that conditional plan:

U(b) =

rr· (b)

= max p b · aP .

If the optimal policy 7r* chooses to execute p starting at b, then it is reasonable to expect that it might choose to execute p in belief states that are very close to b; in fact, if we bound the depth of the conditional plans, then there are only finitely many such plans and the continuous space of belief states will generally be divided into regions, each corresponding to a particular conditional plan that is optimal in that region. From these two observations, we see that the utility function U(b) on belief states, being the maximum of a collection of hyperplanes, will be piecewise linear and convex. To illustrate this, we use a simple two-state world. The states are labeled 0 and I , with R(O) = 0 and R(l) = 1. There are two actions: Stay stays put with probability 0.9 and Go switches to the other state with probability 0.9. For now we will assume the discount factor 'Y = 1 . The sensor reports the correct state with probability 0.6. Obviously, the agent should Stay when it thinks it's in state I and Go when it thinks it's in state 0. The advantage of a two-state world is that the belief space can be viewed as one dimensional, because the two probabilities must sum to l . In Figw·e 17.8(a), the x-axis represents the belief state, defined by b(l), the probability of being in state l. Now Jet us con sider the one-step plans [Stay] and [Go], each of which receives the reward for the current state followed by the (discounted) reward for the state reached after the action:

a[stayJ(O) a[stayJ(l) a[GoJ(O) a[GoJ(l)

R(O) +7(0.9R(O) + O.lR(l)) = 0.1 R(l) +7(0.9R(l) + O.lR(O)) = 1.9 R(O) + 7(0.9R(l) + O.lR(O)) = 0.9 R(l) + 7(0.9R(O) + O.lR(l)) = 1.1

The hyperplanes (lines, in this case) for b·a[stayJ and b· a[GoJ are shown in Figure 17.8(a) and their maximum is shown in bold. The bold line therefore represents the utility function for the finite-horizon problem that allows just one action, and in each "piece" of the piecewise linear utility function the optimal action is the first action of the corresponding conditional plan. In this case, the optimal one-step policy is to Stay when b(l) > 0.5 and Go otherwise. Once we have utilities ap(8) for all the conditional plans p of depth 1 in each physical state 8, we can compute the utilities for conditional plans of depth 2 by considering each possible first action, each possible subsequent percept, and then each way of choosing a depth-1 plan to execute for each percept:

[Stay; [Stay;

if Percept = 0 then Stay else Stay]

if Percept= 0 then Stay else

Go] . . .

Chapter

662 3

3

2.5

2.5

2

2

0.5

--- -- -

0� 0 0.2

17.

Making Complex Decisions

0.5

-....... - --i OA 0.6 0.8 Probability of state 1 (a)

0 �----0 0.2 0.6 0.8 0.4 Probability of state I (b)

3

7.5

2.5

7

2

6.5 6

0.5

5.5

--

5

.._ .... _... ... 0� 0 0.2 0.6 0.8 0.4 Probability of state I

--4.5 �0.4 0 0.2 0.6 0.8

....j

_ _ _

Probability of state I

(c)

(d)

Figure 17.8 (a) Utility of two one-step plans as a function of the initial belief state b(l) for the two-state world, with the corresponding utility function shown in bold. (b) Utilities for 8 distinct two-step plans. (c) Utilities for four undorninated two-step plans. (d) Utility function for optimal eight-step plans.

DOMINATED PLAN

TI1ere are eight distinct depth-2 plans in all, and their utilities are shown in Figure 17.8(b). Notice that four of the plans, shown as dashed lines, are suboptimal across the entire belief space-we say these plans are dominated, and they need not be considered fmther. There are four tmdominated plans, each of which is optimal in a specific region, as shown in Fig ure 17.8(c). The regions partition the belief-state space. We repeat the process for depth 3, and so on. In general, let p be a depth-d conditional plan whose initial action is a and whose depth-d - 1 subplan for percept e is p.e; then

o:p(s) = R(s) + 1

(�

P(s' l s, a)

� P( I s')o:p.e(s')) . e

(17.13)

Tilis recursion naturally gives us a value iteration algorithm, which is sketched in Figure 17.9. The structure ofthe algorithm and its error analysis are similar to those of the basic value iter ation algorithm in Figure 17.4 on page 653; the main difference is that instead of computing one utility number for each state, POMDP-VALUE-ITERATION maintains a collection of

Section 17.4.

Partially Observable MDPs

663

ftmction POMDP-VALUE-ITERATION(pomdp, c) returns a utility function inputs: pomdp, a POMDP with states S, actions A(s), transition model P(s' l s, a), sensor model P(e I s), rewards R(s), discount 'Y €,

the maximum error allowed in the utility ofany state local variables: U, U', sets of plans p with associated utility vectors a,

U' ..- a set containing just the empty plan [], with a[] (s) = R(s) repeat Ut- U' U' n total steps. The agents are thus incapable of representing the number of remaining steps, and must treat it as an unknown. Therefore, they cannot do the induction, and are free to arrive at the more favorable (refuse, refuse) equilibrium. In this case, ignorance is bliss-or rather, having your opponent believe that you are ignorant is bliss. Your success in these repeated games depends on the other player's perception of you as a bully or a simpleton, and not on your actual characteristics. 17.5.3

EXTENSIVE FCflM

Sequential games

In the general case, a game consists of a sequence of tums that need not be all the same. Such games are best represented by a game tree, which game theorists call the extensive form. The tree includes all the same information we saw in Section 5.1: an initial state So, a function PLAYER(s) that tells which player has the move, a function ACTIONS(s) enumerating the possible actions, a function RESULT(s,a) that defines the transition to a new state, and a partial function UTILITY(s,p), which is defined only on terminal states, to give the payoff for each player. To represent stochastic games, such as backgammon, we add a distinguished player, chance, that can take random actions. Chance's "strategy" is part of the definition of the

Section 17.5.

Decisions with Multiple Agents: Game Theory

675

game, specified as a probability distiibution over actions (the other players get to choose their own strategy). To represent games with nondeterministic actions, such as billiards, we break the action into two pieces: the player's action itself has a deterministic result, and then chance has a tum to react to the action in its own capiicious way. To represent simultaneous moves, as in the prisoner's dilemma or two-finger Morra, we impose an arl>itrary order on the players, but we have the option of asserting that the earlier player's actions are not observable to the subsequent players: e.g., Alice must choose refuse or testify first, then Bob chooses, but Bob does not know what choice Alice made at that time (we can also represent the fact that the move is revealed later). However, we assume the players always remember all their own previous actions; this assumption is called perfect recall. The key idea of extensive fonn that sets it apart from the game trees of Chapter 5 is the representation of partial observability. We saw in Section 5.6 that a player in a partially observable game such as Kriegspiel can create a game tree over the space of belief states. With that tree, we saw that in some cases a player can find a sequence of moves (a strategy) that leads to a forced checkmate regardless of what actual state we started in, and regardless of what strategy the opponent uses. However, the techniques of Chapter 5 could not tell a player what to do when there is no guaranteed checkmate. If the player's best strategy depends on the opponent's strategy ami vice versa, then minimax (or a.lpha.--heta) hy itself ca.nnot

INFORMAnoNsETs

find a solution. The extensive form does allow us to find solutions because it represents the belief states (game theorists call them information sets) of all players at once. From that representation we can find equilibiium solutions, just as we did with normal-form games. As a simple example of a sequential game, place two agents in the 4 x 3 world of Fig ure 17.1 and have them move simultaneously until one agent reaches an exit square, and gets the payoff for that square. If we specify that no movement occurs when the two agents try to move into the same square simultaneously (a common problem at many traffic intersec tions), then certain pure strategies can get stuck forever. Thus, agents need a mixed strategy to perform well in this game: randomly choose between moving ahead and staying put. This is exactly what is done to resolve packet collisions in Ethernet networks. Next we'll consider a very simple variant of poker. The deck has only four cards, two aces and two kings. One card is dealt to each player. The first player then has the option to raise the stakes of the game from 1 point to 2, or to check. If player 1 checks, the game is over. If he raises, then player 2 has the option to call, accepting that the game is worth 2 points, orfold, conceding the 1 point. If the game does not end with a fold, then the payoff depends on the cards: it is zero for both players if they have the same card; otherwise the player with the king pays the stakes to the player with the ace. The extensive-form tree for this game is shown in Figure 17 .13. Nontenninal states are shown as circles, with the player to move n i side the circle; player 0 is chance. Each action is depicted as an arrow with a label, corresponding to a raise, check, call, orfold, or, for chance, th� four pussibk ut:als ("AK" mt:aus that playt:r 1 gt:ts au a�.:t: a11u playt:r 2 a k.iug). Tt:nuiual states are rectangles labeled by their payoff to player I and player 2. Information sets are shown as labeled dashed boxes; for example, h,I is the information set where it is player I 's tum, and he knows he has an ace (but does not know what player 2 has). In information set h,I, it is player 2's tum and she knows that she has an ace and that player 1 has raised,

676

Chapter

Figure 17.13

17.

Making Complex Decisions

Extensive form of a simplified version of poker.

but does not know what card player 1 has. (Due to the limits of two-dimensional paper, this information set is shown as two boxes rather than one.) One way to solve an extensive game is to convert it to a normal-form game. Recall that the normal form is a matrix, each row of which is labeled with a pure strategy for player 1, and each column by a pure strategy for player 2. 1n an ext.ensive game a pure strategy for player i corresponds to an action for each information set involving that player. So in Figure 17.13, one pure strategy for player 1 is "raise when in h,1 (that is, when I have an ace), and check when in h,2 (when I have a king)." 1n the payoff matrix below, this strategy is called rk. Similarly, strategy cf for player 2 means "call when I have an ace and fold when I have a king." Since this is a zero-sum game, the matrix below gives only the payoff for player 1 ; player 2 always has the opposite payoff: 2:cc l :rr l:kr l:rk l:kk

0 -1/3 1/3

0

2:cf - l/6 -l/6

0 0

2:ff 1 5/6 1/6

2fc 7/6 2/3 l/2

0

0

This game is so simple that it has two pure-strategy equilibria, shown in bold: cf for player 2 and rk or kk for player 1 . But in general we can solve extensive games by converting to normal fonn and then finding a solution (usually a mixed strategy) using standard linear programming methods. That works in theory. But if a player has I information sets and a actions per set, then that player will have a1 pure strategies. In other words, the size of the normal-form matrix is exponential in the munber of information sets, so in practice the

Section 17.5.

SEQUENCE FORM

ABSTRACTION

Decisions with Multiple Agents: Game Theory

677

approach works only for very small game trees, on the order of a dozen states. A game like Texas hold'em poker has about 1018 states, making this approach completely infeasible. What are the altematives? In Chapter 5 we saw how alpha-beta search could handle games of perfect information with huge game trees by generating the tree incrementally, by pruning some branches, and by heuristically evaluating nonterminal nodes. But that approach does not work well for games with imperfect infonnation, for two reasons: first, it is harder to prune, because we need to consider mixed strategies that combine multiple branches, not a pure strategy that always chooses the best branch. Second, it is harder to heuristically evaluate a nontenninal node, because we are dealing with information sets, not individual states. Koller et al. (1996) come to the rescue with an alternative representation of extensive games, called the sequence form, tha[ is only linear in the size of the tree, rather than ex ponential. Rather than represent strategies, it represents paths through the tree; the number of paths is equal to the number of tenninal nodes. Standard linear programming methods can again be applied to this representation. The resulting system can solve poker variants with 25,000 states in a minute or two. This is an exponential speedup over the nonnal-form approach, but still falls far short of handling full poker, with 1018 states. If we can't handle 1018 states, perhaps we can simplify the problem by changing the game to a simpler fonn. For example, if I hold an ace and am considering the possibility that t.he next card will give me a pair of aces, then I don't care about the suit of the next card; any suit will do equally well. This suggests fonning an abstraction of the game, one in which suits are ignored. The resulting game tree wiU be smaller by a factor of 4! = 24. Suppose I can solve this smaller game; how will the solution to that game relate to the original game? If no player is going for a flush (or bluffing so), then the suits don't matter to any player, and the solution for the abstraction wil1 also be a solution for the original game. However, if any

player is contemplating a flush, then the abstraction will be only an approximate solution (but it is possible to compute bounds on the error). There are many oppmtunities for abstraction. For example, at the point in a game where each player has two cards, if I hold a pair of queens, then the other players' hands could be abstracted into three classes: better (only a pair of kings or a pair of aces), same (pair of queens) or worse (everything else). However, this abstraction might be too coarse. A better abstraction would divide worse into, say, medium pair (nines through jacks), low pair, and no pair. These examples are abstractions of states; it is also possible to abstract actions. For example, instead of having a bet action for each integer from 1 to 1000, we could restrict the bets to 10°, 101, 102 and 103. Or we could cut out one of the rounds of betting altogether. We can also abstract over chance nodes, by considering only a subset of the possible deals. This is equivalent to the rollout teclmique used in Go programs. Putting all these abstractions together, we can reduce the 1018 states of poker to 107 states, a size that can be solved with current techniques. Poker programs based on this approach can easily defeat novice and some experienced human players, but are not yet at the level of master players. Part of the problem is that the solution these programs approximate-the equilibrium solution-is optimal only against an opponent who also plays the equilibrium strategy. Against fallible human players it is important to be able to exploit an opponent's deviation from the equilibrium strategy. As

678

COURNOT COMPETITION

BAYEs-NASH EQUILIBRIUM

Chapter

17.

Making Complex Decisions

Gautam Rao (aka "The Count"), the world's leading online poker player, said (Billings et at. , 2003), "You have a very strong program. Once you add opponent modeling to it, it will kill everyone." However, good models of human fallability remain elusive. In a sense, extensive game form is the one ofthe most complete representations we have seen so far: it can handle partially observable, multiagent, stochastic, sequential, dynamic environments-most of the hard cases from the list of environment properties on page 42. However, there are two limitations of game theory. First, it does not deal well with continuous states and actions (although there have been some extensions to the continuous case; for example, the theory of Coumot competition uses game theory to solve problems where two companies choose prices for their products from a continuous space). Second, game theory assumes the game is known. Parts of the game may be specified as unobservable to some of the players, but it must be known what parts are unobservable. In cases in which the players learn the unknown structure of the game over time, the model begins to break down. Let's examine each source of uncertainty, and whether each can be represented in game theory. Actions: There is no easy way to represent a game where the players have to discover what actions are available. Consider the game between computer virus writers and security experts. Part of the problem is anticipating what action the virus writers will try next. Strategies: Game theory is very good at representing the idea that the other players' strategies are initially unknown-as long as we assume all agents are rational. The theory itself does not say what to do when the other players are less than fully rational. The notion of a Bayes-Nash equilibrium partially addresses this point: it is an equilibrium with respect to a player's prior probability distribution over the other players' strategies-in other words, it expresses a player's beliefs about the other players' likely strategies. Chance: If a game depends on the roll of a die, it is easy enough to model a chance node with unifonn distribution over the outcomes. But what if it is possible that the die is unfair? We can represent that with another chance node, higher up in the tree, with two branches for "die is fair" and "die is unfair," such that the corresponding nodes in each branch are in the same information set (that is, the players don't know if the die is fair or not). And what if we suspect the other opponent does know·? Then we add another chance node, with one branch representing the case where the opponent does lmow, and one where he doesn't. Utilities: What if we don't know our opponent's utilities? Again, that can be modeled with a chance node, such that the other agent knows its own utilities in each branch, but we don't. But what if we don't know our own utilities? For example, how do I know if it is rational to order the Chef's salad if I don't know how much I will like it? We can model that with yet another chance node specifying an unobservable "intrinsic quality" of the salad. Thus, we see that game theory is good at representing most sources of uncertainty-but at the cost of doubling the size of the tree every time we add another node; a habit which quickly leads to intractably large trees. Because of these and other problems, game theory has been used primarily to analyze environments that are at equilibrium, rather than to control agents within an environment. Next we shall see how it can help design environments.

Section 17.6. 17.6

Mechanism Design

679

MECHANISM DESIGN

MECHANISM DESIGN

MECHANISM cENTER

In the previous section, we asked, "Given a game, what is a rational strategy?" In this sec tion, we ask, "Given that agents pick rational strategies, what game should we design?" More specifically, we would like to design a game whose solutions, consisting of each agent pursu ing its own rational strategy, result n i the maximization of some global utility ftmction. This problem is called mechanism design, or sometimes inverse game theory. Mechanism de sign is a staple of economics and political science. Capitalism 101 says that if everyone tries to get rich, the total wealth of society will increase. But the examples we will discuss show that proper mechanism design is necessary to keep the invisible hand on track. For collections of agents, mechanism design allows us to construct smart systems out of a collection of more limited systems--even tmcooperative systems--in much the same way that teams of htunans can achieve goals beyond the reach of any individual. Examples of mechanism design include auctioning off cheap airline tickets, routing TCP packets between computers, deciding how medical interns will be assigned to hospitals, and deciding how robotic soccer players wil1 cooperate with their teammates. Mechanism design became more than an academic subject in the 1990s when several nations, faced with the problem of auctioning off licenses to broadcast in various frequency bands, lost hundreds of millions of dollars in potential revenue as a result of poor mechanism design. Formally, a mechanism consists of (I) a language for describing the set of allowable strategies that agents may adopt, (2) a distinguished agent, called the center, that collects reports of strategy choices from the agents in the game, and (3) an outcome rule, known to all agents, that the center uses to determine the payoffs to each agent, given their strategy choices. 17.6.1

AUCTION

ASCENDING-BID ENGLISHAUCTION

Auctions

Let's consider auctions first. An auction is a mechanism for selling some goods to members of a pool of bidders. For simplicity, we concentrate on auctions with a single item for sale. Each bidder i has a utility value vi for having the item. ln some cases, each bidder has a private value for the item. For example, the first item sold on eBay was a broken laser pointer, which sold for $14.83 to a collector of broken laser pointers. Thus, we know that the collector has vi 2:: $14.83, but most other people would have Vj « $14.83. In other cases, such as auctioning drilling rights for an oil tract, the item has a common value-the tract will produce some amount of money, X, and all bidders value a dollar equally-but there is uncertainty as to what the actual value of X is. Different bidders have different information, and hence different estimates of the item's true value. In either case, bidders end up with their own Vi. Given Vi, each bidder gets a chance, at the appropriate time or times in the auction, to make a bid bi. The highest bid, bmax wins the item, but the price paid need not be bmax; that's part ofthe mechanism design. The best-known auction mechanism is the ascending-bid,8 or English auction, in which the center starts by asking for a minimum (or reserve) bid bmin · If some bidder is 8 The word "auction" comes from the Latin augere, to increase.

680

EFFICIENT

COLLUSION

STRATEGY·PROOF TRUTH-REVEALING REVELATION PRINCIPLE

Chapter

17.

Making Complex Decisions

willing to pay that amount, the center then asks for bmin + d, for some increment d, and continues up from there. The auction ends when nobody is willing to bid anymore; then the last bidder wins the item, paying the price he bid. How do we know if this is a good mechanism? One goal is to maximize expected revenue for the seller. Another goal is to maximize a notion of global utility. These goals overlap to some extent, because one aspect of maximizing global utility is to ensure that the winner of the auction is the agent who values the item the most (and thus is willing to pay the most). We say an auction is efficient if the goods go to the agent who values them most. The ascending-bid auction is usually both efficient and revenue maximizing, but if the reserve price is set too high, the bidder who values it most may not bid, and if the reserve is set too low, the seller loses net revenue. Probably the most important things that an auction mechanism can do is encourage a sufficient number of bidders to enter the game and discourage them from engaging in collu sion. Collusion is an unfair or illegal agreement by two or more bidders to manipulate prices. It can happen in secret backroom deals or tacitly, within the mles of the mechanism. For example, in 1999, Gennany auctioned ten blocks of cell-phone spectrum with a simultaneous auction (bids were taken on all ten blocks at the same time), using the rule that any bid must be a minimum of a 10% raise over the previous bid on a block. There were only two credible bidders, and the first, Mannesman, entered the bid of 20 million deutschmark on blocks l-5 and 18.18 million on blocks 6-10. Why 18.18M? One ofT-Mobile's managers said they "interpreted Mannesman's first bid as an offer." Both parties could compute that a 10% raise on 18.18M is 19.99M; thus Ma1mesman's bid was interpreted as saying "we can each get half the blocks for 20M; Jet's not spoil it by bidding the prices up higher." And in fact T-Mobile bid 20M on blocks 6-10 and that was the end of the bidding. The German goverrunent got Jess than they expected, because the two competitors were able to use the bidding mechanism to come to a tacit agreement on how not to compete. From the government's point of view, a better result could have been obtained by any of these changes to the mechanism: a .higher reserve price; a sealed-bid first-price auction, so that the competitors could not communicate tlu-ough their bids; or incentives to bring in a third bidder. Perhaps the 10% rule was an error in mechanism design, because it facilitated the precise signaling from Mannesman to T-Mobile. In general, both the seller and the global utility fw1ction benefit if there are more bid ders, although global utility can suffer if you count the cost of wasted lime of bidders that have no chance of winning. One way to encourage more bidders is to make the mechanism easier for them. After all, if it requires too much research or computation on the part of the bidders, they may decide to take their money elsewhere. So it is desirable that the bidders have a dominant strategy. Recall that "dominant" means that the strategy works against all other strategies, which in tum means that an agent can adopt it without regard for the other strategies. An agent with a dominant strategy can just bid, without wasting time contemplat ing other agents' possible strategies. A mechanism where agents have a dominant strategy is called a strategy-proof mechanism. If, as is usually the case, that strategy involves the bidders revealing their true value, Vi, then it is called a truth-revealing, or truthful, auction; the term incentive compatible is also used. TI1e revelation principle states that any mecha-

Mechanism Design

Section 17.6.

681

nism can be transfonned into an equivalent truth-revealing mechanism, so part of mechanism design is finding these equivalent mechanisms. It turns out that the ascending-bid auction has most of the desirable properties. The

bidder with the highest value vi gets the goods at a price of b0 + d, where b0 is the highest 9 bid among all the other agents and d is the auctioneer's increment. Bidders have a simple dominant strategy: keep bidding as long as the current cost is below your vi· The mechanism is not quite truth-revealing, because the winning bidder reveals only that his vi

;::: b0 + d; we

have a lower bound on vi but not an exact amount. A disadvanta.ge (from the point of view of the seller) of the ascending-bid auction is that it can discourage competition. Suppose that in a bid for cell-phone spectrum there is one advantaged company that everyone agrees would be able to leverage existing customers and infrastructure, and thus can make a larger profit than anyone else. Potential competitors can see that they have no chance in an ascending-bid auction, because the advantaged com pany can always bid higher. Thus, the competitors may not enter at all, and the advantaged company ends up wilming at the reserve price. Another negative property ofthe English auction is its high communication costs. Either the auction takes place in one room or all bidders have to have high-speed, secw·e commtmi cation lines; in either case they have to have the time available to go through several rmmds of

SEALED-BID AUCTION

bidding. An alternative mechanism, which requires much Jess commtmication, is the sealed

bid auction. Each bidder makes a single bid and communicates it to the auctioneer, without the other bidders seeing it. With this mechanism, there is no longer a silnple dominant strat

vi and you believe that the maximum of all t.he other agents' bids will should bid b0 + E, for some small E, if that is Jess than vi· Thus, your bid

egy. If your value is be

b0, then you

depends on your estimation of the other agents' bids, requiring you to do more work. Also, note that the agent with the highest vi might not win the auction. This is offset by the fact that the auction is more competitive, reducing the bias toward an advantaged bidder.

SEALED-BID SECOND-PRICE AUCTION VICKREYAUCTION

A small change in the mechanism for sealed-bid auctions produces the sealed-bid second-price auction, also known as a Vickrey auction. 10 In such auctions, the winner pays the price of the second-highest bid,

b0, rather than paying his own bid.

This simple modifi

cation completely eliminates the complex deliberations required for standard (or first-price) sealed-bid auctions, because the dominant strategy is now simply to bid vi; the mechanism is truth-revealing. Note that the utility of agent i in terms of his bid bi, his value vi, and the best

{ (vi - bo)

bid among the other agents, b0, is U'·

=

0

if bi >

bo

otherwise.

To see that bi = vi is a dominant strategy, note that when (vi - bo) is positive, any bid that wins the auction is optimal, and bidding vi in particular wins the auction. On the other hand, when

(vi - bo)

is negative, any bid that loses the auction is optimal, and bidding vi in

9 There is actually a small chance that the agent with highest Vi fails to get the goods, in the cnse in which b0 < Vi < b0 +d. The chance of this can be made arbitrarily small by decreasing the increment d. 10 Named after William Vickrey (1914-1996), who won the 1996 Nobel Prize in economics for this work and

died of a heart attack three days later.

Chapter

682

17.

Making Complex Decisions

particular loses the auction. So bidding vi is optimal for all possible values of b0, and in fact, vi

is the only bid that has this property. Because of its simplicity and the minimal computation

requirements for both seller and bidders, the Vickrey auction is widely used in constructing distributed AI systems. Also, Internet search engines conduct over a billion auctions a day to sell advertisements along with their search results, and on1ine auction sites handle $100 billion a year in goods, all using variants ofthe Vickrey auction. Note that the expected value to the seller is REVENUE EQUIVALENCE THEOREM

b0, which is the same expected return as the limit of the English auction as

the increment d goes to zero. This is actually a very general result: the revenue equivalence theorem

states that, with a few minor caveats, any auction mechanism where risk-neutral

bitldt:rS ftavt: va)Ut:S 'Vi kHUWII only tU tht:IIISdVt:S (but kHUW a probability UistributiUII from

which those values are sampled), will yield the same expected revenue. This principle means that the various mechanisms are not competing on the basis of revenue generation, but rather on other qualities. Although the second-price auction is tmth-revealing, it turns out that extending the idea to multiple goods and using a next-price auction is not truth-revealing. Many Internet search engines use a mechanism where they auction k slots for ads on a page. The highest bidder wins the top spot, the second highest gets the second spot, and so on. Each winner pays the price bid by the next-lower bidder, with the understanding that payment is made on1y if the searcher actually clicks on the ad. The top slots are considered more valuable because they are more likely to be noticed and clicked on. Imagine that three bidders, b1, � and b3, have valuations for a click of v1 = 200, 'V2 = 180, and V3 = 100, and thatk = 2 slots are available, where it is known that the top spot is clicked on 5% of the time and the bottom spot 2%. If all bidders bid truthfully, then b1 wins the top slot and pays 180, and has an expected return

0.05= 1 . The second slot goes to �. But b1 can see that if she were to bid anything in the range 101-179, she would concede the top slot to �, win the second slot, and yield an expected return of (200- 100) x .02 = 2. Thus, b1 can double her expected return by

of (200 - 180)

x

bidding less than her true value in this case. In general, bidders in this multislot auction must spend a lot of energy analyzing the bids of others to detennine their best strategy; there is no

simple dominant strategy. Aggarwal et at. (2006) show that there is a unique truthful auction mechanism for this multislot problem, in which the winner of slot j pays the full price for slot j just for those additional clicks that are available at slot j and not at slot j + 1 . The winner pays the price for the lower slot for the remaining clicks. In our example, b1 would bid 200 truthfully, and would pay 180 for the additional .05 - .02 = .03 clicks in the top slot, but would pay only the cost of the bottom slot, 100, for the remaining .02 clicks. Thus, the total return to

b1 would be (200 - 180)

x

.03 + (200 - 100)

x

.02 = 2.6.

Another example of where auctions can come into play within AI is when a collection of agents are deciding whether to cooperate on a joint plan. Hunsberger and Grosz (2000) show that this can be accomplished efficiently with an auction in which the agents bid for roles in the joint plan.

Section 17 .6.

Mechanism Design 17.6.2

683

Common goods

Now let's consider another type of game, in which countries set their policy for controlling air pollution. Each cotmtry has a choice: they can reduce pollution at a cost of -10 points for

implementing the necessary changes, or they can continue to pollute, which gives them a net utility of -5 (in added health costs, etc.) and also contributes -1 points to every other cotmtry (because the air is shared across countries). Clearly, the dominant strategy for each cotmtry

is "continue to pollute;' but if there are 100 countries and each follows this policy, then each country gets a total utility of -104, whereas if every country reduced pollution, they would

TRAGELY OFTHE COM!DIS

each have a utility of -10. This situation is called the tragedy of the commons: if nobody has to pay for using a common resoun:e, then it tends to be exploited in a way that leads to a lower total utility for all agents. It is similar to the prisoner's dilemma: there is another solution to the game that is better for all parties, but there appears to be no way for rational

agents to arrive at that solution.

The standard approach for dealing with the tragedy of the commons is to change the mechanism to one that charges each agent for using the commons. More generally, we need

EXTERNALITIES

to ensure that all externaliti�ffects on global utility that are not recognized in the in dividual agents' transactions-are made explicit. Setting the prices correctly is the difficult pa11. In the limit, this approach amounts to creating a mechanism in which each agent is

effectively required to maximize global utility, but can do so by making a local decision. For

tl1is example, a carbon tax would be an example of a mechanism that charges for use of the commons in a way that, if implemented well, maximizes global utility.

As a final example, consider the problem of allocating some common goods. Suppose a city decides it wants to install some free wireless Internet transceivers. However, the number of transceivers tlley can afford is less than the nwnber of neighborhoods that want them. The city wants to allocate the goods efficiently, to the neighborhoods that would value them the most. That is, they want to maximize the global utility

V = Ei vi·

The problem is that if

they just ask each neighborhood council "how much do you value this free gift?" they would all have an incentive to lie, and report a high value. It turns out there is a mechanism, known

VICKRE'fCLARKE GRCNES VCG

as the Vickrey-Ciarke-Groves, or VCG, mechanism, that makes it a dominant strategy for

each agent to report its true utility and that achieves an efficient allocation of the goods. The trick is that each agent pays a tax equivalent to the loss in global utility that occurs because

of the agent's presence in the game. The mechanism works like this: l . The center asks each agent to report its value for receiving an item. Call this bi.

2. The center allocates tl1e goods to a subset of the bidders. We call this subset A, and use

bi(A) to mean the result to i under this allocation: bi if i is in A (that is, i is a wilmer), and 0 otherwise. The center chooses A to maximize total reported utility B = Ei bi(A). the notation

3. The center calculates (for each except

i.

i) the sum of the reported utilities for all the winners

We use the notation B_i

= L:;,o;i bj(A).

The center also computes (for each

i) the allocation that would maximize total global utility if i were not in the game; call that sum w-i·

4. Each agent i pays a tax equal to W-i - B-i·

684

Chapter 17.

Making Complex Decisions

In this example, the VCG rule means that each winner would pay a tax equal to the highest repo11ed value among the losers. That is, if I report my value as 5, and that causes someone with value 2 to miss out on an allocation, then I pay a tax of 2. All wilmers should be happy because they pay a tax that is less than their value, and all losers are as happy as they can be, because they value the goods less than the required tax. Why is it that this mechanism is truth-revealing? First, consider the payoff to agent i, which is the value of getting an item, minus the tax: (17.14) Here we distinguish the agent's true utility, Vi, from his reported utility bi (but we are trying to show that a dominant strategy is bi = v,). Agent i knows that the center will maximize global utility using the reported values,

L bj(A) = bi(A) + L b;(A)

j jfi whereas agent i wants the center to maximize (17.14), which can be rewritten as vi(A) + L bj(A) - w_i . jfi Since agent i cannot affect the value of W-i (it depends only on the other agents), the only way i can make the center optimize what i wants is to report the true utility, bi =vi· 17.7

SUMMARY

nus chapter shows how to use knowledge about the world to make decisions even when the outcomes of an action are uncertain and the rewards for acting might not be reaped until many actions have passed. The main points are as follows: •

•

•

•

•

Sequential decision problems in uncertain environments, also called Markov decision processes, or MOPs, are defined by a transition model specifying the probabilistic outcomes of actions and a reward function specifying the reward in each state. The utility of a state sequence is the sum of all the rewards over the sequence, possibly discounted over time. The solution of an MOP is a policy that associates a decision with every state that the agent might reach. An optimal policy maximizes the utility of the state sequences encountered when it is executed. 1l1e utility of a state is the expected utility of the state sequences encountered when an optimal policy is executed, starting ill that state. The value iteration algOJittun for solving MOPs works by iteratively solving the equations relating the utility of each state to those of its neighbors. Policy iteration alternates between calculating the utilities of states under the cun·ent policy and improving the current policy with respect to the current utilities. Partially observable MOPs, or POMDPs, are much more difficult to solve than are MOPs. They can be solved by conversion to an MDP in the continuous space of belief

Bibliographical and Historical Notes

685

states; both value iteration and policy iteration algorithms have been devised. Optimal behavior in POMDPs includes information gathering to reduce uncertainty and there fore make better decisions in the future. • A decision-theoretic agent can be constructed for POMDP environments.

uses a

dynamic decision network

The agent

to represent the transition and sensor models, to

update its belief state, and to project forward possible action sequences. •

Game theory

describes rational behavior for agents in situations in which multiple

agents interact simultaneously. Solutions of games are files in which no agent has an incentive •

Nash equilibria-strategy pro

to deviate from the specified strategy.

Mechanism design can be used to set the rules by which agents

will interact,

in order

to maximize some global utility through the operation of individually rational agents. Sometimes, mechanisms exist that achieve this goal without requiring each agent consider the choices made by other agents. We shall return to the world of MDPs and POMDP in Chapter

21,

when we study

to

rein

forcement learning methods that allow an agent to improve its behavior from experience in sequential, uncertain environments.

BffiLIOGRAPHICAL AND HISTORICAL NOTES Richard Bellman developed the ideas underlying the modem approach to sequential decision problems while working at the RAND Corporation beginning in

1949.

According to his au

tobiography (Bellman, 1984), he coined the exciting term "dynamic programming" to hide from a research-phobic Secretary of Defense, Charles Wilson, the fact that his group was

doing mathematics. (This cannot be strictly true, because his first paper using the term (Bell man, 1952) appeared before Wilson became Secretary of Defense in 1953.) Bellman's book, Dynamic Programming (1957), gave the new field a solid foundation and introduced the basic algorithmic approaches. Ron Howard's Ph.D. thesis (1960) introduced policy iteration and the idea of average reward for solving infinite-horizon problems. Several additional results were introduced by Bellman and Dreyfus

(1962). Modified policy iteration is due to van (1976) and Puterman and Shin (1978). Asynchronous policy iteration was analyzed by Williams and Baird (1993), who also proved the policy loss bound in Equation (17.9). The analysis of discounting in terms of stationary preferences is due to Koopmans (1972). The texts by Bertsekas (1987), Puterman (1994), and Bertsekas and Tsitsiklis (1996) provide a rigorous introduction to sequential decision problems. Papadimitriou and Tsitsiklis (1987) Nunen

describe results on the computational complexity of MDPs. Seminal work by Sutton

(1988) and Watkins (1989) on reinforcement learning methods

for solving MDPs played a significant role in introducing MDPs into the AI community, as did the later survey by Barto

et al. (1995).

similar ideas, but was not taken up

(Earlier work by Werbos

to the same extent.)

AI planning problems was made first by Sven Koenig

(1977) contained many

The cormection between MDPs and

(1991), who showed how probabilistic

STRIPS operators provide a compact representation for transition models (see also Wellman,

686

FACTORED MOP

RELATIONAL MOP

Chapter

17.

Making Complex Decisions

1990b). Work by Dean et at. (1993) and Tash and Russell (1994) attempted to overcome the combinatorics of large state spaces by using a limited search horizon and abstract states. Heuristics based on the value of information can be used to select areas of the state space where a local expansion of the horizon will yield a significant improvement in decision qual ity. Agents using this approach can tailor their effort to handle time pressure and generate some interesting behaviors such as using familiar "beaten paths" to find their way around the state space quickly without having to recompute optimal decisions at each point. As one might expect, AI researchers have pushed MOPs in the direction of more ex pressive representations that can accommodate much larger problems than the traditional atomic representations based on transition matrices. The use of a dynamic Bayesian network to represent transition models was an obvious idea, but work on factored MDPs (Boutilier et at., 2000; Koller and Parr, 2000; Guestrin et at. , 2003b) extends the idea to structured representations of the value function with provable improvements in complexity. Relational MOPs (Boutilier et at., 2001; Guestrin et at., 2003a) go one step further, using structured representations to handle domains with many related objects. The observation that a partially observable MOP can be transformed into a regular MDP over belief states is due to Astrom (1965) and Aoki (1965). The first complete algorithm for the exact solution of POMDPs--essentially the value iteration algorithm presented in this chapter-was proposed by Edward Sondik (1971) in his Ph.D. thesis. (A later journal paper by Smallwood and Sondik (1973) contains some errors, but is more accessible.) Lovejoy (1991) surveyed the first twenty-five years of POMDP research, reaching somewhat pes simistic conclusions about the feasibility of solving large problems. The first significant contribution within AI was the Witness algorithm (Cassandra et at., 1994; Kaelbling et at. , 1998), an improved version of POMDP value iteration. Other algorithms soon followed, in cluding an approach due to Hansen (1998) that constructs a policy incrementally in the fonn of a finite-state automaton. In this policy representation, the belief state corresponds directly to a particular state in the automaton. More recent work in AI has focused on point-based value iteration methods that, at each iteration, generate conditional plans and a-vectors for a finite set of belief states rather than for the entire belief space. Lovejoy (1991) proposed such an algorithm for a fixed grid of points, an approach taken also by Bonet (2002). An influential paper by Pineau et at. (2003) suggested generating reachable points by simulat ing trajectories in a somewhat greedy fashion; Spaan and Vlassis (2005) observe that one need generate plans for only a small, randomly selected subset of points to improve on the plans from the previous iteration for all points in the set. Current point-based methods such as point-based policy iteration (Ji et at. , 2007Man generate near-optimal solutions for POMDPs with thousands of states. Because POMDPs are PSPACE-hard (Papadimitriou and Tsitsiklis, 1987), further progress may require taking advantage of various kinds of structure within a factored representation. The online approach-using look-ahead search to select an action for lhe cun-ent belief state-was first examined by Satia and Lave (1973). The use of sampling at chance nodes was expl01-ed analytically by Keams et at. (2000) and Ng and Jordan (2000). The basic ideas for an agent architecture using dynamic decision networks were proposed by Dean and Kanazawa (1989a). The book Planning and Control by Dean and Wellman (1991) goes

Bibliographical and Historical Notes

687

into much greater depth, making connections between DBN/DDN models and the c1assical control literature on filtering. Tatman and Shachter (1990) showed how to apply dynamic programming algorithms to DDN models. Russell (1998) explains various ways in which such agents can be scaled up and identities a number of open research issues. The roots of game theory can be traced back to proposals made in the 17th century by Clu·istiaan Huygens and Gottfried Leibniz to study competitive and cooperative human interactions scientitically and mathematically. Throughout the 19th century, several leading economists created simple mathematical examples to analyze particular examples of com petitive situations. The first formal results in game theory are due to Zermelo (1913) (who had, the year before, suggested a form of minimax search for games, albeit an incorrect one). Emile Borel (1921) introduced the notion of a mixed strategy. John von Neumann (1928) proved that every two-person, zero-sum game has a maximin equilibrium in mixed strategies and a well-defined value. Von Neumann's collaboration with the economist Oskar Morgen stem led to the publication in 1944 of the Theory of Games and Economic Behavior, the defining book for game theory. Publication of the book was delayed by the wartime paper shortage until a member of the Rockefeller family personally subsidized its publication. In 1950, at the age of21, John Nash published his ideas concerning equilibria in general (non-zero-sum) games. His definition of an equilibrium solution, although originating in the work of Coumot (1838), became known as Nash equilibrium. After a long delay because of the schizophrenia he suffered from 1959 onward, Nash was awarded the Nobel Memorial Prize in Economics (along with Reinhart Selten and John Harsanyi) in 1994. The Bayes-Nash equilibrium is described by Harsanyi (1967) and discussed by Kadane and Larkey (1982). Some issues in the use of game theory for agent control are covered by Birunore (1982). The prisoner's dilemma was invented as a classroom exercise by Albert W. Tucker in 1950 (based on an example by Merrill Flood and Melvin Dresher) and is covered extensively by Axelrod (1985) and Pmmdstone (1993). Repeated games were introduced by Luce and Raiffa (1957), and games of partial information in extensive form by Kuhn (1953). The first practical algoritlun for sequential, partial-information games was developed within AI by Koller et al. (1996); the paper by Koller and Pfeffer (1997) provides a readable introduction to the field and describe a working system for representing and solving sequential games. The use of abstraction to reduce a game tree to a size that can be solved with Koller's technique is discussed by Billings et al. (2003). Bowling et al. (2008) show how to use importance sampling to get a better estimate of the value of a strategy. Waugh et al. (2009) show that the abstraction approach is vulnerable to making systematic errors in approximating the equilibrium solution, meaning that the whole approach is on shaky ground: it works for some games but not others. Korb et al. (1999) experiment with an opponent model in the form of a Bayesian network. It plays five-card stud about as well as experienced humans. (Zinkevich et al., 2008) show how an approach that minimizes regret can find approximate equilibria for abstractions with 1 012 states, 100 times more than previous methods. Game theory and MDPs are combined in the theory of Markov games, also called stochastic games (Littman, 1994; Hu and Wellman, 1998). Shapley (1953) actually described the value iteration algorithm independent.ly of Bellman, but his results were not widely ap preciated, perhaps because they were presented in the context of Markov games. Evolu-

688

Chapter

17.

Making Complex Decisions

tionary game theory (Smith, 1982; Weibull, 1995) looks at strategy drift over time: if your opponent's strategy is changing, how should you react? Textbooks on game theory from an economics point of view include those by Myerson (1991), Fudenberg and Tirole (1991), Osbome (2004), and Osbome and Rubinstein (1994); Mailath and Samuelson (2006) concen trate on repeated games. From an AI perspective we have Nisan et al. (2007), Leyton-Brown and Shoham (2008), and Shoham and Leyton-Brown (2009). The 2007 Nobel Memorial Prize in Economics went to Hurwicz, Maskin, and Myerson "for having laid the foundations of mechanism design theory" (Hurwicz, 1973). The tragedy of the commons, a motivating problem for the field, was presented by Hardin ( 1968). The rev elation principle is due to Myerson (1986), and the revenue equivalence theorem was devel oped independently by Myerson (1981) and Riley and Samuelson (1981). Two economists, Milgrom (1997) and K1emperer (2002), write about the multibillion-dollar spectmm auctions they were involved in. Mechanism design is used in multiagent planning (Hunsberger and Grosz, 2000; Stone et al., 2009) and scheduling (Rassenti et al. , 1982). Varian (1995) gives a brief overview with cormections to the computer science literature, and Rosenschein and Zlotkin (1994) present a book-length treatment with applications to distributed AI. Related work on distributed AI also goes under other names, including collective intelligence (Turner and Wolpert, 2000; Segaran, 2007) and market-based control (Clearwater, 1996). Since 2001 there has been an annual Trading Agents Competition (TAC), in which agents try to make the best profit on a series of auctions (Wellman et al. , 2001; Arunachalam and Sadeh, 2005). Papers on computational issues n i auctions often appear in the ACM Conferences on Electronic Commerce.

EXERCISES 17.1 For the 4 x 3 world shown in Figure 17.1, calculate which squares can be reached from (1 ,1) by the action sequence [Up, Up, Right, Right, Right] and with what probabilities.

Explain how this computation is related to the prediction task (see Section 1 5.2.1) for a hidden Markov model.

17.2 Select a specific member of the set of policies that are optimal for R(s) > 0 as shown in Figure 1 7.2(b), and calculate the fraction of time the agent spends in each state, in the limit, if the policy is executed forever. (Hint: Construct the state-to-state transition probability matrix corresponding to the policy and see Exercise 15.2.)

17.3 Suppose that we define the utility of a state sequence to be the maximum reward ob tained in any state in the sequence. Show that this utility function does not result in stationary

preferences between state sequences. Is it sti11 possible to define a utility function on states such that MEU decision making gives optimal behavior?

17.4 Sometimes MOPs are formulated with a reward function R(s,a) that depends on the action taken or with a reward function

R(s, a, s') that also depends on the outcome state.

a. Write the Bellman equations for these formulations.

Exercises

689 b.

Show how an MOP with reward function R(s, a, s') can be transfonned into a different MOP with reward function R( s, a), such that optjmal policies in the new MOP corre spond exact.ly to optimal policies in the original MOP. c. Now do the same to conve1t MOPs with R(s,a) into MOPs with R(s).

17.5 For the environment shown in Figure 17.1, find all the threshold values for R(s) such that the optimal policy changes when the threshold is crossed. You will need a way to calcu late the optimal policy and its value for fixed R(s). (Hint: Prove that the value of any fixed policy varies linearly with R(s).) 17.6 Equation (17.7) on page 654 states that the Bellman operator is a contraction.

a. Show that, for any functions f and g,

I maxf(a) - maxg(a)l :::; max If(a) - g(a)l . a

b.

a

a

Write out an expression for I(B Ui - B UI)(s)l and then apply the result from (a) to complete the proof that the Bellman operator is a contraction.

17.7 This exercise considers two-player MOPs that correspond to zero-sum, tum-taking games like those in Chapter 5. Let the players be A and B, and let R(s) be the reward for player A in state s. (The reward for B is always equal and opposite.)

a. Let UA(s) be the utility of state s when it is A's tum to move in s, and let UB(s) be the utility of state s when it is B's tum to move in s. All rewards and utilities are calculated from A's point of view (just as in a minimax game b·ee). Write down Bellman equations defining UA(s) and UB(s). b. Explain how to do two-player value iteration with these equations, and define a suitable termination criterion. c. Consider the game described in Figure 5.17 on page 197. Draw the state space (rather than the game tree), showing the moves by A as solid lines and moves by B as dashed lines. Mark each state with R(s). You will find it helpful to mange the states (sA, BB) on a two-dimensional grid, using SA and SB as "coordinates." d. Now apply two-player value iteration to solve thls game, and derive the optimal policy. 17.8 Consider the 3 x 3 world shown in Figure 17.14(a). The transition model is the same as in the 4 x 3 Figure 17.1: 80% ofthe time the agent goes in the direction it selects; the rest of the time it moves at right angles to the intended direction. Implement value iteration for this world for each value of r below. Use discounted rewards with a discount factor of 0.99. Show the policy obtained in each case. Explain intuitively why the value of r leads to each policy.

a. r = 100 b. r = -3 c. r = 0 d. r = +3

Chapter

690

-I -I -I -I -I r

1+!01 -I -I

+50 -I

-

1

-I

Start

-50 +I +I +I

(a)

17.

.

.

.

.

.

.

.

.

.

Making Complex Decisions

-I -I -I 0 +I +I +I B

(b)

Figure 17.14 (a) 3 X 3 world for Exercise 17.8. The reward for each state is indicated. The upper right square is a terminal state. (b) 101 x 3 world for Exercise 17.9 (omitting 93 identical columns in the middle). The start state has reward

0.

Consider the 1 01 x 3 world shown in Figure 17.14(b). In the start state the agent has

17.9

a choice of two deterministic actions, Up or Down, but in the other states the agent has one deterministic action, Right. Assuming a discounted reward ftmction, for what values of the discount 1 should the agent choose Up and for which Down? Compute the utility of each action

as

a function of "f. (Note that this simple example actually reflects many real-world

situations in which one must weigh the value of an immediate action versus the potential continual long-term consequences, such as choosing to dump pollutants into a lake.)

17.10 Consider an tmdiscotmted MOP having three states, (1. 2, 3), with rewards -1, -2, 0, respectively. State 3 is a terminal state. In states 1 and 2 there are two possible actions: a and b. The transition model is as follows: •

In state 1 , action a moves the agent to state 2 with probability stay put with probability

•

0.8 and makes the agent

0.2.

In state 2, action a moves the agent to state 1 with probability

0.8 and makes the agent 0.2. In either state 1 or state 2, action b moves the agent to state 3 with probability 0.1 and makes the agent stay put with probability 0.9. stay put with probability

•

Answer the following questions:

a. What can be determined qualitatively about the optimal policy in states 1 and 2'?

b.

Apply policy iteration, showing each step in full, to determine the optimal policy and the values of states 1 and

2.

Assume that the initial policy has action bin both states.

c. What happens to policy iteration if the initial policy has action a in both states? Does discounting help? Does the optimal policy depend on the discount factor?

17.11

Consider the 4 x 3 world shown in Figure 17.1.

a. Implement an environment simulator for this environment, such that the specific geog raphy of the environment is easily altered. Some code for doing this is already in the online code repository.

691

Exercises

b.

Create an agent that uses policy iteration, and measure its performance in the environ ment simulator from various starting states.

Perfonn several experiments from each

starting state, and compare the average total reward received per run with the utility of the state, as determined by your algorithm.

c.

Experiment with increasing the size of the environment.

How does the run time for

policy iteration vary with the size of the environment?

17.12

How can

the value

determination algoritlun be used to calculate the expected Joss

experienced by an agent using a given set of utility estimates

U and an estimated model P,

compared with an agent using correct values?

17.13 Let the initial belief state bo for the 4 x 3 POMDP on page 658 be the uniform dis. . . 1 1 1 1 1 1 1 1 1 tn"butJon over h t e nontermmaI states, J.e., ( "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > "!!" > 0, 0) CalcuIate t he exact belief state b1 after the agent moves Left and its sensor reports 1 adjacent wall. Also calculate •

�

assuming that the same thing happens again.

17.14

What is the time complexity of d steps of POMDP value iteration for a sensorless

environment?

17.15 Consider a version of the two-state POMDP on page 661 in which the sensor is 90% reliable in state 0 but provides no information in state 1 (that is, it reports 0 or 1 with equal probability). Analyze, either qualitatively or quantitatively, the utility function and the optimal policy for this problem.

17.16

Show that a dominant strategy equilibrium is a Nash equilibriwn, but not vice versa.

17.17

In the children's game of rock-paper-scissors each player reveals at the same time

a choice of rock, paper, or scissors. Paper wraps rock, rock blunts scissors, and scissors cut paper.

In the extended version rock-paper-scissors-fire-water, fire beats rock, paper, and

scissors; rock, paper, and scissors beat water; and water beats fire.

Write out the payoff

matrix and find a mixed-strategy solution to this game.

17.18

The following payoff matrix, from Blinder (1983) by way of Bernstein

(1996), shows

a game between politicians and the Federal Reserve. Fed: contract Pol: contract Pol: do nothing Pol: expand

Fed: do nothing

Fed: expand

F = 7, P = 1 F = 9, P = 4 F = 6, P = 6 F = 8, ? = 2 F = 5,P = 5 F = 4, P = 9 F = 3, P = 3 F = 2, P = 7 F = l, P = 8

Politicians can expand or contract fiscal policy, while the Fed can expand or contract mon etary policy. (And of course either side can choose to do nothing.) Each side also has pref erences for who should do what-neither side wants to look like the bad guys. The payoffs shown are simply the rank orderings:

9 for first choice

through

1

for last choice. Find the

Nash equilibrium of the game in pure strategies. Is this a Pareto-optimal solution? You might wish to analyze the policies of recent administrations in this light.

Chapter

692 17.19

17.

Making Complex Decisions

A Dutch auction is similar in an English auction, but rather than starting the bidding

at a low price and increasing, in a Dutch auction the seller starts at a high price and gradually lowers the price until some buyer is willing to accept that price. (If multiple bidders accept the price, one is arbitrarily chosen as the winner.) More formally, the seller begins with a price p and gradually Jowers p by increments of d until at least one buyer accepts the price. Assuming all bidders act rationally, is it true that for arbitrarily small d, a Dutch auction will always result in the bidder with the highest value for the item obtaining the item? If so, show mathematically why. If not, explain how it may be possible for the bidder with highest value for the item not to obtain it. 17.20

Imagine an auction mechanism that is just like an ascending-bid auction, except that

at the end, the winning bidder, the one who bid

bmax. pays only bmax/2 rather than bmax·

Assuming all agents are rational, what is the expected revenue to the auctioneer for this mechanism, compared with a standard ascending-bid auction? 17.21

Teams in the National Hockey League historically received 2 points for winning a

game and 0 for losing. If the game is tied, an overtime period is played; if nobody wins in uvt:rlilm:, lltt: game:: is a lit: and t:adt team gt:ls 1 puiut. But lt:agut: ufftdals fdl that lt:ams were playing too conservatively in overtime (to avoid a loss), and it would be more exciting if overtime produced a winner. So in 1999 the officials experimented in mechanism design: the rules were changed, giving a team that loses in overtime 1 point, not 0. It is still 2 points for a win and 1 for a tie. a.

Was hockey a zero-sum game betore the mle change? After?

b.

Suppose that at a certain time t in a game, the home team has probability p of winning in regulation time, probability 0.78 - p of losing, and probability 0.22 of going into overtime, where they have probability q of winning, .9 - q of losing, and .1 of tying. Give equations for the expected value for the home and visiting teams.

c.

Imagine that it were legal and ethical for the two teams to enter into a pact where they agree that they will skate to a tie in regulation time, and then both try in eamest to win in overtime. Under what conditions, in terms of p and q, would it be rational for both teams to agree to this pact?

d.

Longley and Sankaran (2005) report that since the mle change, the percentage of games with a winner in overtime went up 18.2%, as desired, but the percentage of overtime games also went up 3.6%. What does that suggest about possible collusion or conser vative play after the mle change?

18

LEARNING FROM EXAMPLES

In which we describe agents that can improve their behavior through diligent study of their own experiences.

LEARNING

18.1

An agent is learning if it improves its performance on future tasks after making observations about the world. Learning can range from the trivial, as exhibited by jotting down a phone number, to the profound, as exhibited by Albert Einstein, who inferred a new theory of the tmiverse. In this chapter we will concentrate on one class of learning problem, which seems restricted but actually has vast applicability: from a collection of input-Qut.put pairs, learn a function that predicts the output for new inputs. Why would we want an agent to learn? If the design of the agent can be improved, why wouldn't the designers just program in that improvement to begin with? There are three main reasons. First, the designers cannot anticipate all possible situations that the agent might find itself in. For example, a robot designed to navigate mazes must leam the layout of each new maze it encounters. Second, the designers cannot anticipate all changes over time; a program designed to predict tomorrow's stock market prices must learn to adapt when conditions change from boom to bust. Third, sometimes hwnan programmers have no idea how to program a solution themselves. For example, most people are good at recognizing the faces of family members, but even the best programmers are unable to program a computer to accomplish that task, except by using learning algorithms. This chapter first gives an overview of the various forms of learning, then describes one popular approach, decision tree learning, in Section 18.3, followed by a theoretical analysis of learning in Sections 18.4 and 18.5. We look at various learning systems used in practice: linear models, nonlinear models (in particular, neural networks), nonparametric models, and support vector machines. Finally we show how ensembles of models can outperform a single model. FORMS OF LEARNING

Any component of an agent can be improved by learning from data. The improvements, and t.he techniques used to make them, depend on four major factors: •

Which component is to be improved. 693

18.

Chapter

694 •

What prior knowledge the agent already has.

•

What

•

Learning from Examples

representation is used for the data and the component. Whatfeedback is available to learn from.

Components to be learned Chapter 2 described several agent designs. The components of these agents include:

1. 2. 3.

A direct mapping

from conditions on the cun-ent state to actions.

A means to infer relevant properties of the world from the percept sequence. Infonnation about the way the world evolves and about the results of possible actions the agent can take.

4. Utility information indicating the desirability of world states. 5. Action-value information indicating the desirability of actions.

6.

Goals that describe classes of states whose achievement maximizes the agent's utility.

Each of these components can be learned. Consider, for example, an agent training to become a taxi driver. Every time the instructor shouts "Brake!" the agent might leam a condition action rule for when to brake (component

1);

the agent also learns every time the instructor

does not shout. By seeing many camera images that it is told contain buses, it can leam

to recognize them (2). By trying actions and observing the results-for example, braking hard on a wet road-it can learn the effects of its actions

(3).

Then, when it receives no tip

from passengers who have been thoroughly shaken up dming the trip, it can leam a useful component of its overall utility function

(4).

Representation and prior knowledge We have seen several examples of representations for agent components: propositional and first-order logical sentences for the components in a logical agent; Bayesian networks for the inferential components of a decision-theoretic agent, and so on. Effective teaming algo rithms have been devised for all of these representations. This chapter (and most of current machine learning research) covers inputs that foO"O a

factored representation-a vector

of

attribute values-and outputs that can be either a continuous numerical value or a discrete value. Chapter

19

covers functions and prior knowledge composed of first-order logic sen

tences, and Chapter 20 concentrates on Bayesian networks. There is another way

to look at the various types of learning. We say that learning

a (possibly incorrect) general function or rule from specific input-output pairs is called

INDUCTlVE LEARNIPIG DEDUCTIVE LEARNIPIG

ductive learning. We will see in Chapter 19 that we learning: going from a known general rule to a new

can also do

in analytical or deductive

rule that is logically entailed, but is

useful because it allows more efficient processing.

Feedback to learn from There are three

UNSUPERVISED LEARNitiG CLUSTERING

types offeedback that determine the tlm�e main types of learning: In unsupervised learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common unsupervised learning task is clustering: detecting

Section 1 8.2.

REINFORCEMENT LEARNING

SUPERVISED LEARNING

SEM�SUPERIISED LEARNING

18.2

695

Supervised Learning

potentially useful clusters of n i put examples. For example, a taxi agent might gradually develop a concept of "good traffic days" and "bad traffic days" without ever being given labeled examples of each by a teacher. In reinforcement learning the agent leams from a series of reinforcements-rewards or punishments. For example, the lack of a tip at the end of the journey gives the taxi agent an indication that it did something wrong. The two points for a win at the end of a chess game tells the agent it did something right. It is up to the agent to decide which of the actions prior to the reinforcement were most responsible for it. In supervised learning the agent observes some example input-output pairs and learns a function that maps from input to output. In component 1 above, the inputs are percepts and the output are provided by a teacher who says "Brake!" or "Turn left." In component 2, the inputs are camera images and the outputs again come from a teacher who says "that's a bus." In 3, the theory of braking is a function from states and braking actions to stopping distance in feet. In this case the output value is available directly from the agent's percepts (after the fact); the environment is the teacher. In practice, these distinction are not always so crisp. In semi-supervised learning we are given a few labeled examples and must make what we can of a large collection of un labeled examples. Even the labels themselves may not be the oracular truths that we hope for. Imagine that you are trying to build a system to guess a person's a.ge from a photo. You gather some labeled examples by snapping pictures of people and asking their age. That's supervised learning. But n i reality some of the people lied about their age. It's not just that there is random noise in the data; rather the inaccuracies are systematic, and to uncover them is an unsupervised learning problem involving linages, self-reported ages, and true (un known) ages. Thus, both noise and lack of labels create a continuum between supervised and unsupervised learning.

SUPERVISE D LEARNING

The task of supervised learning is this:

TRAINING SET

Given a training set of N example input-output pairs

HYPOTHESIS

(x1, Yl), (x2, Y2), (xN , YN ) , where each Yj was generated by an unknown function y = f (x), discover a function h that approximates the true function f. Here x and y can be any value; they need not be nwnbers. The function h is a hypothesis. 1

TESTSET

Learning is a search through the space of possible hypotheses for one that will perform well, even on new examples beyond the training set. To measw·e the accuracy of a hypothesis we give it a test set of examples that are distinct from the training set. We say a hypothesis

·

·

·

1 A note on notation: except where noted, we will use j to index the N examples; Xj will always be the input and Y3 the output. In cases where the input is specifically a vector of attribute values (beginning with Section 18.3), we will use x3 for the jth example and we will use i to index the n attributes of each example. The elements of Xj are written Xj,l, x;,2, . . . , x;,n.

696

Chapter f(x)

/�)

0

0

0

0

(a)

0

0

Learning from Examples f(x)

/�)

0

0

X

18.

0

0

0

0

0

0

�

� (b)

X

(c)

X

0

0

0

0

0

0

0

(d)

X

Figure 18.1 (a) Example (x.f(x)) pairs and a consistent, linear hypothesis. (b) A consistent, degree-7 polynominl hypothesis for the snme dntn set. (c) A different dntn set, which admits an exact degree-6 polynomial fit or an approximate linear fit. (d) A simple, exact sinusoidal fit to the same data set.

GENERALIZATION

CLASSIFICATION REGRESSION

generalizes well if it correctly predicts the value of y for novel examples. Sometimes the fimction f is stochastic-it is not strictly a function of x, and what we have to learn is a conditional probability disttibution, P(Y I x). When tile output y is one of a finite set of values (such as sunny, cloudy or rainy), the learning problem is called classification, and is called Boolean or binary classification if there are only two values. When y is a nwnber (such as tomorrow's temperature), the learning problem is called regression. (Technically, solving a regression problem is finding a conditional expectation or average value of y, because the probability that we have found exactly the right real-valued number for y is 0.) Figure I &.I shows a familiar example: fitting a function of a single variable to some data

HYPOTHESIS SPACE

CONSISTENT

OCKHAM'S RAZOR

points. The examples are points in the (x, y) plane, where y = f(x). We don't know what f is, but we will approximate it with a function h selected from a hypothesis space, H, which for this example we will take to be the set of polynomials, such as x5 +3x2 +2. Figure 18.1 (a) shows some data with an exact fit by a straight line (the polynomial 0.4x + 3). The line is called a consistent hypothesis because it agrees with all the data. Figure 18.1(b) shows a high degree polynomial that is also consistent with the same data. This illustrates a fundamental problem in inductive learning: how do we choosefrom among multiple consistent hypotheses? One answer is to prefer the simplest hypothesis consistent wilh the data. This principle is called Ockham's razor, after the 14th-century English philosopher William ofOckham, who used it to argue sharply a.gainst all sorts of complications. Defining simplicity is not easy, but it seems clear that a degree-I polynomial is simpler than a degree-7 polynomial, and thus (a) should be preferred to (b). We will make this intuition more precise in Section 18.4.3. Figure 18.1(c) shows a second data set. There is no consistent straight line for this data set; in fact, it requires a degree-6 polynomial for an exact fit. There are just 7 data points, so a polynomial with 7 parameters does not seem to be finding any pattem in the data and we do not expect it to generalize well. A straight line that is not consistent with any of the data points, but might generalize fairly well for unseen values of x, is also shown in (c). In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better. In Figure 18.l(d) we expand the

Section 1 8.3.

REALIZABLE

Learning Decision Trees

697

hypothesis space 1{ to allow polynomials over both x and sn i (x) , and find that the data in (c) can be fitted exactly by a simple function of the form ax+ b + csin(x). This shows the importance of the choice of hypothesis space. We say that a learning problem is reatizable if the hypothesis space contains the true function. Unfortunately, we cannot always tell whether a given learning problem is realizable, because the true ftmction is not known. In some cases, an analyst looking at a problem is willing to make more fine-grained distinctions about the hypothesis space, to say-even before seeing any data-.not just that a hypothesis is possible or impossible, but rather how probable it is. Supervised learning can be done by choosing the hypothesis h* that is most probable given the data:

h* = argmax P(hldata) . hE1t

By Bayes' mle this is equivalent to

h* = argmax P( data l h) P(h) . hE1t

Then we can say that the prior probability P(h) is high for a degree-] or -2 polynomial, lower for a degree-7 polynomial, and especially low for degree-7 polynomials with large, sharp spikes as in Figure 18.1 (b). We allow unusual-looking functions when the data say we really need them, but we discourage them by giving them a low prior probability. Why not Jet 1{ be the class of all Java programs, or Turing machines? After all, every computable function can be represented by some Turing machine, and that is the best we can do. One problem with this idea is that it does not take into account the computational complexity of learning. There is a tradeoff between the expressiveness ofa hypothesis space and the complexity offinding a good hypothesis within that space. For example, fitting a straight line to data is an easy computation; fitting high-degree polynomials is somewhat harder; and fitting Turing machines is in general tmdecidable. A second reason to prefer simple hypothesis spaces is that presumably we will want to use h after we have learned it, and computing h(x) when h is a linear function is guaranteed to be fast, while computing an arbitrary Turing machine program is not even guaranteed to terminate. For these reasons, most work on learning has focused on simple representations. We will see that the expressiveness-complexity tradeoff is not as simple as it first seems: it is often the case, as we saw with first-order logic in Chapter 8, that an expressive language makes it possible for a simple hypothesis to fit the data, whereas restricting the expressiveness of the language means that any consistent hypothesis must be very complex. For example, the rules of chess can be written in a page or two of first-order logic, but require thousands of pages when written in propositional logic.

18.3

LEARNING DECISION TREES

Decision tree induction is one of the simplest and yet most successful forms of machine learning. We first describe the representation-the hypothesis space-and then show how to learn a good hypothesis.

Chapter

698 18.3.1

DECISION TREE

POSITIVE NEGATIVE

GOAL PREDICATE

18.

Learning from Examples

The decision tree representation

A decision tree represents a function that takes as input a vector of attribute values and returns a "decision"-a single output value. The input and output values can be discrete or continuous. For now we will concentrate on problems where the inputs have discrete values and the output has exactly two possible values; this is Boolean classification, where each example input will be classified as true (a positive example) or false (a negative example). A decision tree reaches its decision by performing a sequence of tests. Each internal node in the tree corresponds to a test of the value of one of the input attributes, Ai, and the branches from the node are labeled with the possible values of the attribute, Ai = Vik· Each leaf node in the tree specifies a value to be returned by the function. The decision tree representation is natural for humans; indeed, many "How To" manuals (e.g., for car repair) are writt.en entirely as a single decision tree stretching over hundreds of pages. As an example, we will build a decision tree to decide whether to wait for a table at a restaurant. The aim here is to learn a definition for the goal predicate WillWait. First we list the attributes that we will consider as part of the input: 1 . Alternate: whether there is a suitable alternative restatu·ant nearby. 2. Bar: whether the restaurant has a comfortable bar area to wait in. 3. FrifSat: true on Fridays and Saturdays.

4. Hungry: whether we are hungry. 5.

6. 7.

8. 9. 10.

Patrons: how many people are in the restaurant (values are None, Some, and Full). Price: the restaurant's price range ($, $$, $$$). Raining: whether it is raining outside. Reservation: whether we made a reservation. Type: the kind of restaw·ant (French, Italian, Thai, or burger). WaitEstimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, or >60).

Note that every variable has a small set of possible values; the value of WaitEstimate, for example, is not an integer, rather it is one of the four discrete values 0-10, 10-30, 30-60, or >60. The decision tree usually used by one of us (SR) for this domain is shown in Figure 18.2. Notice that the tree ignores the Price and Type attributes. Examples are processed by the tree starting at the root and following the appropriate branch until a leaf is reached. For instance, an example with Patrons = Pull and WaitEstimate = 0-10 will be classified as positive (i.e., yes, we will wait for a table). 18.3.2

Expressiveness of decision trees

A Buult:an dcxisiun ln::e:: is lugkally e::yuivalt:ul tu tht:: asse::rliun lhal the:: gual attribute:: is true:: if and only if the input attributes satisfy one of the paths leading to a leaf with value true. Writing this out in propositional logic, we have

( Path1 V Path2 V · · ·) , where each Path is a conjunction of att.ribute-value tests required to follow that path. Thus, Goal

¢>

the whole expression is equivalent to disjunctive normal form (see page 283), which means

Section 18.3.

Learning Decision Trees

699

that any function in propositional logic can be expressed as a decision tree. As an example, the rightmost path in Figure 18.2 is

Path = (Patrons = Full 1\ WaitEstimate =0-10) . For a wide variety of problems, the decision tree format yields a nice, concise result. But some functions cannot be represented concisely. For example, the majority function, which returns true if and only if more than half of the inputs are true, requires an exponentially large decision tree. In other words, decision trees are good for some kinds of functions and bad for others. Is there any kind of representation that is efficient for all kinds of functions? Unfortunately, the answer is no. We can show this in a general way. Consider the set of all Boolean functions on n attributes. How many different functions are in this set? This is just the number of different truth tables that we can write down, because the function is defined by its truth table. A truth table over n attributes has 2" rows, one for each combination of values of the attributes. We can consider the "answer" column ofthe table as a 2n-bit number that defines the ftmction. That means there are 22" different functions (and there will be more than that nwnber of trees, since more than one tree can compute the same function). This is a scary number. For example, with just the ten Boolean attributes of our restaurant problem there are 21024 or about 10308 different functions to choose from, and for 20 attributes there are over 10300•000 . We will need some ingenious algorithms to find good hypotheses in such a large space. 18.3.3

Inducing decision trees from examples

An example for a Boolean decision tree consists of an (x, y) pair, where xis a vector of values for the input attributes, and y is a single Boolean output value. A training set of 12 examples

Figure 18.2

A decision tree for deciding whether to wait for a table.

700

Chapter

Xs X6 X7

xs

Xg X10

xu

Xl2

Learning from Examples

Input Attributes

Example Xl X2 X3 X4

18.

Alt

Ba1·

les Jes No les les No No No

No No

No

les No Yes

Figure 18.3

Jilri Hun

Pat

Yes No

No No No Yes Yes No No No

Some Full Some Full Full Some None Some

Yes

Yes

Yes No

Yes No Yes

Yes

No No

Yes

Yes

Yes Yes No Yes No Yes No Yes No

Yes No Yes

Full

Full None Full

Goal

Price Rain Res $$$ $ $ $

$$$ $$ $ $$ $

$$$ $ $

No No No Yes No Yes Yes Yes Yes

No No No

Yes No No No Yes Yes No Yes

No

Yes No No

Est

WillWait

French

0-10

Thai

30--60

Burger

0-10

Thai

10-30

Type

French

>60

Italian

0-10

Burger

0-10

Thai

0-10

Yl = Yes Y2 = No Y3 = Yes Y4 = Yes Ys = No Y6 = Yes Yr = No YS = Yes

Italian

10-30

YlO = No

Thai

0-10

Yll

Burger

30--60

Y12

Burger

>60

yg

= No

= No = Yes

Examples for the restaurant domain.

is shown in Figure 18.3. The positive examples are the ones in which the goal Will Wait is tme (x1, X3, . . . ); the negative examples are the ones in which it is false (x2, xs, . . . ) . We want a tree that is consistent with the examples and is as small as possible. Un fortunately, no matter how we measure size, it is an intractable problem to find the smallest consistent tree; there is no way to efficiently search through the 22" trees. With some simple heuristics, however, we can find a good approximate solution: a small (but not smallest) con sistent tree. The DECISION-TREE-LEARNING algorithm adopts a greedy divide-and-conquer strategy: always test the most important attribute first. This test divides the problem up into smaller subproblems that can then be solved recursively. By "most important attribute," we mean the one that makes the most difference to the classification of an example. That way, we hope to get to the correct classification with a small number of tests, meaning that all paths in the tree will be short and the tree as a whole will be shallow. Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible outcomes, each of which has the same number of positive as negative examples. On the other hand, in (b) we see that Patrons is a fairly important attribute, because if the value is None or Some, then we are left with example sets for which we can answer definitively (No and Yes, respectively). If the value is Full, we are left with a mixed set of examples. In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning problem in itself, with fewer examples and one less attribute. There are four cases to consider for these recursive problems: 1 . If the remaining examples are all positive (or all negative), then we are done: we can answer Yes or No. Figure 18.4(b) shows examples of this happening in the None and Some branches. 2. If there are some positive and some negative examples, then choose the best attribute to split them. Figure 18.4(b) shows Hungry being used to split the remaining examples. 3. If there are no examples left, it means that no example has been observed for this com-

Section 1 8.3.

701

Learning Decision Trees

None

om

•

(a) Figure

18.4

(b)

Splitting the examples by testing on attributes. At each node we show the

positive (light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type

brings us no nearer to distinguishing between positive and negative examples. (b) Splitting

on Patrons does a good job of separating positive and negative examples. After splitting on Pahvns, Hungry is a fairly good second test.

bination of attribute va1ues, and we return a default value calculated from the plurality classification of all the examples that were used in constructing the node's parent. These are passed along in the variable parent-examples.

NOISE

4. If there are no attributes left, but both positive and negative examples, it means that these examples have exactly the same description, but different classifications. This can happen because there is an error or noise in the data; because the domain is nondeter ministic; or because we can't observe an attribute that would distinguish the examples. The best we can do is retum the plurality classification of the remaining examples. The DECISION-TREE-LEARNING algorithm is shown in Figure 18.5. Note that the set of examples is crucial for constructing the tree, but nowhere do the examples appear in the tree itself. A tree consists of just tests on attributes in the interior nodes, values of attributes on the branches, and output values on the leaf nodes. The details of the IMPORTANCE function are given in Section 18.3.4. The output of the learning algorithm on our sample training set is shown in Figure 18.6. The tree is clearly different from the original tree shown in Figure 18.2. One might conclude that the leaming algorithm is not doing a very good job of leaming the correct function. This would be the wrong conclusion to draw, however. The learning algorittun looks at the examples, not at the correct function, and in fact, its hypothesis (see Figure 18.6) not only is consistent with all the examples, but is considerably simpler than the original tree! The leaming algorithm has no reason to include tests for Raining and Reservation, because it can classify all the examples without them. It has also detected an interesting and previously unsuspected pattern: the first author will wait for Thai food on weekends. It is also bound to make some mistakes for cases where it has seen no examples. For example, it has never seen a case where the wait is 0-10 minutes but the restaurant is full.

Chapter

702

function

18.

Learning from Examples

DECISION-TREE-LEARNING(examples, att?'ibutes, parent-examples)

a tree

returns

if examples is empty then return PLURALITY-VALUE(parent_examples) else if all

examples have the same classification then return the classification

else if attributes is empty then return PLURALITY-VALUE(examples) else

A f- argmaxa E attributes IMPORTANCE( a, examples) tree ._ a new decision tree with root test A

for each value Vk of A do

exs ._ {e : e E examples and e.A = v�c } sttbt?'ee f- DECISION-TREE-LEARNING(exs, attributes - A, examples) add a branch to tree with label (A = v�c) and subtree subtree

return tree

Figure 18.5

The decision-tree learning algorithm. The function

IMPORTANCE is de

scribed in Section 18.3.4. The function PLURALITY-VALUE selects the most common output value among a set of examples, breaking ties randomly.

Figure 18.6

LEARNING CURVE

The decision tree n i duced from the 12-example training set.

In that case it says not to wait when Hungry is false, but I (SR) would certainly wait. With more training examples the teaming program could correct this mistake. We note there is a danger of over-interpreting the tree that the algorithm selects. When there are several variables of similar importance, the choice between them is somewhat arbi trary: with slightly different input examples, a different variable would be chosen to split on first, and the whole tree would look completely different. The function computed by the tree would still be similar, but the structure of the tree can vary widely. We can evaluate the accuracy of a leaming algorithm with a learning curve, as shown in Figw-e 18.7. We have 100 examples at our disposal, which we split nto i a training set and

Section 1 8.3.

703

Learning Decision Trees

i.i � §

tl !!

8

.§

J

''( . rv-.'VV

0.9 0.8 0.7 0.6 0.5 0.4

...._

_ _ _ _ _ _ _ _ _ _ _ _

0

20

40

60

80

100

Training set size

Figure

18.7

A learning curve for the decision tree learning algorithm on

100

randomly

generated examples in the restaurant domain. Each data point is the average of 20 trials.

a test set. We learn a hypothesis h with the training set and measure its accuracy with the test set. We do this starting with a training set of size 1 and increasing one at a time up to size 99. For each size we actually repeat the process of randomly splitting 20 times, and average the results of the 20 trials. The curve shows that as the training set size grows, the accuracy increases. (For this reason, learning curves are also called happy graphs.) In tllis graph we reach 95% accuracy, and it looks like the curve might continue to increase with more data. 18.3.4

Choosing attribute tests

The greedy search used in decision tree learning is designed to approximately minimize the

ENTROPY

depth of the final tree. The idea is to pick the attribute that goes as far as possible toward providing an exact classification of the examples. A perfect attribute divides the examples into sets, each of which are all positive or all negative and thus will be leaves of the tree. The Patrons attribute is not perfect, but it is fairly good. A really useless attribute, such as Type, leaves the example sets with roughly the same proportion of positive and negative examples as the original set. All we need, then, is a formal measure of"fairly good" and "really useless" and we can implement the IMPORTANCE function of Figure 18.5. We will use the notion of information gain, which is defined in terms of entropy, the fundamental quantity in information theory (Shannon and Weaver, 1 949). Entropy is a measure of the uncertainty of a random variable; acquisition of information corresponds to a reduction in entropy. A random variable with only one value-a coin that always comes up heads-has no uncertainty and thus its entropy is defined as zero; thus, we gain no information by observing its value. A tlip of a fair coin is equally likely to come up heads or tails, 0 or I , and we will soon show that this counts as "1 bit" of entropy. The roll of a fairfour-sided die has 2 bits of entropy, because it takes two bits to describe one of four equally probable choices. Now consider an unfair coin that comes up heads 99% of the time. Intuitively, this coin has less uncertainty than the fair coin-if we guess heads we'll be wrong only 1 % of the time-so we would like it to have an entropy measure that is close to zero, but

704

Chapter

Learning from Examples

18.

positive. In general, the entropy of a random variable V with values Vk, each with probability P(Vk), is defined as Entropy:

H(V) = LP(vk) log2

k

1 = - LP(vk) log2 P(vk) . P(Vk) k

We can check that the entropy of a fair coin flip is indeed 1 bit: H(Fair) =

-(0.5log2 0.5 + 0.5log2 0.5) = 1 .

Ifthe coin is loaded to give 99% heads, we get

-(0.99log2 0.99 + 0.01 log2 0.01) 0.08 bits. It will help to define B(q) as the entropy of a Boolean random variable that is true with probability q: B(q) = -(q log2 q + (1 - q) log2(1 - q)) . Thus, H(Loaded) = B(0.99) 0.08. Now Jet's get back to decision tree learning. If a �

H(Loaded) =

p p-) . H( Goal) = n(p+ n

�

training set contains positive examples and n negative examples, then the entropy of the goal attribute on the whole set is

has p

The restaurant training set in Figw·e 18.3 = n = 6, so the corresponding entropy is B(0.5) or exactly I bit. A test on a single attribute A might give us only part of this 1 bit. We can measure exactly how much by looking at the entropy remaining a fter the attribute test. An attribute A with d distinct values divides the training set E into subsets E1, . . . , Ea. Each subset Ek has Pk positive examples and nk negative examples, so if we go along that branch, we will need an additional B(pk/(pk + nk)) bits of information to answer the ques tion. A randomly chosen example from the training set has the kth value for the attribute with probability (pk + nk)/(p + n), so the expected entropy remaining after testing attribute A is d

Remainder(A) = L v;!�kB(�)

k=l

INFORMATION GAIN

.

The information gain from the attribute test on A is the expected reduction in entropy: Gain(A) = B(�) - Remainder(A) . In fact Gain(A) is just what we need to implement the IMPORTANCE function. Returning to the attributes considered in Figure 18.4, we have

1 - [fiB(�) + �B(t) + -&BCi)] 0.541 bits, Gain( Type) = 1 - [1�B(�) + � 1 B(�) + � 1 B( �) + �B(�)] = 0 bits,

Gain( Patrons ) =

�

confirming our intuition that Patmns is a better attribute to split on. In fact, Patrons has the maximwn gain of any of the attributes and would be chosen by the decision-tree learning algorithm as the root.

Section 18.3.

Learning Decision Trees

18.3.5

705

Generalization and overfitting

On some problems, the DECISION-TREE-LEARNING algorithm will generate a large tree when there is actually no pattern to be found. Consider the problem of trying to predict whether the roll of a die will come up as 6 or not. Suppose that experiments are carried out with various dice and that the attributes describing each training example include the color of the die, its weight, the time when the roll was done, and whether the experimenters had their fingers crossed. If the dice are fair, the right thing to learn is a tree with a single node that says "no," But the DECISION-TREE-LEARNING algorithm will seize on any pattern it can find in the input. If it turns out that there are 2 rolls of a 7-gram blue die with fingers

ovERFITTING

crossed and they both come out 6, then the algorithm may construct a path that predicts 6 in that case. This problem is called overfitting. A general phenomenon, overfitting occurs with all types of learners, even when the target function is not at all random. In Figure 18.1(b) and (c), we saw polynomial functions overfitting the data. Overfitting becomes more likely as the hypothesis space and the number of input attributes grows, and less likely as we increase the number of training examples.

g��TREE

For decision trees, a technique called decision tree pruning combats overfitting. Pruning works by eliminating nodes that are not clearly relevant. We strut with a full tree, as generated by DECISION-TREE-LEARNING. We then look at a test node that has only leaf nodes as descendants. If the test appears to be irrelevant-detecting only noise in the data then we eliminate the test, replacing it with a leaf node. We repeat this process, considering each test with only leaf descendants, until each one has either been pruned or accepted as is. The question is, how do we detect that a node is testing an irrelevant attribute? Suppose we are at a node consisting ofp positive and n negative examples. If the attribute is irrelevant,

we would expect that it would split the examples into subsets that each have roughly the same

proportion of positive examples as the whole set, pf(p + n), and so the infonnation gain will 2 be close to zero. Thus, the information gain is a good clue to irrelevance. Now the question is, how large a gain should we require :in order to split on a particular attribute?

SIGNIFICANCE TEST NuLL HYPOTHESis

We can answer this question by using a statistical significance test. Such a test begins by assuming that there is no underlying pattern (the so-called null hypothesis). Then the ac

tual data are analyzed to calculate the extent to which they deviate from a perfect absence of pattern. If the degree of deviation is statistically unlikely (usually taken to mean a 5% prob

ability or less), then that is considered to be good evidence for the presence of a significant pattern in the data. The probabilities ar·e calculated from standard distributions of the runount of deviation one would expect to see in random sampling.

In this case, the null hypothesis is that the attribute is irrelevant atld, hence, that the infonnation gain for an infinitely large sample would be zero. We need to calculate the probability that, under the null hypothesis, a sample of size v = n + p would exhibit the observed deviation from the expected distribution of positive and negative examples. We can measure the deviation by comparing the actual numbers of positive and negative examples in 2 The gain wiU be strictly positive except for the unlikely case where all the proportions are exactly the same.

(See Exercise 18.5.)

Chapter

706

18.

Learning from Examples

each subset, Pk and nk. with the expected numbers, Pk and nk. assmning true irrelevance:

Pk = p x •

, Pk + nk nk = n x :....:.:.... ...;;:. p+n

Pk + nk p+n

_

A convenient measure of the total deviation is given by /:::;. =

d

� L__;

k=l

(pk - Pk)2 •

�

Pk

+

• )2 (nk - nk �

nk

Under the null hypothesis, the value of /:::;. is distributed according to the

2

x2

(chi-squared)

distribution with v - 1 degrees of freedom. We can use a x table or a standard statistical library routine to see if a particular � value confirms or rejects the null hypothesis. For

example, consider the restaurant type attribute, with four values and thus three degrees of freedom. A value of /:::;. = 7.82 or more would reject the null hypothesis at the 5% level (and a value of /:::;. = 1 1 .35 or more would reject at the 1 % level). Exercise 18.8 asks you to extend the

x2 PAJNING

DECISION-TREE-LEARNING algmithm to implement this form of pruning, which is known as x2

pruning. With pruning, noise in the examples can be tolerated. Errors in the example's label (e.g.,

an example (x,

Yes) that should be (x, No)) give a linear increase in prediction error, whereas

eiTors in the desctiptions of examples (e.g., Price = $ when it was actually Price = $$) have an asymptotic effect that gets worse as the tree shrinks down to smaller sets. Pruned trees perform significantly better than unpruned trees when the data contain a large amount of

are often much smaller and hence easier to understand. 2 One final warning: You might think that x pruning and information gain look similar, so why not combine them using an approach called early stopping have the decision tree noise. Also, the pmned trees

EAOLV eTOr'r'INC

algorithm stop generating nodes when there is no good attribute to split on, rather than going to aU the trouble of generating nodes and then pruning them away. The problem with early stopping is that it stops us

from recognizing situations where there is no one good attribute,

but there are combinations of attributes that are informative. For example, consider the XOR ftmction of two binary attributes. If there are roughly equal nwnber of examples for all four combinations of input values, then neither attribute will be informative, yet the correct thing to do is to split on one of the attributes (it doesn't matter which one), and then at the second level we will get splits that are informative. Early stopping would miss this, but generate and-then-prune handles it correctly. 18.3.6

Broadening the applicability of decision trees

In order to extend decision tree induction to a wider variety of problems, a number of issues must be addressed. We will briefly mention several, suggesting that a full understanding is best obtained by doing the associated exercises: •

Missing data:

In many domains, not all the attribute values will be known for every

example. The values might have gone

unrecorded,

or they might be too expensive to

obtain. nus gives rise to two problems: First, given a complete decision tree, how should one classify an example that is missing one of the test attributes? Second, how

Section 18.3.

Learning Decision Trees

707

should one modify the information-gain formula when some examples have unknown values for the attribute? These questions are addressed in Exercise 18.9. •

Multivalued attributes: When an attribute has many possible values, the information

gain measure gives an inappropriate indication of the attribute's usefulness. In the ex treme case, an attribute such as

ExactTime has

a different value for every example,

which means each subset of examples is a singleton with a unique classification, and the information gain measure would have its highest value for this attribute. But choos ing this split first is unlikely to yield the best tree. One solution is to use the gain ratio

GAIN RATIO

(Exercise 18.1 0). Another possibility is to allow a Boolean test of the fonn A = Vk. that is, picking out just one of the possible values for an attribute, leaving the remaining values to possibly be tested later in the tree. •

Continuous and integer-valued input attributes: Continuous or integer-valued at tributes such as Height and Weight, have an infinite set of possible values. Rather than generate nfi i nitely many branches, decision-tree learning algorithms typically find the

split point that gives the highest information gain. For example, at a given node in the tree, it tnight be the case that testing on Weight > 160 gives the most informa

SPLIT POINT

tion. Efficient methods exist for finding good split points: start by sorting the values of the attribute, and then consider only split points that are between two examples in sorted order that have different classifications, while keeping track of the running totals of positive and negative examples on each side of the split point. Splitting is the most expensive part of real-world decision tree learning applications. •

REGRESSION TREE

Continuous-valued output attributes: If we are trying to predict a numerical output value, such as the price of an apartment, then we need a regression tree rather than a classification tree. A regression tree has at each leaf a linear ftmction of some subset of nwnerical attributes, rather than a single value. For example, the branch for two bedroom apartments might end with a linear ftmction of square footage, munber of

bathrooms, and average income for the neighborhood. The learning algorithm must decide when to stop splitting and begin applying linear regression (see Section 18.6) over the attributes. A decision-tree learning system for real-world applications must be able to handle all of these problems. Handling continuous-valued variables is especially important, because both

physical and financial processes provide numerical data. Several commercial packages have been built that meet these criteria, and they have been used to develop thousands of fielded systems. In many areas of industry and commerce, decision trees are usually the first method tried when a classification method is to be extracted from a data set. One important property of decision trees is that it is possible for a human to understand the reason for the output of the learning algorithm. (Indeed, this is a legal requirement for financial decisions that are subject to anti-discrimination Jaws.) This is a property not shared by some other representations, such as neural networks.

708 1 8 .4

Chapter

18.

Learning from Examples

EVALUATING AND CHOOSING THE BEST HYPOTHESIS

STATIONARITY ASSUMPTION

We want to Jearn a hypothesis that fits the future data best. To make that precise we need to define "future data" and "best." We make the stationarity asswnption: that there is a probability distribution over examples that remains stational)' over time. Each example data point (before we see it) is a random variable E; whose observed value ej = (xj, Yi) is sampled from that distribution, and is independent of the previous examples:

P(EjiE;-l ,Ej-2, . . .) = P(Ej) ,

and each example has an identical prior probability distribution:

P(Ej) = P(E;- 1) = P(E;-2) = · · ·

1.1.0.

ERRORAATE

HOLDOUT CROS$-VALID.ATION

K·FOLD CROS$-VALID.ATION

LEAVE-ONE.QUT CROSS.'IALIDATION LOOCV PEEKING

.

Examples that satisfy these asstunptions are called independent and identically distributed or i.i.d . An i.i.d. assumption connects the past to the future; without some such connection, all bets are off-the future could be anything. (We will see later that learning can still occur if there are slow changes in the distribution.) The next step is to define "best fit." We define the error rate of a hypothesis as the proportion of mistakes it makes-the proportion oftimes that h(x) =!= y for an (x, y) example. Now, just because a hypothesis h has a low error rate on the training set does not mean that it will generalize well. A professor knows that an exam will not accurately evaluate students if they have already seen the exam questions. Similarly, to get an accurate evaluation of a hypothesis, we need to test it on a set ofexamples it has not seen yet. The simplest approach is the one we have seen already: randomly split the available data into a training set from which the learning algorithm produces h and a test set on which the accuracy of h is evaluated. This method, sometimes called holdout cross-validation, has the disadvantage that it fails to use all the available data; if we use half the data for the test set, then we are only training on half the data, and we may get a poor hypothesis. On the other hand, if we reserve only 10% of the data for the test set, then we may, by statistical chance, get a poor estimate of the actual accuracy. We can squeeze more out ofthe data and still get an accurate estimate using a technique called k-fold cross-validation. The idea is that each example serves double duty-as training data and test data. First we split the data into k equal subsets. We then perform k rounds of learning; on each round 1/k of the data is held out as a test set and the remaining examples are used as training data. The average test set score of the k rounds should then be a better estimate than a single score. Popular values for k are 5 and 10---enough to give an estimate that is statistically likely to be accurate, at a cost of 5 to 10 times longer computation time. The extreme is k = n, also known as leave-one-out cross-validation or LOOCV. Despite the best efforts of statistical methodologists, users frequently invalidate their results by inadvertently peeking at the test data. Peeking can happen like this: A learning algorithm has various "knobs" that can be twiddled to tune its behavior-for example, various different critetia for choosing the next attribute in decision tree learning. The researcher generates hypotheses for various different settings of the knobs, measures their error rates on the test set, and reports the error rate of the best hypothesis. Alas, peeking has occurred! The .

Section 1 8.4.

7(1)

Evaluating and Choosing the Best Hypothesis reason is that the hypothesis was selected

on the basis ofits test set error rate, so information

about the test set has leaked into the leaming algorithm. Peeking is a consequence of using test-set performance to both choose a hypothesis and

evaluate it.

The way to avoid this is to

really hold

the test set out-lock it away until you

are completely done with learning and simply wish to obtain an independent evaluation of the final hypothesis. (And then, if you don't like the results . . . you have to obtain, and lock away, a completely new test set if you want to go back and find a better hypothesis.) If the test set is locked away, but you still want to measure performance on unseen data as a way of selecting a good hypothesis, then divide the available data (without the test set) into a training VALID\liON SET

set and a

validation set.

The next section shows how to use validation sets to find a good

tradeoff between hypothesis complexity and goodness of fit.

18.4.1

Model selection: Complexity versus goodness of fit

In Figure 18.1 (page 696) we showed that higher-degree polynomials can fit the training data better, but when the degree is too high they will overfit, and perfonn poorly on validation data. MODELSELECTION

Choosing the degree of the polynomial is an instance of the problem of model selection. You can think of the task of finding the best hypothesis as two tasks: model selection defines the

OPTIMIZATION

hypothesis space and then optimization finds the best hypothesis within that space. In this section we explain how to select among models that

are parameterized by size.

For example, with polynomials we have size = 1 for linear functions, size = 2 for quadratics,

and so on. For decision trees, the size could be the number of nodes in the tree. In all cases

we want to find the value of the size parameter that best balances underfitting and overfilling

to give the best test set accuracy. An algorithm to perform model selection and optimization is shown in Figure 18.8. It WRAPPER

is a

wrapper that takes a learning algorithm as an argument (DECISION-TREE-LEARNING, for example). The wrapper enumerates models a.ccording to a parameter, size. For each size, it uses cross validation on Learner to compute the average error rate on the training and test sets. We start with the smallest, simplest models (which probably tmderfit the data), and iterate, considering more complex models at each step, until the models start to overfit. In Figure 18.9 we see typical curves: the training set error decreases monotonically (although there may in general be slight random variation), while the validation set error decreases at first, and then increases when the model begins to overfit. The cross-validation procedure picks the value of size with the lowest validation set error; the bottom of the U-shaped curve. We then generate a hypothesis of that size, using all the data (without holding out any of it). Finally, of course, we should evaluate the returned hypothesis on a separate test set. This approach requires that the learning algorithm accept a parameter,

size, and deliver

a hypothesis of that size. As we said, for decision tree learning, the size can be the number of nodes. We can modify DECISION-TREE-LEARNER so that it takes the number of nodes as an input, builds the tree breadth-first rather than depth-first (but at each level it still chooses the highest gain attribute first), and stops when it reaches the desired nwnber of nodes.

Chapter

710

Learning from Examples

18.

function CROSS-VALIDATION-WRAPPBR(Leame1·, k, examples) returns a hypothesis local variables: er·rT, an array, indexed by size, storing training-set error rates er'l'V, an array, indexed by size, storing validation-set error rates for size = 1 to oo do

errT[size], errV[size] .- CROSS-VALIDATION(Leamer, size, k, examples) if errT has converged then do best..size ..-the value of size with minimum e1r V[size] return Leamer( best_size, examples)

function CROSS-VALIDATION(Leamer, size, k, examples) returns two values: average training set error rate, average validation set error rate

fold_elrT m(x · x;)

-b) .

(18. 14)

A final important property is that the weights Gj associated with each data point are zero ex

cept for the support vectors-the points closest to the separator. (They are called "support" vectors because they "hold up" the separating plane.) Because there are usually many fewer support vectors than examples, SVMs gain some of the advantages of parametric models. What if the examples are not linearly separable? Figure 18.31(a) shows an input space defined by attributes x = (x1 , x2), with positive examples (y = + 1) inside a circular region and negative examples (y = -1) outside. Clearly, there is no linear separator for this problem. Now, suppose we re-express the input data-.i.e., we map each input vector x to a new vector offeature values, F(x). In particular, let us use the three features ft = x i ,

h =xL

h = V2x1x2 .

(18.15)

We will see shortly where these came from, but for now, just look at what happens. Fig ure 18.31(b) shows the data in the new, three-dimensional space defined by the three features; the data are linearly separable in this space! This phenomenon is actually fairly general: if data are mapped into a space of sufficiently high dimension, then they will almost always be linearly separable-if you look at a set of points from enough directions, you'll find a way to make them line up. Here, we used only three dimensions; 1 1 Exercise 18.16 asks you to show that four dimensions suffice for linearly separating a circle anywhere in the plane (not just at the origin), and five dimensions suffice to linearly separate any ellipse. In general (with some

special cases excepted) if we have N data points then they will always be separable in spaces of N - 1 dimensions or more (Exercise 18.25). Now, we would not usually expect to find a linear separator in the input space x, but we can find linear separators in the high-dimensional feature space F(x) simply by replacing xrxk in Equation (18.13) with F(xj) ·F(xk) · This by itself is not remarkable-replacing x by F(x) in any learning algorithm has the required effect-but the dot product has some special properties. It turns out that F(xj) · F(xk) can often be computed without first computing F 11

The reader may notice that we could have used just ft and f2 , but the 30 mapping illustrates the idea better.

Section 18.9.

747

Support Vector Machines

1 .5

0.5

3

2

0 .

0

0 G G

. � � . 8 0

I 0 -I

2.5

-2

2

0

0.5

1.5 (b)

Figure 18.31 (a) A two-dimensional training set with positive examples as black cir cles and negative examples as white circles. The true decision boundary, zi + x� :S 1, s i also shown. (b) The same data after mapping into a three-dimensional input space (xi, x�, J2x1x2). The circular decision boundary in (a) becomes a linear decision boundary in three dimensions. Figure 18.30(b) gives a closeup of the separator in (b).

for each point. In our three-dimensional feature space defined by Equation (18.15), a little bit of algebra shows that

F(x1) · F(xk) = (x1 · xk) 2 .

KERNEL FUNCTION

MERCER'S THEOREM

POLYNOMIAL KERNEL

(That's why the v'2 is in /3.) The expression (x1 · xk)2 is called a kernel function,12 and is usually written as K(x1, Xk) · The kernel function can be applied to pairs of input data to evaluate dot products in some corresponding feature space. So, we can find linear separators in the higher-dimensional feature space F(x) simply by replacing x1 · Xk in Equation (18.13) with a kernel function J( (xj, Xk) · Thus, we can learn in the higher-dimensional space, but we compute only kernel functions rather than the full list of features for each data point. The next step is to see that there's nothing special about the kernel K(x1, xk) = (x1·xk)2. It corresponds to a particular higher-dimensional feature space, but other kernel functions correspond to other feature spaces. A venerable result in mathematics, Mercer's theo rem (1909), tells us that any "reasonable"13 kernel function corresponds to some feature space. These feature spaces can be very large, even for innocuous-looking kernels. For ex ample, the polynomial kernel, K(x1,xk) = (1 + x1 · xk)d, corresponds to a feature space whose dimension is exponential in d.

12 Tlus usage of "kemel function" is slightly different from the kemels in locally weighted regression.

SVM kemels are distance metrics, but not all are.

13 Here, "reasonable" means that the matrix K;k = K(x;, Xk) is positive definite.

Some

Chapter 18.

748 KERNELTRICK

�

SOFT MARGIN

KERNELIZATION

18.10

ENSEMBLE LEARNING

Learning from Examples

This then is the clever kernel trick: Plugging these kernels into Equation (18.13), optimal linear separators can be found efficiently in feature spaces with billions of (or; in some cases, infinitely many) dimensions. The resulting linear separators, when mapped back to the original input space, can cotTespond to arbitrarily wiggly, nonlinear decision bound aries between the positive and negative examples. In the case of inherently noisy data, we may not want a linear separator in some high dimensional space. Rather, we'd like a decision surface in a lower-dimensional space that does not cleanly separate the classes, but reflects the reality of the noisy data. That is pos sible with the soft margin classifier, which allows examples to fall on the wrong side of the decision boundary, but assigns them a penalty proportional to the distance required to move them back on the correct side. The kernel method can be applied not only with learning algorithms that find optimal linear separators, but also with any other algorithm that can be refonnu]ated to work only with dot products of pairs of data points, as in Equations 18.13 and 18.14. Once this is done, the dot product is replaced by a kernel function and we have a kernelized version of the algorithm. This can be done easily for k-nearest-neighbors and perceptron leaming (Section 18.7.2), among others. ENSEMBLE LEARNING

So far we have looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions. The idea of ensemble learning methods is to select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions. For example, during cross-validation we might generate twenty different decision trees, and have tl1em vote on the best classification for a new example. The motivation for ensemble leaming is simple. Consider an ensemble of J( = 5 hy poilieses and suppose that we combine their predictions using simple majority voting. For the ensemble to misclassify a new example, at least three ofthefive hypotheses have to misclas sify it. The hope is that this is much less likely than a misclassification by a single hypothesis. Suppose we assume that each hypothesis hk in the ensemble has an error of p--that is, the probability that a randomly chosen example is misclassified by hk is p. Furthetmore, suppose we assume that the errors made by each hypothesis are independent. 1n that case, ifpis smal1, then the probability of a large nwnber of misclassifications occurring is minuscule. For ex ample, a simple calculation (Exercise 18.18) shows that using an ensemble of five hypotheses reduces an error rate of 1 in 10 down to an en·or rate of less than I in 100. Now, obviously the assumption of independence is unreasonable, because hypotheses are likely to be misled in the same way by any misleading aspects of the training data. But if the hypotheses are at least a little bit different, thereby reducing the correlation between their etTors, then ensemble leaming can be very useful. Another way to think about the ensemble idea is as a generic way of enlarging the hypothesis space. That is, think of the ensemble itself as a hypothesis and the new hypothesis

Section 18. 10.

Ensemble Learning

749

Figure 18.32

Illustration of the increased expressive power obtained by ensemble learn ing. We take three linear threshold hypotheses, each of which classifies positively on the

unshaded side, and classify as positive any example classified positively by all three. The

resulting triangular region is a hypothesis not expressible in the original hypothesis space.

BOOSTING WEIGHTEDTRAINING SET

WEAK LEARNING

space as the set of all possible ensembles constructable from hypotheses in the original space. Figure 18.32 shows how this can result in a more expressive hypothesis space. If the original hypothesis space allows for a simple and efficient learning algorithm, then the ensemble method provides a way to leam a much more expressive class of hypotheses without incurring much additional computational or algorithmic complexity. The most widely used ensemble method is called boosting. To understand how it works, we need first to explain the idea of a weighted training set. In such a training set, each example has an associated weight Wj � 0. The higher the weight of an example, the higher is the importance attached to it during the leaming of a hypothesis. It is straightforward to modify the learning algorithms we have seen so far to operate with weighted training sets. 14 Boosting starts with Wj = 1 for all the examples (i.e., a nonnal training set). From this set, it generates the first hypothesis, h1 . This hypothesis will classify some of the training ex amples correctly and some incorrectly. We would like the next hypothesis to do better on the misclassified examples, so we increase their weights while decreasing the weights of the cor rectly classified examples. From this new weighted training set, we generate hypothesis h2. The process continues in this way until we have generated !{ hypotheses, where !{ is an input to the boosting algorithm. The final ensemble hypothesis is a weighted-majority combination of all the J( hypotheses, each weighted according to how well it perfonned on the training set. Figure 18.33 shows how the algorithm works conceptually. There are many variants ofthe ba sic boosting idea, with different ways of adjusting the weights and combining the hypotheses. One specific algorithm, called ADABOOST, is shown in Figure 18.34. ADA BOOST has a very important property: if the input learning algorithm L is a weak learning algorithm-which 14 For learning algorithms in which this is not possible, one can instead create a J'CJilicated training set where the jth example appears w; times, using mndomization to handle fractional weights.

Chapter 18.

750

D D D D

I (

D

D D

I J

c:::J

D D D

X

J

Learning from Examples

O J O J

D hy�R • • • \V

h��

I

X

I

- I

-

0.8

1.4 � �

.£

....._,_,,_,

(3,2) (3,3) (4,3)

E

@' 1.2

843

>.

-·-·-·-·-·-·-·...

.!,l

I

� ..:

0.4

f:�:::��:��=7::.:0:��i��s�::��

· -------------------------·

40

20

60

Number or trials

80

RMS error -Policy loss -------·

'

8. 0.8 ..: e 0.6 �

··-······

0.6 0

1.2

0

100

b

0.2

' '

0

20

40

60

�

80

100

Number of trials

(a)

(b)

Figure 21.7 Performance of the exploratory ADP agent. using R+ - 2 and Ne = 5. (a) Utility estimates for selected states over time. (b) The RMS error in utility values and the associated policy loss.

leads to a good destination, but because of nondeterminism in the environment the agent ends up in a catastrophic state. The TD update rule will take this as seriously as if the outcome had been the normal result of the action, whereas one might suppose that, because the outcome was a fluke, the agent should not worry about it too much. In fact, of course, the unlikely outcome will occur only infrequently in a large set of training sequences; hence n i the long run its effects will be weighted proportionally to its probability, as we would hope. Once again, it can be shown that the TD algorithm will converge to the same values as ADP as the number of training sequences tends to infinity. There is an alternative TD method, called Q-Iearning, which learns an action-utility representation instead of learning utilities. We will use the notation Q(s, a) to denote the value of doing action a in state s. Q-values are directly related to utility values as follows:

U(s) = maxQ(s,a) .

(21.6)

a

MODEL·FREE

Q-functions may seem like just another way of storing utility information, but they have a very important property: a TD agent that learns a Q-function does not need a model of the form P(s' l s,a), either for learning or for action selection. For this reason, Q-learning is called a model-free method. As with utilities, we can write a constraint equation that must hold at equilibritun when the Q-values are correct:

Q(s, a) = ll(s)

� JJ(s' 1 s,a) + 'Y L..J •'

max / a

Q(s', a') .

(21.7)

As in the ADP learning agent, we can use this equation directly as an update equation for an iteration process that calculates exact Q-values, given an estimated model. This does, however, require that a model also be leamed, because the equation uses P(s' l s, a). The temporal-difference approach, on the other hand, requires no model of state transitions-all

844

Chapter

21.

Reinforcement Leaming

function Q-LBARNING-AGENT(percept) returns an action inputs: pe1·cept, a percept indicating the current state s' and reward signal r' persistent: Q, a table of action values indexed by state and action, initially zero Nsa, a table of frequencies for state-action pairs, initially zero s, a, r, the previous state, action, and reward, initially null ifTERMINAL?(s) then Q[s, None] +-- 1'1 if s is not null then increment Nsa[s, a]

Q[s, a].- Q[s,a] + a(Nsa[s, a])(r + / ffiaJCa' Q[s',a'] - Q[s , a])

' s, a, 1' +-- s',argmaxa, f( Q[s' , a'], Nsa [s' , a']), r return a

Figure 21.8 An exploratory Q-learning agent It s i an active Ieamer that learns the value Q( s, a) of each action in each situation. It uses the same exploration function f as the ex

ploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors. it needs are the Q values. The update equation for TD Q-leaming is

Q(s,a)

SARSA

DROVHiOFFRULE DELTARULE

=

Widrow-Hoff rule, or the delta rule, for online least-squares. For the linear function approximator U9 (8) in Equation (21.10), we get three simple update rules: 8o +- 8o + a (uj(8) - U9 (8)) , Tilis is called the

81 +- 81 + a (uj(8) - U9 (8))x , 82 +- 82 + a (uj (8) - U9 (s))y .

We do know that the exact utility function can be represented in a page or two of Lisp, Java, or C++. That is, it can be represented by a program that solves the game exactly every time it is called. We are interested only in function npproximators that use a reasonable amount of computation. It might in fact be better to learn a very simple function approximator and combine it with a certain amount of look-ahead search. The tradeoffs involved

3

are currently not well understood.

Section 21.4.

847

Generalization in Reinforcement Learning

Uo (1, 1 )

0.8 and uj(1 , 1 ) is 0.4. Bo, 81. and 82 are all decreased by 0.4a, which reduces the en-or for (1,1). Notice that changing the parameters Bin response to an observed transition between two states also changes the values of Uo for every other state! This is what we mean by saying that function approximation We can apply these rules to the example where

is

allows a reinforcement learner to generalize from its experiences.

We expect that the agent will Jearn faster if it uses a ftmction approximator, provided that the hypothesis space is not too large, but includes some functions that are a reasonably good fit to the true utility function. Exercise

21.5

asks you to evaluate

the performance of

direct utility estimation, both with and without function approximation. The improvement in the 4

x 3 world is noticeable but not dramatic, because this is a very small state space to begin with. The improvement is much greater in a 1 0 x 10 world with a + 1 reward at (10,10). This

world is well suited for a linear utility function because the true utility function is smooth and nearly linear. (See Exercise

21.8.)

If we put the +1 reward at

more like a pyramid and the function approximator in Equation All is not lost, however!

(5,5),

the true utility is

(21.10) will fail miserably.

Remember that what matters for linear function approximation

is that the function be linear in the parameters-the features themselves can be arbitrary nonlinear ftmctions of the state variables. Hence, we can include a term such as

B3..j(x - x9)2+ (y - y9)2 that measures the distance to the goal.

83!3(x, y) =

We can apply these ideas equally well to temporal-difference leamers. All we need do is adjust the parameters to try to reduce the temporal difference between successive states. The new versions of the TD and Q-learning equations

(21.3 on page 836 and 21.8 on page 844)

are given by

, , 8Uo(s) Bi � Bi + a [R(s) + 'Y Uo(s') - Uo (s)] oBi

for utilities and

(21.12)

Q��' a)

Bi � Bi + a [R(s) + 'Y ma�Q0(s' , a') - Q0(s, a)]0

(21.13)

forQ-values. For passive TD learning, the update rule can be shown to converge to the closest 4 possible approximation to the true function when the function approximator is linear in the parameters. With active learning and

nonlinear ftmctions such

as neural networks, all bets

are off: There are some very simple cases in which the parameters can go off to infinity even though there are good solutions in the hypothesis space. There are more sophisticated algorithms that can avoid these problems, but at present reinforcement learning with general function approximators remains a delicate

art.

Function approximation can also be very helpful for learning a model of the environ ment. Remember that learning a model for an

observable environment is a supervised learn

ing problem, because the next percept gives the outcome state. Any of the supervised learning methods n i Chapter

18 can be used, with suitable adjustments for the fact that we need to pre

dict a complete state description rather than just a Boolean classification or a single real value. For a partially

observable environment,

the learning problem is much more difficult. If we

know what the hidden variables are and how they are causally related to each other and to the 4

The definition of distance between utility functions is rather technical; see Tsitsiklis and Van Roy (1997).

848

Chapter

21.

Reinforcement Leaming

obseiVable variables, then we can fix the structure of a dynamic Bayesian network and use the EM algorithm to leam the parameters, as was described in Chapter 20. Inventing the hidden variables and learning the model structure are still open problems. Some practical examples are described in Section 2 1 .6. P OLICY S EARCH

2 1 .5

POLICY SEARCH

The final approach we will consider for reinforcement Jeaming problems is called policy search. In some ways, policy search is the simplest of all the methods in this chapter: the idea is to keep twiddling the policy as long as its performance improves, then stop. Let us begin with the policies themselves. Remember that a policy 1r is a function that maps states to actions. We are interested primarily in parameterized representations of 1r that have far fewer parameters than there are states in the state space Uust as in the preceding section). For example, we could represent 1r by a collection of parameterized Q-functions, one for each action, and take the action with the highest predicted value:

1r(s) = max:Qe(s,a) .

a

STOCHASTic

POLICY

soFTw.x FUNCTION

(21. 14)

Each Q-function could be a linea r function of the parameters B, as in Equation (21.10), or it could be a nonlinear ftmction such as a neural network. Policy search wiU then ad just the parameters B to improve the policy. Notice that if the policy is represented by Q functions, then policy search results in a process that learns Q-functions. This process is not the same as Q-leaming! In Q-Jearning with function approximation, the algorithm finds a value of B such that Qe is ..close" to Q*, the optimal ()-function. Policy search, on the other hand, finds a value of () that results in good performance; the values found by the two methods may differ very substantially. (For example, the approximate Q-function defined by Q9(s, a) =Q*(s, a)/10 gives optimal performance, even though it is not at all close to Q* .) Another clear instance of the difference is the case where 1r(s) is calculated using, say, depth- 10 look-ahead search with an approximate utility function Ue. A value o f () that gives good results may be a long way from making Ue resemble the true utility function. One problem with policy representations ofthe kind given in Equation (21. 14) is that the policy is a discontinuous function of the parameters when the actions are discrete. (For a continuous action space, the policy can be a smooth function of the parameters.) That is, there will be values of B such that an infinitesimal change in () causes the policy to switch from one action to another. This means that the value of the policy may also change discontinuously, which makes gradient-based search difficult. For this reason, policy search methods often use a stochastic policy representation TC9 (s, a), which specifies the probability of selecting action a in state s. One popular representation is the softmax function:

7re(s, a) = eQB(s,a)/ L eQB(s,a') . a/

Softmax becomes nearly deterministic if one action is much better than the others, but it always gives a differentiable ftmction of B; hence, the value of the policy (which depends in

Section 21.5.

POLICYVALUE

POLICY GRADIENT

849

Policy Search

a continuous fashion on the action selection probabilities) is a differentiable function of (). Softmax is a genemlization of the logistic function (page 725) to multiple variables. Now let us look at methods for improving the policy. We start with the simplest case: a deterministic policy and a deterministic environment. Let p(B) be the policy value, i.e., the expected reward-to-go when 1r9 is executed. If we can derive an expression for p(B) in closed form, then we have a standard optimization problem, as described in Chapter 4. We can follow the policy gradient vector '\!ep(B) provided p(B) is differentiable. Alternatively, if p(B) is not available in closed form, we can evaluate 1r9 simply by executing it and observing the accumulated reward. We can follow the empirical gradient by hill climbing-i.e., evaluating the change in policy value for small increments in each pammeter. With the usual caveats, this process will converge to a local optimum in policy space. When the environment (or the policy) is stochastic, things get more difficult. Suppose we are trying to do hill climbing, which requires comparing p(B) and p(B + D.B) for some small D.B. The problem is that the total reward on each trial may vary widely, so estimates of the policy value from a small number of trials will be quite unreliable; trying to compare two such estimates will be even more tmreliable. One solution is simply to run lots of trials, measuring the sample variance and using it to detennine that enough trials have been run to get a reliable indication of the direction of improvement for p(B). Unfortunately, tllis is impractical for many real problems where each trial may be expensive, time-consuming, and perhaps even dangerous. For the case of a stochastic policy 1r9(s, a), it is possible to obtain an unbiased estimate of the gradient at B, '\!ep(B), directly from the results of trials executed at B. For simplicity, we will derive this estimate for the simple case of a nonsequential environment in which the reward R(a) is obtained immediately after doing action a in the start state so. 1n this case, the policy value is just the expected value of the reward, and we have

'\lep(B) = Ve :L:>e(so,a)R(a) = L('\le7re(so,a))R(a) . a

a

Now we perform a simple trick so that this summation can be approximated by samples generated from the probability distribution defined by 7re(so,a). Suppose that we have N trials in all and the action taken on the jth trial is aj. Then

�.. ('\le1re(so, aj))R(aj) . ""'.. 1r9 (so, a) . ('\le1re(so, a))R(a) _!__ L... _ L... " v e P (B) N 7re(so, ai) 1re(so, a) i= 1 _

a

Thus, the true gradient of the policy value is approximated by a swn of terms involving the gradient of the action-selection probability in each trial. For the sequential case, tllis generalizes to N

('\le1re(s , aj))Rj(s) L '\lep(B) _..!:._ 7re(s,aj) N �

.

J =l

for each state s visited, where aj is executed in s on the jth trial and Rj (s) is the total reward received from state s onwards in the jth trial. The resulting algorithm is called REINFORCE (Williams, 1992); it is usually much more effective than hill climbing using lots of trials at each value of B. It is still much slower than necessary, however.

850

Chapter

21.

Reinforcement Leaming

Consider the following task: given two blackjac� programs, determine which is best.

to do this is to have each play against a standard "dealer" for a certain number of hands and then to measure their respective winnings. The problem with this, as we have seen, One way

is that the wirmings of each program fluctuate widely depending on whether it receives good

�ED SAMPLING

or bad cards. An obvious solution is to generate a certain number of hands in advance and

have each program play the same set of hands.

In this way, we eliminate the measurement

en-or due to differences in the cards received. This idea, called

correlated sampling,

derlies a policy-search algorithm called PEGASUS (Ng and Jordan,

2000).

un

The algorithm is

applicable to domains for which a simulator is available so that the "random" outcomes of actions can be repeated. The algorithm works by generating in advance

N sequences of ran

dom numbers, each of which can be used to run a trial of any policy. Policy search is carried out by evaluating each candidate policy using the same set of random sequences to determine the action outcomes. It can be shown that the nwnber of random sequences required to ensure that the value of every policy is well estimated depends only on the complexity of the policy space, and not at all on the complexity of the underlying domain.

2 1 .6

APPLICATIONS OF REINFORCEMENT LEARNING

We now tum to examples of large-scale applications of reinforcement learning. We consider applications in game playing, where the transition model is known and the goal is to Jearn the utility function, and in robotics, where the model is usually unknown.

21.6.1

Applications to game playing

The first significant application of reinforcement learning was also the first significant team ing program of any kind-the checkers program written by Arthur Samuel Samuel first used a weighted linear function for the evaluation of positions, terms at any one time. He applied a version of Equation

(1959, 1967). using up to 16

(21.12) to update the weights.

There

were some significant differences, however, between his program and cunent methods. First, he updated the weights using the difference between the cunent state and the backed-up value generated by full look-ahead in the search tree. This works fine, because it amounts

to view

ing the state space at a different granularity. A second difference was that the program did

not use any observed rewards!

That is, the values of terminal states reached in self-play were

ignureu. Tills me::ans that it is the::ureti�o;ally possible:: fur Samud's program nut to �.;unve::rge::, or

to converge on a strategy designed to lose rather than to win.

He

managed

to avoid this fate

by insisting that the weight for material advantage should always be positive. Remarkably, this was sufficient to direct the program into areas of weight space conesponding to good checkers play. Geny Tesauro's backgammon program TD-GAMMON potential of reinforcement learning techniques.

1989),

(1992) forcefully .illustrates

In earlier work (Tesauro and Sejnowski,

Tesauro tried teaming a neural network representation of

5 Also known as twenty-one or pontoon.

the

Q(s, a)

directly from ex-

Section 21.6.

Applications of Reinforcement Learning

851

X

Figure 21.9 Setup for the problem of balancing a long pole on top of a moving cart. The cart can be jerked left or right by a controller that observes x, 8, x, and B. amples of moves labeled with relative values by a human expert.

This approach proved

extremely tedious for the expert. It resulted in a program, called NEUROGAMMON, that was strong by computer standards, but not competitive with human experts. The TO-GAMMON project was an attempt to learn from self-play alone. The only reward signal was given at t.he end of each game. The evaluation function was represented by a fully connected neural network with a single hidden layer containing Equation

40 nodes. Simply by repeated application of

(21.12), TO-GAMMON learned to play considerably better than

NEUROGAMMON,

even though the input representation contained just the raw board position with no computed features. This took about 200,000 training games and two weeks of computer time. Although that may seem like a lot of games, it is only a vanishingly small fraction of the state space. When precomputed features were added to the input representation, a network with 80 hidden nodes was able, after

300,000 training games, to reach a standard of play comparable to that

of the top three human players worldwide. Kit Woolsey, a top player and analyst, said that "There is no question in my mind that its positional judgment is far better than mine."

21.6.2 CARHOLE INVERTED PENDULUM

Application to robot control

The setup for the famous

cart-pole

lum,

21.9. The problem is to control the position x of the cart so that

is shown in Figure

the pole stays roughly upright

(B

�

balancing problem, also known as the

n/2),

inverted pendu

while staying within the limits of the cart track

as shown. Several thousand papers in reinforcement learning and control theory have been published on this seemingly simple problem. The catt-pole problem differs from the prob

x, B, x, and iJ are continuous. The actions are jerk left or jerk right, the so-called bang-bang control regime.

lems described earlier in that the state variables BANG·B.ING CONTROL

usually discrete:

The earliest work on leaming for this problem was carried out by Michie and Cham bers (1968). Their BOXES algorithm was able to balance the pole for over an hour after only about

30 trials. Moreover, tullike many subsequent systems,

BOXES was implemented with a

852

Chapter 21.

Reinforcement Leaming

real cart and pole, not a simulation. The algorithm first discretized the four-dimensional state space into boxes-hence the name. It then ran trials until the pole fell over or the cart hit the end of the track. Negative reinforcement was associated with the final action in the final box and then propagated back through the sequence. It was found that the discretization caused some problems when the apparatus was initialized in a position different from those used in training, suggesting that generalization was not perfect. Improved generalization and faster learning can be obtained using an algorithm that adaptively partitions the state space accord ing to the observed variation in the reward, or by using a continuous-state, nonlinear function approximator such as a neural network. Nowadays, balancing a triple inverted pendulwn is a common exercise-a feat far beyond the capabilities of most humans. Still more impressive is the application of reinforcement leaming to helicopter flight (Figure 21.10). This work has generally used policy search (Bagnell and Schneider, 2001) as well as the PEGASUS algorithm with simulation based on a learned transition model (Ng et at., 2004). Further details are given in Chapter 25.

Figure 21.10 Superimposed time-lapse images of an autonomous helicopter performing a very difficult "nose-in circle" maneuver. The helicopter is under the control of a policy developed by the PEGASUS policy-search algorithm. A simulator model was developed by observing the effects of various control manipulations on the real helicopter; then the algo rithm was run on the simulator model overnight. A variety of controllers were developed for different maneuvers. In all cases, performance far exceeded that of an expert human pilot using remote control. (Image courtesy of Andrew Ng.)

Section 21.7.

2 1 .7

853

Summary

SUMMARY This chapter has examined the reinforcement learning problem: how an agent can become

proficient in an unknown envirorunent, given only its percepts and occasional rewards. Rein forcement learning can be viewed as a microcosm for the entire AI problem, but it is studied in a number of simplified settings to facilitate progress. The major points are: •

•

The overall agent design dictates the kind of information that must be leamed. The three main designs we covered were

the model-based design,

using a model

utility function U; the model-free design, using an action-utility function reflex design, using a policy 1r.

Q;

P and a and the

Utilities can be learned using three approaches:

1.

Direct utility estimation uses

the total observed reward-to-go for a given state as

direct evidence for learning its utility.

2.

Adaptive dynamic programming (ADP) leams a model and a reward function from observations and then uses value or policy iteration to obtain the utilities or

an optimal policy. ADP makes optimal use of the local constraints on utilities of states imposed tlu-ough the neighborhood structure ofthe enviromnent.

3.

(TD) methods update utility estimates to match those ofsuc cessor states. They can be viewed as simple approximations to the ADP approach Temporal-difference

that can learn without requiting a transition model. Using a leamed model to gen erate pseudoexperiences can, however, result in faster leaming. •

ADP appruad1 ur a TD approach. With TD, Q-learning requires no model in either the leaming or action A�.:liuu-ulilily fuu�.:liuns, ur Q-fun�.:liuus, �.:au bt: lt:amt:d by

an

selection phase. This simplifies the leammg problem but potentially restricts the ability to learn in complex environments, because the agent cannot simulate the results of possible courses of action. •

When the leammg agent is responsible for selecting actions while it leatns, it must trade off the estimated value of those actions against the potential for learning useful new information. An exact solution of the exploration problem is mfeasible, but some simple heuristics do a reasonable job.

•

In large state spaces, reinforcement teaming algorithms must use an approximate func tional representation in order to generalize over states. The temporal-difference signal can be used directly to update parameters in representations such as neural networks.

•

Policy-search methods operate directly on a representation of the policy, attempting to improve it based on observed performance. The variation in the performance in a stochastic domain is a serious problem; for simulated domains this can be overcome by fixing the randomness in advance.

Because of its potential for eliminating hand coding ofcontrol strategies, reinforcement learn ing continues to be one of the most active areas of machine leaming research. Applications in robotics promise to be particularly valuable; these will require methods for handling con-

Chapter

854

21.

Reinforcement Leaming

tinuous, high-dimensional, partially obseiVable environments in which successful behaviors may consist of thousands or even millions of primitive actions.

BIBLIOGRAPHICAL AND HISTORICAL NOTES Turing (1948,

1950) proposed the reinforcement-learning approach, although he was not con

vinced of its effectiveness, writing, "the use of punishments and rewards can at best be a part of the teaching process." Arthur Samuel's work (1959) was probably the earliest successful machine learning research. Although this work was informal and had a number of flaws, it contained most of the modern ideas in reinforcement learning, including temporal differ encing and function approximation. Around the same time, researchers in adaptive control theory (Widrow and Hoff,

1960), building on work by Hebb (1949), were training simple net

works using the delta rule. (This early connection between neural networks and reinforcement learning may have led to

the persistent misperception that the latter is

a subfield of the for

mer.) The cart-pole work of Michie and Chambers (1968) can also be seen as a reinforcement learning method with a ftmction approximator. The psychological literature on reinforcement learning is much older; Hilgard and Bower (1975) provide a good survey. Direct evidence for the operation of reinforcement learning in animals has been provided by investigations into the foraging behavior of bees; there is a clear neural correlate of the reward signal in the fonn of a large neuron mapping from tague

et al. , 1995).

the nectar intake

sensors directly to the motor cortex (Mon

Research using single-cell recording suggests that

the dopamine

system

in primate brains implements something resembling value function learning (Schultz

ei at. ,

1997). The neuroscience text by Dayan and Abbott (2001) describes possible neural imple mentations of temporal-difference learning, while Dayan and Niv (2008) sw·vey the latest evidence from neuroscientific and behavioral experiments. The connection between reinforcement learning and Markov decision processes was first made by Werbos

(1977), but the development of reinforcement learning in AI stems from work at the University of Massachusetts in the early 1980s (Barto et at., 1981). The paper by Sutton (1988) provides a good historical oveiView. Equation (21.3) in this chapter is a special case for

A = O of Sutton's general TD(A)

algorithm. TD(A) updates the utility

values of all states in a sequence leading up to each transition by an amount that drops off as

At for states t steps in the past.

TD(1 ) is identical to the Widrow-Hoff or delta rule. Boyan

(2002), building on work by Bradtke and Barto (1996), argues that TD(A) and related algo rithms make inefficient use of experiences; essentially, they are online regression algorithms that converge much more slowly than offline regression. His LSTD (least-squares temporal

differencing) algorithm is an online algorithm for passive reinforcement leaming that gives the same results as offline regression. Least-squares policy iteration, or LSPI (Lagoudakis and Parr,

2003), combines this idea with the policy iteration algmithm, yielding a robust,

statistically efficient, model-free algorithm for learning policies. The combination of temporal-difference learning with the model-based generation of simulated experiences was proposed in Sutton's DYNA architecture (Sutton,

1990). The idea of prioritized sweeping was introduced independently by Moore and Atkeson (1993) and

Bibliographical and Historical Notes

855

Peng and Williams (1993). Q-learning was developed in Watkins's Ph.D. thesis (1989), while SARSA appeared in a technical report by Rummery and Niranjan (1994). Bandit problems, which model the problem of exploration for nonsequential decisions, are analyzed in depth by Berry and Fristedt (1985). Optimal exploration strategies for several settings are obtainable using the technique called Gittins indices (Gittins, 1989). A vari ety of exploration methods for sequential decision problems are discussed by Barto

et al.

(1995). Kearns and Singh (1998) and Brafman and Tennenholtz (2000) describe algorithms that explore unknown environments and are guaranteed to converge on near-optimal policies in polynomial time. Bayesian reinforcement learning (Dearden et al., 1998, 1999) provides another angle on both model uncet1ainty and exploration. Function approximation in reinforcement learning goes back to the work of Samuel, who used both linear and nonlinear evaluation functions and also used feature-selection meth CMAC

ods to reduce the feature space. Later methods include the CMAC (Cerebellar Model Artic ulation Controller) (Albus, 1975), which is essentially a sum of overlapping local kernel

functions, and the associative neural networks of Barto et al. (1983). Neural networks are currently the most popular fonn of function approximator. The best-known application is TO-Gammon (Tesauro, 1992, 1995), which was discussed in the chapter. One significant prohlem exhihitefl hy neura.l-net.work-ha.�efl TD learners is that they tenfl to forget earlier ex

petiences, especially those in parts of the state space that are avoided once competence is achieved. This can result in catastrophic failure if such circumstances reappear. Function ap proximation based on instance-based learning can avoid this problem (Ormoneit and Sen,

2002; Forbes, 2002). The convergence of reinforcement learning algorithms using function approximation is an extremely technical subject. Results for TO learning have been progressively strength ened for the case of linear function approximators (Sutton, 1988; Dayan, 1992; Tsitsiklis and Van Roy, 1997), but several examples of divergence have been presented for nonlinear func tions (see Tsitsiklis and Van Roy, 1997, for a discussion). Papavassiliou and Russell (1999) describe a new type of reinforcement learning that converges with any fonn of ftmction ap proximator, provided that a best-fit approximation can be found for the observed data. Policy search methods were brought to the fore by Williams (1 992), who developed the REINFORCE family of algorithms. Later work by Marbach and Tsitsiklis (1998), Sutton et al. (2000), and Baxter and Bartlett (2000) strengthened and generalized the convergence results

for policy search. The method of correlated sampling for comparing different configurations of a system was described formally by Kahn and Marshall (1953), but seems to have been known long before that. Its use in reinforcement learning is due to Van Roy (1998) and Ng

and Jordan (2000); the latter paper also introduced the PEGASUS algoritlun and proved its fonnal properties.

As we mentioned .in the chapter, the perfonnance of a stochastic policy is a continu ous fuul;l iuu uf its p-k !k(Xi- l ,xi,e,i) . k

Section 22.4.

879

Infonnation Extraction

The Ak parameter values are learned with a MAP (maximum a posteriori) estimation proce dure that maximizes the conditional likelihood of the training data. The feature functions are the key components ofa CRF. The function fk has access to a pair ofadjacent states, x.- 1 and

x., but also the entire observation (word) sequence e, and the current position in the temporal sequence, i. This gives us a lot of flexibility in defining features. We can define a simple feature function, for example one that produces a value of 1 if the current word is A NDREW

{

and the current state is SPEAKER: .)

f1 (Xi-t , X i , e, z

=

1 if Xi = SPEAKER and e;. = ANDREW . herwiSe 0 ot

How are features like these used? It depends on their corresponding weights. If At > 0, then whenever ft is tme, it increases the probability of the hidden state sequence X1:N . This is another way of saying "the CRF model should prefer the target state SPEAKER for the word A NDREW."

If on

the other hand

At < 0, the CRF model will try to avoid this association,

and if Al = 0, this feature is ignored. Parameter values can be set manually or can be learned

{

from data. Now consider a second feature function:

f2 (X·•-t

'

.)

X·" e' 2 =

1 if Xi = SPEAKER and fi+t = SAID

0 otherwise

This feature is true if the current state is SPEAKER and the next word is "said." One would therefore expect a positive A2 value to go with the feature. More interestingly, note that both ft and h can hold at the same time for a sentence like "Andrew said . . . ." In this case, the two features overlap each other and both boost the belief in Xt = SPEAKER. Because of the

independence asswnption, HMMs cannot use overlapping features; CRFs can. Furthermore,

a featu.re in a CRF can use any part of the sequence e1 :N. Features can also be defined over transitions between states. The features we defined here were binary, but in general, a feature function can be any real-valued function. For domains where we have some knowledge about the types of features we would like to include, the CRF fonnalism gives us a great deal of flexibility in defining them. This flexibility can lead to accuracies that are higher than with less flexible models such as HMMs.

22.4.4

Ontology extraction from large corpora

So far we have thought of infonnation extraction as finding a specific set of relations (e.g., speaker, time, location) in a specific text (e.g., a talk announcement). A different applica tion of extraction technology is building a large knowledge base or ontology of facts from a corpus. This is different in three ways: First it is open-ended-we want to acquire facts about all types of domains, not just one specific domain. Second, with a large corpus, this task is dominated by precision, not recall-just as with question answering on the Web (Sec tion 22.3.6). Third, the results can be statistical aggregates gathered from multiple sources, rather than being extracted from one specific text. For example, Hearst (1992) looked at the problem of learning an ontology of concept categories and subcategories from a large corpus. (In 1992, a large corpus was a 1000-page encyclopedia; today it would be a 100-million-page Web corpus.) The work concentrated on templates that are very general (not tied to a specific domain) and have high precision (are

880

Chapter

22.

Natural Language Processing

almost always correct when they match) but low recall (do not always match). Here is one of the most productive templates:

NP such as NP (, NP)* (,)?

((and I or)

NP)? .

Here the bold words and commas must appear literally in the text, but the parentheses are for grouping, the asterisk means

optional. NP is a variable

repe1ition of zero or more, and

standing for a noun phrase; Chapter

the question mark means

23 describes how to identify

noun phrases; for now just assume that we know some words are nouns and other words (such as

verbs) that we can reliably assume are not part of a simple nmm phrase. This template

matches the texts "diseases such as rabies affect your dog" and "supports network protocols

such as DNS;' concluding that rabies is a disease and DNS is a network protocol. Similar templates can be constructed with the key words "including," "especially," and "or other." Of course these templates will fail to match many relevant passages, like "Rabies is a disease." That is intentional. The "NP is a

NP" template does indeed sometimes denote a subcategory

relation, but it often means something else, as in "There is a God" or "She is a little tired." With a large corpus we can afford to be picky; to use only the high-precision templates. We'll miss many statements of a subcategory relationship, but most likely we'll find a paraphrase of the statement somewhere else in the corpus in a form we can use.

22.4.5

Automated template construction

The subcategory relation is so fundamental that is worthwhile to handcraft a few templates to help identify instances of it occun-ing in natural language text. But what about the thousands of other relations in the world? There aren't enough AI grad students in the world to create

and debug templates for all of them. Fortunately, it is possible to learn templates from a few

examples, then use the templates to learn more examples, from which more templates can be learned, and so on. In one ofthe first experiments of this kind, Brin (1999) started with a data set ofjust five examples:

("Isaac Asimov", "The Robots of Dawn") ("David Brio", "Startide Rising") ("James Gleick", "Chaos-Making a New Science") ("Charles Dickens", "Great Expectations") ("William Shakespeare", "The Comedy of Errors") Clearly these are examples of the author-title relation, but the learning system had no knowl edge of authors or titles. The words in these examples were used in a search over a Web

corpus, resulting in

199 matches. Each match is defined as a tuple of seven strings,

(Author, Title, Order, Prefix, Middle, Postfix, URL) , where

Order is true

if the author came first and false if the title came first,

Middle

is the

Prefix is the 10 characters before the match, Suffix is URL is the Web address where the match was made.

characters between the author and title, the I 0 characters after the match, and

Given a set of matches, a simple template-generation scheme can find templates to explain the matches. The language of templates was designed to have a close mapping to the matches themselves, to be amenable to automated learning, and to emphasize high precision

Section 22.4.

Infonnation Extraction

881

(possibly at the risk of lower recall). Each template has the same seven components as a match. The Author and Title are regexes consisting of any characters (but beginning and ending in letters) and constrained to have a length from half the minimum length of the examples to twice the maximum length. The prefix, middle, and postfix are restricted to literal strings, not regexes. The middle is the easiest to learn: each distinct middle string in the set of matches is a distinct candidate template. For each such candidate, the template's Prefix is then defined as the longest common suffix of aU the prefixes in the matches, and the Postfix is defined as the longest common prefix of all the postfixes in the matches. U either of these is of length zero, then the template is rejected. The URL of the template is defined as the longest prefix of the URLs in the matches. In the experiment run by Brin, the first 199 matches generated three templates. The most productive template was

Title by Author (

URL: www . s f f . net/locus/c

The three templates were then used to retrieve 4047 more (author, title) examples. The exam ples were then used to generate more templates, and so on, eventually yielding over 15,000 titles. Given a good set of templates, the system can collect a good set of examples. Given a good set of examples, the system can build a good set of templates. The biggest weakness in this approach is the sensitivity to noise. If one of the first few templates is incorrect, errors can propagate quickly. One way to limit tllis problem is to not accept a new example unless it is verified by multiple templates, and not accept a new template unless it discovers multiple examples that are also found by other templates. 22.4.6 Machine reading

MACHINE READING

Automated template construction is a big step up from handcrafted template construction, but it still requires a handful of labeled examples of each relation to get started. To build a large ontology with many thousands of relations, even that amount of work would be onerous; we would like to have an extraction system with no human input of any kind-a system that could read on its own and build up its own database. Such a system would be relation-independent; would work for any relation. In practice, these systems work on all relations in parallel, because of the I/0 demands of large corpora. They behave less like a traditional information extraction system that is targeted at a few relations and more like a human reader who learns from the text itself; because of tllis the field has been called machine reading. A representative machine-reading system is TEXTRUNNER (Banko and Etzioni, 2008). TEXTRUNNER uses cotraining to boost its performance, but it needs something to bootstrap from. In the case of Hearst (1992), specific patterns (e.g., such as) provided the bootstrap, and for Brin (1998), it was a set of five author-title pairs. For TEXTRUNNER, the original ni spi ration was a taxonomy of eight very general syntactic templates, as shown in Figure 22.3. It was felt that a small number of templates like this could cover most of the ways that relation ships are expressed in English. The actual bootsrapping starts from a set of labelled examples that are extracted from the Perm Treebank, a corpus of parsed sentences. For example, from the parse of the sentence "Einstein received the Nobel Prize in 1921 , TEXTRUNNER is able "

882

Chapter

22.

Natural Language Processing

to extract the relation ("Einstein," "received;' "Nobel Prize").

Given a set of labeled examples of this type, TEXTRUNNER trains a linear-chain CRF

to extract further examples from unlabeled text. The features in the CRF include function words like "to" and "of" and "the," but not nouns and verbs (and not noun phrases or verb phrases). Because TEXTRUNNER is domain-independent, it cannot rely on predefined lists ofnouns and verbs. Type

Template

Example

Frequency

Verb Noun-Prep Verb-Prep Infinitive Modifier Noun-Coordinate Verb-Coordinate Appositive

NP1 Ve1·b NP2 NP1 NP Pl·ep NP2 NP1 Ve1·b Prep NP2 NP1 to Ve1·b NP2 NP1 Verb NP2 Noun NP1 (, I and I - I :) NP2 NP NP1 (,1 and) NP2 Verb NP1 NP (:1 ,)? NP2

X established Y X settlement with Y X moved to Y X plans to acquire Y X is Y winner X-Y deal X, Y merge X hometown : Y

38% 23% 16% 9% 5% 2% 1% 1%

Figure 22.3 Eight general templates that cover about 95% of the ways that relations are expressed in English.

TEXTRUNNER achieves a precision of 88% and recall of 45% (F1 of 60%) on a large Web corpus. TEXTRUNNER has extracted hundreds of millions of facts from a corpus of a half-billion Web pages. For example, even though it has no predefined medical knowledge, it has extracted over 2000 answers to the query [what kills bacteria]; cotTect answers include antibiotics, ozone, chlorine, Cipro, and broccoli sprouts. Questionable answers include "wa ter," which came from the sentence "Boiling water for at least I 0 minutes will kill bacteria." It would be better to attribute this to "boiling water" rather than just "water."

With the techniques outlined in this chapter and continual new inventions, we are start

ing to get closer to the goal of machine reading.

22.5

SUMMARY The main points of this chapter are as follows:

• Probabilistk language models based on n-grams recover a surprising amount of infor mation about a language. They can perform well on such diverse tasks as language identification, spelling correction, genre classification, and named-entity recognition.

• These language models can have millions of features, so feature selection and prepro cessing of the data to reduce noise is important. • Text classification can be done with naive Bayes n-gram models or with any of the classification algorithms we have previously discussed. Classification can also be seen as a problem in data compression.

Bibliographical and Historical Notes •

883

fuformation retrieval systems use a very simple language model based on bags of words, yet still manage to perform well in terms of recall and precision on very large corpora of text. On Web corpora, link-analysis algorithms improve performance.

•

Question answering can be handled by an approach based on information retrieval, for

•

fuformation-ex1raction systems use a more complex model that includes limited no

•

In building a statistical language system, it is best to devise a model that can make good

questions that have multiple answers in the corpus. When more answers are available in the corpus, we can use techniques that emphasize precision rather than recall. tions of syntax and semantics in the form of templates. They can be built from finite state automata, HMMs, or conditional random fields, and can be learned from examples. use of available data, even if the model seems overly simplistic.

BIBLIOGRAPHICAL AND HISTORICAL NOTES

N-gram letter models

for language modeling were proposed by Markov (1913). Claude Sharmon (Shannon and Weaver, 1949) was the first to generate n-gram word models of En glish. Chomsky (1956, 1957) pointed out the limitations of finite-state models compared with context-free models, concluding, "Probabilistic models give no particular insight into some ofthe basic problems of syntactic structure." This is true, but probabilistic models do provide insight into some other basic problems-problems that context-free models ignore. Chom sky's remarks had the unfortunate effect of scaring many people away from statistical models

for two decades, until these models reemerged for use in speech recognition (Jelinek, 1976). Kessler et al. ( 1997) show how to apply character n-gram models to genre classification, and Klein et al. (2003) describe named-entity recognition with character models. Franz and Brants (2006) describe the Google n-gram corpus of 13 million unique words from a trillion words ofWeb text; it is now publicly available. The bag of words model gets its name from a passage from linguist Zellig Harris (1954), "language is not merely a bag of words but a tool with particular properties." Norvig (2009) gives some examples of tasks that can be accomplished with n-gram models. Add-one smoothing, first suggested by Pierre-Simon Laplace ( 1816), was formalized by Jeffreys (1948), and interpolation smoothing is due to Jelinek and Mercer (1980), who used it for speech recognition. Other techniques include Witten-Bell smoothing (1991), Good Turing smoothing (Church and Gale, 1991) and Kneser-Ney smoothing (1995). Chen and Goodman (1996) and Goodman (2001) survey smoothing techniques. Simple n-gram letter and word models are not the only possible probabilistic models. Blei et al. (2001) describe a probabilistic text model called latent Dirichlet allocation that views a document as a mixture of topics, each with its own distribution of words. This model can be seen as an extension and rationalization of the latent semantic indexing model of (Deerwester et al. , 1990) (see also Papadimitriou et al. (1998)) and is also related to the multiple-cause mixture model of (Sahami et al., 1996).

884

Chapter

22.

Natural Language Processing

Manning and Schiitze (1999) and Sebastiani (2002) survey text-classification teclmiques. Joachims (2001) uses statistical leaming theory and support vector machines to give a theo retical analysis of when classification will be successful. Apte et at. (1994) report an accuracy of 96% in classifying Reuters news articles into the "Eamings" category. Koller and Sahami (1997) report accuracy up to 95% with a naive Bayes classifier, and up to 98.6% with a Bayes classifier that accounts for some dependencies among features. Lewis (1998) surveys f01ty years of application of naive Bayes techniques to text classification and retrieval. Schapire and Singer (2000) show that simple linear classifiers can often achieve accuracy almost as good as more complex models and are more efficient to evaluate. Nigam et at. (2000) show how to use the EM algorithm to label unlabeled documents, thus learning a better classifi cation model. Witten et al. (1999) describe compression algorithms for classification, and show the deep connection between the LZW compression algoritlun and maximum-entropy language models. Many of the n-gram model techniques are also used in bioinformatics problems. Bio statistics and probabilistic NLP are coming closer together, as each deals with long, structured sequences chosen from an alphabet of constituents. The field of information retrieval is experiencing a regrowth in interest, sparked by the wide usage of Internet searching. Robertson (1977) gives an early overview and intro duces the probability ranking principle. Croft et at. (2009) and Manning et at. (2008) are the first textbooks to cover Web-based search as well as traditional IR. Hearst (2009) covers user interfaces for Web search. The TREC conference, organized by the U.S. government's National Institute of Standards and Technology (NIST), hosts an annual competition for IR systems and publishes proceedings with results. In the first seven years of the competition, performance roughly doubled.

The most popular model for IR is the vector space model (Salton et al., 1975). Salton's work dominated the early years of the field. There are two alternative probabilistic models, one due to Ponte and Croft (1998) and one by Maron and Kuhns (1960) and Robertson and Sparck Jones (1976). Lafferty and Zhai (2001 ) show that the models are based on the same joint probability distribution, but that the choice of model has implications for training the parameters. Craswell et al. (2005) describe the BM25 scoring function and Svore and Burges (2009) describe how BM25 can be improved with a machine learning approach that incorpo rates click data-examples of past search queies and the results that were clicked on. Brin and Page (1998) describe the PageRank algoritlun and the implementation of a Web search engine. Kleinberg (1999) describes the HITS algorithm. Silverstein et al. (1998) investigate a log of a billion Web searches. The journal Information Retrieval and the pro ceedings of the annual SIGIR conference cover recent developments in the field. Early information extraction programs include GUS (Bobrow et al. , 1977) and FRUMP (DeJong, 1982). Recent information extraction has been pushed forward by the annual Mes sage Understand Conferences (MUC), sponsored by the U.S. goverrunent. The FASTUS finite-state system was done by Hobbs et al. (1997). It was based in part on the idea from Pereira and Wright (1991) of using FSAs as approximations to phrase-stmcture grammars. Surveys of template-based systems are given by Roche and Schabes (1997), Appelt (1999),

Exercises

885 and Muslea (1999). Large databases of facts were extracted by Craven

et al.

et al.

(2000), Pasca

(2006), Mitchell (2007), and Durme and Pasca (2008). Freitag and McCallum (2000) discuss HMMs for Information Extraction. CRFs were

introduced by Lafferty et al. (2001); an example of their use for information extraction is described in (McCallum, 2003) and a tutorial with practical guidance is given by (Sutton and McCallum, 2007). Sarawagi (2007) gives a comprehensive survey. Banko

et al.

(2002) present the ASKMSR question-answering system; a similar sys

tem is due to Kwok

et al.

(2001). Pasca and Harabagiu (2001) discuss a contest-winning

question-answering system. Two early influential approaches to automated knowledge engi neering were by Riloff (1993), who showed that an automatically constructed dictionary per formed almost as well as a carefully handcrafted domain-specific dictionary, and by Yarowsky (1995), who showed that the task of word sense classification (see page 756) could be accom plished ttuough unsupervised training on a corpus of unlabeled text with accuracy as good as supervised methods. The idea of simu1taneously extracting templates and examples from a handful oflabeled examples was developed independently and simultaneously by Blum and Mitchell (1998), who called it cotraining and by Brin (1998), who called it DIPRE (Dual Iterative Pattern Relation Extraction). You can see why the term

cotraining has

stuck. Similar early work,

tmder the name of bootstrapping, was done by Jones et al. (1999). The method was advanced by the QXTRACT (Agichtein and Gravano, 2003) and KNOWITALL (Etzioni et al., 2005) systems. Machine reading was introduced by Mitchell (2005) and Etzioni et al. (2006) and is the focus of the TEXTRUNNER project (Banko

et al., 2007; Banko and Etzioni, 2008).

This chapter has focused on natural language text, but it is also possible to do informa

tion extraction based on the physical structure or layout of text rather than on the linguistic

are home to data that can be extracted and consolidated (Hurst, 2000; Pinto et al., 2003; Cafarella et al., 2008).

structure. HT\1L lists and tables in both HTML and relational databases

The Association for Computational Linguistics (ACL) holds regular conferences and publishes the journal Computational linguistics. There is also an International Conference on Computational Linguistics (COLING). The textbook by Manning and Schiitze (1999) cov ers statistical language processing, while Jurafsky and Martin (2008) give a comprehensive introduction to speech and natural language processing.

EXERCISES

22.1 This exercise explores the quality of the n-gram model of language. Find or create a monolingual corpus of 100,000 words or more. Segment it into words, and compute the fre quency of each word. How many distinct words are there? Also count frequencies of bigrams

(two consecutive words) and trigrams (three consecutive words). Now use those frequencies to generate language: from the unigram, bigram, and trigram models, in ttun, generate a 100word text by making random choices according to the frequency counts. Compare the three generated texts with actual language. Finally, calculate the perplexity of each model.

886

Chapter

22.

Natural Language Processing

Write a program to do segmentation of words without spaces. Given a string, such as the URL "thelongestlistofthelongeststuffatthelongestdomainnameatlonglast.com," return a list of component words: ["the," "longest;' "list," . . . ]. This task is useful for parsing URLs, for spelling correction when words mntogether, and for Jan guages such as Chinese that do not have spaces between words. It can be solved with a unigram or bigram word model and a dynamic programming algorithm similar to the Viterbi algorithm. 22.2

22.3

STYLOMETRY

(Adapted from Jurafsky and Martin (2000).) In this exercise you will develop a classi fier for authorship: given a text, the classifier predicts which of two candidate authors wrote the text.. Obtain samples of text from two different authors. Separate them into training and test sets. Now train a language model on the training set. You can choose what features to use; n-grams of words or letters are the easiest, but you can add additional features that you think may help. Then compute the probability of the text under each language model and chose the most probable model. Assess the accuracy of this technique. How does accuracy change as you alter the set of features? This subfield of linguistics is called stylometry; its successes include the identification of the author ofthe disputed Federalist Papers (Mosteller and Wallace, 1964) and some disputed works of Shakespeare (Hope, 1994). Khmelev and Tweedie (2001) produce good results with a simple letter bigram model. 22.4

This exercise concems the classification of spam email. Create a corpus of spam email and one of non-spam mail. Examine each corpus and decide what features appear to be useful for classification: unigram words? bigrams? message length, sender, time of arrival? Then train a classification algoritlun (decision tree, naive Bayes, SVM, logistic regression, or some od1er algoritlun of your choosing) on a training set and report its accuracy on a test set. 22.5 Create a test set of ten queries, and pose them to three major Web search engines. Evaluate each one for precision at 1, 3, and 10 documents. Can you explain the differences between engines? 22.6

Try to ascertain which of the search engines from the previous exercise are using case folding, stemming, synonyms, and spelling coJTection. Write a regular expression or a short program to extract company names. Test it on a corpus of business news articles. Report your recall and precision.

22.7

22.8

Consider the problem of trying to evaluate the quality of an IR system that returns a ranked list of answers (like most Web search engines). The appropriate measure of quality depends on the preswned model of what the searcher is trying to achieve, and what strategy she employs. For each of the following models, propose a corresponding numeric measure.

a. The searcher will look at the first twenty answers returned, with the objective of getting as much relevant infonnation as possible.

b. The searcher needs only one relevant document, and will go down the list until she finds the first one. c.

The searcher has a fairly naiTow query and is able to examine all the answers retrieved. She wants to be sure that she has seen everything in the document collection that is

Exercises

887 relevant to her query. (E.g., a lawyer wants to be sure that she has found all relevant precedents, and is willing to spend considerable resources on that.) d.

The searcher needs just one document relevant to the query, and can afford to pay a research assistant for an hour's work looking through the results. The assistant can look through 100 retrieved documents in an hour. The assistant will charge the searcher for the full hour regardless of whether he finds it immediately or at the end of the hour.

e.

The searcher will look through all the answers. Examining a document has cost $A; finding a relevant document has value $B; failing to find a relevant document has cost

$C for each relevant document not fotmd. f.

The searcher wants to collect as many relevant documents as possible, but needs steady encouragement. She looks through the documents in order. If the documents she has looked at so far are mostly good, she will continue; otherwise, she will stop.

23

NATURA L LANGUAGE FOR COMMUNICATION

In which we see how humans communicate with one another in natural language, and how computer agents mightjoin in the conversation.

COMMUNICATION SIGN

Communication is the intentional exchange of infonnation brought about by the production and perception of signs drawn from a shared system of conventional signs. Most animals use signs to represent important messages: food here, predator nearby, approach, withdraw, let's mate. In a partially observable world, communication can help agents be successful because they can learn information that is observed or inferred by others. Hwnans are the most chatty of all species, and if computer agents are to be helpful, they'll need to leam to speak the language. In this chapter we look at language models for communication. Models aimed at deep understanding of a conversation necessarily need to be more complex than the simple models aimed at, say, spam classification. We start with grammatical models of the phrase structure of sentences, add semantics to the model, and then apply it to machine translation and speech recognition.

23.1

PHRASE STRUCTURE GRAMMARS

The n-gram language models of Chapter 22 were based on sequences of words. The big issue for these models is data sparsity with a vocabulary of, say, 105 words, there are 1015 trigram probabilities to estimate, and so a corpus of even a trillion words wi11 not be able to supply reliable estimates for all of them. We can address the problem of sparsity through generalization. From the fact that "black dog" is more frequent than "dog black" and similar observations, we can form the generalization that adjectives tend to come before nouns in English (whereas they tend to follow nouns in French: "chien noir" is more frequent). Of course there are always exceptions; "galore" is an adjective that follows the nmm it modifies. Despite the exceptions, the notion of a lexical category (also known as a part ofspeech) such as noun or adjective is a useful generalization-useful in its own right, but more so when we string together lexical categories to fonn syntactic categories such as noun phrase or verb phrase, and combine these syntactic categories into trees representing the phrase structure of sentences: nested phrases, each marked with a category. -

LEXICAL CATEGORY

SYNTACTIC CATEGORIES PHRASE STRUCTURE

888

Section 23.1.

Phrase Structure Grammars

GENERATIVE CAPACITY Grammatical formalisms can be classified by their generative capacity: the set of languages they can represent. Chomsky (1957) describes four classes of grammat ical formalisms that differ only in the form of the rewrite rules. The classes can be arranged in a hierarchy, where each class can be used to describe all the lan guages that can be described by a less powerful class, as well as some additional languages. Here we list the hierarchy, most powerful class first: Recursively enumerable grammars use unrestricted rules: both sides of the rewrite rules can have any number of terminal and nonterminal symbols, as in the rule A B C --+ D E. These grammars are equivalent to Turing machines in their expressive power. Contex1-sensitive grammars are restricted only in that the right-hand side must contain at least as many symbols as t.he left-hand side. The name "context sensitive" comes from the fact that a rule such as A X B --+ A Y B says that an X can be rewritten as a Y in the context of a preceding A and a following B . Context-sensitive grammars can represent languages such as anbncn (a sequence of n copies of a followed by the same nwnber of bs and then cs). In context-free grammars (or CFGs), the left-hand side consists of a sin gle nonterm.inal symbol. Thus, each rule licenses rewriting the nontenninal as the right-hand side in any context. CFGs are popular for natw·al-language and progranuning-language grammars, although it is now widely accepted that at least some natural languages have constructions that are not context-free (Pullum, 1991). Context-free grammars can represent anbn, but not anbncn. Regular grammars are the most restricted class. Every rule has a single non terminal on the left-hand side and a terminal symbol optionally followed by a non terminal on the right-hand side. Regular grammars are equivalent in power to finite state machines. They are poorly suited for programming languages, because they cannot represent constructs such as balanced opening and closing parentheses (a variation of the anbn language). The closest they can come is representing a* b*, a sequence of any number of as followed by any number of bs. The grammars higher up in the hierarchy have more expressive power, but the algoritluns for dealing with them are less efficient. Up to the 1980s, linguists focused on context-free and context-sensitive languages. Since then, there has been renewed interest in regular grammars, brought about by the need to process and leam from gigabytes or terabytes of online text very quickly, even at the cost of a less complete analysis. As Femando Pereira put it, "The older I get, the further down the Chomsky hierarchy I go." To see what he means, compare Pereira and Warren (1980) with Mohri, Pereira, and Riley (2002) (and note that these three authors all now work on large text corpora at Google).

889

890

PROBABILISTIC OONTEXHREE GRAMMAR GRAMMAR LANGUAGE

NON-TERMINAL SYMBOLS TERMINALSYMBOL

Chapter

23.

Natural Language for Communication

There have been many competing language models based on the idea of phrase struc ture; we will describe a popular model called the probabilistic context-free grammar, or PCFG. 1 A grammar is a collection of rules that defines a language as a set of allowable strings of words. "Context-free" is described in the sidebar on page 889, and "probabilistic" means that the grammar assigns a probability to every string. Here is a PCFG rule:

VP -4 Verb [0. 70] VP NP [0.30] . I Here VP (verb phrase) and NP (noun phrase) are non-terminal symbols. The grammar also refers to actual words, which are called terminal symbols. This rule is saying that with probability 0.70 a verb phrase consists solely of a verb, and with probability 0.30 it is a VP followed by an NP. Appendix B describes non-probabilistic context-free grammars. We now define a grammar for a tiny fragment of English that is suitable for communi cation between agents exploring the wumpus world. We call this language Eo. Later sections

improve on Eo to make it slightly closer to real English. We are unlikely ever to devise a complete grammar for English, if only because no two persons would agree entirely on what constitutes valid English.

23.1.1 LEXICON

The lexicon of Eo

First we define the lexicon, or list of allowable words. The words are grouped into the lexical categories familiar to dictionary users: nmms, pronouns, and names to denote things; verbs to denote events; adjectives to modify nouns; adverbs to modify verbs; and function words: ruticles (such as

the),

prepositions

(in), and conjunctions (and).

Figure 23.1 shows a small

lexicon for the language £o. Each of the categories ends in . . . to indicate that there are other words in the category. For nouns, names, verbs, adjectives, and adverbs, it is infeasible even in principle

to list all

the words. Not only are there tens of thousands of members in each class, but new ones like iPod or biodiesel-are being added constantly. These five categories ru·e called open

OPEN CLASS CLOSED CLASS

classes. For the categories of pronoun, relative pronoun, article, preposition, and conjunction we could have listed all the words with a little more work. These are called closed classes; they have a small number of words (a dozen or so). Closed classes change over the course

of centuries, not months. For example, "thee" and "thou" were commonly used pronouns in the 17th century, were on the decline in the 19th, and are seen today only in poetry and some regional dialects.

23.1.2 The Grammar of Eo The next step is to combine the words into phrases. Figw-e 23.2 shows a grammar for

Eo,

with rules for each of the six syntactic categories and an example for each rewrite rule. 2 ll\RSE TREE

Figure 23.3 shows a parse tree for the sentence "Every wwnpus smells." The parse tree 1 PCFGs are also known as stochastic- context-free grammars, or SCFGs. 2 A relative clause follows nod modjfies a noun phrase. It consists of a relative pronoun (such as "who" or "that") followed by a verb phrase. An example of a relative clause is that stinks in "The wumpus that stinks is in 2 2." Another kind ofrelative clause has no relative pronoun, e.g., /!mow in "the man I know."

Section 23.1.

891

Phrase Structure Grammars

Noun Verb Adjective Adverb Pronoun RelP.ro Name Article Prep Conj Digit

--4

stench [0.05] 1 breeze [0.10] 1 wwnpus [0.15] 1 pits [0.05] 1 is [0.10] 1 feel [0.10] 1 smells [0.10] 1 stinks [0.05] 1 . . . right [0.10] 1 dead [0.05] 1 smelly [0.02] 1 breezy [0.02] . . . here [0.05] 1 ahead [0.05] 1 nearby [0.02] 1 . . . me [0.10] I you [0.03] I I [0.10] 1 it [0.10] 1 . . . that [0.40] 1 which [0.15] 1 who [0.20] 1 whom [0.02] V . . . John [0.01] I Mary [0.01] I Boston [0.01] I . . . the [0.40] 1 a [0.30] 1 an [0.10] 1 every [0.05] 1 . . . to [0.20] 1 in [0.10] 1 on [0.05] 1 near [0.10] 1 . . . and [0.50] 1 or [0.10] 1 but [0.20] 1 yet [0.02] V . . .

--4

o

--4 --4 --4 --4 --4 --4 --4 --4 --4

[o.2o] 1

t

[o.2o] 1

2

[o.2o] 1

3

[o.2o] 1

4

[o.2o] 1 . . .

The lexicon for £o. RelPm is short forrelativepronoun, Prep for preposition, Conj for conjunction. The sum of the probabilities for each category is 1.

Figure 23.1

and

eo :

s

--4

NP

--4

VP

--4

Adjs

--4

pp

--4

RelClause

--4

NP VP S Conj S

[0.90] [0.10]

Pronoun Name Noun Article Noun Article Adjs Noun Digit Digit NP PP NP RelClause

[0.30] I [0.10] John [0.10] pits [0.25] the + wumpus [0.05] the + smelly dead + wumpus [0.05] 3 4 [0.10] the wumpus + in 1 3 [0.05] the wumpus + that is smelly

Verb VP NP VP Adjective VP PP VP Adverb

[0.40] [0.35] [0.05] [0.10] [0.10]

stinks

Adjective Adjective Adjs Prep NP RelP.ro VP

[0.80] [0.20] [1.00] [1.00]

smelly

I + feel a breeze

I feel a breeze + and + It stinks

feel + a breeze smells + dead is+ in

13

go+ ahead

smelly + dead to + the east that + is smelly

Figure 23.2 The grammar for £o, with example phrases for each rule. The syntactic cat egories are sentence (S), noun phrase (NP), verb phrase ( VP), list of adjectives (Adjs), prepositional phrase (PP), and relative clause (RelClause).

892

Chapter

23.

Natural Language for Communication

s

�VP

NP � Article Noun I o.o5 I o.I5 Every

wumpus

l o.40 Verb 10. 10

smells

Figure 23.3 Parse tree for the sentence "Every wumpus smells" according to the grammar &o. Each interior node of the tree is labeled with its probability. The probability of the tree as a whole is 0.9 X 0.25 X 0.05 X 0.15 X 0.40 X 0.10 = 0.0000675. Since this tree is the only

parse of the sentence, that number is also the probability of the sentence. The tree can also be written in linear form as [S [NP [Article every] [Noun wtunpus]][VP [Ve1·b smells]]].

ovERGENERATION UNDERGENERATION

23.2

PARSING

gives a constmctive proof that the string of words is indeed a sentence according to the rules of [Q. The Eo grammar generates a wide range of English sentences such as the following: John is in the pit The wumpus that stinks is in 2 2 Mary is in Boston and the wwnpus is near 3 2 Unfortunately, the grammar overgenerates: that is, it generates sentences that are not gram matical, such as "Me go Boston" and "I smell pits wwnpus John." It also undergenerates: there are many sentences of English that it rejects, such as "I think the wumpus is smelly." We will see how to learn a better grammar later; for now we concentrate on what we can do with the grammar we have.

SYNTACTIC ANALYSIS (PARSING)

Parsing is the process of analyzing astring of words to uncover its phrase stmcture, according to the rules of a grammar. Figure 23.4 shows that we can start with the S symbol and search top down for a tree that has the words as its leaves, or we can start with the words and search bottom up for a tree that culminates in an S. Both top-down and bottom-up parsing can be inefficient, however, because they can end up repeating effort in areas of the search space that lea.d to dead ends. Consider the following two sentences: Have the students in section 2 of Computer Science 101 take the exam. Have the students in section 2 of Computer Science 101 taken the exam? Even though they share the first 10 words, these sentences have very different parses, because the first is a command and the second is a question. A left-to-right parsing algorithm would have to guess whether the first word is part of a command or a question and will not be able to tell if the guess is correct until at least the eleventh word, take or taken. If the algorithm guesses wrong, it will have to backtrack all the way to the first word and reanalyze the whole sentence under the other interpretation.

Section 23.2.

893

Syntactic Analysis (Parsing)

Ust ofitems

Rule

s

� NP VP VP � VP Adjective VP � Verb Adjective � dead Verb � is NP � Article Noun Noun � wumpus Article � the

NP VP NP VP Adjective NP Verb Adjective

S

NP Verb dead NP is dead

Article Noun is dead Article wumpus is dead the wumpus is dead Figure 23.4

Trace of the process of finding a parse for the string "The wumpus is dead"

as a sentence, according to the grammar £o. Viewed as a top-down parse, we start with the list of items being S and, on each step, match an item X with a rule of the form (X -t and replace

X in the list of items

. . .

)

with (. . . ). Viewed as a bottom-up parse, we start with the

list of items being the words of the sentence, and, on each step, match a string of tokens (. . . )

in the list against a rule of the form (X -t . . . ) and replace (. . . ) with X.

To avoid this source of inefficiency we can use dynamic programming:

every time we

analyze a substring, store the results so we won't have to reanalyze it later. For example, once we discover that "the students in section 2 of Computer Science 101" is an NP, we can CHART

record that result n i a data structure known as a chart. Algorithms that do this are called chart

parsers. Because we are dealing with context-free grammars, any phrase that was fotmd

n i

the context of one branch of the search space can work just as well in any other branch of the search space. There are many types of chart parsers; we describe a bottom-up version called CYKALGORITHM

the CYK algorithm, after its inventors, John Cocke, Daniel Younger, and Tadeo Kasami. The CYK algorithm is shown in Figure

CHOMSKY NORMAL FORM

23.5.

Note that it requires a grammar with all

rules in one of two very specific fonnats: lexical rules of the fonn X rules of the form X

�

� word, and syntactic

Y Z. This grammar format, called Chomsky Normal Form, may

seem restrictive, but it is not: any context-free grammar can be automatically transformed

into Chomsky Nonnal Form. Exercise

23.8 leads you through the process.

The CYK algorithm uses space of O(n2m) for the words in the sentence, and

P table, where n is the number of

m is the number of nomerminal symbols in the grarrunar, and takes

O(n3m). (Since m is constant for a particular grammar, this is commonly described as O(n3).) No algorithm can do better for general context-free grammars, although there are

time

faster algorithms on more restricted grammars. In fact, it is quite a trick for the algorithm to complete in 0(n3) time, given that it is possible for a sentence to have an exponential nwnber of parse trees. Consider the sentence Fall leaves fall and spring leaves spring. It is ambiguous because each word (except "and") can be either a noun or a verb, and "fall" and "spring" can be adjectives

as

well. (For example, one meaning of "Fall leaves fall" is

894

Chapter

23.

Natural Language for Communication

function CYK-PARSE(words,gmmmar) returns P, a table of probabilities

N +- LBNGTH(wonls) M +-the number of nonterminal symbols in gmmmar P +-an array of size [M, N, N], initially all 0 I * Insert lexical rulesfor each word *I for i

=

1 to N do

tor each rule of form (X

P[X, i, 1] +-p I*

-+ wordsi [p]) do

Combn i efirst and second parts ofright-handsides ofrules,from short to long * I

for length = 2 to N do

1 to N - length + 1 do for lenl = l to N - 1 do len2 +- length - lenl for each rule ofthe form (X -+ Y Z [p]) do P[X, sta1-t, length]+-MAX(P[X, start, length], P[ Y, st.art, lenl] x P[Z, start + lenl , len2] return P for start

=

Figure 23.5

The

x

p)

CYK algorithm for parsing. Given a sequence of words, it finds the

most probable derivation for the whole sequence and for each subsequence. It returns the

P, in which an entry P[X, start, len] is the probability of the most probable X of length len starting at position start. If there is no X of that size at that location, the

whole table,

probability is 0.

equivalent to "Autumn abandons autumn.) With £o the sentence has four parses:

[S [S [NP Fall leaves] fall] and [S [NP spring leaves] spring] [S [S [NP Fall leaves] fall] and [S spring [ VP leaves spring]] [S [S Fall [ VP leaves fall]] and [S [NP spring leaves] spring] [S [S Fall [ VP leaves fall]] and [S spring [ VP leaves spring]]

.

If we had c two-ways-ambiguous conjoined subsentences, we would have 2c ways of choos 3 ing parses for the subsentences. How does the CYK algorithm process these 2c parse trees in

O(c3)

time? The answer is that it doesn't examine all the parse trees; all it has to do is

compute the probability of the most probable tree. The subtrees are all represented in the

P

table, and with a little work we could enumerate them all (in exponential time), but the beauty of the CYK algorithm is that we don't have to emunerate them unless we want to. In practice we are usually not interested in all parses; just the best one or best few. Think of the CYK algoritlun as defining the complete state space defined by the "apply grammar rule" operator. It is possible to search just part of this space using

A* search. Each state

in this space is a list of items (words or categories), as shown in the bottom-up parse table (Figure

23.4). The start state is a list of words, and a goal state is the single item S. The

There also would be O(c!) ambiguity in the way the components conjoin-for example, (X and (Y nod Z)) versus ((X and Y) and Z). But that is another story, one told well by Church and Pntil (1982).

3

Section 23.2.

895

Syntactic Analysis (Parsing)

[ [S [NP-SBJ-2 Her eyes] [VP were [VP glazed [NP *-2] [SBAR-ADV as if [S [NP-SBJ she] [VP did n't [VP [VP hear [NP *-1]]

or

[VP [ADVP even] see [NP *-1]] [NP-1 him]]]]]]]]

.]

Annotated tree for the sentence "Her eyes were glazed as if she didn't hear or even see him." from the Penn Treebank. Note that in this grammar there is a distinction between an object noun phrase (NP) and a subject noun phrase (NP-SBJ). Note also a gram matical phenomenon we have not covered yet: the movement of a phrase from one part of the tree to another. This tree analyzes the phrase "hear or even see him" as consisting of two constituent VPs, [VP hear [NP *-1]] and [VP [ADVP even] see [NP *-1]], both of which have a missing object, denoted *-1, which refers to the NP labeled elsewhere in the tree as [NP-1 him].

Figure 23.6

cost of a state is the inverse of its probability as defined by the rules applied so far, and there are various heuristics to estimate the remaining distance to the goal; the best heuristics come from machine learning applied to a corpus of sentences. With the A• algorithm we don't have to search the entire state space, and we most probable.

23.2.1

are

guaranteed that the first parse fotmd will be the

Learning probabilities for PCFGs

A PCFG has many rules, with a probability for each rule. This suggests that learning the grammar from data might be better than a knowledge engineering approach. Learning is eas TREEBANK

iest if we are given a corpus of correctly parsed sentences, commonly called a treebank. The Penn Treebank (Marcus et al., 1993) is the best known; it consists of 3 million words which

have been annotated with part of speech and parse-tree structure, using human labor assisted by some automated tools. Figure 23.6 shows an annotated tree from the Penn Treebank.

Given a corpus of trees, we can create a PCFG just by counting (and smoothing). ln the

are two nodes ofthe form [S[NP . . .] [ VP . . .]]. We would count these, and all the other subtrees with root S in the corpus. If there are 100,000 S nodes of which 60,000 are of this fonn, then we create the rule: example above, there

S � NP VP [0.60] . What if a treebank is not available, but we have a corpus of raw unlabeled sentences? It is still possible to learn a grammar from such a corpus, but it is more difficult. First of all, we actually have two problems: learning the structure of the granunar rules and learning the

896

Chapter

23.

Natural Language for Communication

probabilities associated with each rule. (We have the same distinction in learning Bayes nets.) We'll assume that we're given the lexical and syntactic category names. (If not, we can just assume categories X1 , . . . Xn and use cross-validation to pick the best value of n.) We can then assume that the grammar includes every possible (X � Y Z) or (X � word) rule, although many of these rules will have probability 0 or close to 0. We can then use an expectation-maximization (EM) approach, just as we did in learning HMMs. The parameters we are trying to learn are the rule probabilities; we strut them off at random or uniform values. The hidden variables are the parse trees: we don't know whether a string of words wi . . . Wj is or is not generated by a rule (X � ). The E step estimates the probability that each subsequence is generated by each rule. The M step then estimates the probability of each rule. The whole computation can be done in a dynamic-programming fashion with an algorithm called the inside-outside algorithm in analogy to the forward backward algorithm for HMMs. The inside-outside algorithm seems magical in that it induces a grammar from tmparsed text. But it has several drawbacks. First, the parses that are assigned by the induced grammars are often difficult to tmderstand and unsatisfying to linguists. This makes it hard to combine handcrafted knowledge with automated induction. Second, it is slow: O(n3m3 ), where n is the number of words in a sentence and m is the number of grammar categories. Third, the space of probability assignments is very large, and empirically it seems that getting stuck in local maxima is a severe problem. Alternatives such as simulated annealing can get closer to the global maximum, at a cost of even more computation. Lari and Young (1990) conclude that inside-outside is "computationally intractable for realistic problems." However, progress can be made if we are willing to step outside the bounds of learning solely from unparsed text. One approach is to learn from prototypes: to seed the process with a dozen or two rules, similar to the rules in £1. From there, more complex rules can be learned more easily, and the resulting grammar parses English with an overall recall and precision for sentences of about 80% (Haghighi and Klein, 2006). Another approach is to use treebanks, but in addition to learning PCFG rules directly from the bracketings, also learning distinctions that are not in the treebank. For example, not that the tree in Figure 23.6 makes the distinction between NP and NP SBJ. The latter is used for the pronoun "she," the former for the pronoun "her." We will explore this issue in Section 23.6; for now let us just say that there .

INSIDE-OUTSIDE ALGORITHM

.

.

-

are many ways in which it would be useful

to split

a category like NP-grammar induction

systems that use treebanks but automatically split categories do better than those that stick with the original category set (Petrov and Klein, 2007c). The error rates for automatically learned grammars are still about 50% higher than for hand-constructed grammar, but the gap is decreasing. 23.2.2

Comparing context-free and Markov models

The problem with PCFGs is that they are context-free. That means that the difference between P("eat a banana") and P("eat a bandanna") depends only on P(Noun � "banana") versus P(Noun � "bandanna.") and not on the relation between "eat" and the respective objects. A Markov model of order two or more, given a sufficiently large corpus, will know that "eat

Section 23.3.

Augmented Grammars and Semantic Interpretation

897

a banana" is more probable. We can combine a PCFG and Markov model to get the best of both. The simplest approach is to estimate the probability of a sentence with the geometric mean ofthe probabilities computed by both models. Then we would know that "eat a banana" is probable from both the grammatical and lexical point of view. But it still wouldn't pick up the relation between "eat" and "banana" in "eat a sljghtly aging but still palatable banana" because here the relation is more than two words away. Increasing the order of the Markov model won't get at the relation precisely; to do that we

can

use a

Iexicalized

PCFG, as

described in the next section. Another problem with PCFGs is that they tend to have too strong a preference for shorter sentences. is about

In a corpus such as the

Wall Street Journal,

the average length of a sentence

25 words. But a PCFG will usually assign fairly high probability to many shott

sentences, such as "He slept," whereas in the Journal we're more likely to see something like "It has been reported by a reliable source that the allegation that he slept is credible." It seems that the phrases in the Journal really are not context-free; instead the writers have an idea of the expected sentence length and use that length as a soft global constraint on their sentences.

This is hard to reflect n i a PCFG.

23.3

AUGMENTE D GRAMMARS AND

SEMANTIC INTERPRETATION

In this section we see how to extend context-free grammars--to say that, for example, not every

NP is independent of context, but rather, certain NPs are more likely to appear in one

context, and others in another context.

23.3.1

Lexicalized PCFGs

To get at the relationship between the verb "eat" and the nouns "banana" versus "bandarma;' LEXICALIZED PCFG

we can use a Iexicalized

PCFG, in which the probabilities for a mle depend on the relation

ship between words in the parse tree, not just on the adjacency of words in a sentence. Of course, we can't have the probability depend on every word in the tree, because we won't have enough trainjng data to estimate all those probabilities. It is useful to introduce the no HEAD

tion of the

head of a phrase-the

VP NP "a banana." We use the notation VP(v) VP whose head word is v. We say that the category VP most important word. Thus, "eat" is the head of the

"eat a banana" and "banana" is the head of the to denote a phrase with category AUGMENTED GRAMMAR

is augmented with the head variable verb-object relation:

v.

I Iere is an augmented grammar that describes the

VP(v) -4 Verb(v) NP(n) [Pt (v,n)] VP(v) -4 Verb(v) [P2(v)] NP(n) -4 Article(a) Adjs(j) Noun(n) [Pg(n, a)] Noun(banana) -4 banana [pn] Pt(v, n) depends on the head words v and n. We would set this proba bility to be relatively high when v is "eat" and n is "banana," and low when n is "bandatma." Here the probability

898

Chapter

23.

Natural Language for Communication

Note that since we are considering only heads, the distinction between "eat a banana" and "eat a rancid banana" will not be caught by these probabilities. Another issue with this ap proach is that, in a vocabulary with, say, 20,000 nouns and 5,000 verbs, P1 needs 100 million probability estimates. Only a few percent of these can come from a corpus; the rest will have

to come from smoothing (see Section 22.1.2). For example, we can estimateP1 (v, n) for a

(v, n)

pair that we have not seen often (or at all) by backing off to a model that depends

only on v. These objectless probabilities are still very useful; they can capture the distinction between a transitive verb like "eat"-which will have a high value for P1 and a low value for P2-and an intransitive verb like "sleep," which will have the reverse. I t is quite feasible to learn these probabilities from a treebank.

23.3.2

Formal definition of augmented grammar rules

Augmented rules are complicated, so we will give them a formal definition by showing how an DEANITECLAUSE GRAMMAR

augmented rule can be translated into a logical sentence. The sentence will have the fonn

of a definite clause (see page 256), so the result is called a definite clause grammar, or DCG. We'll use as an example a version of a rule from the lexicalized grammar for

NP

with one

new piece of notation:

NP(n)

--+

Article(a) Adj8(j) Noun(n) { Compatible(j, n)} .

The new aspect here is the notation { constmint} to denote a logical constraint on some of the variables; the rule only holds when the constraint is true. Here the predicate

Compatible (j, n)

is meant to test whether adjective j and noun n are compatible; it would be defined by a series of assertions such as

Compatible(black, dog).

We can convert this grammar rule into a def

inite clause by ( I ) reversing the order of right- and left-hand sides, (2) making a conjunction

(3) adding a variable 8i to the list of arguments for each constituent to represent the sequence of words spanned by the constituent, (4) adding a tenn of all the constituents and constraints,

for the concatenation of words,

Append (81, . . . ) , to the list of arguments for the root of the

tree. That. gives us

Article(a, 81) A Adj8(j, 82) A Noun(n, 83) A Compatible(j, n) � NP(n, Append(81, 82, 83)) . Article is true of a head word a and a string 81, and Adj8 is similarly true of a head word j and a string 82, and Noun is true of a head word n and a string 83, and if j and n are compatible, then the predicate NP is true of the head word n and the result of appending strings 81, 82, and 83.

nus definite clause says that if the predicate

The DCG translation left out the probabilities, but we could put them back in: just aug ment each constituent with one more variable representing the probability of the constituent, and augment the root with a variable that is the product of the constituent probabilities times the mle probability. The translation from grammar mle to definite clause allows us to talk about parsing as

logical inference. This makes it possible to reason about languages and strings in many

different ways. For example, it means we can do bottom-up parsing using forward chaining or top-down parsing using backward chaining. In fact, parsing natural language with DCGs was

Section 23.3.

LANGUAGE GENERATION

Augmented Grammars and Semantic Interpretation

899

one of the first applications of (and motivations for) the Prolog logic programming language. It is sometimes possible to run the process backward and do language generation as well as parsing. For example, skipping ahead to Figure 23.10 (page 903), a logic program could be given the semantic form Loves( John, Mary) and apply the definite-clause rules to deduce

S(Loves(John, Mary), [John, loves, Mary] ) . This works for toy examples, but serious language-generation systems need more control over the process than is afforded by the DCG rules alone. s

-4

NPs NPo VP

-4 -4

pp

-4

Pronouns P.ronouno

-4

£1 :

-4

-4

NPs VP I Pronouns I Name I Noun I Pronouno I Name I Noun I VP NPo I Prep NPo I I you I he I she I it I me I you I him I her I it I

S(head) -4 NP(Sbj, pn, h) VP(pn, head) I . . . NP(c, pn, head) -4 Pronoun(c, pn, head) I Noun( c, pn, head) I VP(pn, head) -4 VP(pn, head) NP( Obj,p, h) I PP(head) -4 Prep(head) NP(Obj,pn,h) Pronoun(Sbj, 1S, I) -4 I Pronoun(Sbj, 1P, we) -4 we Pronoun(Obj, 1S,me) -4 me Pronoun( Obj, 3P, them) -4 them �:

Figure 23.7

Top: part of a grammar for the language &1, which handles subjective and

objective cases in noun phrases and thus does not overgenerate quite as badly as

&o.

The

portions that are identical to Eo have been omitted. Bottom: part of an augmented grammar for &2, with three augmentations: case agreement, subject-verb agreement, and head word.

Sbj, Obj, JS, JP and 3P are constants, and lowercase names are variables.

23.3.3

Case agreement and subject-verb agreement

We saw in Section 23.1 that the simple granunar for Eo overgenerates, producing nonsen tences such as "Me smell a stench." To avoid this problem, our grammar would have to know that "me" is not a valid NP when it is the subject of a sentence. Linguists say that the pronoun "I" is in the subjective case, and "me" is in the objective case.4 We can account for this by 4 The subjective case is also sometimes called the nominative case and the objective case is sometimes called the accusative case. Many languages also have a dative case for words in the indirect object position.

900

CASE AGREEMENT

SUilJtCI-vtH� AGREEMENT

Chapter

23.

Natural Language for Communication

splitting NP into two categories, NPs and NPo, to stand for noun phrases in the subjective and objective case, respectively. We would also need to split the category Pronoun into the two categories Pronouns (which includes "I") and P.ronouno (which includes "me"). The top part of Figure 23.7 shows the grammar for case agreement; we call the resulting language £1. Notice that all the NP rules must be duplicated, once for NPs and once for NPO · Unfortunately, £1 still overgenerates. English requires subject-verb agreement for person and number of the subject and main verb of a sentence. For example, if "I" is the subject, then "I smell" is grammatical, but "I smells" is not. If "it" is the subject, we get the reverse. In English, the agreement distinctions are minimal: most verbs have one form for third-person singular subjects (he, she, or it), and a second form for all other combinations of person and number. There is one exception: the verb "to be" has three fonns, "I am I you are I he is." So one distinction (case) splits NP two ways, another distinction (person and number) splits NP three ways, and as we uncover other distinctjons we would end up with an exponential number of subscripted N P forms if we took the approach of £1. Augmentations are a better approach: they can represent an exponential number of forms as a single rule. In the bottom ofFigure 23.7 we see (pa1t ot) an augmented grammar for the language whlch handles case agreement, subject-verb agreement, and head words. We have just £2, one NP category, but NP(c, pn, head) has three augmentations: c is a parameter for case, p11. is a parameter for person ancl numher, ancl h.Pild is a parameter for the heacl worcl of

the phrase. The other categories also are augmented with heads and other arguments. Let's consider one rule in detail:

S(head)

---+

NP(Sbj , pn,h) VP(pn, head ) .

Thls rule is easiest to understand right-to-left: when an NP and a VP are conjoined they fonn an S, but only if the NP has the subjective (Sb;) case and the person and number (pn) of the NP and VP are identical. If that holds, then we have an S whose head is the same as the head of the VP. Note the head of the NP, denoted by the dummy var.iable h, is not part of the augmentation of the S. The lexical rules for £2 fill in the values of the parameters and are also best read right-to-left. For example, the rule

Pronoun(Sbj, 18,1) ---+ I says that "I" can be interpreted as a Pronoun in the subjective case, first-person singular, with head "1." For simplicity we have omitted the probabilities for these rules, but augmentation does work with probabilities. Augmentation can also work with automated leaming mecha nisms. Petrov and Klein (2007c) show how a learning algorithm can automatically split the NP category into NPs and NPO · 23.3.4

Semantic interpretation

To show how to add semantics to a grammar, we start with an example that is simpler than English: the semantics of arithmetic expressions. Figure 23.8 shows a grammar for arithmetic expressions, where each rule is augmented with a variable indicating the semantic interpreta tion of the phrase. The semantics of a digit such as "3" is the digit itself. The semantics of an expression such as "3 + 4" is the operator "+" applied to the semantics of the phrase "3" and

Section

23.3.

Augmented Grammars and Semantic Interpretation

901

Exp(x) Exp(x1) Operator(op) Exp(x2) {x = Apply(op, x1,x2)} ( Exp(x) ) Exp(x) Exp(x) Number(x) N11.m.hP.r(x) nigit(x) Number(x) Number(x1) Digit(x2) {x= 1 0 x x1 + x2} Digit(x) x {0 � x � 9} Operator(x) x {x E {+, - , +, x } } �

� �

�

�

�

�

Figure 23.8

A grammar for arithmetic expressions, augmented with semantics. Each vari

able Xi represents the semantics of a constituent. Note the use of the logical predicates that must

{test} notation to define

be satisfied, but that are not constituents.

Exp(5) Exp(2)

EJ.p(3)

Exp(4)

I

Number(4)

I

Number(3)

I

Digit(3)

I

3 Figure23.9

COMPOSITIONAL SEMANTICS

Exp(2)

I

Operator(+)

I

+

I

Number(2)

I

Digit(4) Operator(+) (

I

4

Digit(2)

I

I

)

2

.

Parse tree with semantic interpretations for the string

"3 + (4 -;- 2)".

the phrase "4." The ndes obey the principle of compositional semantics-the semantics of a phrase is a function ofthe semantics ofthe subphrases. Figw·e

3 + ( 4 -;- 2)

23.9 shows the parse tree for

according to this grammar. The root of the parse tree is

whose semantic interpretation is

5.

Exp(5),

an expression

Now let's move on to the semantics of English, or at least of Eo. We start by determin ing what semantic representations we want to associate with what phrases. We use the simple example sentence "John loves Mary." The NP "John" should have as its semantic interpreta tion the logica1 term John, and the sentence as a whole should have as its interpretation the

logica1 sentence

Loves( John , Mary) .

That much seems clear. The complicated part is the

VP "loves Mary." The semantic interpretation of this phrase is neither a logical term nor a complete logical sentence. Intuitively, "loves Mary" is a description that might or might not

Chapter

902

23.

Natural Language for Communication

apply to a particular person. (In this case, it applies to John.) This means that "loves Mary" is a predicate that, when combined with a term that represents a person (the person doing the loving), yields a complete logical sentence. Using the .A-notation (see page 294), we can represent "loves Mary" as the predicate

.Ax Loves(x, Mary) . Now we need a rule that says "an NP with semantics obj followed by a VP with semantics pred yields a sentence whose semantics is the result of applying pred to obj :"

S(pred( obj)) TI1e rule

�

NP( obj)

VP(pred) .

tells us that the semantic interpretation of "John loves Mary" is

(.Ax Loves(x,Mary))(John) , which is equivalent to Loves( John, Mary) . TI1e rest of the semantics follows in a straightforward way from t.he choices we have made so far. Because VPs are represented as predicates, it is a good idea to be consistent and

represent verbs as predicates as well. The verb "loves" is represented as .Ay .Ax Loves(x, y) , the predicate that, when given the argument Mary, returns the predicate

.Ax Loves(x, Mary).

We end up with the grammar shown in Figure 23.10 and the parse tree shown in Figure 23.1 1.

We could just as easily have added semantics to £2; we chose to work with Eo so that the reader can focus on one type of augmentation at a time. Adding semantic augmentations to a grammar by hand is laborious and error prone. Therefore, there have been several projects to lean1 semantic augmentations from examples. CHILL (Zelle and Mooney, 1996) is an inductive logic programming (ILP) program that learns a grammar and a specialized pars·er for that grammar from examples. The target domain is natural language database queries. The training examples consist of pairs of word strings

and corresponding semantic forms-for example; What is the capital of the state with the largest population?

Answer(c, Capital(s, c) A Largest(p, State(s) A Population(s,p))) CHILL's task is to learn a predicate Parse( words, semantics) that is consistent with the ex amples and, hopefully, generalizes well to other examples. Applying ILP directly to learn this predicate results in poor performance: the induced parser has only about 20% accuracy. Fortunately, ILP learners can improve by adding knowledge. In this case, most of the Parse predicate was defined as a logic program, and CHILL's task was reduced to inducing the control rules that guide the parser to select one parse over another. With this additional back growJd knowledge, CHILL can learn to achieve 70% to 85% accuracy on various database query tasks. 23.3.5

Complications

The grammar of real English is endlessly complex. We will briefly mention some examples. TIME AND TENSE

Time and tense:

Suppose we want to represent the difference between "John loves

Mary" and "John loved Mary." English uses verb tenses (past, present, and future) to indicate

Section 23.3.

Augmented Grammars and Semantic Interpretation

903

S(pred(obj)) � NP(obj) VP(pred) VP(pred(obj)) � Verb(pred) NP(obj) NP( obj) � Name(obj) Name(John) � John Name(Mary) � Mary Verb(Ay AX Loves(x,y)) � Figure 23.10

loves

A grammarthat can derive a parse tree and semantic interpretation for"John

loves Mary" (and three other sentences). Each category is augmented with a single argument representing the semantics.

S(Loves(John,Mary))

VPI).xfLoves(x,Mary))

�wy)

NP(John)

I

Name(John) Verb(A.yfo.xjLoves(x,y))

I

I

John Figure 23.11

Name(Mary)

loves

I

Mary

A parse tree with semantic interpretations for the string "John loves Mary".

the relative time of an event. One good choice to represent the time of events is the event calculus notation of Section

12.3. In event calculus we have

E1 E Loves(John, Mary) 1\ During( Now, Extent(E1)) John loved mary: E2 E Loves(John, Mary) 1\ After( Now, Extent(E2)) .

John loves mary:

This suggests that our two lexical rules for the words "loves" and "loved" should be these:

Verb(Ay AXe E Loves(x, y) 1\ During(Now, e)) � loves Verb(Ay AXe E Loves(x, y) 1\ After( Now, e)) � loved .

Other than this change, everything else about the grammar remains the same, which is en couraging news; it suggests we are on the right track if we can so easily add a complication like the tense of verbs (although we have just scratched the surface of a complete grammar for time and tense). It is also encouraging that the distinction between processes and discrete events that we made in mu· discussion of knowledge representation in Section

12.3.1 is actu

Sleeping is a process category, but it is odd to say "John found a unicorn a Jot last night;' where Finding

ally reflected in language use. We can say "John slept a lot last night," where

is a discrete event category. A grammar would reflect that fact by having a low probability for adding the adverbial phrase "a lot" QlANTIFICATION

Quantification:

to discrete events.

Consider the sentence "Every agent feels a breeze." The sentence has

only one syntactic parse under

£o, but it is actually

semantically ambiguous; the preferred

Chapter

904

23.

Natural Language for Communication

meaning is "For every agent there exists a breeze that the agent feels," but an acceptable alternative meaning is "There exists a breeze that every agent feels." 5 The two interpretations

can be represented as V a a E Agents � 3 b b E Breezes ! \3 e e E Feel(a, b) 1\ During( Now, e) ; 3 b b E Breezes V a a E Agents � 3 e e E Feel(a, b) 1\ During( Now, e) . TI1e standard approach to quantification is for the grammar to define not an actual logical QUASI-l.OOICAI.. FORM

semantic sentence, but rather a quasi-logical form that is then turned into a logical sentence by algorithms outside of the parsing process. Those algorithms can have preference rules for preferring one quantifier scope over another-preferences that need not be reflected directly in the grammar.

Pragmatics: We have shown how an a.gent can perceive a string of words and use a

PRAGMATICS

grammar to derive a set of possible semantic interpretations. Now we address t.he problem of completing the interpretation by adding context-dependent information about the current situation. The most obvious need for pragmatic information is in resolving the meaning of INDEXICAL

indexicals, which are phrases that refer directly to the current situation. For example, in the sentence "I am in Boston today," both "I" and "today" are indexicals. The word "I" would be represented by the fluent

Speaker, and it would be up to the hearer to resolve the meaning of

the fluent-that is not considered part of the grammar but rather an issue of pragmatics; of using the context of t.he current situation to interpret fluents. Another part of pragmatics is interpreting the speaker's intent. The speaker's action is SPEECH Ia

considered a speech act, and it is up lo the hearer to decipher what type of action it is-a question, a statement, a promise, a warning, a command, and so on. A command such "go to 2 2" implicitly refers to the hearer. So far, our grammar for

as

S covers only declarative

sentences. We can easily extend it to cover commands. A command can be fonned from a VP, where the subject is implicitly the hearer. We need to distinguish commands from statements, so we alter the rules for S to include the type of speech act:

S(Statement(Speaker,pred(obj))) � NP(obj) VP(pred) S(Command(Speaker,pred(Hearer))) � VP(pred) . Long-distance dependencies: Questions introduce a new grammatical complexity.

LONG·DISTANCE DEPENDENCIES

In

"Who did the agent tell you to give the gold to?" the final word "to" should be parsed as TRACE

[PP to �], where the "w" denotes a gap or trace where an NP is missing; the missing NP is licensed by the first word of the sentence, "who." A complex system of augmentations is used to make sure that the missing NPs match up with the licensing words in just the right way, and prohibit gaps in the wrong places. For example, you can't have a gap in one branch of an

NP conjunction:

"What did he play

[NP

you can have the same gap in both branches of a AMBIGUITY

Dungeons and �]?" is ungrammatical. But

VP conjunction:

"What did you

[ VP [ VP

smell �] and [ VP shoot an arrow at w]]?" Ambiguity: In some cases, hearers are consciously aware of ambiguity in an utterance. Here are some examples taken from newspaper headlines: 5 If this interpreat t ion seems unlikely, consider "Every Protestant believes in a just God."

Section 23.3.

Augmented Grammars and Semantic Interpretation

905

Squad helps dog bite victim. Police begin campaign to run down jaywalkers. Helicopter powered by human flies. Once-sagging cloth diaper industry saved by full dumps. Portable toilet bombed; police have nothing to go on. Teacher strikes idle kids. Include your children when baking cookies. Hospitals are sued by 7 foot doctors. Milk drinkers are turning to powder. Safety experts say school bus passengers should be belted. But most of the time the language we hear seems llllambiguous. Thus, when researchers first began to use computers to analyze language in the 1960s, they were quite surprised to learn that almost every utterance is highly ambiguous, even though the alternative interpretations

might not be apparent to a native speaker. A system with a large grammar and lexicon might LEXICALAMBIGUITY

find thousands of interpretations for a perfectly ordinary sentence. Lexical ambiguity, in which a word has more than one meaning, is quite common; "back" can be an adverb (go back), an adjective (back door), a noun (the back of the room) or a verb (back up your files). "Jack" can be a name, a nmm (a playing card, a six-pointed metal game piece, a nautical flag, a fish, a socket, or a device for raising heavy objects), or a verb (to jack up a car, to hunt with

SYNTACTIC AMBIGUITY

a light, or to hit a baseball hard). Syntactic ambiguity refers to a phrase that has multiple parses: "I smelled a wumpus in 2,2" has two parses: one where the prepositional phrase "in 2,2" modifies the nollll and one where it modifies the verb. The syntactic ambiguity leads to a

SEMANTIC

semantic ambiguity,

AMBIGUITY

because one parse means that the wumpus is in 2,2 and the other means

that. a stench is in 2,2. In this case, getting the wrong interpretation could be a deadly mistake for the agent. Finally, there can be ambiguity between literal and figurative meanings. Figures of speech are important in poetry, but are surprisingly common in everyday speech as well. A METONYMY

metonymy

is a figure of speech in which one object is used to stand for another. When

we hear "Chrysler announced a new model," we do not interpret it as saying that compa nies can talk; rather we understand that a spokesperson representing the company made the annow1cement. Metonymy is common and is often interpreted unconsciously by human hear ers. Unfortllltely, la our grammar as it is written is not so facile. To handle the semantics of metonymy properly, we need to introduce a whole new level of ambiguity. We do this by pro viding two objects for the semantic interpretation of every phrase in the sentence: one for the object that the phrase literally refers to (Chrysler) and one for the metonymic reference (the spokesperson). We then have to say that there is a relation between the two. In our cun-ent grammar, "Chrysler announced" gets interpreted as x

= Chrysler 1\ e E Announce(x) 1\ After(Now, Extent(e)) .

We need to change that to x

= Chrysler 1\ e E Announce(m) 1\ After(Now, Extent(e)) 1\

Metonymy (m,x) .

906

Chapter 23.

Natural Language for Communication

This says that there is one entity x that is equal to Chrysler, and another entity m that did the announcing, and that the two are in a metonymy relation. The next step is to define what kinds of metonymy relations can occur. The simplest case is when there is no metonymy at all-the literal object x and the metonymic object m are identical:

Vm,x (m = x)

=?

Metonymy (m, x) .

For the Chl)'sler example, a reasonable generalization is that an organization can be used to stand for a spokesperson of that organization:

Vm,x x E Organizations A Spokesperson(m,x)

METAPHOR

DISAMBIGUATION

=?

Metonymy(m,x) .

Other metonymies include the author for the works (I read Shakespeare) or more generally the producer for the product (I drive a Honda) and the part for the whole (TI1e Red Sox need a strong arm). Some examples of metonymy, such as "The ham sandwich on Table 4 wants another beer," are more novel and are interpreted with respect to a situation. A metaphor is another figure of speech, in which a phrase with one literal meaning is used to suggest a different meaning by way of an analogy. Thus, metaphor can be seen as a kind of metonymy where the relation is one of similarity. Disambiguation is the process of recovering the most probable intended meaning of an utterance. In one sense we already have a framework for solving this problem: each rule has a probability associated with it, so the probability of an interpretation is the product of the probabilities of the rules that led to the interpretation. Unfortunately, the probabilities reflect how common the phrases are in the corpus from which the grarrunar was learned, and thus reflect general knowledge, not specific knowledge of the cw·rent situation. To do disambiguation properly, we need to combine four models: I . The world model: the likelihood that a proposition occurs in the world. Given what we know about the world, it is more likely that a speaker who says "I'm dead" means "I am in big trouble" rather than "My life ended, and yet I can still talk." 2. The mental model: the likelihood that the speaker forms the intention of communicat ing a certain fact to the hearer. This approach combines models of what the speaker believes, what the speaker believes the hearer believes, and so on. For example, when a politician says, "I am not a crook;' the world model might assign a probability of only 50% to the proposition that the politician is not a criminal, and 99.999% to the proposition that he is not a hooked shepherd's staff. Nevertheless, we select the former interpretation because it is a more likely thing to say. 3. TI1e language model: the likelihood that a certain string of words will be chosen, given that the speaker has the intention of communicating a certain fact. 4. The acoustic model: for spoken communication, the likelihood that a particular se quence of sounds will be generated, given that the speaker has chosen a given string of words. Section 23.5 covers speech recognition.

Section 23.4.

23.4

907

Machine Translation

MACHINE TRANSLATION

Machine translation is the automatic translation oftext from one natural language (the source) to another (the target). It was one of the first application areas envisioned for computers (Weaver, 1949), but it is only in the past decade that the technology has seen widespread usage. Here is a passage from page 1 of this book: AI is one of the newest fields in science and engineering. Work started in earnest soon after World War IT, and the name itself was coined in 1956. Along with molecular biol

ogy, AI is regularly cited as the "field I would most like to be in" by scientists in other disciplines.

And here it is translated from English to Danish by an online tool, Google Translate: AI er en af de nyeste 01nrilder inden for videnskab og teknik. Arbejde startede for alvor

lige efter Anden Verdenskrig, og navuet i sig selv var opfundet i 1956. Sammen med molekylrer biologi, er AI jrevn1igt nrevnt som "feltet Jeg ville de fleste gerne vrere i" af

forskere i an.U, Yo + >.V, Zo + >.W), with ).. varying between -oo and +oo. Different choices of (Xo, Yo,Zo) yield different lines parallel to one another. The projection of a point P>. from this line onto the image plane is given by

>.U f Yo + >.V ) (! ZoXo++>-W . ' Zo + >.W

VANISHING POINT

As ).. � oo or ).. � -oo, this becomes Poo = (!UjW, fVfW) if W =f:. 0. This means that two parallel lines leaving different points in space will converge in the image-for large ).., the image points are nearly the same, whatever the value of (Xo, Yo, Zo) (again, think railway tracks, Figure 24.1 ). We call Poo the vanishing point associated with the family of straight lines with direction (U, V, W). Lines with the same direction share the same vanishing point.

24.1.2

MOTION BLUR

Lens systems

The drawback of the pinhole camera is that we need a small pinhole to keep the image in focus. But the smaller the pinhole, the fewer photons get through, meaning the image will be dark. We can gather more photons by keeping the pinhole open longer, but then we will get motion blur--objects in the scene that move will appear blurred because they send photons to multiple locations on the image plane. If we can't keep the pinhole open longer, we can try to make it bigger. More light will enter, but light from a small patch of object in the scene will now be spread over a patch on the image plane, causing a blurred image.

Chapter

932

24.

Perception

Image plane

Lens

System

Figure 24.3

Lenses collect the light leaving a scene point n i a range of directions, and steer

it all to arrive at a single point on the image plane. Focusing works for points lying close to a focal plane in space; other points will not be focused properly.

In cameras, elements of

the lens system move to change the focal plane, whereas in the eye, the shape of the lens is changed by specialized muscles.

Vertebrate eyes and modern cameras use a

LENS

keeping the image in focus.

A large

lens system to gather sufficient light while

opening is covered with a lens that focuses light from

i the image plane. However, lens systems nearby object locations down to nearby locations n DEPTH OFFIELD FOCAL PLANE

have a limited

depth of field: they can focus light only from points that lie within a range of depths (centered arotmd a focal plane). Objects outside this range will be out of focus in

the image. To move the focal plane, the lens in the eye can change shape (Figure 24.3); in a

camera, the lenses move back and forth.

24.1.3 Scaled orthographic projection Perspective effects aren't always pronounced. For example, spots on a distant leopard may look small because the leopard is far away, but two spots that are next to each other will have i distance to the spots is small compared about the same size. This is because the difference n

SCALED ORTHOGRAPHIC PROJECTION

to the distance to them, and so we can simplify the projection model. The appropriate model scaled orthographic projection. The idea is as follows: If the depth Z of points on the object vaties within some range Zo ± t:.Z, with t:.Z « Zo, then the perspective scaling factor fJZ can be approximated by a constant s = f/Zo. The equations for projection from the scene coordinates (X, Y, Z) to the image plane become x = sX and y = sY. Scaled is

orthographic projection is an approximation that is valid only for those parts of the scene with

not much internal depth vatiation. For example, scaled orthographic projection can be a good model for the features on the front of a distant building.

24.1.4

Light and shading

The brightness of a pixel in the image is a function of the brightness of the surface patch in the scene that projects to the pixel. We will assume a linear model (current cameras have non i the middle). Image brightness is lineatities at the extremes of light and dark, but are linear n

Section 24. 1 .

Image Formation

933

Diffuse reflection, bright

Diffuse reflection, dark

Cast shadow

Figure 24.4

A variety of illumination effects. There are specularities onthe metal spoon bright because it faces the light direction. The dark diffuse surface is dark because it is tangential to the illumination direction. The shadows appearat surface points that cannot.see the light source. Photo by MikeLinksvayer(mlinksva and on the milk. The bright diffuse surface is

on

flickr).

a strong, if ambiguous, cue to the shape of an object, and from there to its identity. People are usually able to distinguish the three main causes of varying brightness and reverse-engineer OVERALL INTENSITY

the object's properties. The first cause is overall intensity of the light. Even though a white object in shadow may be Jess bright than a black object in direct sunlight, the eye can distin guish relative brighh1ess well, and perceive the white object as white. Second, different points

REFLECT

in the scene may reflect more or Jess of the light. Usually, the result is that people perceive these points as lighter or darker, and so see texture or markings on the object. Third, surface patches facing the light are brighter than swface patches tilted away from the light, an effect

SHADING

DIFFUSE REFLECTION

known as shading. Typically, people can tell that this shading comes from the geometry of the object, but sometimes get shading and markings mixed up. For example, a streak of dark makeup under a cheekbone will often look like a shading effect, making the face look thi1mer. Most surfaces reflect light by a process of diffuse reflection. Diffuse reflection scat ters light evenly across the directions leaving a surface, so the brightness of a diffuse surface doesn't depend on the viewing direction. Most cloth, paints, rough wooden surfaces, vegeta tion, and rough stone are diffuse. Mirrors are not diffuse, because what you see depends on the direction in which you look at the minor. The behavior of a perfect mirror is known as

SPECULAR REFLECTION SPECULARITIES

Some surfaces-such as brushed metal, plastic, or a wet floor--display small patches where specular reflection has occurred, called specularities. These are easy to

specular reflection.

identify, because they are small and bright (Figure 24.4). For almost all purposes, it is enough

odel ail

to m

surfaces as being diffuse with specu]arities.

Chapter 24.

934

Figure 24.5

Perception

Two surface patches are illuminated by a distant point source, whose rays are

shown as gray arrowheads. Patch

A is tilted away

from the source

({} is close to

90°) and

collects less energy, because it cuts fewer light rays per unit surface area. Patch B, facing the source ({} is close to 0°), collects more energy.

DIS-,.NT POINT LIGHTSOURCE

The main source of illumination outside is the stm, whose rays all travel paraUel to one another. We model this behavior as a distant point light source. This is the most important model of lighting, and is quite effective for indoor scenes as well as outdoor scenes. The

DIFFUSEALBEDO

LAMBERT"S COSINE LAW

amount of light collected by a surface patch in this model depends on the angle B between the illumination direction and the normal to the sUJface. A diffuse surface patch iUuminated by a distant point light source will reflect some fraction of the light it collects; this fraction is called the diffuse albedo. White paper and snow have a high albedo, about 0.90, whereas flat black velvet and charcoal have a low albedo of about 0.05 (which means that 95% of the incoming light is absorbed within the fibers of the velvet or the pores of the charcoal). Lambert's cosine law states that the brightness of a diffuse patch is given by I = plo cos B ,

SHADOW

INTERREFLECTIONS

AMBIENT ILLUMINATION

where pis the diffuse albedo, Io is the intensity of the light source and B is the angle between the light source direction and the surface normal (see Figure 24.5). Lampert's law predicts bright image pixels come from surface patches that face the light directly and dark pixels come from patches that see the light only tangentiaUy, so that the shading on a surface pro vides some shape information. We explore this cue in Section 24.4.5. If the surface is not reached by the light source, then it is in shadow. Shadows are very seldom a uniform black, because the shadowed surface receives some light from other sources. Outdoors, the most important such source is the sky, which is quite bright. Indoors, light reflected from other surfaces illuminates shadowed patches. These interreftections can have a significant effect on the brightness of other surfaces, too. These effects are sometimes modeled by adding a constant ambient iUumination term to the predicted intensity.

Section 24.2.

Early Image-Processing Operations 24.1.5

935

Color

Fruit is a bribe that a tree offers to animals to cany its seeds around. Trees have evolved to have fntit that turns red or yellow when ripe, and animals have evolved to detect these color changes. Light arriving at the eye has different amounts of energy at different wavelengths; this can be represented by a spectral energy density function. Human eyes respond to light in the 380-750run wavelength region, with three different types of color receptor cells, which have peak receptiveness at 420mm (blue), 540nm (green), and 570run (red). The hwnan eye can capture only a small fraction of the full spectral energy density function-but it is enough to tell when the fruit is ripe. The principle of trichromacy states that for any spectral energy density, no matter how

PRINCIPLE OF TRICHAOMACY

complicated, it is possible to construct another spectral energy density consisting of a mixture ofjust three colors-usually red, green, and blue-such that a human can't tell the difference between the two. That means that our TVs and computer displays can get by with just the three red/green/blue (or R/G/B) color elements. It makes our computer vision algorithms easier, too. Each surface can be modeled with three different albedos for RIG/B. Similarly, each light source can be modeled with three R/G/B intensities. We then apply Lambert's cosine law to each to get three R/G/B pixel values. This model predicts, correctly, that the same surface will produce different colored image patches under different-colored lights. In COLOR CONSTANCY

fact, human observers are quite good at ignoring the effects of different colored lights and are able to estimate the color ofthe surface under white light, an effect known as color constancy. Quite accw·ate color constancy algorittuns are now available; simple versions show up in the "auto white balance" function of your camera. Note that if we wanted to build a camera for mantis shrimp, we would need 12 different pixel colors, corresponding to the 12 types of color receptors of the crustacean.

24.2

EARLY IMAGE-PROCESSING OPERATIONS

We have seen how light reflects off objects in the scene to form an image consisting of, say, five million 3-byte pixels. With all sensors there will be noise in the image, and in any case there is a lot of data to deal with. So how do we get started on analyzing this data? In this section we will study three useful image-processing operations: edge detection, texture analysis, and computation of optical How. These are called "early" or "low-level" operations because they are the first in a pipeline of operations. Early vision operations are charactelized by their local natw·e (they can be carried out in one part of the image without regard for anything more than a few pixels away) and by their lack of knowledge: we can perform these operations without consirleration of the ohject� that might he present in the

scene. This makes the low-level operations good candidates for implementation in parallel hardware-either in a graphics processor unit (GPU) or an eye. We will then look at one mid-level operation: segmenting the image into regions.

Chapter

936

24.

Perception

1

1

1

Figure 24.6 Different kinds of edges: (1) depth discontinuities; (2) surface orientation discontinuities; (3) reflectance discontinuities; (4) illumination discontinuities (shadows). 24.2.1 EDGE

Edge detection

Edges are straight lines or curves in the image plane across which there is a "significant" change in image brightness. The goal of edge detection is to abstract away from the messy, multimegabyte image and toward a more compact, abstract representation, as in Figure 24.6. The motivation is that edge contours in the image correspond to important scene contours. In the figure we have three examples of depth discontinuity, labeled

1 ; two surface-normal

discontinuities, labeled 2; a reflectance discontinuity, labeled 3; and an ilhunination discon tinuity (shadow), labeled

4. Edge detection is concerned only with the image, and thus does

not distinguish between these different types of scene discontinuities; later processing will. Figure 24.7(a) shows an image of a scene containing a stapler resting on a desk, and (b) shows the output of an edge-detection algorithm on this image. As you can see, there is a difference between the output and an ideal line drawing. There are gaps where no edge appears, and there are "noise" edges that do not correspond to anything of significance in the scene. Later stages of processing will have to correct for these errors. How do we detect edges in an image? Consider the profile of image brightness along a one-dimensional cross-section perpendicular to an edge-for example, the one between the left edge of the desk and the wall. It looks something like what is shown in Figure 24.8 (top). Edges correspond to locations in images where the brightness undergoes a sharp change, so

a naive idea would be to differentiate the image and look for places where the magnitude

of the derivative I'(x) is large. That almost works. In Figure 24.8 (middle), we see that there is indeed a peak at x = 50 , but there are also subsidiary peaks at other locations (e.g., x = 75).

These arise because of the presence of noise in the image. If we smooth the image first, the spurious peaks are diminished, as we see in the bottom of the figure.

Section 24.2.

Early Image-Processing Operations

937

(a) Figure 24.7

(b)

(a) Photograph of a stapler. (b) Edges computed from (a).

;�L__.__L._.__�� : lL______L___�� :�: �

_

_

_

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

0

10

20

M

40

50

60

"tO

80

90

100

Top: Intensity profile I(x) along a one-dimensional section across an edge at x=50. Middle: The derivative of intensity, I'(x). Large values of this function correspond to edges, but the function is noisy. Bottom: The derivative of a smoothed version of the intensity, (I * Gu )', which can be computedin one step as the convolution I * G�. The noisy candidate edge at x = 75 has disappeared. Figure 24.8

i a CCD camera is based on a physical The measurement of brightness at a pixel n process involving the absorption of photons and the release of electrons; inevitably there will be statistical fluctuations of the measw·ement-noise. The noise can be modeled with

Chapter 24.

938

GAUSSIAN FILTER

Perception

a Gaussian probability distribution, with each pixel independent of the others. One way to smooth an image is to assign to each pixel the average of its neighbors. This tends to cancel out extreme values. But how many neighbors should we consider--one pixel away, or two, or more? One good answer is a weighted average that weights the nearest pixels the most, then gradually decreases the weight for more distant pixels. The Gaussian filter does just that. (Users of Photoshop recognize this as the Gaussian blur operation.) Recall that the Gaussian function with standard deviation a and mean 0 is

Nu(x) = � e-x2f2u2 in one dimension, or u in two dimensions. N (x , y) = .... 1ru .l.... e-("'2+Y2) /2u2

2 410

to (x + D cos(1>), y + D sin(¢>), ¢>) , and Rotate(B) has the effect of changing state (x, y, ¢>) to (x, y, ¢> + B).

has the effect of changing state

the action

a. Suppose that the robot is initially at (0, 0, 0) and then executes the actions Rotate(60°), Roll(l), Rotate(25°), Roll(2). What is the final state of the robot?

1018

Chapter

25.

Robotics

goal

Figure 25.33

Simplified robot in a maze. See Exercise 25.9.

b. Now suppose that the robot has imperfect control of its own rotation, and that, if it attempts to rotate by B, it may actually rotate by any angle between () - 10° and () + 10°. In that case, if the robot attempts to carry out the sequence of actions in (A), there is

a range of possible ending states. What are the minimal and maximal values of the x-coordinate, the y-coordinate and the orientation in the fi nat state?

c. Let us modify the model in (B) to a probabilistic model in which, when the robot attempts to rotate by (), its actual angle of rotation follows a Gaussian distribution with mean () and standard deviation 10°. Suppose that the robot executes the actions Rotate(90°), Roll(!). Give a si111ple argument that (a) the expected value of the loca tion at the end is not equal to the result of rotating exactly

1 unit, and

90° and then rolling forward

(b) that the distribution of locations at the end does not follow a Gaussian.

(Do not attempt to calculate the true mean or the true distribution.) The point of this exercise is that rotational uncertainty quickly gives rise to a lot of positional uncertainty and that dealing with rotational tmcertainty is painful, whether uncertainty is treated in tenns of hard intervals or probabilistically, due to the fact that the relation between orientation and position is both non-linear and non-monotonic. 25.9

Consider the simplified robot shown in Figure 25.33. Suppose the robot's Cartesian

coordinates are known at all times, as are those of its goal location. However, the locations

of the obstacles are unknown. The robot can sense obstacles in its immediate proximity, as illustrated in this figure. For simplicity, let us assume the robot's motion is noise-free, and the state space is discrete. J:•olution. Nor ton. C•sati, R. and Varzi, A. (1999). Perts and places: the structuns ofspatial representation. MIT Press. Cassandra, A. R., Kaelbling, L. P., and Littman, M. L. (1994). Acting optimally n i partially observ able stochastic domains. In AAAI-91, pp. 10231028.

Cassandras, C. G. and Lygeros, J. (2006). Stochas tic Hybrid Systems. CRC Press. Castro, R., Coates, M., Liang, G., Nowak, R., and Yu, B. (2004). Network tomography: Recent devel opments. Stastical ti Science, 19(3), 499-.517. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learnn i g, and Games. Cambridge University Press.

Cesta, A., Cortellessa, G., Denis. M., Donati, A., Fralin� S., Oddi, A., Policella, N., Rabenau, E., and Schulster, J. (2007). MEXAR2: AJ solves mission planner problems. IEEE Intelligent Systems, 22(4), 12-19.

Chakrabarti, P. P., Ghose, S., Acharya, A., and deSarkar, S.C. (1989). Heuristic searchinrestricted memory. A/1,41(2), 197-222.

Chandra, A. K. and Harel, D. (1980). Computable queries for relational data bases. J. Computer and SysremSciences,21(2), 156-178.

Chang, C.-L. and Lee, R. C.-T. (1973). Symbolic Logic and Mechanical Theorem Provn i g. Academic Press.

Chapman, D. (1987). Planning for conjWlCtive goals. AIJ, 32(3), 333-377.

Charniak, E. (1993). Statistical Longuage �.Lam n i g. MIT Press.

Cheeseman, P. (1985). In defense of probability. In IJCAI-85, pp. 1002-1009.

66.

Cheeseman, P., Kanefsky, B., and Thylor, W. (1991). Where the really bard problems are. In IJCAI-91, pp. 331-337.

Church, A. (1936). A note on the Entscheidungsproblem JS4 I, 40-41 and 101-102.

Church, A. (1956). lnJroduction to Mathematical Logic. Princeton University Press.

Church, K. andPatil, R. (1982). Coping with syn tactic am bi guity orhow to put the block n i thebox on

the table. 'C onr purationalLi n guistics, 8(�). I39149. Church, K- (2004). Speech and language process ing: Can we use the past to predict the future. In Proc. Conference an Te.tt, Speech, and Dialogue.

Church, K. and Gale, W. A. (1991). A comparison of the enhanced Good-Tusing and deleted estima tion methods for estimating probabilities of English bgr i ams. ComputerSpeechandLanguage, 5, 19-54.

Churcbland, P. M. and Churchland, P. S. (1982). and n i tentionality. In Biro, J. I. and Shahan, R. W. (Eds.), Mind, Brain and Function: Essays n i the Philosophy ofMind, pp. 121-145. University ofOklahoma Press.

Functionalism, qwtlia,

ophy: Churcbland, P. S. (1986). Neurophilos ind-Bran. Toward a Unified Science of the M

i

MIT Press.

Cianmrini, P. and Wooldridge, M. (2001). Agent Oriented Software Engineering. Springer-Verlag. Cimatti, A., Roveri, M., and Traverso, P. (1998). Automatic OBDD-based generation of uni>-ersal plans innon-detenninistic domains. InAAAI-98, pp. 875-881. Clark. A. (1998). Being 1hero: Putting Brain, Body, and IVar/d TogetherAgain. MIT Press.

Cheeseman, P., Self, M., Kelly, J., and Stutz, J. (1988). Bayesian classification. InAAAI-88, Vol. 2, pp. 607�11.

Clark, A. (2008). Superszi g the Mind: Embodi nren� Action, and Cognitive E.ttension. Oxford Uni

Cheeseman, P. and Stutz, J. (1996).

Clark, K.L. (1978). Negation as fai l ure. InGallaire, H. and Minker, J. (Eds.), Logic and Data Bases, pp.

Bayesian

classification (AutoCJass): Theory and results. In Fayyad, U., Piatesky-Shapiro, G., Smyth, P., and

Uthurusarny, R. (Eds.), Advances n i Knowledge Dis covery andData Mining. AAAI Press/MIT Press.

Chen, S. F. and Goodman, J. (1996). An empirical study of smoothing techniques for language model ing. InACL-96, pp. 310-318.

Cheng, J. and Dnrzdzel, M. J. (2000). AIS-BN: An adapve ti importance sampling algorithm for eviden tial reasoning n i large Bayesian networks. JAIR, 13, 155-188. Cheng, J., Greiner, R., Kelly, J., Bell, D. A., and Liu, W. (2002). Learning Bayesian networks from data: An information-theory based approach. AIJ. 137,43-90. Chklovski, T. and Gil, Y. (2005). Improving the design of intelligent acquisition n i terfaces for col lectingworld knowledge from web contributors. In Proc. ThirdInternational Conference on Knowledge Capture (K-CAP). Chomsky, N. (1956). Three models forthe descrip a n g uage. IRE Transactions on lnfom11ltion tion of l T h eory,2(3), 111--124.

Chomsky, N. (1957). Syn tacticStruciUros. Mouton. Chose!, H. (1996). Sensor BasedMotion Planning: The Hitrarchir"f"llilg. l IJ{J-4). 369-

409. Cohn. A. G, Bennett, B

Gooday. J. M.. and Ooa� ualito N. (1997). RCC: A calculus fcr region boJOdq tive sp:llial reoscning. fdolnformotico, I. 27s-.316. .•

and KaiZ. S. (1999). Self-stabilizing distribu"'d conllrainl Atisr..:tion. Chicago JDliNial of TMontical Computu Scitn· ersite d' Aix

Cntig, J. (1989). Introduction to Robotics: Mechan ics and Control (2nd tdition). Add soo-Wesley Pub i lishing, Inc.

Condon, J. H. 3Jld Thompson, K. (1982). Belle cbess hanlware. In Clarb, M. R. B. (Ed.), AIAnncts in O>mputtrChtss3, pp. 45-S4.Perpmon.

Craswell, N Zar.lgoza, H., and Robertson, S. E.

and

Colmernuer, A. (1975). Les g m mmlli res de meta morphose. Tech. rep., Groupe d'lnt elli gence Artifi. cielle. Universite de Maneille�·urniny.

Rousse� P. (1973).

R appo rt,

Marseille n.

Congdon, C. B., Huber, M., Kon enl: amp. D., Bid lack. C. Cohen, C, Huffman. S.. Koa, F Rucbke. U, and Weymouth, T. (1992). CARMEL ,'1!11us •.

Flakey: A comparison ortwo robots. Tecb. rep. Pa persfiom thoAAAlRobol Com tition. RC-92.01. pe American Association for Artifia al lrweUigence.

Confisk. J. (1989). Tbtee variants on tho Allais ••· ample.Amtrican EcDMmicRtional complexity net. of probabilistic inference using Ba.yeaian beli r e

works. A/J, 42, 393-405.

Cooper, G. and Herskovits, E. (1992). A Bayeaian i duction of probabili stic networks method for the n from data. Machine Ltaming, 9, 309-347. Copeland, I. (1993). Anificial lnrtlligencc: A PhilosophicalIntroduction. Blackwell.

Au

Cox, R. T. (1946). Probability, frequency, and rea sonable expectation. American Journal ofPhy!ics,

14(1). 1-13.

Cnoik, K. I. (1943). lhe Nature of Explmtation. Cambridge University Press. ••

(2005). Microsoft camlridge attrec-14: Enterprise

crack. In Proc. FoMrt«nth Tert REiriel10l Confor �nu.

Cnuser, A., Mehlhom, K.,Meyer,U.,and Sanders,

P . (1998). A parallelil3tion of Dipua's sboneso path algorithm. In Proe. 2Jnl lnttNIDriottal S)m· posiMm on Mmhtmatical FDUIIdations ofO>trfJIIltr Science., pp. 722-731.

Cross, S. E. and Walker. E. (1994). DART: Apply ing knowledgebased plannins and schedulingto cri sis action l p ann ing. In Zweben, M. and Fox, M. S. (Eds.), Intt/Ji�tS chedulill g. pp. 71 1-729. Morgan Kaufmm1n. Cruse, D. A. (1986).Ltxical Stmantics. �ridge

Uni\mry Press.

Culberson, J. and Seboefl'er, J. (1996). Searching with panem databoJOs. In AIA10nces ill Anijiised Language Acquisition. Ph.D. thesis, MIT.

De Morgan, A. (1864). On the syllogism, No. IV, and on the logic of relations. Tmnsaction of the Cambridge Philosophical Society, X, 331-358.

De Raedl, L. (1992). lntemcrive Theory Reo•ision: An Jnductio•e Logic Progmmming Approach. Aca demic Press.

de. Salvo Braz, R., Asnir. E., and Roth, D.

(2007).

Lifted first-order probabilistic n i ference. In Getoor, L. and Thskar, B. (Eds.), Introduction to Staristical Relational /..earning. MIT Press.

Deacon, T. W. (1997). The symbolic species: The co-evolution oflanguage and the brain. W. W. Nor ton.

Deale, M., Yvanovich, M., Schnitzius, D., Kautz, D., C:upeultaneous localisationandmap building (SLAM) problem. IEEE Transactions on Robotics and Au tomation, /7(3), 229--241. Do, M. B. and Kambhampat� S. (2001). Sapa: A domain-independent heuristic metric temporal plan Dillenburg, All,

search.

ner. In ECP.()I.

Do, M. B. and Kambhampati, S. (2003). Planning as constraint satisfaction: solving the planning graph by compiling ti into CSP. All, 132(2), 151-182. Doctorow, C. (2001). Metacrap: Puttingthe torch to seven straw-men of the meta-utopia. www.well. com/Ndoctorow/metacrap.htn

P. andPau.ani, M. (1997). On the opti mality ofthe simple Bayesian classifierunder zero Domingos,

one Joss. Machine Learning, 29, 103-30.

P. andRichardson, M. (2004). Marlcov logi : A unifying frameworl: for statistical relational learning. In Proc. ICM/.,04 Workshop on Statistical Domingos, c

Relaonal ti Learning.

C. and Lorenz, U. (2004). The chess monster hy . In Proc. 14th International Con

Donninger, dra

ference on Field-Programmable Logic and Applica tions, pp. 927-932.

Doorenbos, R. (1994). Combining left and right un linking for matching a large number oflearnedrules. lnAAAI-94. Doran, J. and Michie, D. (1966). Experiments with the graph traverser program. Proc. Ro)Yll Society of London, 294, SeriesA, 235-259. Dorf, R. C. and Bishop, R. H. (2004). Modem Con trol Systems (lOth edition). Prentice-Hall. Doucet, A. (1997). Monte Carlo methods for

Bayesian estimation ofhidden Markov models: Ap plicaon ti to radiation signals. PhD. thesis. Univer

site de Paris-Sud. Doucet, A., de Freitas, N., and Gordon, N. (2001). Sequential Monte Carlo Methods in Prac tice. Springer-Verlag. Doucet, A., de Freitas, N., Murphy, K., and Russell, S. J. (2000). Rao-blackl•-ellised particle filtering for dynamic bayesian networl:s. In UAT-00. Dowling, W. F. and Gallier, J. H. (1984). Lineas time algorithms fortestingthe satisfiabilityofpropo sitional Hom foonulas. J. Logic Programming, I, 267-284.

Dowty, D., Wal�

R., and Peters, S. (1991).

duction to Montague Semantics. D. Reidel.

Intro

J. (1979). A truth maintenance system. All, /2(3), 231-272. c Doyle, J. (1983). What is rational ps hology? To y (3), 50M a g, 4 ward a modernmental philosophy. AJ Doyle,

53.

and Pari� R. (1991). Two theses ofknowl edge representation: Language restrictions, taxo nomic classification, and the utility ofrepresentation services. All, 48(3), 261-297. Drabble, B. (1990). Mission scheduling for space craft: Diaries of T-SCHED. In Expert Plooning Sys tems,pp. 76-81. Institute ofElectrical Engineers. Dredze, M., Crammer, K., and Pereira, F. (2008). Confidence-weighted linear classification. In ICMJ., 08, pp. 264-271. Doyle, J.

Dreyfus, H. L. (1972). IVhar Computers Can't Do: A Critique ofArtificialReason. Harper andRow.

Dreyfus, H. L. (1992). IVhat Computers Still Can't Do: A Critique ofArtificialReason. MITPress.

Dreyfus, H.L. andDreyfus, S. E. (1986). Mind O\'er Machine: The Power ofHumanIntuition and Exper tise in the Em of the Computer. Blackwell Dreyfus, S. E. (1969). An appraisal of some shortest-paths algorithms. Operations Researr:h, 17,

395-412.

andPrade,H. (1994). A survey ofbelief and updating rules in various uncertainty models. Tnt. J. IntelligentSysten&r, 9(1), 61-100. Duda, R. 0., Gaschnig, J., and Hart P. E. (1979). Model desigo n i the Prospector consulLmt system for mineral exploration. In Michie, D. (Ed.), Ex pert Systems n i the Microelectronic Age, pp. 153167. Edinburgh University Press. Duda, R. 0. and Hart, P. E. (1973). Pattem classifi Dubois, D.

revision

,

caon ti andsceneanalysis. Wiley.

R. 0., Hart P. E., and Stork, D. G. (2001). Patum Oassijicatioo (2nd edition). Wiley. Dudek, G. and Jenkin, M. (2000). Computational Principles ofMabile Robotics. Cambridge Univer sity Press. Duffy, D. (1991). Principles ofAutomated Theorem Proving. John Wiley & Soos. Duda,

,

Dunn, H.L. (1946). Record linkage". Am. J. Public Hea/th,36(12), 1412-1416. Durfee., E. H. and Lesser, V. R. (1989). Negotiat ing task decomposition and allocation using partial global planning. InHubns,M. andGasser, L. (Eds.), DistributedAl,Vol. 2. Morgan Kaufinann. Durme, B. V. and Pasca, M. (2008). Finding cars, goddesses and enzymes: Parametrizable acquisition of labeled n i stances for open-domain infonnation extraction.JnAAAJ.(J8,pp.l243-1248. Dyer, M. (1983). In-Depth Understanding. MITPress.

yson, G. (1998).Darwinamongthe machines : the evolution ofglobal n i telligence. Perseus Books. D

S., Muggleton, S. H., and Russell, S. J. PAC-leamability of detenninate logic pro grams. In COLT-92, pp. 12�135. Earley, J. (1970). An efficient context-free parsing algorithm. CACM, 13(2), 94-102. Edelkamp, S. (2009). Scaling search with symbolic pattern databases. Jn Model Checking andArtificial Intelligence (MOCHAR1), pp. 49--65. Edmonds, J. (1965). Paths, trees, and flowers. Canadian Joumal ofMathematics, 17, 449--467. Edwards, P. (Ed.). (1967). The Encyclopedi a ofPhi losophy. Macmillan. Een, N. and Sorensson, N. (2003). An extensi ble SAT-solver. In Giunchiglia, E. and TaccheUa, Duzeroski,

(1992).

A. (Eds.), Theory and Applications ofSatisjiability

Testing: 6th International Conference (SAT 2003).

Springer-Verlag. Eiur, T., Leone, N., Mateis, C., Pfeifer, G., and Scarcello, F. (1998). The KR system dlv: Progress repon, comparisons and benchmarks. In KR-98, pp. 406-417.

R. (Ed.). (2002). Common Sense, Reasoning, and Rationality. Oxford University Press.

Elio,

Elkan, C.

(1993). The paradoxical success of fuzzy logic. In AAAT-93, pp. 698-703.

Elkan, C.

(1997). Boosting and naive Bayesian learning. Tech. rep., Department of Computer Sci i ence and Engneering, University of Californa i , San Diego. Ellsberg, D. (1962). Risk, Ambiguity, and Decision. Ph.D. thesis, HarvardUniversity. Elman, J., Bates, E., Johnson, M., Kanuiloff-Smith, A., Parisi, D., and Plunkett, K. (1997). Rtthn i king Innateness, MITPress. Empson, W. (1953). Seven Types ofAmbi guity. New Directions. Ende.rton, H. B. (1972). A Mathematical Introduc tion to Logic. Academic Press. Epstein, R., Roberts, G., and Beber, G. (Eds.). (2008). Parsing the Turing Test. Springer. Erdmann, M.A. and Mason, M. (1988). An explo ration of sensorless manipulation. IEEE Journal of Robotics and Automation, 4(4), 369--379. Ernst, H. A. (1961). MH-1, a Computer-Opemred Mechanical Hand. Ph.D. thesis, Massachusetts In

stitute of Technology. M., Millstein, T., and Weld, D. S. (1997). Au tomatic SAT-compilation of planning problems. In /JCAJ-97, pp. 1169--1176. Erol, K., Hendler, J., and Nau, D. S. (1994). HTN planning: Complexity and expressivity. InAAA/-94, pp. 1123-1 128. Emst,

1072

Bibliography

Erol, K., Hendler, J., and Nau, D. S. (1996). Com plexity results forHTN planning. AIJ, 18(1), 69--93.

Etzioni, A. (2004). From Empire to Community: A New APproach to International Relation. Palgrave Macmillan.

Elzioni, 0. (198'9). Tractable decision-analytic con trol. In Proc. Fr i st International Conference on Knowledge Representation andReasoning, pp. II� 125.

Felzenszwalb, P. and Huttenlocber, D. (2000). Effi

cient matching ofpictorial structures. In CVPR.

Felze.nszWlllb, P. and MeAllester, D. A. (2007). The

generalized A* architecture. JAJR.

Forgy, C. (1982). A fast algorithm for the many patterns/many objects match problem. AU, 19(1),

389--403.

Forsyth, D. aud Ponce, J. (2002). Computer 1-l'sion: AModemAPproach. Prentice Hall.

Ferguson, T. (1992). Mate with knight aud bishop in k:riegspiel. Theoretical ComputerScience, 96{2),

Ferguson, T. (1995). Mate withthe two bishops n i kriegspiel. www.math.ucla.edtV"tom/papers.

Elzioni, 0., Banko, M., Soderland, S., and Weld, D. S. (2008). Open infonnation extraction from the

Ferguson, T. (1973). Bayesian analysis of some non.parametric problems. Annals ofStatistics, 1{2),

Etzioni, 0., Hanks, S., Weld, D. S., Draper, D., Lesh, N., and Williamson, M. (1992). An approach to planning with n i complete infonuation. In KR-92.

Ferraris, P. and Giunchiglia, E. (2000). Planning as satisability in nondeterministic domains. In AAAI00, pp. 748-753.

web. CACM, 5/(12).

Elzioni, 0. and Weld, D. S. (1994). A softbot-based interface to the Internet. CACM, 37(1), 7'1--76.

Elzioni, 0., Banko, M., and Cafarella, M. J. (2006). Machine reading. InAAAI-06. Elzioni, 0., Cafarella, M. J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2005). Unsupervised named-entity ex traction from the web: An experimental study. AJJ, 165(1), 91-134. Evans, T. G. (1968). A program for the solution of a class of geometric-ana1ogy intelligence-test ques tions. In Minsky, M. L. (Ed.),Semantic lnfomJation Processing, pp. 7:11-353. MITPress. Fagin, R.,Halpern,J. Y., Moses, Y., andVardi, M. Y. (1995). Reasoning about Knowledge. MIT Press. Fahlman, S. E. (1974). A planuing system for robot

construction tasks. AJJ,5(1), 1-49.

Faugeras, 0. (1993). Three-Dimensional Computer Viso i n: A Geometric Viewpoint. MITPress.

209-230.

Ferriss, T. (2007). The 4-Hour \Vorkweek. Crown.

Fikes, R. E., Hart, P. E., and Nilsson, N. J. (1972). Leaming andexecuting generalizedrobotplans.AJJ,

3(4), 251-288.

Fikes, R. E. and Nilsson, N. J. (1971). STRIPS: A approach to the application of theorem proving to problem solving. AJJ, 2(3-4), 189--208. new

Fikes, R. E. and Nilsson, N. J. (1993). STRIPS, a retrospective. AJJ, 59(1-2), 227-232. Fine,S.,Singer, Y.,and Tishby,N. (1998). Theltier archical hidden marlro•• model: Analysis and appli cations. Machine Learning, 32(41-02).

Finney, D. J. (1947). Probit analysis: A statistical treatment ofthe sigmoid response curve. Cambridge University Press. Firth, J. (1957). Papers n i Linguiscs. ti Oxford Uni •tersity Press.

(2001). The Geometry of Multiple Images. MIT

Fislter, R. A. (1922). On the mathematical founda tions of theoretical statistics. Philosophical T mn .sac tions of the Royal Society ofwndon, Series A 222,

Fearing, R. S. and HoUemach, J. M. (1985). Basic solid mechanics for tactile sensing. Int. J. Robocs ti Research,4(3), 4vin. G. (1937). Let's call the whole thing off.

Song.

Cetoor, B. (Eds.). (2007). Introduc tion to Stastical ri Relarional Learning. MIT Press.

L. and Taskar,

Ghahramani, Z. and Jordan, M. L ( 1997). Facto rial hiddenMarlrov mod.els. Machine Learning. 29,

245-274.

Gbahramani, Z. (1998). In Adapti1·e Processing of Se quences and DataStructures,pp. 16&-197.

Learning dynamic

bayesian networks.

Cbahramani, Z. (2005). Tutorial on nonparametric Bayesian methods. Tutorial presentation at the UAI

Conference.

Ghallab, M., Howe, A., Knoblock, C. A., and Mc Dermott, D. (1998). POOL-The planning domain definition language. Tech. rep. DCS TR-1165, Yale Center for Computational Vision and Control. Ghnllab, M. and Laruelle, H. (1994). Representa tion and control n i IxTeT, a temporal planner. In

273-282. UNESCO House.

AIPS-94, pp. 61-67.

M. and Lifschitz, V. (1988). Compiling cir cumscriptive theories into. logic programs. In Non

Gballab, M., Nau, (2004). A11tomar.d Planning: Theory andpractice. Morgan

Celfond,

Monotonic Reasoning: 2nd lntemarional \Volishop Proceedings, 74-99.

pp.

Answer sets. In vanHarmelan, F., Lifschitz, V., and Porrer, B. (Eds.), Handbook of Celfond, M. (2008).

Knowledge Representation, pp. 285-316. Elsevier.

Celly, S. and Silver, D. (2008). Achieving master le>�l play in 9 x 9 comp�rsity. Ce.nt, 1., Petrie, K., and Puget, J.-F. (2006). Sym metry n i constraint programming. In Rossi. F., van Beek, P., and Walsh, T. (Eds.), Handbook of Con straint Progmmmi r1g. Elsevier.

D. S., and Traverso, P.

Kaufmann.

Gibbs, R. W. (2006). Metaphorinteopretation as esn bodied simulation. Mn i d, 2/(3), 434-458. Gibson, J. J. (1950). The Perception of the Visual \Vorld. Houghton Mifflin. Gibson, J. J. (1979). The Ecological Approach to Visual Perception. Houghton Mifflin. Gilks, W.R., Richardson, S., and Spiegelhalter, D. J.

(Eds.). (1996). Markov chain Mont< Carlo n i pmc

rice.

Chapman and HalL W. R., Thomas, A., and Spiegelhalter, D. J. (1994). A language .and program for complex Bayesian modelling. The Stastician, ti 43, 169-178. Gilmore, P. C. (1960). A proof method for quantifi cation theory: Its justfi i cation and realization. IBM

Cilks,

Jo11rnal ofResearchand De>·elopment, 4, 28-35.

L. (1993). Essrnrials ofArrijicialln Morgan Kaufmann. Ginsberg, M. L. (1999). GIB: Steps toward an expert-level bridge-playing program. In IJCAI-99, Ginsberg, M. telligence.

pp. 584-589.

Ginsberg, M. L., Frank, M., Halpin, M. P., and Tor

rance, M. C. (1990). Search lessons learned from crossword puzzles. InAAAI-90, Vol. I, pp. 210-215. Ginsberg, M. L. (2001)- GIB: Imperfect n i footma tion n i a computationally challenging game. JAIR, 14,303-358.

Gionis, A.,Indyk, P., andMotwani, R. (1999). Simi larity searchinhg i h dimensionsvis hashing. InProc. 25th Very Large Database (VWB) Conference.

Bibliography

1074 Gittins, 1.C. (1989). Multi-Am�td BanditAllocation Indices. Wiley.

Clanc, A. (1978). On the etymology of the word "robot". SIGARTNewsletter,67, 12. Clover, F. and Laguna, M. (Eds.). (1997). Tabu search. KJuwer.

COde!, K. (1930). Ober die Vol/standigkeit des Logikkalktt/s. Ph.D. thesis, University ofVienna. C"actie. In AlPS.(}(), pp. 177-

Jennings, H. S. (1906). Behavior ofthe Lawer Or

29(2), 246-254.

Kaelbling, L. P. and Rosenscbein,S. J. (1990). Ac tion and planning in embedded agents. Robotics and

.

ural Language Processn i g. Computational Linguis tics, andSpuch Recognition. Prentice-HalL

Jurnfsky, D. and Martin, J. H. (2008). Speech and Longuage Processing: An Introduction to Nat

ural lAnguage Processn i g. Computational linguis tics, and SpeechRecognition (2ndedition). Prentce

Hall.

Kadane, J. B. and Simon, H. A.

i

(1977). Optimal

str.uegies for a class ofronstrained sequential prob lems. Annals ofStatiscs ti , 5, 237-255.

Kadane, I. B. and Laskey, P. D. (1982), Subjecth-e

probability and the theory of games. Management Science, 28(2), 113-120. Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. (1998). Planning and actiong n i partiaUy ob servable stochastic domaims. AIJ, 101,99-134.

Kaelbling, L. P., Littman. M. L., and Moore, A. W.

(1996). Reinfon:ernent leasning: A survey. JAIR,4,

237-285.

1364.

insandprediction problems. J. Basc i Engineering, 82, 3546.

reuse. ComputationallnuUigence, 10,213-244.

Kambh.-.mpati, S., Mali, A. D., and Srivastava, B.

(1998). Hybrid planning for partially hieran:hical domains. InAAAI-98, pp. 882-888.

Kanal, L. N. and Kumar, V. (1988). Search in Arri jiciallntelligtnce. Sprin;ger-Verlag. Kanazavra, K., Koller, D., andRussell, S. J. (1995). Stochastic simulation algorithms for dynamic prob abilistic networks. In UAI-95, pp. 346-351. Kantorovich, L. V. (1939). Mathematical methods oforgani zing and planning production. Publisbd in

translation in Management Science, 6(4), 366-422, July 1960. Kaplan,D. and Montague, R. (1960). A paradox re gained. Notre Dame Journal ofFomJall.ogic, 1(3), 79-90.

Karmarkar, N. (1984). A new polynomial-time al gorithm for linearprogramming. Combinarorica. 4.

373-395.

Karp, R. M. (1972). Reducibility among combina torial problems. In Miller, R. E. and Thatcher, J. W. (Eds.), Comple.tity of Computer Computarions, pp. 85-103. Plenum. Kart..•m, N. A. and Levitt, R. B.

(1990). A constraint-based approach to construction planning of multi-story buildings. In E.tpert Planning Sys rems, pp. 245-250. Institute ofElectrical Engineers. Kasami, T.(l965). An efficientrecognition and syn tax

analysis algorithm for context-free languages. Tech. rep. AFCRL-65-758, Air Force Cambridge Research Laboratory. Kasparov, G.

(1997). IBM owes me a rematch.

1ime, 149(21), 66--67.

Kaul\nann, M., Manolios, P., and Moore, J. S. (2000) Compurer-Aidul Reasoning: An Approach. .

Kluwer.

Kautz, H. (2006). Deconstructing planning as satis

fiability. InAAAI-06.

Kautz, H., McAIIester, D. A., and Selman, B.

(1996). Encoding plans n i propositional logic. In KR-96, pp. 374-384. Kautz, H. and Selman, B. (1992). Planning as satis

fiability. In ECAI-92, pp-. 359-363.

1078

Bibliography

Kautz, H. and Selman, B. (1998). BLACKBOX: A new approach to the application of theorem proving to problem solving. Working Notes of the AIPS-98 Workshop on Planning as Combinatorial Search. Kavrnki, L., Svestka, P., Latombe, J.-C., and Over mars, M. (1996). Probabilistic roadma�s for path planning inhigh-dimensional configuration spaoes.

IEEE Transacti ons on Robocs ti and Automation. /2(4), 566-580.

Kim, J. H. and Pearl, J. (1983). A computational model for combined causal and diagnostic reasoning in inference systems. InTJCAT-83, pp. 190-193.

Kim, J.-H., Lee, C.-H., Lee, K.-H., and Kup puswamy, N. (2007). Evolving personality of a ge netic robot in ubiquitous environment In The 16th TEEE lntemational Symposium on Robot and Hu man interacti:l·e Communicaon. ti pp.

84�53.

Kocsis, L. and Szepesvari, C. (2006). Bandit-based Monte-Carlo planning. In ECML-06. Koditsdtek, D. (1987). Exact robot navigation by means ofpotential functions: some top>logical con siderations. In TCRA-87, Vol. 1. pp. 1-6.

Koehler, J., Nebel, B., Hoffmann, J., and Dimopou los,Y. (1997). Extending planninggraphs to anADL subset. In ECP-'77, pp. 27�285.

Kay, M., Gawron, J. M., and Norvig, P. (1994).

King, R. D., Rowland, J., Oliver, S. G., and Young, M. (2009). The automation of science. Science,

Koehn, P. (2009). Stastical ti Machine Tmnslation. Cambridge University Press.

Kearns, M. (1990). Tht Computaonal ti Complexity ofMachine Learning. MIT Press.

Kirk, D. E. (2004). Optimal Control Theory: An

Koenderink, J. J. (1990). Solid Shape. MIT Press.

Introduction. Dover,

Verl>mobil: A Tmnslation System for Face-To-Fact Dialog. CSLIPress.

324(5923), 85-89.

Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983). Optimization by simulated annealing. Sci

Kearns, M., Mansour, Y., and Ng, A. Y. (2000) Ap proximate planning n i large POMDPs via reusable trajectories. In SoUa, S. A., Leen, T. K., and Milller, K.-R. (Eds.),NTPS 12. MITPress.

Kister, J., Stein, P., Ul:un, S., Walden, W., and

Kearns, M. and Singh, S. P. (1998). Near-optimal reinforcement learning in polynomial rime. In TCML-98, pp. 260-268.

Kisynski, J. and Poole, D. (2009). Lifted aggrega tion in disee1ed first-order probabilistic models. In

.

Kearns, M. andVaz.irani, U. (1994). Anlntroducon ti to Computational Learning Theory. MIT Press.

Kearns, M. andMansour, Y. (1998). A fast, bottom up decisiontreepnmingalgorithm with near-optimal generalization. In TCML-98, pp. 269-277. Kebeasy, R. M., Hussein, A. 1., and Dahy, S. A. (1998). Discrimination between natural earthquakes and nuclear explosions using the Aswan Seismic Networlc. Annali di Geofisica,41(2), 127-140. Keeney, R. L. (1974). Multiplicative utility func tions. Operations Research, 22, 22-34. Keeney, R. L. and Raiffa, H. (1976). Decisions with Multiple Obj ectives: Prtfert'nces and Value Tmde offs. Wiley.

Kemp, M. (Ed). (1989). Leonardo on Painting: An Anthology of \Vritings. Yale University Press. Kephart, J. 0. and Chess, D. M. (2003). The vision of autonomic computing. TEEE Computer, 36(1), 41-50.

Kersting, K., Raedt, L. D., and Kramer, S. (2000) Interpretingbayesian logicprograms. In Proc.AAAT-

.

2(}()1) \Vorkshop on LearningStatisticalModelsfrom Relational Data.

Kessler, B., Nunberg, G., and Schiitze, H. (1997). Automatic detection of text genre. CoRR. cmp

ence, 220,671-680.

WeUs, M. (1957). Experiments in chess. JACM, 4, 174-tn.

Koe.nig, S. (2000). Exploring unknown environ ments with real4ime search or reinforcement learn ing. In SoDa, S. A., Leen, T. K., and MUUer, K.-R. (Eds.), NTPS 12. MIT Press.

Koe.nig, S. (2001). Agent-centered search. AlMag,

TJCAT-09.

22(4), 109-131.

Kitano, H., Asada, M., Kuniyoshi, Y., Noda, l, and Osawa, E. (1997a). RoboCup: The robot world cup initiative. In Proc. FirstInternaonal ti Conference on

Koller, D., Mesgido, N., and von Stengel, B. (1996). Efficient computation of equilibria for ex tensive twoiJerson games. Games and Economic

Kitano, H., Asada, M., Kuniyoshi,Y., Noda, L, Os awa, E., and Matsubara, H. (1997b). RoboCup: A challenge problem forA.J. A/Mag, !8(1), 7�5.

Koller, D. and Pfuffer, A. (1997). Representations and solutions for game-theoretic problems. All,

Autonomous Agents, pp. 340-347.

Behaviour, !4(2), 247-259.

94(1-2), 167-215.

Kjaerulff, U. (1992). A computational scheme for reasoning in dynamic probabilistic networks. In

Koller, D. and Pfeffer, A. (1998). Probabilistic frame-based systems. In AAAT-98, pp. 580-587.

Klein,D. andManning, C. (2001). Parsing withtree bank grammars: Empirical bounds, theoretical mod els, and the structure ofthe Penn treebank. In ACL-

tic Gmphical Models: Principles and Techniques.

U A/-92, pp. 121-129.

01.

Klein, D. and Manning, C. (2003). A* parsing: Fast exact Viterbi parse "'lection. In HLT-NAACL-03, pp.

119-126.

Klein, D., Smarr, J., Nguyen, H., and Manning, C. (2003). Named entity recognition with character level models. In Conference on Narum! Language

Learning (CoNLL).

Kleinberg, J. M. (1999). Authoritative sources n i a hyper!inked environment JACM, 46(5), 604-632.

lg/9707002.

Klemperer, P. (2002). What really matters in auc tion design. J. Economic Perspectives, !6(1).

Keynes, J. M. (1921). A Treatise on Probability. Macmillan.

Knese.r, R. and Ney, H. (1995). Improved backing off for M-grarn language modeling. In TCASSP-95,

Khare, R. (2006). Microfonnats: The next (small) thing on the semantic web. TEEE lnternet Comput n i g, /0(1), 68-75.

Koenig, S. (1991). Optimal probabilistic and decision-theoretic planning using Markovian deci sion theory. Master's report, Computer Science Di vision, University of California.

pp. 181-184.

Koller, D. and Friedman, N. (2009). Probabilis MIT Press.

Koller, D. and Milch, B. (2003). Multi-agent intlu ence diagrams for representing and solving games.

Games and Economic Behavior,45, 181-221.

Koller, D. and Parr, R. (2000). Policy iteration for factored MOPs. In UAT-00, pp. 326-334.

Koller, D. and Sahami, M. (1997). HierarchicaUy classifying docwnents using very few words. In TCML-97, pp. 170-178. Kolmogorov, A. N. (1941). Interpolation und ex trapolation von stationaren zufalligen folgen. Bul

letin of the Academy of Sciences ofthe USSR, Ser. Math. 5, �14.

Kolmogorov, A. N. (1950). Foundaons ti ofthe The ory ofProbability. Chelsea.

Kolmogorov, A. N. (1963). On tables of random numbers. Sankhya, the lndian Journal ofStatiscs, ti

Knight, K. (1999). A statistical MT tutorial worlc book. Prepared in connection with the Johns Hop kins University swnmer workshop.

SeriesA 25.

Research, 5(1), 90-98.

Knuth, D. E. (1964). Representing nmnbers using only one 4. Mathematics Magazine, J7(Nov/I)ec),

lnfomwtion Transmission, /(1), 1-7.

Khmelev, D. V. and Tweedie, F. J. (2001). Using Marlcov chains for identification of writer. Litemry

Knuth, D. E. (1968). Semantics forcontext-free lan

Khatib, 0. (1986). Real-time obstacle avoidance for robotmanipulator andmobile robots. Tnt. J.IIDbocs ti

and Linguistic Computing, !6(3), 299-307.

Kietz, J.-U. and Duzeroski, S. (1994). Inductive

logic programmins and le:unAbility. SIGART Bul letin, 5(1), 22-32.

Kilgarriff, A. and Grefensterte, G. (2006). Intro duction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333-347.

Kim, J. H. (1983). CONVTNCE: A Conversational lnfe,.nce Consolidation Engine. PhD. thesis, De partment of Computer Science, University of Cali fornia at Lns Angeles.

308-310.

guages. Mathematical Systems Theory, 2(2), 127145.

Knuth, D. E. (1973). ThtArt ofComputerProgram

Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of infonnation. Problems n i

Kolodner, J. (1983). Reconstructive memory: A computer model. Cognitive Science, 7, 281-328.

Kolodner, J. (1993). Case-Baud Reasoning. Mor gan Kaufmann. Kondrnk, G. and van Beek, P. (1997). A theoretical

ming (second edition)., Vol. 2: Fundamental Algo rithms. Addison Wesley.

evaluation of selected backtracking algorithms. All,

Knuth, D. E. (1975). An analysis of alpha-beta pruning. All, 6(4), 29�326.

Konolige, K. (1997). COLBERT: Alanguage forre active control in Saphira. In Kanstliche lntelligenz: Adwmces n i Artificial lntelligence, LNAI, pp. 31-

-

Knuth, D. E. and Bendix, P. B. (1970). Simple word problems in universal algebras. In Leech, J.

(Ed.), Computational Problems n i AbstractAlgebm, pp. 263-267. Pergamon.

89. 365-387.

52.

Konolige, K. (2004). Large-scale map-making. In AAA/-04, pp. 457-463.

1079

Bibliography Konolige, K. (19l2).

A first order formalizatioo of knowledge and action for a multi-agent planning system. In Hayes, J. E., Michie, D., and Pao, Y.-H. (Eds.), Machinelfllelligence 10. Ellis Horwood.

Konolige, K. (lm). Easytobehard: Difficult prob lems for greedy al�orithms. In KR-94, pp. 374-378. Koo T., Carreras, X., and Collins, M. (2008). Sim ,

ple senti-superviseddependeocy parsing. InACL-08.

Koopmans, T. C. (1972). Representation of pref erence

In

orderings over time.

and Radner, R. (Eds.), Decision Elsevier/North-Holland.

McGuire. C. B.

and Organization.

Korb, K. B. and Nicholson, A. (2003). Bayesian Arrificial Intelligence. ChapmanandHall.

Koza, J. R. (lm). Genec ri Programming II:Auto matic discm·ery of reusableprograms. MITPress. K'ery: Com

putational E:cplorations of the C,.eative Processes.

MITPress. Langton,

MITPress.

C.

(Ed.).

L.•place, P. (1816).

probabilitis (3rd

Paris.

(1995).

Artificial Lif e.

Essai philosophique sur les

edition).

Courcier Imprimeur,

1080

Bibliography

Laplev, I. and Perez, P. (2007). Retrieving actions in movies. In /CCV, pp. 1-8.

Lege.ndre, A. M. (1805). Nou1·elles mithodes pour Ia dltemJination dts orbites des comiues. .

4, 35-56.

lenat, D. B. (1983). EURJSKO: A program that

Lari, K. and Young, S. J. (1990). The estimruion of stochasticcontext-free grammars usingthe inside outside algorithm. ComputerSpeech and Language,

Lehrer, J. (2009). How We Decide. Houghton Mif flin.

Larrailaga, P., Kuijpers, C., Murg��, R., Inza, I., and

learns new heuristics and domain concepts: The na ture f heuristics, ill: Program design and results.

13, 129-170.

LenaI, D. B. and Brown, J. S. (1984). WhyAM and EURJSKO appear to wodc. AIJ, 23(3), 269-294.

Dizdarevic, S. (1999). Genetic algorithms for the travelling salesman problem: A review of represen tations and operators. Artificiallntelligtnce Review,

AJJ, 2/(1-2), 61-98.

lenat, D. B. andGuba, R. V. (1990). Building Lo.!f!e

Lightbill, J. (1973). Artificial intelligence: A gen eral survey. In Lightbill, J., Sutherland, N. S., Need ham, R. M., Longuet-Hi!J8ins, H. C., and Michie, D. (Eds.), Arrijicial lntelligrnce: A Paper Symposiunc Science Research Council fGreat Britain. Lin, S. (1965). Computer solutins of the travelling salesman problem. Bell Systtms Technical Joumal, 44(10), 2245-2269.

Lin, S. and Kernighan, B . W. (1973). An effective heuristic algorithm for the travelling-salesman prob lem. Operations Resrarch, 2/(2), 498-516.

Larson, S. C. (1931). The shrinkage of the coef ficient of multiple correlation. J. Educaonal ti Psy chology, 22, 45-55.

Knowledge-Baud System•: Representation and In ference in the CYCProjec.t. Addison-Wesley.

Robot Motion Planning.

l.eonard, J. andDurrant-Whyte, H. (1992). Directed

ficiallntdligtnctfor Organic Chemistry: The DEN

LeSniewski, S. (1916). mnogoSci. Moscow.

Littman, M. L (1994). M:ukov games as a fcame work for multi-agent reinforcement learning. In ICML-94, pp. 157-163.

Laskey, K. B. (2008). MEBN: A language for first order bayesianknowledge bases. AJJ, 172, 140-178.

Latombe, J.-C. (1991).

Kluwer.

Lauritzen, S. (1995). The EM algorithm for graphi cal association models with missing data. Computa

tional Statistics and Data Analysis, 19, 191-201.

Lauritzen, S. (1996). Graphical models. Oxford

i Press. Universty

Lauritzen, S., Dawid, A. P., Lassen, B., and leim.,.,

H. (1990). Independence properties of directed Markov fields. Networl:s, 20(5),491-505.

Lauritzen, S. and Spiegelbalter,D. J. (1988). Local computations with probabilities on graphical struc tures andtheir applicationte:q>ert systems. J. Royal

Statistical Society, B 50(2), 157-224.

Lauritzen, S. and Wennutlh, N. (1989). Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals

ofStatistics, 17, 31-57. LaValle, S.

(2006). Planning Algorithms.

bridge Un iversity Press.

Cam

Lavrauc, N. and Duzeroski, S. (1994). Inductive

Logic Programming: Techniques and Applications. Ellis Horwood.

Lawter, E. L., Lenstra, J.K., Kan, A., and Shmoys,

D. B. (1992). The Tm,·ellingSalesman Problem. Wi ley lnterscience.

Lawter, E. L., Lenstra, J. K., Kan, A., and Shmoys,

D. B. (1993). Sequencing and scheduling: Algo

rithms and complexity. In Graves, S. C., Zipkin, P. H., and Kan, A. H. G. R. (Eds.), Lo.gistics ofPro

duction and Inventory: Handbooks n i Operations Research and Managemtnr Scitnct) Volume 4. pp.

leonard, H. S. and Goodman, N. (1940). The cal culus of individuals and its uses. JSL, 5(2), 45-55.

sonarstnsingfor mobile robot navigation. KJuwer.

Pwenheim, L. (1915). Obermoglichkeiten m i Rel

ativkalkiJJ. Mathemarische Annalen,

76,447-470.

Lowerre, B. T. (1976). The HARPY Speech Recog nition System. Ph-D. thesis, Computer Science De partment, Carnegie-MeDon University. Lowerre, B. T. and Reddy, R. (1980).

The HARPY

speech recognition system. In Lea, W. A. (Ed.), Trends n i Speech Recognition, chap. 15. Prentice

Hall.

Lowry, M_ (2008). Intelligent software engineering tools for NASA's crew exploration vehicle. In Proc.

ISMIS.

Loyd, S. (1959). Mathematical Puwes of Sam Loyd: Selected and Edited by Marrin Gardner. Dover.

�L.c.Kay, D. J. C. (1992).

A practical Bayesian

frao1e""d< for back-propagation networks. Neural Computation, 4(3), 448-4n. Mac.Kay, D. J. C. (2002). lnformatian Theory, In

ference and Learning Algorithms. Cambridge Uni versity Press.

MacKenzie, D. (2004). Mechooizing Proof

Press.

MIT

Mac.kworth, A. K. (1977). Consistency n i networks ofrelations. AIJ, 8(1), 99-1 18.

(2007).

A new look at survey propagation and its

generalizations.

JACM, 54(4).

marie program synthesis. CACM, 14(3), 151-165.

Manna, Z. and Waldinger, R. (1985). The Logical Basis for ComputerProgmmming: Volume 1: De ductive Reasoning. Addison-Wesley.

Manning, C. and Schillze, ll (1999). Foundations of Statistical Natuml Longuage Processing.

MIT

Press.

Manning, C., Raghavan, P.,andSchtitze, H. (2008).

lntroductian to lnfomJation Retrie\'111. Cambridge

University Press.

Mannion, M.

(2002).

Using first-order logic for

product line model validation. In Saftware Product lines: Second International Conference. Springer. Manzini, G. (1995). BIDA*: An improved perime ter search algorithm. AIJ, 72(2), 347-360.

Marbach, P. andTsitsiklis, J. N. (1998). Simulation based optimization of Markov reward processes. Technicalreport LIDS-P-2411, Laboratory forInfor mation and Decision Systems, Massachusetts Insti IUte ofTechnology.

Marcus, G. (2009). Kluge: The HaplmPrrf El'olu tion afthe HunJan Mind. Mariner Books. Marcus, M. P., Santorini, B., and Mascinkiewicz, M_ A. (1993). Building a lasge annotated corpus of english: The pemtteebank. Computationallinguis tics, /9(2), 313-330.

Markov, A. A- (1913). An example of statistical m·e i stigation n i the text of "Eugene Onegi.n" illus trating coupling of"tests" in chains. Proc. Academy ofSciences ofSt. Petersburg, 7.

Marsland, S. (2009). Machine Learni ng: An Algo rithmic Perspective. CRC Press.

Addivtie

Martelli, A. and Montanari, U. (1978). Optimizing

trees thsough heuristically guided seasch. CACM, 21, 102S-1039. decision

Martelli, A. (1977). On the complexity of admissi ble seasch algorithms. All, 8(1), 1-13.

Marthi, B., Pasula, ll, Russell, S. J., and Peres, Y. (2002). Decayed MCMC filtering. In UAI-02, pp. 319-326.

Marthi, B., Russell, S. 1., Latham, D., and Guestrin, C. (2005). Concurrent ierarchical h reinforcement learning. In UCAI-05.

Marthi, B., Russell, S. J., and Wolfe, J. (2007). An gelic semantics for high-level actions. In ICAPS-07. Marthi, B., Russell, S. J., and Wolfe, J. (2008). An gelic ieraschical h planning: Optimal and online al gorithms. In ICAPS-08. Martin, D., Fowlkes, C., and Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 26(5), 530-549.

Martin, J. ll (1990). A Computational Model of !fetaphorInterpretation. Academic Press. Mason, M_ (1993). A/Mag, 14(1), 58-59.

Kicking

the

sensing habit .

Mason, M. (2001). Mechanics ofRobotic Maniptl lation. MITPress.

Mason, M. and Salisbury, J. (1985). Robot hands andthe mechanics ofrnanipulation. MIT Press.

Ma!aric, M_ J. (1997). Reinforcement learning n i the multi-robot domain. Autonomous Robots, 4(1), 73-83. Mates, B. (1953). Stoic Logic. University of Cali

fornia Press.

Matuszek, C., Cabral, J., Witbrock, M., and DeO ti.. i troduction to the syntax and , ira, J. (2006). An n semantics ofeye. In Proc. AAAJSpring Symposium an FomJalizing and Compiling Background Knowl edge and ItsApplications to Knowledge Representa tion and Question Answering.

Maxwell, J. and Kaplan, R. (1993). The interface between phrasal and functional constraints. Compu

tational linguistics, 19(4), 571-590.

McAllester,D.A.(t980). An outlookon bUthmain

tenance. Ai memo 55 1, MIT AJ Laboratory.

McAllester, D. A. (1988). Conspiracy numbers for min-max seorch. Al/,35(3), 287-310. McAllester, D. A. (1998). What is the most press ing issue facing and today? Candidate statement, eJection for Councilor of the American Association for Artificial Intelligence.

AJ

the AAAJ

M-aron, M. E. (1961). Automatic indexn i g: An ex

McAIIester, D. A. and Rosenblitl, D. (1991). Sys tematic nonlineas planning. In AAAI-91, Vol. 2, pp.

Maron, M_ E. and Kuhns, J.-L (1960).

McCallum, A. (2003). Efficiently inducing feaiUres ofconditional random fields. In UAI-03.

perimental inquiry.

evance,

JACM,8(3), 404-417.

On rel probabilistic indexing and information re

trieval. CACM, 7, 219-244.

Marr, D. (1982). Vision: A Computational Investi gation n i to the Human Representation and Process ing ofVisuallnfomJation. W.H. Freeman.

Marriotl, K. and Stuckey, P. J. (1998). Program ming with Consrmints: An Introduction. MIT Press.

634-639.

McCarthy, J. (1958). sense.

Programs with common

In Proc. Symposium on Mechanisation of

1houghJ Processes, Vol. 1, pp. n-84.

McCarthy, J. (1963). Situations, actions, and causal laws. Memo 2, Stanford University Artificial Intelli gence Project.

Bibliography

1082 McCarthy, J. (1968). Programs with common sense. Jn Minsky, M. L. (Ed.), Semantic lnfomlll tion Processing, pp. 40�18. MIT Press. McCarthy, J. (1980). Ci r cumscription: A fonn of

non-monotonic reasoning. AJJ, 13(1-2), 27-39.

McCarthy, J. (2007). From here to hwnan-level AI.

AIJ, 171(18), 117�1182.

McCarthy, J. and Hayes, P. J. (1969). Some philo sophical problems from the standpoint of artificial intelligence. Jn Meltzer, B., Michie, D., and Swann, M. (Eds.), Machn i e Intelligence 4, pp. 461-502. Ed inburgh University Press. McCarthy, J., Minsky, M. L., Rochester, N., and Shannon, C. E. (1955). Proposal for the Dartmouth

summer research project on artificial intelligence. Tech. rep., Dartmouth College.

McCawley, J.D. (1988). The Syntactic Phmomena of English, Vol. 2 •·olwnes_ University of Chicago Press. McCorduck, P. (2004). Machn i es who think: a per

sonal inquiry into the history andprospects ofartifi· cial intelligence (Revised edition). A K Peters.

McCulloch, W. S. and Pitts, W. (1943). A logical calcuius of the ideas imm:ment n i nervous activity. Bulletin ofMathematical Biophysics, 5, 1 15-137.

Mendel, G. (1866). Versuche fiber pftanzen hybriden. Verhandlungm des Naturforschend�n Vereins, Abhandlungen, 8rtlnn,4, �1. Translated into English by C. T. Druery, published by Bateson (1902). Me...,.r, J. (1909). Functions of positive and nega tive type and their conneotion with the theory of in tegral equations. Philos. Trans. Roy. Soc. London, A.

209, 415-446.

Merleau.Ponty, M. (1945). PhenomrnologyofPer

ception. Routledge.

Metropolis, N., Rosenbl"uth, A., Rosenbluth, M.,

Teller, A., and Teller, E. ( 1953). Equations of state calculations by fast computing machines. J. Chemi cal Physics, 21, 1087-1091. Metzinger, T. (2009). The Ego Tunnel: The Science i d tlild the Myth oft ofrlw Mn he Self Basic Books.

Mez.ard, M. and Nadal, J.-P. (1989). Learning in feedforward layered networks: The tiling algorithm.

J. Physics, 22, 2191-2204.

Michalski, R. S. (1969). On the quasi-minimal so lution of the general covering problem. In Proc. First International Sympo·sium on lnfomJO.tion Pro cessing, pp. 125-128. Michalski, R. S., Mo:retic, 1., Hong, J., and Lavrauc,

McCune, W. (1992). Automated discovery ofnew axiomatizations ofthe left SJTOUP andright group cal

N. (1986). The multi-pt1rpose incremental team ing system AQI5 and its testing application to three medical domains. InAAAl-86, pp. 1041-1045.

McCune, W. (1997). Solution ofthe Robbins prob lem. JAR, 19(3), 263-276.

Mkhie, D. (1966). Game-playing and game teaming automata. Jn Fox, L. (Ed.), Advances n i Programming andNon-Numtrical Computation, pp. 183-200. Pergamon.

culL JAR, 9(1), 1-24.

McDermott, D. (1976). Artificial intelligence meets natural stupidity. SIGARTNewsle"u, 57, �9. McDermott, D. (1978a). Planning and acting. Cog

Michie, D. (1972). Macbine imelligence at Edin burgh. ManagemenT lnforrnatics, 2(1), 7-12.

Minsky, M. L. (1975). A frameworl< for represent ing knowledge. JnWinston, P. H. (Ed.), The Psycho� ogyofComputer Vision,pp. 211-277. McGraw-Hill. Originally an MIT AI Laboratory memo; the 1975 version is abridged. but is the most widely cited. Minsk , M. L. (1986). The society ofmind. Simon

y

and Sc huster.

Minsky, M. L. (2007). The Emotion Machine: Com

monsense Thinking, Artificial Intelligence, and the Fmure ofthe Human Mind. Simon and Schuster.

Minsky, M. L. and Papert, S. (1969). Perceprrons: An lntroducrion to Computational Geometry (first

edition). MIT Press.

Minsky, M. L. and Papert, S. (1988). Perceptrons: An Introduction to Computarional Geornetry (Ex

panded edition). MIT Press.

Minsky, M. L., Singh, P., and Sloman, A. (2004).

The st. thomas

common sense symposium: De sig,ning architectures fo.r human-level intelligence.

A/Mag, 25(2), 113-124.

Minton,S. (1984). Constraint-based generalization: Learning game-playing plans from single examples. JnAAAl-84, pp. 251-254. Minton, S. (1988). Quantitative results concerning i the utility of explanaton-based teaming. Jn AAAI88, pp. 5�569. Minton, S., Johnston, M. D., Philips, A B., and Laird, P. (1992). Minimizing conflicts: A heuris tic repair method for ·constraint satisfaction and scheduling problems. A/J,58(1-3), 161-205. Misak, C. (2004). The Cambridge Companion Peirce. Cambridge University Press.

ro

Mitc.hell, M. (1996). An Introduction to GenecAl ti gorithms. MIT Press.

McDermott, D. (1978b).

Michie, D. (1974). Macbine n i telligence at Edin burgh. Jn On Intelli gence, pp. 143-155. Edinburgh University Press.

Mitc.hell, M., Holland, J. H., and Forrest,S. (1996). When will a genetic algorithm outperfonn hill climbing? In Cowan, J_, Tesauro, G., and Alspec tor, J. (Eds.), NIPS 6. MIT Press.

McDermott, D. (1985). Reasoning about plans. In

Michie, D. and Chambers, R. A. (1968). BOXES: An experiment in adaptive control. Jn Dale, E. and Michie, D. (Eds.),Machine Intelligence 2, pp. 125-

Mitc.hell, T. M. (1977). \lorsion spaces: A candidate elimination approach to rule learning. Jn IJCAI-77, pp. 305-310.

nitive Science, 2(2), 71-109.

Tarskian semantics, or, no notation without denotamionl Cogniri:l•e Science, 2(3).

Hobbs, J. and Moore, R. (Eds.), Fom111l theories of the commonsense world� Intellect Books. McDermott, D. (1987). A critique ofpure reason.

ComputationalIntelligence, 3(3), 151-237. McDermott, D. (1996). A heuristic estimator for

means-ends analysis n i planning. Jn ICAPS-96, pp.

142-149.

McDermott, D. and Doyle, J. (1980).

monotoniclogic: i. AIJ, 13(1-2), 41-72.

Non-

McDermott, J. (1982). Rl: A rule-based configurer

of computer systems. AIJ, 19(1), 3!Wl8.

McEliece, R. J., MacKay, D. J. C., and Cheng, 1. F. (1998). TIUOO decoding .as an instance of Pearl's "beliefpropagation" algorithm. IEEE Journal on Se

lectedAreas n i Communications, 16(2), 140-152.

McGregor, J. J. (1979). Relational consistency al

gorithms and their application in frnding subgrapb and graph isomorphisms. Information Sciences,

19(3), 229-250.

Mcllraith, S. and Zeng, H. (2001). Semantic web

services. IEEElnrelligent Systems, 16(2), 46-53.

McLachlan, G. J. and Krislhnan, T. (1997). The EM Algorithm and Etttnsions. Wiley. McMillan, K. L. (1993). Symboli c Model Checking.

Kluwer.

Meehl, P. (1955). Clinical vs. Statiscal ti Prediction. University ofMinnesota Press.

133. Elsevier/North-HoU311d.

Michie, D., Spiegelhalter, D. 1., and Taylor, C. (Eds.). (1994). Machine Learning, Neural and Sta tistical Classification. Ellis Horwood.

Milch, B., Martbi, B., SOI. New C,neration Computing, !3(3-4), 245286. Muggleton, S. H. (2000). Learning stochastic logic programs. Proc. AAAJ 2000 Workshop on Learning Statistical Models fmm Relational Data.

Mugglelou, S. H. and Buntine, W. (1988). Machine

invention of first-order predicates by inverting reso lution. In ICML-88, pp. 339-3S2.

Muggleton, S. H. and De Raedt, L. (1994). Indue rive logic programming: Theory and methods. J.

Logc i Progmmming, 19120, 629-679.

Muggleton,S. H. and Feng, C. (1990). Efficient n i duction of logic programs. In Proc. \Vorl:shop on Algorithmc i Learning Theory, pp. 368-381. Muller, M. (2002). Computer Go. AJJ, /34(1-2),

145-179.

Murphy, K. and Russell, S. J. (2001). Rao black"'ellised particle filtering for dynamic Bayesian networks. In Doucet, A., de Freitas, N., and Gordon, N. J. (Eds.), Sequential Monu Carlo Methods n i Practice. Springer-Verlag. Murphy, K. and Weiss, Y. (2001). The fac tored frontier algori t hm for appro xi mate inference in DBNs. In UAI-01 , pp. 37&-385.

Murphy, R. (2000). Introduction to AI Robotics. MITPress.

Mul'ray-Rnst, P., Rzepa, H. S., Williamson, J., and Willighagen, E. L (2003). Chemical markup, XML andthe world-wide web. 4. CML schema. J. Chem. lnf. Comput. Sci., 41, 75'b-772.

Murthy, C. andRussell, J. R. (1990). A constructive proofofHigman·s lemma. InUCS-90, pp.257-269.

Mustettoln, N. (2002). Computingtheenvelope for

stepwise-constant resource allocations. In CP-02, pp. 139-154.

Muscettola, N., Nayak, P., Pell, B., and Williams, B. (1998). Remote agent: Toboldly go where no AJ system bas gone before. AU, 101. 5-48.

Muslea, I. (1999). Extraction patterns for n i forma tion extraction tasks: A survey. In Proc. AAAI-99 \Vorkshop on Machine LearningforInformation Et

rmction.

Myerson, R. (1981). Optimal auction design. Math ematics ofOpemtions Research, 6, 5&--73. Myerson, R. (1986). Multistage games with oom munication. Econometrica, 54, 323-358.

Myerson, R. (1991). Game Theory: Analysis of Conflict Harvard University Press. Nagel, T. (1974). Whatis it like to be a bat? Philo

sophicalReview, 83, 435-450.

Nalwa, V. S. (1993). A Guided Tour of Computer Wsion. Addison-Wesley. Nash, J. (1950). Equilibrium points in N'l"'rson games. PNAS, 36, 48-49. Nau, D. S. (1980). Pathology on game trees: A sum mary of results. InAAAJ-80, pp. JO'b--104.

Nau, D. S. (1983). Pathology on game trees revis ited, and an alternative to minimaxing. AU.21(l-2),

221-244.

Nau, D. S., Kumar, V., and Kanal, L. N. (1984).

General branchandbound, and its relation to A* and AO*. AIJ, 23, 29-58.

Nayak, P. and Williams, B. (1997). Fast context

switching n i real-me ti propositional reasoning. In

Mtiller, M. (2003). Conditional combinatorial games, and their application to analyzing capturing races in go. lnfornUition Sciences, 154(3-4), 189-

AAAI-97, pp. 50-56.

Mumford, D. and Shah, J. (1989). Optimal approx imations by piece-wise smooth functions and asso ciated variational problems. Commurt. Pure Appl.

Nebel, B. (2000). On the cornpilability and expres sive power of propositional pJanning formalisms.

202.

MaJh., 42, 577�85.

Murphy, K., Weiss, Y., and Jordan, M. I. (1999). Loopy beliefpropagatioofor approximate inference: An empirical study. In UA/-99, pp. 467-475.

Murphy, K. (2001).

'Ibe Bayes net toolbox for

MATLAB. Computing Science and Statiscs, ti 13.

Murphy, K. (2002) D)Yramic Bayesian Networks: Representation, Inference and Learning. PhD. the sis, UC Beoteley. .

Murphy, K. and Mian, L S. (1999). Modelling gene expression data using Bayesian networks. peoplQ.cs.ubc.ca/"murphyk/Papers/ ismb99 . pdf.

Neal, R.(1996). Bayesian Learningfor Neuml Net worl:s. Springer-Verlag.

JAIR, 12, 271-315.

Nefian, A., Liang, L., Pi, X., Liu, X., and Murphy, K. (2002) Dynamic bayesian netwodts for audio visual syeech recognition. EURASIP, Joumal ofAp .

pliedS•gnalProctssing. ll, 1-15.

Nesterov, Y. and Nernirovsk� A. (1994). Interior Point Polynomial Methods in Coovex Programming.

SIAM (Society for Industrial and Applied Mathe matics).

Netto, E. (1901). Lehrbuch der Combinatorik. B. G. Teubner.

Neviii-Manning, C. G. and Witten, I. H. (1997). Identifying hierarchical structures n i sequences: A linear-time algorithm. JAJR, 7, 67-82.

Bibliography

1084 Newell, A. (1982). The knowledge le•�l. AU, 18(1), 82-127. Newell, A. (1990). Unified Theories of Cognition.

Harvard University Press.

Neweii,A. and Ernst, G. (1965). The search for gen erality. In Proc. JFIP Congress, Vol. 1, pp. 17-24.

Newell, A., Shaw, J. C., and Simon, fl A. (1957). Empirical explorations with the logic theory ma

chine Proc. iK>stem Joint Computer Confertnce, 15, 218-239. Reprinted in Fei genbaum and Feld man (1963). .

Newell, A., Shaw, J. C., and Simon, fl A. (1958). Chess playing programs and the problem of com plexity. IBM Journal ofRestarrhand De�•elopmenr, 4(2), 320-335. Newell, A. and Simon, H. A. (1961).

GPS, a pro

gram that simulates human thought In Billins. H. (Ed.), LernendeAutomatrn, pp. 109-124. R. Olden

bourg.

Newell, A. and Simon, H. A. (1972). Human Prob lem Solving. Prentice-Hall.

i Newell, A. and Smon, H. A. (1976).

Computer

scienoe as empiricaJ inquiry: Symbols and search.

Nilsson, N. J. (1984). Shakey the robot. Technical

note 323, SRI InternationaL

Nilsson, N.J. (1986). Probabilistic lngic. AIJ,28(1), 71-87.

Nilsson, N. J. (1991). Logic and artificial intelli gence. AJJ, 47(1-3), 3 1-56.

Nilsson, N. J. (1995). Eye on the prize. AIMag, /6(2), 9-17. Nilsson, N. J. (1998). ArtificialIntelligence: A New Synthesis. Morgan Kaufmann.

Nilsson, N. J. (2005). Human-le>�l artificial intelli gence? be serious! A/Mag, 26(4), 68-75. Nilsson, N. J. (2009). The QuestforArtificial Intel

ligence: AHistory ofIdeas andAchievements. Cam bridge University Press.

Nisan, N., Roughgarden, T., Tardos, E,, and Yazi rani, V. (Eds.). (2007). Algorithmic Game Theory. Cambridge University Press. Noe, A. (2009). Out of Our Heads: IVhy You Are

Not YourBrain, and OtherLessonsfrom the Biology ofConsciousness. Hil l and Wang.

Norvig, P. (1988). Multp i le simultaneous n i terpreta

CACM, 19, 113-126.

tions of ambiguous sentences.

Newton, L (1664-1671). Metbodus fluxionum et se rierum infinitarum. Unpublished notes.

Stna Progmmming: Case Studies in Common Lisp.

In COGSCI-88.

Norvig, P. (1992). Rlradigms ofArtiftcial lntelli Morgan Kaufmann.

Ng, A. Y. (2004). Feature selection, 11 vs. !2 regu larization, and rotational invariance. In JCM/.,04.

Norvig, P. (2009). Natural language corpus data. In Segaran, T. and Hammerbacher, J. (Eds.), Beautiful

Ng, A. Y. and Jordan, M. L (2000). PEGASUS: A policy search method forlarge MOPs and POMDPs.

Nowick, S. M., Dean, M. E., Dill, D. L., and Horowitz, M. (1993). The design of a high performance cache controller: A case study n i asyn chronous synthesis. Integration: The VLSJJournal,

Ng, A. Y., Kim, H. J., Jordan, M. 1., and Sastry, S. (2004). Autonomous helicopterflight via reinforce

Nunbe.rg, G. (1979). The non-uniqueness of seman tic solutions: Polysemy. Language and Philosophy,

Ng, A. Y., Harada, D., and Russeii,S. J. (1999). Pol icy invariance underrewardtransformations: Theory and application to reward shaping. In JCML-99. In UAI-00, pp.406-415.

ment learning.

In NIPS 16.

Nguyen, X. and Kambhrunpati, S. (2001). Reviving partial orderplanning. In JJCAI-01, pp. 459-466.

n d a, R. S. v n, X., Kamb harn ati, S., and Nie Ngue (200 1 ). Planning graphp as the basisg t o r deriving heuristics for plan synthesis by state space and CSP search. Tech. rep., Computer Science and Engineer ing Department, Arizona State University.

Nicholson, A. and Brady, J. M. (1992). The data as sociation problem when monitoring robot vehicles using dynamic belief networlcs. In ECAI-92, pp. 689-693.

Niemela, 1., Simons, P., and Syrj n en, T. (2000).

Smodels:

a

A system for ansv.-er

set

program

ming. In Proc. 8th lntemational lVorl:shop on Non Monotoni c Reasoning.

Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. M. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning,

39(2-3), 103-134.

Niles, L and Pease, A. (2001). Towards a standard In FOJS '0I: Proc. n i ternational i lnfornllltion Sys conference on FomJal Ontology n

upper ontology.

tems,pp. 2-9.

Nilsson, D. and Lauritzen, S. (2000). Evaluaring influence diagrams using UMJDs. In UAU!O, pp.

43645.

Nilsson, N. J. (1965). Learning Machines: FOun dations of Trainable Rlttem-Classifying Systems. McGraw-Hill. Republished in 1990. Nilsson, N. J. (1971). Problem-Solving Methods in

Artificial Intelligence. McGraw-Hil L

Data. O'Reilly.

/5(3), 241-262.

3(2), 143-184.

Nussbatun, M. C. (1978). Arisrorlt's De MoiU Ani

O'Reilly, U.-M. and Oppacher, F. (1994). Program search with a ierarchical h variable length represen tation: Genetic programming. simulated annealing and hill climbing. In Proc. Third Conference on Par aUel Problem Solvingfrom Nahtre, pp. 397-406.

Onnoneit, D. and Sen, S. (2002) Kernel-based re infon:ement learning. Machine Learning, 49(2-3), 161-178. .

Osborne, M. J. (2004). An Introduction to Game Theory. Oxford University Pres. Osborne, M. J. and Rubinstein,A. (1994). A Course n i Game Theory. MIT Press.

Osherson, D. N., Stob, M.,

and Weinstein, S.

(1986). Systems That Leam: An Introduction to Learning Theory for Cogniti••e and Computer Sci ensts. ti MIT Press.

Padgham; L, and Winikoff, M. (2004). !kl'cloping

IntelligentAgemSystems: A PracticalGuide. Wiley. Page, C. D. and Srinivasan, A. (2002). ILP: A short

look back and a longer look forward. Submitted to Journal of Machine Learning Research. Palacios, H. and Geffner, fl (2007). From confor mant into classicaJ planning-: Efficient translations that may be complete too. In JCAPS-07. .

Palay, A. J. (1985). Scan:hing with Probabilities.

Pitman.

Palmer, D. A. and Hearst, M. A. (1994). Adapti� sentenceboun dary dismbig uation.InProc. Confer a n guage Processing, pp. ence on App li e d NaturolLD 78-83. Palmer, S. (1999). VisionScience: Photons to Phe nomenology. MIT Press. Papadimilriou, C. H. (1994). Computational Com plexiry. Addison Wesley.

Papadimitriou, C. H., Tamaki, H., Raghavan, P., and Vempala, S. (1998). Latent semantic indexing: A probabilisticanalysis. In PODS-98, pp. 159-168. Papadimilriou, C. H. and Tsitsiklis, J. N. (1987). The complexity of Maikov decision processes.

malium. Princeton University Press.

Mathematics of Operations Researrh, /2(3), 441450.

Ocb, F. J. and Ney, H. (2003). A systematic compar ison of various statistical alignment modeL Compu

Shortestpaths without a map. Theoretical Computer

Oaksford, M. and Chater, N. (Eds.). (1998). Ra tional models ofcognion. ti Oxford Uni•�rsity Press.

tational linguistics, 29(1), 19-51.

Och, F. J. and Ney, H. (2004). The alignment template approach to statistical machine translation.

Computaonal ti Linguistics, 30,417-449.

Ogawa, S., Lee, T.-M., Kay, A. R., and Tank, D. W. (1990). Brain magneticresonance imagingwith con trast dependent on blood oxygenation. PNAS, 87, 9868-9872.

Oh, S., Russell, S. J., and Sastry, S. (2009). Markov

chain Monte Carlo data association for multi-target tracking. IEEE Transactions on Automatic Control, 54(3), 481-497. Olesen, K. G. (1993). Causal probabilistic networks with both discrete and conrinuous variables. PAM/,

/5(3), 275-279.

Oliver, N., Garg, A., and Horvitz, E. J. (2004). Lay ered representations for learning and inferring office activity from multiple sensory channels. Computer Vision andlnJage Understanding, 96, 163-180.

OJi,,.r, R. M. and Smith, J. Q. (Eds.). (1990). Influ ence Diagrams, BeliefNtts and Decision Analysis.

Wiley. Omohundro, S. (2008).

The basic AI

dri•�s.

In

AGI-08 Workshop on rile Sociocultural, Ethical and Futurological Implications ofArtiftcial lnte/ligen.ce.

Papadimitriou, C. H. and Yannakakis, M. (1991).

Science,84(1), 127-150.

Papavassiliou, V.and Russel� S. J. (1999). Conver gence of reinforcement learning with general func tion approximators. In /JCAI-99, pp. 748-757.

Parekh, R. and Honavar, V. (2001). DFA learning from simple examples. Mac/tine Learning, 44, 935. Pa.risi, G. (1988). Statisticalfieldtheory. Addison Wesley. Pa.risi, M. M. G. and Zecchina, R.

(2002) Ana lytic and algorithmic solution of random satisfiabil ity problems. Scien.ce, 297, 812-815. .

Pa.rker, A., Nau, D. S., and Subrahmanian, V. S. (2005). Game-tree searchwith combinatorially !urge belief states. In JJCAI-05, pp. 254-259.

Parker, D. B. (1985). Learning logc. Tech nical report TR-47, Center for Computational Re

i

search in Economics and Management Science, Massachusetts Institute ofTechnology.

Parker, L E. (1996). On the design of behavior based multi-robot teams. J. Advanced Rnbotics, /0(6).

Parr, R. and Russell, S. J. (1998). Reinforcement

learning with hierarchies of machines. In Jordan, M. 1., Kearns, M., and Solla, S. A. (Eds.), NIPS 10. MIT Press.

Bibliography

1085

Parz.dometric i.nfor mation. In IJCA1-97. Shelley, M. (1818). Frankt'nstein: Or, the Modem Prometheus. Pickering and Chatto. Sheppard, B. (2002). World..:hampionship-caliber

scrabble. All, /34(1-2), 241-275.

Shi, J. and Malik, J. (2000). Nonnalized cuts and

image segnu,.,tation. PAM/, 22(8), 888-905.

Shieber, S. (1994). Lessons from arestricted Turing Test. CACM, 37, 70-78. Shieber, S.

Press.

(Ed.). (2004). The Turing Test.

MIT

Sboham, Y. {1993). Agent-oriented prosramming.

AIJ, 60(1), 51-92.

Sboham, Y. (1994). Artificial Intelligence Tech niques n i Prolog. MorganKaufmann.

Sboham, Y. and Leyton-Brnwn, K. (2009). Mul tiagtnl S y ste ms: Algorithmic, Game-Theoretic, and LogicalF ouMiations. Cambridge Univ. Press.

Sboham, Y.,Powers, R., and Grenager, T. (2004). H multi-agent learning is the ansY.-er, what is the ques tion? In Proc. AAAl Fall Symposium on Arrijicial Multi-Agent Learning. Shortliffe., E. H. (1976). Computer-Based Medical

Consultations: MYCIN. Elsevier/North-Holland.

Sietsma, J. and Dow, R. J. F. (1988). Neural net pruning-Wby andbow.In IEEElntemarional Con ference on Neuml Networks, pp. 325-333. Siklnssy, L. and Dreussi, J. (1973). An efficient robot planner which generates its own procedures. In UCAI-73,pp. 423-430.

Silverstein, C., Henzinger, M., Marais, H., and

Moricz, M. (1998). Ana!Jsis ofa 1�ry large altavista 8-014, Digital Systems Re query los. Tech. rep. 199 search Center.

Simmons, R. and Koenig, S. (1995). Probabilis tic robot navigation n i partially observable environ ments. In IJCAI-95, pp. 1080-1087. IJCAI, Inc.

Simon, D. {2006). Optimal State Estimation: Kalman, H lnforiry, andNonlinearApproaches. Wi ley. Simon, H. A. (1947).

Macmillan.

Administrative behavior.

Simon, H. A. (1957). Models of Man: Social and lloonal. ti John Wiley. Simon, H. A. (1963). Experiments with a heuristc i

onmpiler. JACM, 10, 493-506.

Simon, H. A. (1981). The Sciences of the Artificial (second edition). MIT Press.

Bibliography

1089

Simon, H. A. (1982). Models ofBoundedRational

Smith, D. E. and Weld, D. S. (1998). Confonnant

Simon, H. A. and Newell, A. (1958). Heuristic problem solving: The next ad\·anoe in operations re

Smith, J. Q. (1988). DecisionAnalysis. Chapman

Simon, H. A. and Newell, A. (1961). Computer simulation of human thinking and problem solving. Datamaon, ti June/July, 35-37.

mizer's curse: Skepticism and postdecision surprise n i decision analysis. Managemenr Science, 52(3), 31 1-322.

ity, Volume!. The MIT Press.

search. Operations Research,6, 1-10.

Simon, J. C. and Dubois, 0. (1989). Number of solutions to satisfiability instances-Applications to knowledge bases. AIJ, 3, 53-65. Simonis, H. (2005). Sodoku as a constraint prob lem. In CP Workshop on Modeling andRtfomu.iat ing ConstmintSatisfaction Problems, pp. 13-27. Singer, P. W. (2009). ll'in'dfor IVar. Penguin Press.

Singh, P., Lin, T., Mueller, E. T., Lim, G., Perkins, T., and Zbu, W. L. (2002). Open mind common

sense: Knowledge acquisition from the general pub tic. InProc. First International Confenmce on On tologies, Databasts, and Applicarions ofSemantics forLarge Scale Information Systems. Singhal, A., Bucldey, C., and Mitra, M. (1996). Piv

oted document length nonnalization. In SIGIR-96, pp.21-29.

Sittle.r, R. W. (1964). An optimal data association problem in surveillance theory. IEEE Transactions on Military Electronics, 8(2), 125-139. Skinner, B. F. (1953). Science and Human Behav

ior. Macmillan.

Skolem, T. (1920). Logisch-kombinatoriscbe Unter

suchungen tiberdieEmillbarkeit i oder Beweisbarlceit mathematiscber S3tze nebst enem Theoreme tiber ie d dichte Mengen. Videnskapsselskapets skrifter, I. Matematisk-naturvidenskabelig klasse, 4. Skole.m, T. (1928). tiber die mathematische Logik. Norsk matematisk tidsskrift, 10, 125-142.

Slagle,J.R. (1963). A heuristic program that solves symbolic integration problems in freshman calculus. JACM, 10(4).

Slate, D. J. and Atkin, L. R. (1977). CHESS 4.5-

Northwestem University chess program. In Frey, P. W. (Ed.), Chess Skill n i Man and Machine, pp. 82-118. Springer-Verlag.

Slater, E. (1950). Statistics for the chess computer

and the factor of mobility. In S ymposium on Infor mation Theory, pp. 150-152. Ministry of Supply. Sleator, D. and Temperley, D. (1993). Parsing En

glish with a link gmmmar. In Thitrl Annual Work shop on Parsing technologies.

Slocum, J. and Sonneveld, D. (2006). The 15 Puzzle. Slocum Pnzzle Fonndation. Sloman, A. (1978). The Computer Revolution in

Philosophy. Hasvester Press.

Smallwood, R. D. and Sondik, E. J. (1973). The

optimal control of i pasrially observable Markov pro cesses over a finte horizon. Operations Research, 21, 1071-1088. Smart, J. J. C. (1959). Sensations and brain pro

Grapbplan. InAAAI-98, pp. 889-896.

and Hall

Smith, J. E. and Winkler, R. L.

(2006). The opti

Smith, J. M. (1982). E1·olution and rhe Theory of Games. Cambridge University Press. Smith, J. M. and Szatbmary, E. (1999).

The Ori

gins oflif e: From rhe Birth oflife to the Origin of Language. Oxford Uni1�rsity Press.

Smith, M. K., Welty, C., and McGuinness, D. (2004). OWL web ontology language guide. Tech. rep., W3C.

Smith, R. C. and Cheeseman, P. (1986). On the rep

resentation and estimation of spatial uncertainty. Int. J. Robotics Research, 5(4), 56-68.

Smith, S. J. J.,Nau, D. S., and Throop, T. A. (1998). Success in spades: Using AI planning techniques to win the world championship of computer bridge. In AAA/-98, pp. 1079-1086.

Smolensky, P. (1988). On the proper treatment of

connectionism. BBS, 2, 1-74.

Smullyan, R. M. (1995). First-OrderLogic. Dover.

Smyth, P., Heckerrnan, D., and Jordan, M. L (1997). Probabilisticiindependencenetworks forbid den Ma.tov probablity models. Neuml Compwa tion, 9(2), 227-269.

Dependency

parsing by belef i propagation. In EMNLP, pp. 145-

156.

Stallman, R. M. andSussman,G.J. (1977).Forward reasoning and dependency-rected id i backtracking in a system for computer-aided circut analysis. AJJ. 9(2), 135-196. Stanfill, C. and Waltz, D. (1986). Toward memory based reasoning. CACM, 29(12), 1213-1228.

Stefik, M. (1995). Introduction to Knowledge Sys tMJS. Morgan Kaufmann.

Stein, L. A. (2002). lnteracm·e Programming in Jal'a (pre-publication dmft). Morgan Kanfrnann. Stephenson, T., Bourlasd, H., Bengio, S., and Mor ris, A. (2000). Automatic speech recognition using dynamic bayesian networks with bath acoustic and tory features. In ICSLP-00, pp. 951-954. articula

Stergiou, K. and Walsh, T. (1999). The difference

all-difference makes. In IJCAI-99, pp. 41�19.

Stickel, M. E. (1992). A prolog technology theorem proYer: anew exposition andmplementation i in pro tog. Theoretical ComputerScience, 104, 109-128.

Stiller, L. (1992). KQNKRR. J. lntemational Com puter ChessAssociation, 15(1), 1�18.

Stiller, L. (1996). Multilinear algebra and chess endgames. In Nowakowski, R. J. (Ed.), Games of No Chance, MSRI, 29, I996. Mathematical Sciences Research Institute.

Soderla.nd, S. and Weld, D. S. (1991). Evaluating nonlinear planning. Technical report TR-91-02-03, University of Washington DepaJ1Jnent of Computer Science and Engineering. Solomonolf, R. J. (1964). A formal theory of n i

ductive inference. lrformation and Control, 7, 1-22, 224-254.

Solomonolf, R. J. (2009). Algorithmic probability theory and applications. In Emmert-Streib, F. and Dehmer, M. (Eds.), lnfom111tion Theory and Statiti

cal Learning. Springer.

Sondik, E. J. (1971). The Optimal Control of Par tially ObservableMarka1• Decision Processes. Ph.D. thesis, Stanford University. Sosic, R. and Gu, J. (1994). Efficient local search

with conftict minimization: A case study of the n queens problem. IEEE Tmnsactions on Knowkdge and Data Engineering, 6(5), 661-668. Sowa, J. (1999). Knowledge Rtpresentation: Logi

cal, Philosophical, and Computational Foundations. Blackwell. Sp aa n, M. T. J. and Vlassis, N. (2005). Perseus: Ra n d omze i d poru� i sed value iteration for

POMDPs. JAIR, 24, 195-220.

The B/adtwe/1 Guide to the Philosophy ofComput

Smith, D. A. and Eisner, J. (2008).

Handbook on Ontologies.

Stoffel, K., Taylor, M., and Hendler, J. (1997). Effi cient management of very large ontologies. In Proc. AAAI-97, pp. 442-447. Stolcke, A. and Omohundro, S. (1994). Inducing probabilistic grammars by Bayesianmodel merging. In Proc. SecondInternational Colloquium on Grom maticallnfen'nce and Applications (ICGI-94), pp. 106-118.

tems. Statistical Science, 8, 219 2:82.

Smith, D. E., Geneseretb, M. R., and Ginsberg, M. L. (1986). Controlling recussive inference. AIJ, 30(3), 343-389.

Staab, S. (2004). Springer.

Stockman, G. (1979). Aminimax algorithm better than alpba�ta? AU, 12(2), 179-196.

(2008). Do you have free will? John Searle reflects on various philosophical questions n i light ofnew researchon the brain. Calif o rnia Alumni Magazine, March/April.

Spiegelhalter, D. 1.,Dawid, A. P., Lauritzen, S., and

ingandlnfomiiJtion, pp. 155-166. Wiley-Blackwell.

ware, 7(5), 52-64.

Snell, M. B.

cesses. Philosophical Review, 68, 141-156.

Smith, B. (2004). Ontolosy. In Floridi, L. (Ed.),

Srivas, M. and Bickford, M. (1990). Fonnal veri fication of a pipetined microprocessor. IEEE Soft

Cowell, R.

(1993),

Bayesian analysis io expert sys

Spielberg,S. (2001). AL Movie.

Spirtes, P., Glymous, C., and Scheines, R. (1993). Causation, predicon, ti and search. Springer-Verlag.

Srinivasan, A., Muggleton, S. H., King, R. D., and Sternberg. M. J. E. (1994). Mutagenesis: ILP exper i iments in a non-determinate bological domain. In ILP-94, \bl. 237, pp. 217-232.

Stone, M. (1974). Cross-validatory choice and as

sessment of statostical predictions. J. RoyalStatisti calSociety, 36(1 11-133).

Stone., P. (2000). Layered uaming n i Multi-Agent Systems: A ll'inning Approach to Robotic Soccer.

MITPress.

Stone., P. (2003). Multiagent competitions and re

seasch: Lessons from RoboCup and TAC. In Lima, P. U. and Rojas, P. (Eds.), RoboCup-2002: Robot Soccer IVorld Cup VI, pp. 224-237. Springer Verlag.

Stone, P., Karninka, G., and Rosenschein, J. S.

(2009). Leading a best-response teammate in an ad

hoc team. In AAMAS IVorl:shap n i Agent Mediated Electronic Commerce. Stork, D. G. (2004). Optics and realism n i rennais sance art. Scientific American, pp. 77-83. Stracbey, C. (1952). Logical or non-mathematical

programmes. In Proc. 1952 ACM national meeting

(Toronto), pp. 4649.

Stratonovic.h, R.L. (1959). Optimum nonlinear sys e t ms whichbringabout a separation ofa si�'11al with constant parameters from noise. Radiofizika, 2(6), 892-901.

Stratonovich, R. L. (1965). On value of n i fonm rion. l zvestiya of USSR Academy ofSciences, Tech nical Cybernetics,5, 3-12. Subramanian, D. and Feldman, R. (1990). The util

ityofEBL in recussive domain tboories. InAAAI-90, Vol 2, pp. 942-949.

1090

Bibliography

Subramanian, D.andWang. E. (1994). Constraint based kinematic synthesis. In Proc. International Confrrrnce on Qu•litati••e Reasoning,pp. 228-239. Sussman, G. J. (1975).

A Computer Model ofSkill

Acquisition. Elsevier/North-Holland.

Tarski, A. (1941). lntro�Jcrion to Logic and to the Methodology ofDeductivt Sciences. Dover. Tarski, A. (1956). Logi c, Semantics, Metamarhe nl(ltics: Papersfrom 1923 to 1938. Oxford Univer sity Press.

Sutcliffe, G.andSuttner, C. (1998). The TPTPProb : CNFRelease vl.2.1. JAR. 2/(2), 177lemLibrary

203.

Tash, J. K. and Russeii, S. J. (1994). Control strate gies for a stochastic planner. In AAA/-94, pp. 10791085.

Sutcliffe, G., Scl>llz, S., Claessen, K., and Gelder, A. V. (2006). Using the TPTP language for writing

Thskar, B., Abheel, P., 3lld Koller, D. (2002). Dis criminav tie proOObilistic models for relational data.

derivat ions andfinite interpretations. In Proc. ln.ter national Joint COfl{erence onAutomatedReasoning, pp. 67-81.

Sutherland, I. (1963). Sketchpad: A man-machine graphical commuo.ication system. In Proc. Spring Joint Computer Conference,pp. 329-346. Sutton, C. and McCallum, A. (2007). An introduc

tion to condtionalrandom fields fOr relationa.l learn i ing. In Getoor, L. andTaskar, B. (Eds.), lntroducon ri to Statistical Relmional Learning. MIT Press. i ct Sutton, R. S. (1988). Learning to pred

methods of temporal differences.

ing,3,�.

b y the

M ac hine L arn e

Sutton, R. S., McAJiester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradieot methods for re

inforcement learning with function approximation. lnSolla, S. A., Leen, T. K., andMiiller, K.-R. (Eds.), NIPS 12, pp. 1057-1063. MIT Press.

In UAI-02.

Thte, A. (1975a). Interacting goals and theiruse. In

IJCA/-75, pp. 215-218.

Tate, A. (1975b). Using Goal Structure to Direct Search in a ProblemSolvtr. PhD. thesis, University ofEdinbusgh.

i g project networks. Jn Tate, A. (1977). Genera!n IJCA/-77, pp. 888-893.

Swade, D. (2000). Difference Engine: CharlesBab

bage And The Qutst To Build The Fir.st Computer.

Diane Publishing Co.

Swerling, P. (1959). First order error propagation in a stagewise smoothing prooedtu'e for satellite obser

Tattersall, C. (1911).

Tenenbawn, J.,Griffths,T.,andNiyogi, S. (2007). i Intuitive theories as grammars for causal inference. In Gopnik, A. and Schuh, L. (Eds.), Causal /earn ing: Psychology, Philosophy, and Computation.Ox ford University Press.

Tesauro, G. (1992). Practlcal issues n i temporal dif ference learning. MachirJe Learning, 8(3-4), 257-

on

Tesauro, G. and Senow!lci, T. (1989). A parallel network that reamsto playbackgammon. AIJ, 39(3),

322-329.

Tade.palli, P., Gh>n, R., and Driessens, K. (2004). Relational reinforcement learning: An overview. In ICM/.,04.

j

357-390.

Teyssier, M. and Koller,D. (2005). Ordering-based search: A simple and effective algorithm for learning

Bnycsi.nnnetworks. In UAJ-05, pp. 58�590.

Thaler, R. (1992). The Winner's Curse: Parado.tes nndAnomalies ofEconomic lif e. Princeton Univer sity Press. Thaler, R. andSunstein, C. (2009).

Tail, P. G. (1880). Note on the theory of the "15 puzzle". Proc. RoyalSocietyofEdi nburgh, 10,664-

Nudge: Improv ingDecisionsAbout Heall/; IVealth, andHappiness. Penguin.

Tamaki, H. and Sato, T. (1986). OLD resolution with tabulation. InICLP-86, pp. 84-98.

(2004).

665.

Thoocharous, G., Murphy, K., and Kaelbling, L. P.

Tarjan, R. E. (1983). Data Stmcturesand Network

RepresenTing hierarchical POMDPs as DBNs for multi-scale rolx>t localization. Jn ICRA04.

Tarski, A. (1935). Die Wahrheitshegriff in den for malisierten Spract.en. Studia Philosophi ca, I, 261-

Thiele, T. (1880). Om anvendelse af mindste kvadraters methode i nogle tilf.,lde, hl'or en kom plikation af visse slags uensartede tilf.,ldige fejlk ildergiver fejleoe en ' systernatisk' karakter. Vidensk. Selsk. Skr. 5. Rk., naturvid. og mat. Aft/., 12, 381-

CBMS-NSF Regional Confereoce Se ries n i Appled i MAthematics. SIAM (Society for In dustrial and Applied Mathematics).

Algorithms.

405.

Tikhonov, A. N. (1963).

Solution of incorrectly fotmulated problems and the regularization method.

Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985). Starisrical analysis ofjinitr mitture

Future Shack, Bantam.

Tomer, A. (1970).

Swift, T.andWarren, D. S. (1994). AnalysisofSLG WAM evaluation ofdefinite programs. In Logic Pro

Tadepalli, P. (199�). Learning from queries and ex

661-692.

Tomasi, C. andKanade, T. (1992). Shape and mo

277.

amples with tree-struct:ured bias. In TC!fL-93, pp.

Thrun, S. (2006). Stanley, the robot that won the DARPAGrand Challenge. J. Field Robocs, ri 23(9),

acrions on Systems, MQil and Cybernetics, 20{2),

Tesauro, G. (1995). Temporal differenoe learning andTO-Gammon. CACM, 38(3), 5�.

Syrjiinen, T. (20JO). Lparse 1.0 user's manual. saturn.tcs.hut.fi/software/smodels.

29-53.

365-379.

vations. J. AstronauticalSciences. 6,46-52.

gmmming. Proc. !994 ln.temarional ymposium S Logicprogmmmillg, pp. 219-235.

Proba

Thrun, S., Fox, D., and Busgard, W. (1998). A prob abilistic approach to concurrent rnapping and local ization for mobile robots. Machine Learning, 3/,

distributions. Wiley.

Taylor, G., Stensrud, B., Eitelman, S., and Dnnhatn, C. (2007). Towards autcmating airspace manage ment. In Proc. Computational IntelligenceforSectt rity and Defense Applicarions (CISDA) Conference, pp. 1-.5.

agement.

Tbrun, S., Busgard, W., andFox, D. (2005). bilisticRobotics. MIT Press.

Thtman, J. A. and Shachter, R. D. (1990). Dynamic programming and influence cliagrarns. IEEE Trans

216-224.

Svo...,, K. and Busges, C. (2009). A machine learning approach for improved bD\25 retrieval. In Proc. Conference on lnfornl(lrion Knowledge Man

Thompson, K. (1996). 6-p iece endgames . J. Inter nuliuu«l Cumputu Clu::i:s A�·�·(.l(;iutiur1, 19(4), 215226.

Soviet Math. DokL, 5, 1035-1038.

A Thotuand End-Games: A Collection of Chess Posiri ons That Can be \Ibn or Drmw' by the: Windows and durations for activities and goals. PAM/, 5, 246-267.

(Eds.), Handbook ofConsrmint Progmmming. Else-

Ve.rma, V Gordon, G., Simmons, R and Thrun, S. (2004). Particle filters for rover fault diagnosis.

CPian: A con su-aint programming approach to planning. InAAA/99, pp. 585-590.

Vinge, V. (1993). The eming technological sin gulari ty: How to survive n i the post-human era. In VISION-21 Symposium. NASA Lewis Research Center and the Ohio Aerospace Institute.

ner. van Beek, P. and Chen, X. (1999).

van Ueek, P. and Maachak, D. (1996). The design and experimental analysis ofalgorithms for temporal reasoning. JAJR,4, 1-18. van Bentham, J. and ter Meu.Jen, A. (1997). Hand

book ofLogic and Longuage. MITPress.

Van Emden, M. H. and Kowalski, R. (1976). The semantics of p-edicate logic as a prograrruning lan guage. JACM, 23(4), 733-742.

Rarmelen, F. and Bundy, A. (1988). Explanation-based generalisation partial evalu atioo. AJ/,36(3), 401-412. van

=

van Harmelen, F., Lifschitz, V., and

Porter, B.

(2007). The Handbook of Knowledge Representa tion. Elsevier. van Heijenoort, J. (Ed.). (1967).

From Frege to Glide!: A Source Book in Mathematical Logc, i

.•

.•

IEEE Robotics andAutomation Magazine, June.

Viola, P. andJones, M. (2002a). Fast and robust clas sification using asymmetric adaOOost and a detector cascade. In NIPS 14.

Viola, P. and Jones, M. (:2002b). Robust real-time object detection. /CCV.

Visser, U. and Burkhard, H.-D. (2007). RoboCup 2006: achievements andgoals for the future. AJMag, 28(2), 115-130. Visser, U., Ribeiro, F., Obash� T., and Dellaert, F.

(Eds.). (2008). RoboCup 2007: RobotSoccer World

CupXI. Springer.

Viterbi, A. J. (1967). Erro:rbounds for convolutional codes and an asymptotically optimum decoding al gorithm. IEEETmnsactions on lnfom!aon ti Theory,

/3(2), 260-269.

Warren, D. H. D. (198.3). An abstract Prolog in struction set. Technical aote 309, SRI InternationaL Warren, D. H. D., Pereira, L. M., and Pereira, F.

PROLOG: The language and its imple mentation compared with LISP. SIGPLAN Notices,

(1977).

/2(8), 109-1 15.

Wassern ..n, L. (2004). All ofSratiscs. ti Springer.

Watkins, C. J. (1989). Models of Delayed Rein forcemenr Learning. Ph.D. thesis, Psychology De partment, Cambridge University. Watson,J. D. andCrick,F. H.C. (1953). A sbUcture for deoxyribose nucleic acid. Naruro, 171, 737.

Waugh, K., Scbnizlein, D., Bowling, M., and Szafron, D. (2009). Abstraction pathologies in ex tensive games. JnAAMAS-09.

Weaver, W. (1949). Translation. In Locke, W. N. and Booth, D. (Eds.), Machine rmnslation of lan guages: fourteen essays, pp. I5-23. Wiley. Webber, B. L. and Nilsson, N. J. (Eds.). (1981). Relldings in Artificial lmelligence. Morgan Kauf

mann.

Weibull, J. (1995). Evolutionary Game Theory. MIT Press. Weidenbach, C. (2001). SPASS: Combining super position, sorts and splitting. In Robinson, A. and

Voronkov, A. (Eds.), Handbook ofAutomated Rea soning. MJT Press. Weiss, G. (2000a). Multiagenr sysrems. MIT Press.

1879-1931. Harvard Univeirsity Press.

Vlassis,N. (2008). A Concise lnrroduction to Multi

Van Hente.nryck, P., Saraswat, V., and DeviUe, Y. (1998). Design, implementation, and evaluation of the constraint langnase cc(FD). J. Logic Progmm

agenrS ystemsandDisrribr.1tedArtificialintelligence.

Weiss, Y. (2000b). Correctness oflocal probability propasation in graphical models with loops. Neural

von Mises, R. (1928). Wahrscheinlichkeir, Sratistik und \fuhrheir. 1. Springer.

Weiss, Y. and Freeman, W. (200I). Correctness of belief propasation in Gaussian graphical models

ming, 37(1-3), 139-164.

Morgan and Claypool.

van Doeve, W..J.(2001). The allditferent constraint: a survey. In 6rh AnmNJl Workshop of the ERCIM

von Neumann, J. (1928).

van Hoeve, W..J. and Katriel,

von Neumann, J. andMorgenstern, 0. (1944). The ory ofGames and Economic Brhavior (firstedition). Princeton University Press.

Working Group on Constmi nts.

constraints.

In Rossi, F.,

van

L (2006). Global Beek, P., and Walsh,

T. (Eds.), Handbook ofCo,nstmint Processing, pp. 169-208. Elsevier.

van Uu1lhalgen, M. and Hamm, F. (2005). The

Proper Trearmenr ofEvents. Wiley-BlackweU.

van Nunen, J. A. E. E. (I 976). A set of succes sive approximation methods for discounted Marko vian decision problems. Zeitschriftfur Opemons ri Research, Serie A, 20(5), 203-208. Van Roy, B.

(1998). Learning and value function

appro.timarion in complexdecisionprocesses. Ph.D. thesis, Laboratory for Jnfonnation andDeciso i nSys tems, MIT.

Van Roy, P. L. (1990). Can logic programming execute as fast as imperative programming? Re port UCB/CSD 90/600, Co mpu t erScience Division, Un iverst i y ofCalifornia, Bedrks of adiline "neurons". In Self Organizing Systems 1962, pp. 435-461.

B. and Hoff, M. E. (1960). Adaptil-e swi t ching circuits. In 1960 IRE IVESCON Conven tion Record, pp. 96-104. Widrow,

Wiedijk, F. (2003). Comparing mathematical InMathematical Knowledge Management, pp. 188-202. provers.

Wiegley, J., Goldberg, K., Peshkin, M., and Brokowski, M. (1996). A complete algorithm for designing passive fences to orient parts. In ICRA-

96.

Wiener, N. (1942). The extrapolation, n i terpolation, and smoothing of stationary rime series. Osrd 370, Report to the Services 19, Research Project DIC6037,MIT. Wiener, N. (1948). Cybernetics. Wiley.

Wilensky, R. (1978). Understanding goal-based ston'es. PhD. thesis, Yale University. Wilensky, R. (1983). Planning and Understanding.

Addison-Wesley.

Wilkins, D. E. (1980). Using patterns and plans n i chess. AIJ, 14{2), 165-203. Wilkins, D.E. (1988). Practical Planning: &tend ing theAIPlanning Paradigm. Morgan Kaufmann.

Wilkins, D. E. (1990). Can AI planners solve prac tical problems? Computational Intelli gence, 6(4),

232.--246.

Williams, B., Ingham, M Chung, S., and EUiott, .•

Williams, R. J. and Bair,d, L. C. I. (1993). Tight performance boundson greedy policies basedon im perfect value functions. Tech. rep. NU-CCS-93-14, College ofComputer Science, Northeastern Un iver sity.

l , R C. (Eds.). (1999). Wilson, R. A. and Kei The MIT Encyclopedia