968 80 17MB
Pages 124 Page size 415.44 x 595.08 pts Year 2011
Selected Readings in Vision and Graphics Volume 60 Daniel Eugen Roth
Real-Time Multi-Objekt Tracking Diss. ETH No. 18721
∑ Hartung-Gorre
Diss. ETH No. 18721
Real-Time Multi-Object Tracking A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH
for the degree of Doctor of Sciences ETH
presented by DANIEL EUGEN ROTH MSc ETH in Electrical Engineering and Information Technology born 28 t/l January 1978 citizen of Zollikon (ZH) and Hemberg (SG)
accepted on the recommendation of Prof. Dr. Luc Van Gool, ETH Zurich and K.U. Leuven, examiner Prof. Dr. Thomas B. Moeslund, Aalborg University, co-examiner
2009
Abstract New video cameras are installed in growing numbers in private and public places, producing a huge amount of image data. The need to process and analyze the data automatically in real-time is critical for applications such as visual surveillance or live sports analysis. Of particular interest is the tracking of moving objects such as pedestrians and cars. This work presents two visual object tracking methods, multiple prototype systems and an event-based performance evaluation metric. The first object tracker uses a flexible 2D environment modeling to track arbitrary objects. The second method detects multiple object classes using a camera calibration, therefore operates in 2.5D and performs more robustly in crowded situations. Both proposed monocular object trackers use a modular tracking framework based on Bayesian per-pixel classification. It segments an image into foreground and background objects based on observations of object appearances and motions. Both systems adapt to changing lighting conditions, handle occlusions, and work in real-time. Multiple prototype systems are presented for privacy applications in video surveillance and cognitive multi-resolution behavior analysis. Furthermore, a performance evaluation method is presented and applied to different state-of-the-art trackers based on the successful detection of semantic high level events. The high level events are extracted automatically from the different trackers and their varying types of low level tracking results. The general new event metric is used to compare our tracking method and the other tracking methods against ground truth of multiple public datasets.
Zusammenfassung Die steigende Zahl von installierten Kameras an öffentlichen und privaten Orten führt zu einer stetig wachsenden Bilderflut, die nicht oder nur eingeschränkt analysiert werden kann. Die automatische Auswertung und Analyse dieser Videos in Echtzeit ist daher Voraussetzung für Anwendungen der Videoüberwachung oder zur live Analyse von Sportarten. Speziell interessant ist hierbei die Verfolgung von sich bewegenden Objekten, wie zum Beispiel Personen oder Autos. Diese Dissertation beschreibt zwei Methoden sowie mehrere Prototypsysteme zur automatischen Detektierung und Verfolgung von Objekten. Zudem wird eine Bewertungs- und Vergleichsmethode für solche Systeme präsentiert, die das Erkennen von einzelnen Ereignissen durch die Systeme auswertet. Die erste der beiden präsentierten Objektverfolgungsmethoden arbeitet rein zweidimensional (2D) mit beliebigen Objektgrössen und Formen. Die zweite Methode unterscheidet mehrere definierte Objektklassen. Zudem schätzt sie die räumliche Tiefe in einer zweieinhalb-dimensionalen (2.5D) Interpretation des Bildes mithilfe einer einzelnen kalibrierten Kamera. Damit wird die Verfolgung weniger fehleranfällig bei vielen überlappenden Objekten auf engem Raum. Bei beiden Verfolgungsmethoden werden die Bildpunkte segmentiert in Vordergrund und Hintergrund. Dabei werden Unterschiede in Bewegung und Aussehen der Objekte als Bayesische Wahrscheinlichkeiten interpretiert und klassifiziert. Beide Verfolgungsmethoden arbeiten in Echtzeit, passen sich laufend den Lichtverhältnissen an und erkennen Verdeckungen zwischen Objekten. Daraus entstanden mehrere Prototypsysteme. Zum einen werden Systeme zur kognitiven Videoüberwachung auf verschiedenen Auflösungsstufen präsentiert. Zum anderen wird mit einem Prototyp untersucht, wie die Privatsphäre trotz Videoüberwachung wiederhergestellt werden kann. Zusätzlich zu den Systemen beschreibt diese Arbeit eine hierfür geeignete quantitative Bewertungs- und Vergleichsmethode. Dabei werden konkrete ein-
ZUSAMMENFASSUNG
zelne Ereignisse mit semantischer Bedeutung aus den Resultaten der Objektverfolgungsmethoden gefiltert und deren Erkennungsrate gemessen. Vergleiche zwischen mehreren modernen Objektverfolgungsmethoden und menschlich annotierten Resultaten wurden auf mehreren bekannten Videosequenzen durchgeführt.
Acknowledgements This thesis would not have been possible without the invaluable support from many sides. First and foremost I thank my supervisor Prof. Luc Van Gool for giving me the opportunity to explore this fascinating field and the interesting research projects. My thanks also go to Dr. Esther Koller-Meier for guiding me with her advice and for her scientific and personal support. And I am very grateful to Prof. Thomas B. Moeslund for the co-examination of this thesis. 1 would like to thank all the project partners for their fruitful collaborations and for helping me building the various prototype systems. • Blue-C II project: Torsten Spindler, Indra Geys and Petr Doubek. • HERMES project: University of Oxford: Eric Sommerlade, Ben Benfold, Nicola Bellotto, Ian Reid. Aalborg Universitet: Preben Fihl. Universitat Autdnoma de Barcelona: Andrew D. Bagdanov, Carles Fernandez, Dani Rowe, Juan Jose Villanueva. Universitat Karlsruhe: Hanno Harland, Nico Pirlo, Hans-Hellmut Nagel. Finally, I thank all my colleagues at the Computer Vision Laboratory of ETH Zurich for their support, countless table soccer matches and the informal atmosphere at the working place. In particular Alexander Neubeck, Andreas Ess, Andreas Griesser, Angela Yao, Axel Krauth, Bastian Leibe, Bryn Lloyd, Michael Breitenstein, Michael Van den Bergh, Philipp Zehnder, Robert McGregor, Roland Kehl, Simon Hagler, Stefan Saur, Thibaut Weise, Tobias Jaggli. This work was funded by by the ETH Zurich project blue-c-II, the Swiss SNF NCCR project IM2 and the EU project HERMES (FP6 IST-027110). Last but not least, I want to thank my family and friends, especially my mom and dad for their love and support.
Contents List of Figures
xi
List of Tables
xv
1 Introduction 1.1 Challenges of Visual Object Tracking 1.2 Contributions 1.3 General Visual Surveillance Systems 1.4 Outline of the Thesis 2 2D Real-Time Tracking 2.1 Introduction 2.1.1 Related Work 2.2 Bayesian Per-Pixel Classification 2.3 Appearance Models 2.3.1 Color Modeling with Mixtures of Gaussians 2.3.2 Background Model 2.3.3 Finding New Objects 2.3.4 Foreground Models 2.4 Motion Model 2.5 Object Localization 2.5.1 Connected Components 2.5.2 Grouping Blobs to Objects and Adding New Objects . 2.5.3 Occlusion Handling 2.6 2D Tracking Results and Discussion 2.6.1 PETS 2001 TestDataset3 2.6.2 Occlusion Handling Limitations 2.6.3 Illumination Changes 2.6.4 Computational Effort
1 2 2 3 9 11 11 11 12 14 14 15 16 17 18 20 21 22 23 24 24 27 29 29
CONTENTS
2.7 2.8
Application: Privacy in Video Surveilled Areas Summary and Conclusion
32 34
Extended 2.5D Real-Time Tracking 37 3.1 Introduction and Motivation 37 3.2 Previous Work 38 3.3 2.5D Multi-Object Tracking 39 3.3.1 Tracking Algorithm 40 3.3.2 Iterative Object Placement and Segmentation Refinement 42 3.3.3 New Object Detection 43 3.3.4 Tracking Models 44 3.3.5 Ground Plane Assumption 45 3.4 Extended 2.5D Tracking Results and Discussion 46 3.4.1 Central Sequence 46 3.4.2 HERMES Outdoor Sequence 47 3.5 Application: HERMES Demonstrator 48 3.6 Summary and Conclusion 53 Event-based Tracking Evaluation 4.1 Introduction 4.2 Previous Work 4.2.1 Tracking Evaluation Programs 4.3 Event-Based Tracking Metric 4.3.1 Event Concept 4.3.2 Event Types 4.3.3 Event Generation 4.3.4 Evaluation Metric 4.3.5 Evaluation Pipeline 4.4 Experiments 4.4.1 CAVIAR Data Set 4.4.2 PETS 2001 Data Set 4.4.3 HERMES Data Set 4.4.4 Event Description 4.4.5 Tracker 1 4.4.6 Tracker 2a and 2b 4.4.7 Tracker 3 4.5 Case Study Results 4.5.1 CAVIAR
57 57 60 61 62 63 64 65 66 66 69 69 70 70 70 73 74 74 75 75
CONTENTS
4.6 5
4.5.2 PETS2001 4.5.3 HERMES 4.5.4 Metric Discussion
Conclusion 5.1 2D and 2.5D Tracking Methods 5.2 Discussion of the Evaluation Metric 5.3 Outlook
A Datasets Bibliography
ix
77 80 85 86 87 87 88 89 91 97
List of Publications
103
Curriculum Vitae
105
List of Figures 1.1
General framework of visual surveillance
2.1 2.2
Tracking framework Example for a Gaussian mixture with three RGB Gaussians with different mean, variance and weight Constructing Gaussian mixture for a whole slice Sliced object model Object position priors, from the original image in Figure 2.4(a) Two examples of the 3x3 filter masks for the noise filtering. The filter on the left removes single pixel noise. The right filter mask is one example of a filter removing noise in the case of seven uniform and two differing pixel values The connected component algorithm finds closed regions as shown in a) and b). The grouping these regions to final objects is shown in steps c) and d) Eight cases of occlusion are differentiated. The hatched rectangle represents the object in front. The bounding box of the object behind is represented by solid (=valid) and dashed ( i n valid) lines PETS 2001 test dataset 3, part 1 PETS 2001 test dataset 3, part 2 PETS 2001 test dataset 3, part 3 Tracking during partial and complete occlusion Limitation: The background model only adapts where it is visible Computational time and the number of objects including the background model of the PETS 2001 Test Dataset 3 Multi-person tracking in a conference room Large change in the background can lead to wrong objects. . .
2.3 2.4 2.5 2.6
2.7
2.8
2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16
4 13 17 18 19 20
20
21
22 25 26 27 28 30 32 33 34
xii
LIST OF FIGURES
3.1
Tracking framework: The maximum probability of the individual appearance models results in an initial segmentation using Bayesian per-pixel classification. The white pixels in the segmentation refer to the generic 'new object model' described in Section 3.3.3 3.2 The segmentation is refined while iteratively searching for the exact object position from close to distant objects 3.3 Ground plane calibration 3.4 Segmentation from the central square sequence. The different size of the bounding-boxes visualize the different object classes. Unique object IDs are shown in the top left corner. Black pixels = background, white pixels = unassigned pixels M, colored pixels = individual objects 3.5 Tracking results from the Central square sequence. Objects are visualized by their bounding box and unique ID in the top left corner 3.6 Tracking results from the HERMES outdoor sequence 3.7 Computational effort: The blue curve above shows the computation time in milliseconds per frame. Below in red, the number of tracked objects is given 3.8 HERMES distributed multi-camera system sketch. It shows the super visor computer with SQL database on top, the static camera tracker on the bottom left, and the active camera view to the bottom right 3.9 HERMES indoor demonstrator in Oxford. Static camera view and tracking results 3.10 New CVC Demonstrator, static camera view 4.1 4.2
4.3 4.4 4.5 4.6 4.7
Evaluation scheme Example of a distance matrix. It shows the distances between every ground truth event (column) versus every tracker event (row) Event matching of same type CAVIAR OneLeaveShopReenterl cor sequence with hand-labeled bounding boxes and shop area PETS2001 DS1 sequence HERMES sequence Hierarchical multiple-target tracking architecture
41 42 46
48
49 50
50
52 53 54 63
67 68 71 71 72 73
LIST OF FIGURES
4.8 4.9
4.10
4.11 4.12 4.13
Multi-camera tracker by Duizer and Hansen Frames/Event plot for the PETS sequence. Stars are ground truth events, squares from tracker 1 and diamonds show events from tracker 2a Frames/Event plot for the HERMES sequence. Stars equal ground truth, squares equal tracker 1, diamonds equal tracker 2b and pentagrams equal tracker 3 The presented tracking method on the HERMES sequence . . Segmentation on the HERMES sequence HERMES sequence: three trackers. Only tracker 1 in the left image detects the small bags (nr. 3 and nr. 23). Tracker 2b in the center and tracker 3 omit the small objects to achieve a higher robustness for the other objects
76
78
81 83 84
84
List of Tables 2.1
Tracking framework algorithm
14
2.2
Computational effort of the different parts of the algorithm . .
31
3.1
2.5D tracking algorithm
42
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12
Evaluation metric Detected events of the CAVIAR sequence Event-based evaluation of the CAVIAR sequence Detected events for the PETS sequence Event-based evaluation of the PETS sequence Object-based evaluation of tracker 1 (PETS) Object-based evaluation of tracker 2a (PETS) Detected events for the HERMES outdoor sequence Event-based evaluation of the HERMES outdoor sequence . . Object-based evaluation of tracker 1 (HERMES) Object-based evaluation of tracker 2b (HERMES) Object-based evaluation of tracker 3 (HERMES)
66 75 77 78 79 79 80 80 82 83 85 85
A. 1 HERMES, Central and Rush Hour dataset comparison A.2 PETS 2001, CAVIAR and PETS 2006 dataset comparison. . . A.3 PETS 2007, BEHAVE and Terrascope dataset comparison. . .
94 95 96
1 Introduction Tracking moving objects in video data is part of a broad domain of computer vision that has received a great deal of attention from researchers over the last twenty years. This gave rise to a body of literature, of which surveys can be found in the works of Moeslund et al. [53; 54], Valera and Velastin [74] and Hu et al. [32]. Computer-vision-based tracking has established its place in many real world applications; among these are: • Visual Surveillance of public spaces, roads and buildings, where the aim is to track people or traffic and detect unusual behaviors or dangerous situations. • Analysis of Sports to extract positional data of athletes during training or a sports match for further analysis of the performance of athletes and whole teams. • Video Editing where tracking can be used to add graphic content to moving objects for an enhanced visualization. • Tracking of Laboratory Animals such as rodents with the aim to automatically study interactions and behaviors. • Human-Computer Interfaces in ambient intelligence, where electronic environments react to the presence, action and gestures of people. • Cognitive Systems which use tracking over a longer time period in order to learn about dynamic properties. The growing number of surveillance cameras over the last decade as well as the numerous challenges posed by unconstrained image data make visual surveillance the most general application in the field. The aim of this work is to
1. INTRODUCTION
develop an intelligent visual detection, recognition and tracking method for a visual surveillance system. It should track multiple objects in image sequences in order to answer questions such as: How many objects are in the scene? Where are they in the image? When did a certain person entered the room? Are there pedestrians on the crosswalk? This thesis presents methods and prototype systems to answer such questions in real-time. Given the challenges of the problem, the research will be restricted to real-time tracking methods with static cameras, primarily of people and secondarily of other objects such as vehicles. Furthermore, the presented methods are compared and evaluated with novel measurements and tools. Finally, the methods are designed to run live in prototype systems. Multiple such systems for visual surveillance of different scope are built and presented. One system implements an application for privacy in video surveillance, the other is a cognitive multi-resolution surveillance system.
1.1
Challenges of Visual Object Tracking
Multi-object tracking poses various interconnected challenges, starting with object detection, classification, environment modeling, tracking and occlusion handling. The approach in this thesis addresses each of these challenges with separated and specialized models or modules. The development of a modular framework and the modification of these modules directly affect the computational speed as well as the tracking performance and robustness. The framework is chosen so that results of the visual sensing first lead to a segmentation, regardless of the sensing method. The tracking between frames is formulated as a Bayesian filtering approach which uses the segmentation. The contributions of this thesis are the general framework, the special modifications to each module and their integration in order to achieve a good balance between robustness and speed.
1.2
Contributions
This thesis deals with specialized modules for real-time multi object tracking and explores probabilistic models concerning the appearance and motion of
i I, (JI-NERAL VISUAL SURVEILLANCE SYSTEMS
BI i -.1 ins. The work proposes a tracking framework, multiple real-time protolype systems and an evaluation metric to measure improvements. The contril -in ions of the thesis are: • Two real-time tracking methods sharing a similar tracking framework for Bayesian per-pixel classification, which combines the appearance and motion of objects in a probabilistic manner. • Multiple real-time prototype systems for privacy in video surveillance and cognitive multi-resolution behavior analysis. • Evaluation method to compare and measure tracking performance based on a novel event metric.
1.3 General Visual Surveillance Systems I he first part of this thesis deals with the design and implementation of a vilual surveillance system. Therefore, general modules of a visual surveillance w sicm including references to prior art are sketched first. Inspired by several general frameworks in the literature, e.g. in Hu et al. [32], Moeslund et al. [54] OT Valera and Velastin [74] a similar structure as the one in Hu et al. [32] is adopted. Modifications to this general structure are made due to the focus in monocular tracking by discarding any multi-camera fusion. Furthermore, a nunc general task of object detection and segmentation is considered instead simpler motion segmentation. The general structure of a visual surveillance system is shown in Figure 1.1. It Combines modules for Visual Observation, Object Classification, Object Tracking. Occlusion Handling and Environment Modeling. Only the Visual Observation module has direct access to the full images of the surveillance camera, despite the fact that image regions or preprocessed image features are passed on to different modules in the framework. The tight connections between the modules are essential for any object tracking system to fulfill the needs of a visual surveillance system. While this framework is used throughout the thesis, first, recent developments and general strategies of all these modules are reviewed.
1. INTRODUCTION
Figure 1.1: General framework of visual surveillance. 1. Visual Observation. The visual observation module works directly on the image data and aims at detecting regions corresponding to the objects under surveillance such as humans. Detecting the regions which are likely to contain the tracked object (e.g. by means of background segmentation methods) provides a focus of attention for further analysis in other modules such as tracking, object classification or occlusion handling. The purpose of this module is often twofold in the sense that only these regions will be considered in the later process and that intermediate computational results (e.g. color likelihoods) will be required by other modules. In the following several approaches for the visual observation module are outlined. • Background Subtraction A popular method to segment objects in an image is background subtraction. It detects regions by taking the difference between the current image and a previously learnt empty background image. In its simplest versions it literally subtracts color or grey scale values pixel-by-pixel and compares it to a threshold. More complex methods are adaptive, incorporating illumination changes as
I (ii-.NERAL VISUAL SURVEILLANCE SYSTEMS
they occur during the day. A comparison of state-of-the-art methods can be found in [24] by Hall et al. They compare W4 by Haritaoglu et al. [26], single [76] and multiple Gaussian models [69] as well as LOTS [8] regarding performance and processing time. For the test, they use sequences from the CAVIAR corpus, showing a static indoor shopping mall illuminated by sunlight. In this setting the fast and simpler methods such as single Gaussian and LOTS outperform more complex multi-modal methods such as multiple Gaussian mixtures. However, in complex outdoor settings the background is more dynamic due to wind, waving trees or water. Several methods have been proposed to deal with such multimodal distributed background pixels. For example, Wallflower by Toyama et al. [71 ] employs a linear Wiener filter to learn and predict background changes. Alternatively, codebook vectors are used by Tracey [40] to model foreground and background. The codebook model by Kim et al. [38] quantizes and compresses background samples at each pixel into codebooks. A comparison of these methods can be found in the work of Wang et al. [75] where they introduce a statistical background modeling method. • Temporal Differencing This method is similar to background subtraction. But instead of computing the pixel-wise difference between the current image and an empty background, two or three consecutive frames of an image sequences are subtracted to extract moving regions. Therefore, temporal differencing does not rely on an accurate maintenance of an empty background image. In contrary, it is adapted to deal with dynamic environments. However, due to its focus on temporal differences in a short time frame the method rarely segments all the relevant pixels and holes inside moving objects are a common error. An example of this method is presented in Lipton et al. [46], where connected component analysis is used to cluster the segmented pixels to actual motion regions. • Optical Flow Characteristics of optical flow vectors can be used for motion segmentation of moving objects. While the computation of accurate flow vectors is costly, it allows the detection of independently moving objects, even in the presence of camera motion. More details about optical flow can be found in Barron's work [2]. Flow-based
1. INTRODUCTION
tracking methods were presented by Meyer et al. for gait analysis [50; 51] and Nagel and Haag [55] for outdoor car tracking. • Feature Detection Recently, tracking by detection approaches received attention for non real-time approaches. Local invariant features such as SURF [4] or SIFT [47] are extracted from the image which are further processed in additional modules such as object classification in order to detect and later track similar regions over time. Their main advantage is the robustness of local features, the data reduction for further processing and their independence of a background model, which allows such techniques on a moving camera. 2. Object Classification Modules for object classification are important to identify regions of interest given by the visual observation module. In visual surveillance this module classifies the moving objects into humans, vehicles and other moving objects. It is a classic pattern recognition problem. Depending on the preprocessing results of the observation layer, different techniques can be considered. However, three main categories of methods are distinguished. They are used individually or in combination. • Feature-detector-based Classification Detected local features can be used to train object class detectors such as ISM by Leibe [43] for general feature detectors. Bo Wu and Ram Nevatia use specific human body part detectors for the object detection and tracking [77]. Today, these approaches are generally not suitable for real-time operation. GPU accelerated implementations will benefit from a huge performance boost in the near future, which will allow some detector-based tracking methods to become real-time using high-end graphics hardware. • Segmentation-based Classification Object classification based on image blobs are used by VSAM, Lipton et al. [46] and Kuno et al. Different descriptors are used for classification of the object's shape such as points, boxes, silhouettes and blobs. • Motion-based Classification Periodic motion patterns are used as a useful cue to distinguish non-
i I, GENERAL VISUAL SURVEILLANCE SYSTEMS
rigid articulated human motion from motion of other rigid objects, such as vehicles. In [46], residual flow is used as the periodicity measurement. 3. Object Tracking The object tracking module relates the observations of the same target in different frames into correspondence and thus obtains object motion. Information from all modules are combined to initiate, propagate or terminate object trajectories. Prominent mathematical methods are applied, such as the Kalman filter [35], the Condensation algorithm [33] [34] or dynamic Bayesian networks. • Region-based Tracking Region-based tracking algorithms track objects according to image regions identified by a previous segmentation of the moving objects. These methods assume that the foreground regions or blobs contain the object of interest. Wren et ai [76] explore the use of small blob features to track a single person indoor. Different body parts such as the head, torso and all four limbs are identified by means of Gaussian distributions of their pixel values. The loglikelihood is used to assign the pixels to the corresponding part, including background pixels. The human is tracked by combining the movements of each small blob. The initialization of a newly entering person starts from a clean background. First, it identifies a large blob for the whole person and then continues to find individual body parts. Skin color priors are used to identify the head and hand blobs. Mean shift tracking is a statistical method to find local maxima in probability distributions where the correlation between the image and a shifted target region is forming the probability distribution. Its maximum is assumed to locate the target position. Collins et. al [12] presented a method to generalize the traditional 2D meanshift algorithm by incorporating scale into a difference of Gaussian mean-shift kernel. • Feature-Based Tracking Feature-based tracking methods combine successive object detections for tracking. General local features are extracted, clustered for object classification and then matched between images. Current methods can be classified into either causal or non-causal ap-
1. INTRODUCTION
proaches. Non-causal approaches construct trajectories by finding the best association according to some global optimization criterion after all observations of a video are computed. Examples are the methods from Leibe et. al [44], Berclaz et. al [7], Wu and Nevatia [79]. Causal methods, in contrast, do not look into the future and only consider information from past frames. Examples for causal feature-based tracking are Okuma et. al [57], Breitenstein et. al [9]. 4. Environment Modeling Tracking is done in a fixed environment which defines object and background models, coordinate systems and target movements. For static single cameras, environment modeling focuses on the automatic update and recovery of the background image from a dynamic scene. The most prominent background subtraction techniques are described in the visual observation module. The coordinate system for single camera systems is mostly 2D, avoiding a complex camera calibration. Multi-camera setups, in contrast, usually use more complex 3D models of targets and the background in real world coordinates. 3D coordinates given by a multi-camera calibration are used in various aspects of the tracking process, such as volumetric target models and movement restrictions to a common ground plane [19]. Many different volumetric 3D models are used for humans, such as elliptical cylinders, cones [52], spheres, etc. These models require more parameters than 2D models and lead to more expensive computation during the matching process. Vehicles are mainly modeled as 3D wire-frame. Research groups at the University of Karlsruhe [59] [39] [23], University of Reading [70] and the National Laboratory of Pattern Recognition [80] made important contributions to 3D model-based vehicle localization and tracking. Multi-camera methods are not directly applicable to monocular tracking. However, some ideas and subsets of the algorithms can be adopted to single camera tracking in order to improve the tracking, especially under occlusion. 5. Occlusion Handling Occlusion handling is the general problem of tracking a temporally nonvisible object by inferring its position from other sources of information besides the image. Closely related is therefore the environment modeling
I I. OUTLINE OF THE THESIS
and the ability to detect an occlusion. Multiple cameras with different viewpoints onto the scene cope with occlusions by data fusion [52] [37] or "best" view selection [73]. However, the focus of this work is on using only a single camera and therefore focuses on methods to estimate depth ordering and 3D ground positions from environment modeling and object motion only.
1.4 Outline of the Thesis I lir. ihcsis is structured into the following four chapters. Chapter 2 outlines 111. • ' I) real-time tracking method. Chapter 3 extends the previous 2D approach Into a 2.5D tracking approach extending the environment modeling. Chapter 4 Introduces the novel event-based tracking evaluation metric. Finally Chapter 5 lUmmarizes this thesis, discusses the achieved results and provides an outlook l"i future research in the field of real-time tracking and tracking evaluation. I he Appendix lists popular tracking datasets and their specific challenges and properties.
2 21) Real-Time Tracking 2.1
Introduction
l In', chapter introduces the first of two methods for the detection and trackiin' of people in non-controlled environments. The focus throughout the thesis "i! monocular tracking with a static camera. Furthermore, the proposed Rlthod adapts to changing lighting conditions, handles occlusions and newly ippcaring objects, and works in real-time. It uses a Bayesian approach, asIgning pixels to objects by exploiting learned expectations about both motion mill appearance of objects. In comparison to the general framework of viu.11 Burveillance in Figure 1.1 this 2D method lacks a module for Environment \ /. ii /< 'ling.
' i.l
Related Work
Human tracking has a rich history as shown in the introduction in Chapter 1 ,n HI i he re fore we only describe the work most closely related to the presented BWthod in this chapter. Mittal and Davis [52] developed a multi-camera system hli li also uses Bayesian classification. It calculates 3D positions of humans i'"in segmented blobs and then updates the segmentation using the 3D posiI he approach owes its robustness to the use of multiple cameras and more • iphisticated calculations, which cause a problem for a real-time implementa". m In comparison, we developed a modular real-time tracker which solves ii" 11,ieking problem with a single view only. Furthermore, the original tracker horn 152] needs a fixed calibrated multi-camera setup and does not scale well dlli i" I lie pairwise stereo calculations and iterative segmentation.
12
2. 2D REAL-TIME TRACKING
Capellades et al. [10] implemented an appearance-based tracker which uses color correlograms and histograms for modeling foreground regions. A correlogram is a co-occurrence matrix, thereby including joint probabilities of colors at specific relative positions. Also in our case, taking into account of how colors are distributed over the tracked objects is useful, but we prefer a sliced color model instead, as will be explained. Senior et al [65] use an appearance model based on pixel RGB colors combined with shape probabilities. The outline of this chapter is as follows. Section 2 presents the overall strategy underlying our tracker, which is based on both appearance and motion models. The appearance models are explained in Section 3 and the motion models in Section 4. Section 5 describes the object localization and how occlusions are handled. Results are discussed in Section 6 and Section 7 concludes this chapter.
2.2
Bayesian Per-Pixel Classification
The proposed method performs a per-pixel classification to assign the pixels to different objects that have been identified, including the background. The probability of belonging to one of the objects is determined on the basis of two components. On the one hand, the appearance of the different objects is learned and updated, and yields indications of how compatible observed pixel colors are with these models. On the other hand, a motion model makes predictions of where to expect the different objects, based on their previous positions. Combined, these two factors yield a probability that, given its specific color and position, a pixel belongs to one of the objects. The approach is akin to similar Bayesian filtering approaches, but has been slimmed down to strike a good balance between robustness and speed. As previously mentioned, each object, including the background, is addressed by a pair of appearance and motion models. Figure 2.1 sketches this tracking framework. It incorporates different characteristics of objects such as their appearance and their motion via separate and specialized models, each updated by observing them over time. This method has several advantages for object tracking over simple foreground / background segmentation, especially in cases when an object stands still for a while. It will not be mistaken for background or fades into the background over time.
1
HAYESIAN PER-PIXEL CLASSIFICATION
13
Figure 2.1: Tracking framework i "ini.illy, the classification is described by (2.1) and (2.2). These equations dei rlbc how the probabilities to occupy a specific pixel are calculated and compared for different objects. These probabilities are calculated as the product -I i prior probability Pprior(object) to find the object there, resulting from its motion model and hence the object's motion history, and the conditional probibllllj / ',„l.sterior(object\pixel) that, if the object covers the pixel, its specific uloi would be observed. / 'r..sterior(object\pixel) a P{pixel\object)Pprior segmentation—
(object)
max (Posterior (objectlpix el)).
(2.1) (2.2)
object
11" Iriu ker executes the steps in Table 2.1 for every frame. First, the prior probnbilities of all objects are computed. Then they are used in the second lip lo segment the image with the Bayesian per-pixel classification. Each • I i assigned to the object with the highest probability at this position. In a lop, ihe object positions are found by applying a connected components Inn \>H\ to the segmentation image, which groups pixels assigned to
14
2. 2D REAL-TIME TRACKING
1. compute posterior probability of all models 2. segment image (Bayesian per-pixel classification) 3. find connected components and group them to objects 4. add, delete, split objects 5. handle occlusion 6. update all models Table 2.1: Tracking framework algorithm the same object. The fourth step handles several special situations detected by the connected components algorithm. Missing objects are deleted, if they are not occluded. Objects are split if they have multiple new object positions. New objects are initialized from the regions claimed by the generic new-objectmodel (described in Section 2.3.3). The fifth step handles objects which are partially or completely occluded and infers their spatial extent according to the motion model. Occlusion handling is the subject of Section 2.5.3. Finally, all object models are updated. Namely, the appearance models update the color models from the pixels assigned to them in the segmentation image and the motion models are updated according to the newly found object positions.
2.3 Appearance Models In this section we introduce the color-based appearance models responsible for estimating the P(pixel\object) probabilities in (2.1). Different models are used for the background B, for newly detected objects Af, and for the objects Oi that are already being tracked. The differences of the models reflect the different expectations of how these objects change their appearance over time. Each model provides a probability image, where pixels which match the model will have a high probability and the others a low probability. The appearance models are updated after pixels have been assigned to the different objects, based on (2.2).
2.3.1
Color Modeling with Mixtures of Gaussians
The appearance models B of the background and O, of all tracked objects are based on Gaussian mixtures in the RGB color space. Methods employing
' i. APPEARANCE MODELS
15
iinn' adaptive per-pixel mixtures of Gaussians (TAPPMOGs) have become a popular choice for modeling scene backgrounds at the pixel level, and were proposed by [69]. Other color spaces than simple RGB could be considered, g. described by Collins and Liu [13]. I oi the presented tracker the color mixture models are separated among our pi i ialized appearance models B. The first and dominant Gaussian is part Ol i pixel wise background model, described in the next Section 2.3.2. The i in.lining Gaussians of the color mixture are assumed to model foreground objects. These colors arc separated from B and part of a Oi as described in li i lion 2.3.4 about foreground models. i "i both appearance models, the probability of observing the current pixel llue V, = [Rt Gt Bt] at time t, given the mixture model built from preOUB observations is K
l'(Xt\Xi,...,Xt-i)
= ywi-i^iAu
?t-i,fc, St-i>fc)
(2.3)
fe=l
Iv re "'/ I.A- is the weight of the kth Gaussian at time t — 1, ~jlt-i,k is the oi of the RGB mean values, the Et-i,fc k the covariance matrix of the kth i.iii and r\ is the Gaussian probability density function „//, .,A-.E t -i, fc )=
,
»—e4(^-M1-1,)rEr-1u(^-^-u).
i • .i ihc covariance it is assumed that the red (R), green (G) and blue (B) comi .i'HiH are independent. While not true for real image data, such an aplinution reduces the computational effort. A diagonal covariance matrix '/, 0 0 -, avoids a costly matrix inversion. unplc of Gaussian mixture with K = 3 is shown in Figure 2.2.
* \ Background Model ii.
ilyorithm by [69] was originally designed to combine foreground and i niiincl Gaussians together into one model, where the foreground Gausi \ov> er weights. Due to the separation of foreground and background
16
2. 2D REAL-TIME TRACKING
pixels into different models in the presented approach, there is no more need to model foreground colors within B. During experiments, a single Gaussian has shown to be sufficient. The background colors are modeled with one Gaussian per pixel with evolving Jit, assuming the camera is static. Such a simplification is beneficial for a faster frame by frame computation and it simplifies the initialization and training at the beginning with a single empty background image. Going one step further, the background model uses the same fixed diagonal covariance matrix everywhere which again speeds up the computations without sacrificing too much accuracy. A fixed color variance is sufficient to handle global image noise and possible compression artifacts. More complex background models are discussed in the related work section. For tracking purposes, the single Gaussian model showed satisfactory results given the limited processing time available in real-time applications. The covariance matrix is set beforehand and the mean vectors are initialized based on the values in an empty background image. Thereafter, the pixels Xt segmented as background are used for updating the corresponding color model, where a is the learning rate ~Jtt,A- = (1 — a)~ftt-i,k + cxXt. The background is updated only in visible parts.
2.3.3 Finding New Objects The tracker detects newly appearing objects as part of the segmentation process. Their creation is based on a generic 'new object model' M. This appearance model has a uniform, low probability pv. Thus, when the probabilities of the background and all other objects drop below pjV, the pixel is assigned to M. Typically, this is due to the following reasons: • A new object appears in front of the background and the background probability drops. • The pixel is on the "edge" of an object and its value is a mixture of the background and foreground color. • The foreground model does not contain all colors of the object. A new foreground model Oi is initialized as soon as a region of connected pixels has a minimal size. Some rules have been established to avoid erroneous initializations:
17
' I, APPEARANCE MODELS
£
1st A
A •a
1
3rd
2nd
A
^^_
A
^ ^
Pixel intesities [0 255]
' fglire 2.2; Example for a Gaussian mixture with three RGB Gaussians with different mean, variance and weight. • Objects entering the image from the sides are not initialized until they are fully visible, i.e. until they are no longer connected with the image border. This prevents objects from being split into multiple parts. • Pixels assigned to M which are connected with a foreground object Oi arc not used for the new object detection. Instead, it is assumed that Ox is not properly segmented. If the number of these pixels exceed 20% of Ot, they are added to Oi. The threshold is due to the assumption that a smaller number of pixels are a result of pixels on the "edge" of the object. \ also has a mechanism for cleaning up itself by reducing the probability even bilow i\ for those pixels which are assigned to M for a longer period of time. I In', mechanism is essential to keep the new object model clean from noise blobs and it increases the accuracy of the background model. M is initialized H i.ni up withp^.
2,3.4
Foreground Models
i i ;r