DETECTION AND TRACKING OF

MULTIPLE MOVING OBJECTS IN VIDEO

Wei Huang and Jonathan Wu

Department of Electrical and Computer Engineering, University of Windsor

Windsor, Ontario, N9B 3P4, Canada

Keywords: Motion detection, tracking, partial occlusion, color, texture, DCT, inexact graph matching.

Abstract: This paper presents a method for detecting and tracking multiple moving objects in both outdoor and indoor

environments. The proposed method measures the change of a combined color-texture feature vector in each

image block to detect moving objects. The texture feature is extracted from DCT frequency domain. An

attributed relational graph (ARG) is used to represent each object, in which vertices are associated to an

object’s sub-regions and edges represent spatial relations among the sub-regions. Object tracking and

identification are accomplished by matching the input graph to the model graph. The notion of inexact graph

matching enables us to track partially occluded objects. The experimental results prove the efficiency of the

proposed method.

1 INTRODUCTION

The efficient detection and tracking of multiple

moving objects is currently one of the most active

research topics in computer vision. It has many

applications such as visual surveillance, human-

machine interfaces, video communication, and so

on.

As for motion detection, the background

subtraction technique is a popular method. In

(

Stauffer and Grimson, 2000), the pixel value was

modeled by a mixture of weighted K Gaussian

distributions to support multiple backgrounds.

(

Elgammal et al., 2002) used a nonparametric kernel

density model by estimating the probability of pixel

intensity directly from a set of recent intensity

values.

As to the tracking method, the most widely used

cues in object tracking are color, spatial position,

shape and motion. In (

Xu et al., 2004), five significant

features were used, including velocity, size, elliptic-

fit aspect ratio, orientation, and dominant color.

(

Brasnett et al., 2005) demonstrated that the combined

color and texture cues provided a good tracking

result that was more accurate than the two cues

individually.

In this paper we introduce a new motion

detection method which does not compute any

model of the background. We measure the change of

a combined color-texture feature vector in each

image block within a time window and then directly

obtain moving objects by statistically analyzing the

change. For effective tracking, the attributed

relational graph is used to represent each moving

object. A combined color-texture-position feature

vector is used to describe each object’s sub-regions,

which are associated to the vertices of the ARG.

Inexact graph matching enables us to track and

identify partially occluded objects. In the discussion

below, we calculate the color-texture combined

feature vector for motion detection in Section 2.1,

and then we explain the details of detecting moving

objects using eigenspace decomposition and

statistical analysis in Section 2.2. Section 2.3

describes how to construct the attributed relational

graph to represent the detected object. Section 2.4

gives the details of identifying objects using inexact

graph matching technique. We show experimental

results for real image sequences in Section 3.

Conclusions are given in Section 4.

2 PROPOSED ALGORITHM

2.1 The Color-Texture Feature

Approach for Motion Detection

In (Latecki et al., 2004), an idea was introduced that

the texture vectors are very likely to have a large

spread when a moving object is passing through a

492

Huang W. and Wu J. (2007).

DETECTION AND TRACKING OF MULTIPLE MOVING OBJECTS IN VIDEO.

In Proceedings of the Second International Conference on Computer Vision Theory and Applications - IU/MTSV, pages 492-497

 SciTePress

⎟

⎠

⎞

⎜

⎝

⎛

⎟

⎠

⎞

⎜

⎝

⎛

−

∑

−=

−

∑

ττ

()

∑

−=

ττ

fixed position. Motivated by this idea, we measure

the change of a combined color-texture feature

vector to detect moving objects. Combining color

and texture as the feature vector can still extract

foreground objects when the color distributions of

the foreground and background are similar, in which

case the Gaussian mixture model will fail. We

assume a stationary camera.

We use some DCT coefficients as the texture

feature. We partition every new frame into blocks

with 8 * 8 pixels, where every two neighboring

blocks overlap each other by four pixels horizontally

or vertically for improving the spatial resolutions of

the detection results. A feature vector is extracted for

each block. Eleven features are used for detection.

Two of them are the average color components

),(

in an 8 * 8 block. The other nine

features are the first nine AC coefficients inside each

block along the zigzag scanning. We use the well-

known

color space, where Y encodes

luminance,

and

encode color information

(chrominance). To obtain the other nine features, the

DCT is applied to the Y component of the image

block. One color-texture feature vector for a block u

is then expressed as:

(1)

2.2 Detection of Moving Objects by

Measuring the Change of the

Color-Texture Feature Vector

By measuring the change of the color-texture feature

vector over time, we are able to detect whether a

particular block belongs to a background or to a

moving object. We compute the covariance matrix

of the feature vectors in the same block location

within a small number of consecutive frames. The

eigenvalues of the covariance matrix refer to the

variance of the data in the direction of the basis

vectors. We use the largest eigenvalue as a local

change measure. The larger the largest eigenvalue,

the more likely is the presence of a moving object.

In practice, for each block u, we consider the

color-texture feature vectors for a symmetric

window with size of 2S+1 around the temporal

instant

SuSuuSu +−++−

τττττ

,,1,,1,

,...,,...,, fffff

S-u,

. S is set

to 1 here. For these vectors, the covariance matrix

is:

(2)

Then, the covariance matrix

is decomposed

into its eigenvectors

()

and eigenvalues

(

)

( k =1, 2, …, 11).

(3)

The largest eigenvalue

is the local

change measure

(4)

Finally, we mark each block as part of a moving

object or background according as whether the

change measure is larger than a predefined threshold

or not. We assume that the values of the local

change measure

in every video frame obey the

Gaussian distribution. We compute the mean

and

variance

of all

for u=1, 2,…, L. L is the

total number of sub blocks of every video frame. A

block will be labelled as moving if

(5)

Where th1 is a constant and is set to 0.5 here.

(6)

(7)

The pixels belonging to an object are connected.

A connected component analysis algorithm is used

to find connected components in the binary images

that we obtained at the motion detection stage. We

use a size filter to remove the connected component

whose area is below a threshold th2.

2.3 Object Representation by the

Attributed Relational Graph

Currently, color histogram is widely used to

represent detected objects (

Comaniciu et al., 2003).

However, color histograms have limited

discriminative power. Two images producing

()

AAAAAAAAA

1,4

2,3

3,2

4,1

1,3

2,2

3,1

1,2

2,1

,,=

tu,

)(

).(

)(

ττ

(

)

−

DETECTION AND TRACKING OF MULTIPLE MOVING OBJECTS IN VIDEO

493

identical color histograms may have totally different

spatial organization of colors. In this paper, we use

an attributed relational graph to represent each

object, in which vertices are associated to an object’s

sub-regions and edges represent spatial relations

among the sub-regions. Therefore, the tracking and

identification of objects amounts to graph matching.

Both input and model graphs are automatically

extracted from video sequences. Usually, the graphs

extracted in the first frame act as the model graphs.

To fragment an object into sub-regions, we first

use a combined color-texture-position feature vector

to describe the detected image blocks which belong

to the object. The two color features are the average

color components

),(

in an 8*8 image

block. The one texture feature is the average

summation of the first nine squared AC coefficients

along the zigzag scanning. The two position features

are simply the coordinates of the image block. After

obtaining the feature vectors for all the blocks, we

perform normalization on the five features to

eliminate the effects of different feature ranges.

Then the k-means algorithm is used to cluster the

feature vectors into several classes with every class

in the feature space corresponding to one spatial

sub-region of the detected object. The k-means

algorithm does not specify the value of k. To

compute the optimal value of k, we iterate it between

a minimum

()

min

=k and a maximum value

()

max

=k until a stop constraint is satisfied.

After the segmentation, we are ready to build the

ARG for each detected object. An ARG is a graph in

which attribute vectors are assigned to vertices and

to edges. Formally, we define an ARG as

()

νμ

,,, ENG = , where N represents the set of

vertices of G and

NNE x ⊆ the set of edges. Two

vertices a, b of N are said to be adjacent if

()

Eba ∈, . Furthermore,

LN →:

assigns an

attribute vector to each vertex of G, while

LE →:

assigns an attribute vector to each

edge in G.

The structure of an object can be represented as a

collection of sub-regions which are related by their

relative positions within the object. The sub-regions

are represented by vertices in a graph, while

relations between them are represented by edges. Let

us consider any two vertices a, b in N. The vertex

attribute

()

is defined as follows:

()

(

)

Ca ,,,,=

(8)

The five terms correspond to the color

component

C , color component

C , texture

T ,

spatial coordinate

P and spatial coordinate

P at the

centroid location of a cluster, respectively. Each

cluster obtained by k-means algorithm corresponds

to a sub-region within the object.

The edge attribute

(

)

ba,

, for a, b, in E, is

defined as the length value of the edge linking the

two vertices a and b.

In practice, the object is represented by a

signature, which is composed of two parts: the first

one is the feature vectors of all sub-regions, which

are called vertex attributes, and the second one is a

representation of the topology of the sub-regions

within the object. Spatial relationships between sub-

regions are characterized by an adjacency matrix of

sub-regions with a value of 1 if both sub-regions

have at least one pixel in common, otherwise 0. For

the pair of adjacent sub-regions, the length value of

the corresponding edge is stored in a distance

matrix.

2.4 Inexact Graph Matching for

Tracking and Identification of

Moving Objects

When graphs are used to represent objects, the

problem of objects tracking and identification can be

seen as a problem of graph matching. The notion of

inexact graph matching enables us to track partially

occluded objects. Two matched graphs do not have

to be identical but only similar in terms of vertex

number, vertex attributes or edge number. Our

implementation of the matching algorithm is given

below:

In the following, a, b refer to vertices in the input

graph I, and

a ,

b correspond to vertices in the

model graph M.

1) For each vertex a in the input graph, a search is

conducted to find the best matching vertex

a in the

model graph, such that the Euclidian distance of the

matching vertex attributes

()

(

)

(

)

, aad

μμ

is the

minimum value. The vertex similarity for this pair of

vertices is computed as:

()

()()

, aad

μμ

−

= (9)

We have to satisfy two basic constraints during the

matching process:

2) A vertex in the input graph cannot match with

two different vertices in the model graph.

VISAPP 2007 - International Conference on Computer Vision Theory and Applications

494

3) Two different vertices in the input graph cannot

match with a single vertex in the model graph.

It is possible that some vertices in the input graph do

not have matching vertices in the model graph

because the two graphs may have different vertex

number.

4) After the vertices are matched, total similarity is

computed by taking into account the topology of the

matched graphs.

Let the vertices a and b match

and

b respectively. Then the topology similarity for this

pair of vertices is computed as:

baba

νν

−−

= (10)

Where

and

are the length values of the

edges ab and

ba .

5) The total similarity for matching the input graph

to the model graph is given by

()

∑∑

−

∑

baba

MIS

αα

(11)

Where

N is the maximum vertex number

between the two matching graphs.

N is the

maximum number of really existing edges between

the two matching graphs. It is possible that edges in

the input graph do not have corresponding edges in

the model graph and vice-versa.

is a scaling

parameter which controls the relative importance of

the two similarity functions.

is set to 0.6 here.

6) The total similarity is then scaled to reflect the

difference in the size and position of the input and

model objects.

() ()

MISpmMIS ,

**, = (12)

m and

m denote the sizes of the input

object and model object respectively, then:

⎭

⎬

⎫

⎩

⎨

⎧

m ,min

(13)

otherwise e

d e

dd-

⎪

⎩

⎪

⎨

⎧

14)

where d is the Euclidian distance between the

centroids of the input and model objects.

d is a

constant and set to 20 here. The motivation of

adding the position scaling factor p into the

similarity fuction is that an object will not move far

from its last position. Therefore, the centroid

presents us with a useful feature for tracking objects.

To let this factor work properly, we update the

model’s position once we get an input object

matched to that model.

7) The best candidate match

satisfies

(

)

()

MISMIS

,max,

(15)

When the value of

(

)

, MIS is more than a

predefined threshold, we say the input graph/object

is identified with the model graph/object. Otherwise,

we assume a new object enters the scene, it will be

tracked and labelled, and the corresponding ARG of

that object is constructed and stored in the model

graph/object list.

3 EXPERIMENTAL RESULTS

Our proposed method is tested on both outdoor and

indoor image sequences: PETS 2001 dataset 2 and

PETS 2006 S3-T7-A dataset. The image sizes of the

PETS 2001 and PETS 2006 datasets are 768*576

and 720*576, respectively. The PETS 2001 dataset

involves swaying trees. The PETS 2006 dataset

involves cast shadows. Results are shown in Fig. 1-

Fig. 4.

Figure 1: (a) Original image from PETS 2001 dataset 2.

DETECTION AND TRACKING OF MULTIPLE MOVING OBJECTS IN VIDEO

495

Figure 1: (b) Motion detection result.

Figure 2: Original image from PETS 2006 S3-T7-A

dataset with the attributed relational graph.

(a)

(b)

(c)

Figure 3: Tracking a single object in indoor environment.

VISAPP 2007 - International Conference on Computer Vision Theory and Applications

496

(a)

(b)

(c)

Figure 4: Tracking multiple moving objects in outdoor

environment with dynamic background.

4 CONCLUSIONS

In this paper, we propose a novel method for

detection and tracking multiple moving objects in

both outdoor and indoor environments. To detect the

moving objects, we compute a combined color-

texture feature vector for each image block and

measure the change of the color-texture feature

vector of the image block within a certain time

interval. For tracking and identification of the

detected multiple moving objects, we represent each

object by an ARG, in which vertices are associated

to an object’s sub-regions and edges represent

spatial relations among the sub-regions. The notion

of inexact graph matching enables us to track

partially occluded objects. The future work is to

solve the case when an object is totally occluded.

REFERENCES

Brasnett, P., Mihaylova, L. Canagarajah, N. and Bull, D.,

2005. Particle filtering with multiple cues for object

tracking in video sequences. Proc. of SPIE-IS&T

Electronic Imaging, vol. 5685, pp. 430-441.

Comaniciu,.D., Ramesh, V. and Meer, P., 2003. Kernel-

based object tracking. IEEE Trans. Pattern Analysis

and Machine Intelligence, vol. 25, no. 5, pp. 564-577.

Elgammal, A., Duraiswami, R., Harwood, D. and Davis,

L., 2002. Background and foreground modeling using

nonparametric kernel density estimation for visual

surveillance. Proceedings of the IEEE, vol. 90, no. 7,

pp. 1151-1163.

Latecki, L., Miezianko, R., and Pokrajac, D., 2004.

Motion detection based on local variation of

spatiotemporal texture. Proceedings of the 2004 IEEE

Computer Society Conference on Computer Vision

and Pattern Recognition Workshops (CVPRW’04), pp.

135-141.

Stauffer, C. and Grimson, W., 2000. Learning patterns of

activity using real-time tracking. IEEE Trans. Pattern

Analysis and Machine Intelligence, vol. 22, no. 8, pp.

747-757.

Xu, L., Landabaso, J. and Lei, B., 2004. Segmentation and

tracking of multiple moving objects for intelligent

video analysis. BT Technology Journal, vol. 22, no. 3,

pp. 140-150.

DETECTION AND TRACKING OF MULTIPLE MOVING OBJECTS IN VIDEO

497