Human Detection and Tracking under Complex Activities

B. Cancela, M. Ortega and M. G. Penedo

VARPA Group, University of A Coru˜na, Campus de Elvi˜na s/n, A Coru˜na, Spain

Keywords:

Background Subtraction, Cascade Classiﬁer, Histogram of Oriented Gradients, Particle Filter, Collision

Detection.

Abstract:

Multiple-target tracking is a challenging question when dealing with complex activities. Situations like partial

occlusions in grouping events or sudden target orientation changes introduce complexity in the detection which

is difﬁcult to solve. In particular, when dealing with human beings, often the head is the only visible part.

Techniques based in upper body achieve good results in general, but fail to provide a good tracking accuracy

in the kind of situations mentioned before. We present a new methodology for provide a full tracking system

under complex activities. A combination of three different techniques is used to overcome the problems

mentioned before. Experimental results in sport sequences show both the speed and performance of this

technique.

1 INTRODUCTION

In complex scenes, human detection and tracking is

a challenging question far to be solved. Since the

use of full-body human detections (Desai et al., 2009)

is questionable due to possible occlusions, novel sys-

tems try to focus into the head region.

For instance, Rodriguez et al. (Rodriguez et al.,

2011) introduce a density function to detect every

head into a crowded scene. However, tracking sys-

tem based in local points does not take into account

complex activities, like quick human spins. On the

other hand, Li et al. (Li et al., 2009) use the head-

shoulder omega shape feature to perform a human

detection technique. This system cannot track peo-

ple when exist human interaction events, like people

grouping, or when sudden orientation changes occur.

It also obtain poor results in presence of noise within

the image. Many other different techniques can be

seen in (Zhan et al., 2008).

In this paper we present a new methodology for

human detection and tracking people under complex

activities. Based on the work of Li et al., we intro-

duce additional information about the motion within

the scene to reduce the region to process. In our work,

a combination of a Viola-Jones type classiﬁer with a

HOG feature based SVM is used to detect every per-

son in the scene, while a particle ﬁlter system, sup-

ported by the human detection system and a adaptive

ﬁlter technique to predict the target position, performs

the tracking methodology, achieving good results in

both areas.

This paper is organized as follows. Section 2 ex-

plains the detection system; section 3 describes the

tracking system; ﬁnally, section 4 shows some exper-

imental results and section 5 offers some conclusions.

2 DETECTION SYSTEM

To detect every human in the scene, we perform a

method based in the omega-shape feature (Li et al.,

2009). We combine a Viola-Jones type classiﬁer with

a Histogram of Gradients (HOG) feature based SVM.

The main reason to introduce two different classiﬁers

in related with the processing speed. The SVM is a

better classiﬁer, but the time needed to compute the

HOG feature and to classify it is high, making it un-

able to be used in real-time scenarios. On the other

hand, the Viola-Jones type classiﬁer speed is high, but

lacks in the classiﬁcation accuracy, since it introduces

several false positive detections.

First, the Viola-Jones type classiﬁer is used to de-

tect every possible human being in the scene. This

results in a pool of different patches, which are go-

ing to be conﬁrmed as human beings using the HOG

based feature SVM. However, using the Viola-Jones

along the whole scene also involves a large amount of

time. Our detection system is divided into three dif-

ferent steps: ﬁrst, a background subtraction technique

is used to detect every moving pixel in the scene;

370

Cancela B., Ortega M. and Penedo M..

Human Detection and Tracking under Complex Activities.

DOI: 10.5220/0004201503700374

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 370-374

ISBN: 978-989-8565-48-8

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

(a) (b)

Figure 1: Detection system example. Once the background

subtraction technique is applied, we select a poo of bound-

ing boxes containing the blobs in the scene (a). Viola-Jones

type classiﬁer, in combination with a torso estimation, are

used to select the possible detections (a), which are ﬁnally

evaluated using the HOG feature based SVM (b).

then, both the Viola-Jones type classiﬁer and the HOG

based feature SVM are used to detect every moving

person. We decided not to use optic ﬂow to detect ev-

ery moving pixel in the scene because of the process-

ing speed. Since information about the module or the

orientation of the movement are not going to be used

in this methodology, a background subtraction tech-

nique with online updating system is used. We use

the Mixture of Gaussians (MoG) algorithm (Stauffer

and Grimson, 1999) with a short window history, in

order to quickly consider as background stopped tar-

gets. This is also done to make the system more robust

to sudden illumination changes. After that, opening

and closing morphological operators are user to re-

duce the image noise.

Before using the Viola-Jones type classiﬁer, we

have to reduce the processing regions. Having the

background information technique, we perform a

blob detection and, for each blob, a bounding box is

created containing all its foreground pixels.

After we have the bounding boxes, we use the

Viola-Jones type classiﬁer under these regions. De-

tails of how to train the classiﬁer can be seen in (Viola

and Jones, 2001). Finally, a pool of different patches

are obtained (green rectangles in Fig. 1-(a)). Before

we apply the HOG feature based SVM, we introduce

a restriction based in the human movement knowl-

edge. A head movement detected by the background

subtraction technique is always followed by a torso

movement. So, using a small rectangle at the bottom

of the patch we can check the foreground pixels to see

if a movement occurs down the head. If no movement

is found, the patch is discarded. The height of this

rectangle is related with the patch size we are check-

ing. Blue rectangles in Fig. 1-(a) shows an example

of the rectangles we use to check the torso.

The HOG Feature Based SVM is applied to every

remaining patch to conﬁrm it is related to a person.

We perform a classic HOG technique. HOG feature

extraction details can be seen in (Dalal and Triggs,

2005). The computational cost to obtain this feature

is high, but we can improvethe performanceusing the

integral histogram technique (Porikli, 2005). In Fig.

1-(b) we can see both accepted and rejected patches.

3 TRACKING SYSTEM

Although tracking system is mainly focused on us-

ing a particle-ﬁlter system, that method decreases its

quality during long-time scenes, since an error in the

estimation is carried along the frames without any

possibility to correct them. Our tracking system is de-

ﬁned as follows: ﬁrst, we use the detection system ex-

plained before to also track the target. If this method

fails, we perform a particle-ﬁlter system to locate the

best target state into the new frame. Finally, we in-

troduce the new elements detected by the detection

system.

After using the detection system, we have a pool

of different accepted patches. We also have, at time

t, a set containing all the targets tracked by the sys-

tem in previous frames. Thus, we create, for every

new patch, a set containing all the targets that could

ﬁt with the new detection, using the euclidean dis-

tance. If we have only one target candidate, we assign

the new patch with the target contained into the set.

If there is more than one candidate. a collision oc-

curs. Then, we perform a comparison between their

respective HOG features using the Bhattacharyya co-

efﬁcient, which is a good method for tracking non-

rigid objects (Comaniciu et al., 2000). The target

which obtains a better coefﬁcient is assigned to the

new patch position.

We compute a predicted position for the targets we

have stored, using a bunch of linear ﬁlters (Adalines)

to estimate the target speed, because of its simplic-

ity and its performance under noisy images (Cancela

et al., 2011).

We propose to use a particle-based system, choos-

ing the extracted local HOG features, to model its

appearance. Using the linear ﬁlters to predict the

velocity of each target, we can assume the new tar-

get position is described as z

= ˜z

+ ω, where ˜z

the predicted position of the target z

at time t and

ω ∼ N(0, Σ) is a Gaussian Noise. In our particle ﬁlter

system, we add Gaussian noise to the predicted po-

sition to generate a bunch of different particles. The

local HOG feature is extracted for each particle and,

using the Bhattacharyya coefﬁcient we choose the po-

sition which obtains the best value.

Once the position is located, we have to update the

model appearance. Because of a bad chosen particle,

the model should maintain information about previ-

ous appearances. So, having the target model

and

HumanDetectionandTrackingunderComplexActivities

371

0 20 40 60 80 100

Frame #

Amount #

Number of people detected

Number of detections

(a)

0 20 40 60 80 100

Frame #

Amount #

Number of people detected

Number of detections

(b)

0 20 40 60 80 100

Frame #

Amount #

Number of people detected

Number of detections

(c)

0 20 40 60 80 100

Frame #

Amount #

Number of people detected

Number of detections

(d)

Figure 2: Detection system stats. With no background subtraction technique (Li et al., 2009), false positive rate is high

(a). Including background subtraction technique we reduce the number of total and false detections (b). Including the torso

estimation, the system locate every moving person in the scene avoiding wrong detections (c), while stopped persons cannot

be reached. Initializing the system having disabled the restrictions under a few frames obtains the better results (d).

the HOG feature of the winning particle, O

t+1

, the

model is updated using a learning parameter, α (set to

0.05 in our case).

We always accept the best particle. Thus, we in-

troduce a threshold τ

to detect lost targets. If, along

consecutive frames, the Bhattacharyya coefﬁcient of

the best particle is below to that frame, we declare the

target as lost.

Once we have successfully tracked all the targets

we have detected in previously frames, we introduce

new targets appearing within the scene. Basically, we

deﬁne as new target every patch that has no overlap-

ping region with any of the targets tracked with the

other two techniques mentioned before.

4 EXPERIMENTAL RESULTS

To perform our experiments we have recorded a sport

sequence, consisting in a 3 × 3 basket match. More

than 15000 frames were recorded in this video, hav-

ing a 640× 368 resolution running at 25 frames per

second. We divide the test results in two big groups:

one involving the detection performance and another

focused in the tracking system.

Since our method for adding new targets is em-

bedded in the tracking system, we use the whole sys-

tem to test the detection, without taking into account

any information about the identiﬁcation system. To

train both Viola-Jones type classiﬁer and the RBF ker-

nel SVM we use the generic dataset introduced in (Li

et al., 2008). To validate the advantage of our method

on the detection task, we compared it with two differ-

ent conﬁgurations, in terms of detection quality and

processing speed. First, Li et al. (Li et al., 2009) im-

plementation is used. Also, we test out method with

and without torso estimation. We evaluate the detec-

tion system during a sequence. Fig. 2 shows the re-

sults obtained.

(a) (b) (c) (d) (e) (f)

Figure 3: Tracking system example. Without taking into

account information about the human detection system, the

accuracy of the tracking decreases when partial occlusion

events occur (a, b, c). More accurate solution can be

achieved introducing that information (d, e, f).

FD FD FD

KF/AF KF/AF KF/AF

KF KF KF AF AF AF

Figure 4: Examples of tracking using different predicted po-

sition techniques. Using a simple ﬁnite difference scheme

(FD) lacks in presence of noise, while both kalman ﬁlter

(KF) and adaptive ﬁlters (AF) can deal with it. When occlu-

sion occurs, AFs performs better when recovering a moving

object, while KFs obtain better results tracking the stopped

one.

The idea of using both background subtraction in-

formation and the torso estimation reduces the sensi-

tivity of the system. However, in video sequences,

it is very important to avoid, as much as possible,

the false positive rate. Although our system can re-

ject good patches, the speciﬁcity is close to 100%.

The impact due to the sensitivity decreasing is lim-

ited, since video sequences provide good chances to

detect that patches into the successive frames. Table

1 shows the speed processing results using a Pentium

Quad Core, running at 2.40GHz with 4 RAM GB in

a Linux Operative System. In these images, we can

increase the number of frames processed per second

from 1 up to 4, while we improve the accuracy in the

detection procedure.

According to the prediction system, three differ-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

372

Table 1: Time consuming on a 640× 368 image by the Viola-Jones type classiﬁer in combination with the HOG feature based

SVM, and by the tracking system, with or without taking into account the detection system information.

Detector Type Time per frame

Detection

No background subtraction (dense scan) 962ms

Background subtraction 255ms

Background subtraction + torso estimation 245ms

Tracking

Particle ﬁlter 32ms

Particle ﬁlter + detection information 16ms

Figure 5: Partial occlusion example. The system is able

to detect every person in the collision, even when sudden

orientation changes occur.

ent approximations were tested: a simple ﬁnite dif-

ference scheme, a Kalman ﬁlter (Kalman, 1960) and

the adaptive ﬁlters. As we can see in Fig. 4, ﬁnite

difference scheme is too sensible to noise. Similar re-

sults were obtained using both Kalman and adaptive

ﬁlters. Since the behavior is similar, we decide to use

the adaptive ﬁlter, as it is less memory and computa-

tion demanding.

The system can successfully track multiple targets

under sudden movement changes. Also can detect

every person within a scene under partial occlusion

events. In Fig. 5 we can see four different players un-

der partial occlusion circumstances. Every frame dur-

ing that collision, the system is able to successfully

detect all the targets involving.

5 CONCLUSIONS

In this paper we present a new methodology for track-

ing people under uncontrolled and complex scenar-

ios. Three different techniques were tested. We can

conclude the system performs a better detection tech-

nique.

A sport sequence is used to test our algorithm,

including several challenging situations, like occlu-

sions, grouping events and sudden speed and orienta-

tion change movements. While previous approaches

failed to deal with these problems, the results show

our system is able to maintain a good tracking quality

during the sequences, while we increase their algo-

rithm speed. In future researches, we plan to intro-

duce a full tracking system with total occlusion re-

covery.

ACKNOWLEDGEMENTS

This paper has been partly funded by the Conseller´ıa

de Industria, Xunta de Galicia through grant contracts

10/CSA918054PR and 10TIC009CT.

REFERENCES

Cancela, B., Ortega, M., Penedo, M. G., and Fern´andez, A.

(2011). Solving multiple-target tracking using adap-

tive ﬁlters. In Lecture Notes in Computer Science

(ICIAR 2011), volume 6753, pages 416 – 425.

Comaniciu, D., Ramesh, V., and Meer, P. (2000). Real-

time tracking of non-rigid objects using mean shift.

In Computer Vision and Pattern Recognition, 2000.

Proceedings. IEEE Conference on, volume 2, pages

142 –149 vol.2.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Computer Vision and

Pattern Recognition, 2005. CVPR 2005. IEEE Com-

puter Society Conference on, volume 1, pages 886 –

893.

Desai, C., Ramanan, D., and Fowlkes, C. (2009). Discrim-

inative models for multi-class object layout. In Com-

puter Vision, 2009 IEEE 12th International Confer-

ence on, pages 229 –236.

Kalman, R. (1960). A New Approach to Linear Filtering

and Prediction Problems. Transactions of the ASME–

Journal of Basic Engineering, 82(Series D):35–45.

Li, M., Zhang, Z., Huang, K., and Tan, T. (2008). Estimat-

ing the number of people in crowded scenes by mid

based foreground segmentation and head-shoulder de-

tection. In Pattern Recognition, 2008. ICPR 2008.

19th International Conference on, pages 1 –4.

Li, M., Zhang, Z., Huang, K., and Tan, T. (2009). Rapid and

robust human detection and tracking based on omega-

shape features. In 16th IEEE International Confer-

ence on Image Processing (ICIP), pages 2545 – 2548.

Porikli, F. (2005). Integral histogram: a fast way to extract

histograms in cartesian spaces. In Computer Vision

and Pattern Recognition, 2005. CVPR 2005. IEEE

Computer Society Conference on, volume 1, pages

829 – 836.

Rodriguez, M., Sivic, J., Laptev, I., and Audibert, J.-Y.

(2011). Density-aware person detection and tracking

in crowds. In Proceedings of the International Con-

ference on Computer Vision (ICCV).

HumanDetectionandTrackingunderComplexActivities

373

Stauffer, C. and Grimson, W. E. L. (1999). Adaptive back-

ground mixture models for real-time tracking. In

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, volume 2, pages 246–

252.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Computer Vi-

sion and Pattern Recognition, 2001. CVPR 2001. Pro-

ceedings of the 2001 IEEE Computer Society Confer-

ence on, volume 1, pages I–511 – I–518.

Zhan, B., Monekosso, D., Remagnino, P., Velastin, S., and

Xu, L.-Q. (2008). Crowd analysis: a survey. Machine

Vision and Applications, 19:345–357.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

374