Real-time Accurate Pedestrian Detection and Tracking in Challenging

Surveillance Videos

Kristof Van Beeck

and Toon Goedem´e

1,2

EAVISE, Campus De Nayer - KU Leuven, J. De Nayerlaan 5, 2860 Sint-Katelijne-Waver, Belgium

ESAT-PSI, KU Leuven, Kasteel Arenbergpark 10, 3100 Heverlee, Belgium

Keywords:

Pedestrian Detection, Tracking, Surveillance, Computer Vision, Real-time.

Abstract:

This paper proposes a novel approach for real-time robust pedestrian tracking in surveillance images. Such

images are challenging to analyse since the overall image quality is low (e.g. low resolution and high compres-

sion). Furthermore often birds-eye viewpoint wide-angle lenses are used to achieve maximum coverage with a

minimal amount of cameras. These speciﬁc viewpoints make it difﬁcult - or even unfeasible - to directly apply

existing pedestrian detection techniques. Moreover, real-time processing speeds are required. To overcome

these problems we introduce a pedestrian detection and tracking framework which exploits and integrates these

scene constraints to achieve excellent accuracy results. We performed extensive experiments on challenging

real-life video sequences concerning both speed and accuracy. We show that our approach achieves excellent

accuracy results while still meeting the stringent real-time demands needed for these surveillance applications,

using only a single-core CPU implementation.

1 INTRODUCTION

Reliable pedestrian detection and tracking in surveil-

lance images opens up a wide variety of applica-

tions (e.g. abnormal behaviour detection, path pre-

diction, intruder detection, people safety on e.g. mov-

able bridges and crowd counting). In recent years,

tremendous advances concerning pedestrian detec-

tion were published. Current state-of-the-art detec-

tors achieve excellent accuracy results on publicly

available datasets (see section 2). Unfortunately, di-

rectly applying these existing techniques on challeng-

ing surveillance images is not a trivial task. This

is due to the inherent nature of these surveillance

applications; often a large number of cameras are

utilised since large areas need to be covered com-

pletely. Such scenarios impose severe constraints on

the hardware: low-cost cameras are employed with

wide-angle lenses, mounted high in a partly down-

looking birds-eye view. Consequently image process-

ing and analysis on these images is challenging.

Indeed, typical surveillance images are often cap-

tured at low-resolution and use high compression.

Classic background subtraction based object detec-

tion methods yield very noisy results at these high

compression ratios. Moreover, these techniques do

not differentiate between people and other objects.

Due to their speciﬁc viewpoint (and wide-angle lens)

standard pedestrian detectors - which are trained and

evaluated on forward-lookingimages - are also unable

to give accurate detection results on these images.

Additionally, due to perspective effects some pedes-

trians to be detected appear very small in the image,

which remains one of the most challenging tasks for

current pedestrian detectors (Doll´ar et al., 2012). Fur-

thermore, real-time processing speeds are required. In

this paper we propose a ﬂexible and fast pedestrian

detection and tracking framework speciﬁcally ad-

dressing these challenging surveillance images. See

ﬁgure 1 for a typical example frame of the pub-

licly available surveillance dataset we used (CAVIAR

project, 2005). Our approach achieves excellent accu-

Figure 1: Example frame of one of the sequences of the

CAVIAR dataset (CAVIAR project, 2005).

325

Van Beeck K. and Goedemé T..

Real-time Accurate Pedestrian Detection and Tracking in Challenging Surveillance Videos.

DOI: 10.5220/0005308703250334

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 325-334

ISBN: 978-989-758-091-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

racy results at real-time processing speeds. We over-

come the above mentioned challenges by the integra-

tion of three modalities: foreground segmentation ap-

proaches, the exploitation of scene constraints and an

accurate pedestrian detector. This is done as follows.

First, candidate regions in the image are generated.

Using a calibrated scene distortion model, an early

rejection of false patches is achieved. Next the can-

didate regions are warped to a standard viewing an-

gle and used as input for a state-of-the-art pedestrian

detector. As explained in section 3 our approach al-

lows for the use of a highly accurate pedestrian de-

tector which would otherwise be too computationally

intensive for real-time applications. Finally, the de-

tections are employed in a tracking-by-detection ap-

proach to further increase the accuracy. Note that, us-

ing our approach the actual scene calibration is triv-

ial and easily performed. We demonstrate the effec-

tiveness of our approach on challenging surveillance

video sequences, and present extensive accuracy and

speed results. Our approach is generalisable to other

object classes. The remainder of this paper is struc-

tured as follows. In section 2 we discuss related work

on this topic, and distinguish our approach from exist-

ing work. Section 3 presents our framework in detail.

Next we propose experimental results on challenging

sequences in section 4. Finally, in section 5 we con-

clude our work and give ﬁnal remarks on future work.

2 RELATED WORK

Pedestrian detection and tracking in general is a very

active research topic. (Dalal and Triggs, 2005) ini-

tially proposed the use of Histograms of Oriented

Gradients (HOG) for pedestrian detection. Their

insights paved the way for numerous derived ap-

proaches; even today most state-of-the-art pedestrian

detectors still rely on HOG features albeit in a more

subtle manner (e.g. in combination with other fea-

tures). A well-known example is the work of (Felzen-

szwalb et al., 2008). As opposed to the rigid model

introduced by Dalal and Triggs they propose the in-

clusion of parts (representing e.g. the limbs or head

of a pedestrian) to increase detection accuracy, coined

the Deformable Part Models (DPM). In later work the

authors tackled the inevitable increase in computa-

tional complexity by introducing a cascaded approach

in which a fast rejection of negative detection win-

dows is possible (Felzenszwalb et al., 2010). An ex-

tension was proposed using grammar models to cope

with partial occlusion (Girshick et al., 2012). (Gir-

shick and Malik, 2013) published a new and fast train-

ing methodology for DPM models. As opposed to

enriching the model with parts, (Doll´ar et al., 2009a)

introduced the use of a rigid model with additional

features, called Channel Features (ChnFtrs).

All previously mentioned approaches employ a

sliding window paradigm: to cope with scale varia-

tions a scale-space feature pyramid is calculated and

each layer is evaluated at each location. To speedup

detection (Doll´ar et al., 2010) proposed an approach

which approximates features nearby to avoid a full

feature pyramid calculation. Several other techniques

have been proposed to speedup detection: using

model scaling in stead of image scaling, GPU imple-

mentations (Benenson et al., 2012b) and search space

minimisation techniques (Benenson et al., 2012a; Cho

et al., 2012; Pedersoli et al., 2013). For several years,

the DPM approaches remained among the top per-

forming methods (Doll´ar et al., 2009b; Doll´ar et al.,

2012). However, the need of parts for pedestrian de-

tection remains unclear (Benenson et al., 2014). In-

deed, recent work on optimised rigid models - e.g.

Roerei (Benenson et al., 2013) and ACF (Doll´ar et al.,

2014) - in fact outperform the DPM detectors.

In (Girshick et al., 2014a) the authors present the

use of convolutional neural networks (R-CNN) for

object detection, achieving unprecedented state-of-

the-art accuracy results. This methodology existed for

a long time, but its applicability to image classiﬁca-

tion tasks was highlighted by the work of (Krizhevsky

et al., 2012). Interestingly, their method steps away

from the traditional sliding window approach, and

utilises region proposals as input for the deep learn-

ing classiﬁers. Although currently not real-time,

their framework is able to classify a large variety of

classes simultaneously, making it ideal for large im-

age database retrieval applications such as ImageNe-

timage (Russakovsky et al., 2014). Recently (Gir-

shick et al., 2014b) presented a hybrid approach com-

bining DPMs with CNNs, called DeepPyramid DPM.

Several pedestrian tracking algorithms exist. Due

to recent advances in object detection techniques,

tracking-by-detection has become increasingly popu-

lar. There, an object detector is combined with a re-

liable tracking algorithm (e.g. particle ﬁltering); see

for example (Breitenstein et al., 2011). Concerning

existing work on pedestrian tracking in surveillance

images many either operate on standard viewpoint

and/or high-resolution images (Benfold and Reid,

2011; Singh et al., 2008), or employ thermal cam-

eras to facilitate segmentation to reduce the search

area (Leykin and Hammoud, 2010).

In previous work we presented a real-time pedes-

trian detection framework for similar viewpoint im-

ages which are captured with a blind-spot camera

mounted on a real truck (Van Beeck et al., 2012).

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

326

Figure 2: Overview of our detection pipeline. After a ﬁrst foreground segmentation step we extract region proposals which

potentially contain pedestrians. Each region is warped to an upright ﬁxed-height patch. Next, a highly accurate pedestrian

detector is evaluated at a single scale. Finally, the detections are retransformed and tracked.

These images are - apart from the viewpoint - chal-

lenging since the camera is moving. However, in this

work we can fully exploit and integrate foreground

segmentation methods to increase both accuracy and

speed. Furthermore, we work with images captured

from genuine surveillance cameras. These images are

of low-resolution, low-quality and, due to the use of

wide-angle lenses show large amounts of distortion

and contain non-trivial viewpoints. Existing work on

the same dataset either employs clustering algorithms

with GPU optimisation (Orts-Escolano et al., 2014),

or focusses on motion analysis by matching trained

silhouette models (Rogez et al., 2014a). We differ

signiﬁcantly from these previous works: we devel-

oped an accurate tracking framework in which we can

employ a highly accurate pedestrian detector on these

challenging images, and thus perform much better

than existing methods. We achieve real-time process-

ing speeds on a single-core CPU implementation. Our

approach easily lends itself for multi-threaded imple-

mentation if higher computational speeds are needed.

3 ALGORITHM OVERVIEW

Running standard pedestrian detectors such as the

Deformable Part Models on surveillance images as

shown in ﬁgure 1 is unfeasible. Current pedestrian

detectors are only trained on upright pedestrians at

a ﬁxed height. Scale invariance is achieved using a

scale-space pyramid. Thus in order to achieve de-

cent detections on these surveillance images the de-

tectors ought to run on multiple rotations and scales

of the same surveillance image, using both dense rota-

tion and scale steps. Evaluating the total 4D rotation-

scale search space in real-time evidently is impos-

sible. Nonetheless, the use of a pedestrian detec-

tor could signiﬁcantly increase the accuracy, as op-

posed to standard techniques which only rely on e.g.

backgroundsubtraction with blob analysis due to time

constraints. Therefore, to overcome these challenges

we propose the integration of a foreground segmen-

tation approach with a scene model and a highly ac-

curate pedestrian detector. Our approach allows for

the detection of pedestrians in challenging viewpoints

(e.g. rotated) under large lens distortion at low com-

putational complexity, with very high accuracy. To

retrieve the scene model, a simple one-time calibra-

tion procedure is performed, no explicit lens or cam-

era calibration is needed. Our algorithm brieﬂy works

as follows. As seen in ﬁgure 1, pedestrians appear

rotated and scaled based on their position in the im-

age. We exploit this scene knowledge throughout our

detection and tracking pipeline. For each input im-

age, after a preliminary segmentation, we generate

region proposals which potentially contain pedestri-

ans. The scene model is used to reduce the number

of region proposals. Next, based on the position in

the image we warp each valid potential region to an

upright and ﬁxed-height image patch. These patches

are given as input to a state-of-the-art pedestrian de-

tector, which evaluates a pedestrian model on a single

scale only. This is the key advantage of our work:

since only one scale and position needs to be evalu-

ated we can use a highly accurate pedestrian detector

which would otherwise be too time-consuming. Fur-

thermore this approach allows for the detection of ex-

tremely small pedestrians, if the detection model is

powerful enough. The detections are retransformed to

the original input image, and employed in a tracking-

by-detection framework to associate pedestrian tracks

and handle missing detections. Since each region

can be evaluated independently, a fast multi-threaded

Real-timeAccuratePedestrianDetectionandTrackinginChallengingSurveillanceVideos

327

implementation of this approach is trivial. Figure 2

shows an overview of our approach. In the next sub-

sections we describe further details of each step in our

pipeline, and motivate important design choices.

3.1 Foreground Segmentation

First we perform a foreground segmentation step to

identify moving regions in the static camera images.

Several segmentation approaches are applicable rang-

ing from basic background subtraction methods to

more advanced motion estimation methods. Since

we employ scene constraints further on to reduce the

number of region proposals, our approach allows for

the use of a coarse segmentation. For this step we

thus prefer low computational complexity over high

accuracy, excluding time-consuming techniques (e.g.

optical ﬂow). Hence, we rely on background esti-

mation techniques, which generate a statistical model

of the scene. Several popular methods exist. Since

a comprehensive comparison of these techniques is

out of the scope of this work, we refer to (Benezeth

et al., 2008) and (Parks and Fels, 2008) for a de-

tailed overview. Concerning background subtraction,

the main challenges in typical surveillance images

arise from changing lightning conditions and cam-

era shake. Based on these comparative works we

opted for the method of (Zivkovic and van der Heij-

den, 2006), which employs Gaussian Mixture Models

(GMM). These methods haven proven to cope well

with (limited) background motion. Their proposed

method is an extension of the original GMM where

the number of Gaussian components per pixel is auto-

matically selected. This effectively reduces memory

requirements and increases the computation speed,

making it ideal for this application. A qualitative seg-

mentation output example is shown in the overview

ﬁgure (ﬁg. 2).

3.2 Modelling Scene Constraints

As previously mentioned, the pedestrians in the

surveillance images appear rotated and scaled. Since

the position of the surveillance camera is ﬁxed with

respect to the ground plane both parameters only de-

pend on the position in the image. If we know the

rotation and average pedestrian height for each pixel

position x = [x, y] we can exploit this scene knowl-

edge to achieve fast and accurate pedestrian detec-

tion, similar to (Van Beeck et al., 2012). During the

generation of the region proposals this information

can be used to reject regions which diverge too much

from the expected region properties, thus limiting the

search regions. For each valid proposed region, we

use the transformation parameters to warp each patch

to an upright, ﬁxed-scale image patch, allowing the

use of an accurate pedestrian detector whilst being

real-time. To retrieve these transformation parameters

a one-time ofﬂine calibration needs to be performed

(see ﬁgure 3). However, the scene calibration as pro-

posed here is easy to perform and trivial. For this,

we extracted the rotation and height of each annotated

pedestrian from the dataset, giving the scale and rota-

tion for that speciﬁc point. Next we interpolated the

datapoints using a second order 2D polynomial func-

tion f

(x) for both parameters:

(x) = p

+ p

x+ p

y+ p

+ p

xy+ p

(1)

Both f

scale

(x) and f

rotation

(x) are used as Lookup

functions (LUFs): at each position in the image they

deﬁne the expected region properties and transforma-

tion parameters.

3.3 Generation of Region Proposals

In a next step we reﬁne the segmentation and generate

region proposals which need to be warped and evalu-

ated using our single-scale pedestrian detector. Since

we employ a pedestrian detector in the next stage

to validate each region we are allowed to propose

more regions than needed, i.e. regions without

pedestrians. An accurate detector should indeed

negatively classify such patches. However, it is

important to early reject false patches, since they lead

to useless computations and lower processing speeds.

This stage thus tries to balance between optimal

accuracy and speed, generating an optimal amount of

search locations. Figure 4 gives an overview of our

region proposal calculation. Let us now discuss each

consecutive step in this pipeline.

First Elimination. As a preprocessing step, we

ﬁrst eliminate noise in the segmentation which

remained after the background subtraction step (due

to e.g. changing lightning conditions). This is

simply done using morphological opening. Next,

Figure 3: A one-time calibration step is needed. The trans-

formation parameters are extracted from the annotations.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

328

Figure 4: Our region proposals pipeline. After foreground segmentation and noise removal a ﬁrst blob elimination is per-

formed. Next we perform region growing using a distance transform. Finally, we determine the optimal search points.

we perform a connected component analysis (using

8-connectivity), and test the local scene model for

each blob. That is, we construct a bounding box of

the expected scale and rotation around the centroid of

each blob. We reject two types of regions: extremely

small ones (25 pixels or less) due to the high SNR

there (drawn in magenta in the second step of ﬁg-

ure 4), and those that diverge from an area constraint

(drawn in red). For this constraint, we require that

the area of the connected component should be

larger than a minimal percentage of the expected

area (15%). This step eliminates most invalid regions.

Region Growing. In the case of insufﬁcient contrast,

the foreground segmentation performs suboptimal

(i.e. tends to split a valid pedestrian in multiple blobs,

as seen for the largest pedestrian in ﬁgure 4). For

each remaining valid region we therefore perform

region growing based on the Euclidean distance

transform, joining regions nearby. This has a second

advantage: multiple pedestrians which are nearby are

joined into a single detection region, even if one of

them was removed after the ﬁrst elimination. This is

also illustrated in ﬁgure 4: after the ﬁrst elimination

only one of both small pedestrians is maintained.

However, after region growing both are connected.

Deﬁning Search Points. Finally, we deﬁne exact

search locations where the pedestrian detector will

be applied. This is done as follows. Each remaining

region is again veriﬁed against the scene constraints

since, due to the previous step, these regions could

have grown signiﬁcantly. This is the case when

multiple (possibly previously invalid) regions are

joined. Note that we do not reject regions at this

stage. We locally evaluate each region and use the

expected height and rotation to estimate the number

of possible pedestrians. Based on the size of the

region we ﬁrst evaluate if multiple search points

are necessary for this region. If so, we deﬁne a

linear grid over the entire region of which the step

size depends on the ratio of the expected and actual

region parameters, and eliminate grid points which

are located outside the segmented region.

The ﬁnal region proposals are visualised as the

green rectangles shown in the rightmost image in ﬁg-

ure 4. As seen, our regions accurately predict pos-

sible pedestrians in the image. This is the power of

this approach: by combining foreground segmenta-

tion and scene model constraints the search space for

the computationally expensivepedestriandetector can

be enormously restricted. Slight deviations from the

exact pedestrian position are allowed since we employ

a sliding-window approach in the ﬁnal warped patch.

3.4 Warping Patches

Our scene model has another advantage: for each

image location we know how a pedestrian is locally

distorted. Each region proposal is warped to an up-

right pedestrian at a ﬁxed-scale. Using this approach

we are able to accurately detect even rotated and ex-

tremely small pedestrians, using a single-scale pedes-

trian detector only. The region proposals I are warped

such that I

warp

= TI where transformation matrix

T simply consists of a Euclidean transformation of

which the parameters are extracted from the LUFs:

T =





scosθ −ssinθ t

ssinθ scosθ t

0 0 1





(2)

Note that the optimal scale to which the patches

are warped highly depends on which pedestrian de-

tector is used. This is dicussed in the next section,

where we motivate the choice of pedestrian detector

and determine the optimal scale.

3.5 Pedestrian Detector

The warped image patches are now classiﬁed by a

pedestrian detector. In fact, the method described

in the previous sections is generic and can be com-

bined with each existing pedestrian detection algo-

rithm. As discussed in section 2, recent R-CNN based

detection methods currently achieve top accuracy re-

sults concerning object detection in general. How-

ever, their performance is far from real-time, and they

are more suited for multi-class large database retrieval

tasks. Rigid pedestrian detectors (such as ChnFtrs)

Real-timeAccuratePedestrianDetectionandTrackinginChallengingSurveillanceVideos

329

Figure 5: Pedestrian model used in our implementation. (L)

Root model. (M) Different parts. (R) Deformation costs.

currently offer the best trade-off between speed and

accuracy when a full-scale space pyramid needs to

be constructed. However, since we need to evaluate

a single scale only, no scale-space pyramid needs to

be constructed. Therefore we are able to use an accu-

rate pedestrian detector which would otherwise be too

time-consuming, such as the Deformable Part Mod-

els. Moreover, since a rigid model does not allow

for any deformation, using it in our single-scale ap-

proach is even unfeasible in a direct manner. Since

natural slight height variations exist between pedes-

trians (and due to small calibration errors), the de-

tection accuracy signiﬁcantly drops when using these

models on a single-scale. Given this information, we

opted to use the cascaded DPM model (Felzenszwalb

et al., 2010). When used out-of-the-box this detec-

tor works as follows. First a scale-space pyramid is

constructed in which for each layer HOG features are

calculated resulting in a feature pyramid. Next, this

pyramid is evaluated using a sliding window with the

pedestrian model shown in ﬁgure 5. A pedestrian is

represented as a root model (left), several parts rep-

resenting e.g. the limbs and head which are calcu-

lated at twice the resolution of the root model (mid-

dle), and a deformation cost which penalises large

deviations from the expected part locations (right).

−2.5

−2

−1.6

−1.1

−0.6

−0.2

0.3

0.8

100

120

140

160

180

200

0.5

0.6

0.7

0.8

0.9

Det. threshold

Detection accuracy versus height and threshold

Pedestrian height (pixels)

Accuracy

Figure 6: The accuracy versus the pedestrian height and de-

tection threshold for the single-scale cascaded DPM detec-

tor.

6080100120140160180200

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Detection accuracy versus pedestrian height

Pedestrian height (pixels)

Accuracy

Figure 7: The optimal threshold slice displaying the accu-

racy versus the pedestrian height.

The responses of both root ﬁlter and part ﬁlters are

summed to give a ﬁnal detection score. We altered

this detector into a single-scale only implementation,

and performed experiments to simultaneously deter-

mine the optimal scale factor to which the region pro-

posals need to be warped, and the optimal detection

threshold. This is done as follows. We extracted

about 6000 annotated pedestrians from the CAVIAR

dataset and warped them to different scales (heights).

Combined with 6000 negative patches we calculated

the accuracy in function of the height and detection

score threshold. The results are shown in ﬁgure 6.

As can be seen, at low resolutions the accuracy drops

signiﬁcantly, since only very limited spatial informa-

tion is available. At high resolution similar behaviour

is seen, since the pedestrians mismatch the detec-

tion model. Concerning the detection threshold, the

detection accuracy is low at both high values (high

false negative rate) and low values (high false pos-

itive rate). Figure 7 displays the optimal threshold

slice extracted from ﬁgure 6. The accuracy is almost

constant between 130-170 pixels. However, at larger

pedestrian heights the detection time signiﬁcantly in-

creases. We therefore used 140 pixels as our optimal

rescale height to which the region proposals will be

warped such that a one-scale pedestrian model can be

directly applied.

3.6 Tracking

The resulting detections are then retransformed to the

input image coordinates. Next a non-maxima sup-

pression step is performed, in which overlapping de-

tections are ﬁltered; only the highest scoring detection

is kept. To link detections over multiple frames and

to cope with occasional missing detections we inte-

grate our approach in a tracking-by-detection frame-

work. For this we employ the well-know Kalman

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

330

Figure 8: Qualitative tracking example on two of the evaluation sequences (top and bottom row). See

http://youtu.be/kWoKBPQoeQI for a video.

ﬁlter (Kalman, 1960). For each new detection, a

Kalman ﬁlter is initialised. We employ a constant ve-

locity motion model. The state vector x

thus consists

of the centre of mass of each detection, the velocity

and the scale: x



x y v



. Our process

matrix A thus equals:

A =







1 0 1 0

0 1 0 1

0 0 1 0

0 0 0 1







(3)

Using this motion model we predict the position of the

pedestrians in the next frame. When a new frame is

processed, we try to match each running tracker with

a new detection as follows. We construct a circular

region - based on the scale of that tracked detection -

around the estimated new centroid. If the centroid of

a new detection is found within that region, the detec-

tion is associated with this track, and the Kalman ﬁlter

is updated. If multiple detections are found, we take

the closest one based on the Euclidean distance. If no

detection can be associated with a running track, we

update the Kalman ﬁlter with its estimated position.

If this occurs for multiple frames in a row, the track is

discarded. For detections without an associated track,

evidently a new Kalman ﬁlter is instantiated. Further-

more the exact size of the bounding boxes are aver-

aged over multiple frames. See ﬁgure 8 for two qual-

itative tracking sequences of our proposed algorithm.

4 EXPERIMENTS AND RESULTS

We performed extensive experiments concerning

both speed and accuracy on the publicly available

CAVIAR dataset (CAVIAR project, 2005). This

dataset was recorded at the entrance lobby of the IN-

RIA labs with a wide-angle camera lens. The images

are taken with a resolution of 384× 288 at 25 frames

per second, and are compressed using MPEG2. See

ﬁgure 1 for an example frame. The dataset is di-

vided into six different scenarios: walk, browse, meet,

leave bags, rest and ﬁght. Each scenario is again

subdivided into multiple sequences, making a total

of 28 sequences. We used all sequences of the ﬁrst

four scenarios (20 sequences in total) for evaluation,

thus excluding the less common rest and ﬁght sce-

narios. In total, our evaluation set consists of about

19650 frames, containing about 29450 annotations.

Our algorithm is implemented in Matlab, with time-

consuming parts (e.g. the detection and transforma-

tion) in C and OpenCV (using mexopencv as inter-

face). Our test hardware consists of an Intel Xeon

E5 CPU which runs at 3.1 GHz. All speed test are

performed on a single CPU core. However, a multi-

threaded CPU implementation to further increase the

processing speed is trivial.

4.1 Accuracy

Figure 9 displays the accuracy results of our algo-

rithm, using a precision-recall curve. We give re-

sults for the four scenarios mentioned above, rang-

ing from easy (e.g. walk - limited number of per-

sons) through difﬁcult (e.g. meet - multiple persons

with occlusions). We exclude small pedestrians from

the annotations (20 pixels), and remove annotations

in the top left corner of the image (on the balcony)

and the bottom left corner of the image (people be-

hind the covered reception desk). Furthermore we

discard annotations close to the image border, since

the pedestrians are not completely visible there (the

annotation is strict and already starts when part of a

pedestrian enters the frame). The solid lines in ﬁg-

ure 9 indicate the accuracy without tracking, whereas

Real-timeAccuratePedestrianDetectionandTrackinginChallengingSurveillanceVideos

331

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accuracy

Recall

Precision

Walk

Browse

Leave bags

Meet

All sets

Figure 9: The accuracy of our algorithm over the CAVIAR

dataset. Solid lines indicate the results without tracking,

dotted lines include tracking.

the dotted lines show the accuracy with tracking. The

black curves indicate the total average accuracy over

the entire evaluation set. To indicate the difﬁculty,

in ﬁgure 10 we display some extracted annotations

which are warped to an upright position. As can be

seen, these low-resolution output images contain se-

vere compression artifacts. Even for humans they

are sometimes difﬁcult to recognize as a pedestrian.

However, we achieve excellent accuracy results given

these strict dataset annotations and challenging nature

of these images. Since our tracker handles missing

detections, the accuracy signiﬁcantly improves.

4.2 Speed

The exact calculation time depends on the number of

region proposals per image. Figure 11 therefore dis-

plays the speed of our algorithm (in frames per sec-

ond), versus the number of region proposals. Evi-

dently, the processing speed decreases when multiple

region proposals need to be evaluated. However, even

at e.g. four region proposals we still achieve 17 fps.

Over the entire evaluation set we achieve an average

of 32 frames per second, indicated with the dotted red

line. Note that all experimental results are performed

on a single CPU core. In fact, each region proposal

can be evaluated independently, thus allowing for an

Figure 10: Example of warped annotations. Low-resolution

and high-compression artifacts are noticeable.

0 1 2 3 4 5 6 7 8 9 10

Processing speed

Number of region proposals

FPS

Figure 11: The processing speed of our algorithm versus

the number of regions proposals.

easy multi-threaded implementation. Figure 12 visu-

alises the individual calculation times for each impor-

tant step in the entire algorithm pipeline, for a vary-

ing number of region proposals. As visualised, gen-

eration of the region proposals takes about 15-20ms.

The warping operation is very fast: on average 1ms

per region proposal is needed. Concerning the pedes-

trian evaluation step, the average feature calculation

time per region is about 3ms whereas the model eval-

uation takes 4ms. The time needed to retransform the

coordinates is negligible.

4.3 Comparative Evaluation

Figure 13 illustrates the accuracy improvement we

achieved as compared to a basic background subtrac-

tion technique, i.e. interpreting the foreground blobs

that are large enough as pedestrians. As seen, on

these challenging images these naive methods yield

poor results. The inclusion of our scene model and

the application of a state-of-the-art pedestrian detec-

tor raises the accuracy enormously.

A quantitative comparison with other work us-

0 (15.7 ms) 1 (27.4 ms) 2 (38.5 ms) 3 (49.4 ms)

Detailed calculation time

Time in ms

Number of region proposals

Region propsals

Warping

Calc. features

Model evaluation

Retransform coord.

Tracking and overhead

Figure 12: An overview of the calculation time for each step

in the algorithm versus the number of region proposals.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

332

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accuracy improvement

Recall

Precision

Background subtraction

Region prop. + pedestrian det. (Ours)

Region prop. + pedestrian det. + tracking (Ours − final)

Figure 13: The obtained accuracy improvement compared

to a naive background subtraction approach.

ing precision-recall curves on this dataset is difﬁcult,

since to the best of our knowledge no such accuracy

results similar to our work exist. Existing work on

these speciﬁc sequences of the CAVIAR dataset of-

ten focusses on activity recognition (e.g. ﬁght) and

anomaly detection. However, (Rogez et al., 2014b)

present accuracy experiments using tracking failure

measurements on 11 tracks of the CAVIAR dataset.

For this, the authors consider a track lost if the track-

ing failed for 20 frames or more. In their work

a multi-hypothesis tracking approach (particle ﬁlter)

is used. They achieve a tracking failure percentage

of 33.64% with N = 20 particles and 16.82% when

N = 50. Using our approach we achieve a tracking

failure of 9.1% on the same sequences relying only

on a single hypothesis tracker (Kalman ﬁlter).

As a ﬁnal qualitative analysis, we compare our ap-

proach with a naive detection approach; that is run-

ning the standard deformable part model detector on

all scales and all rotations. For this, we need to up-

scale the image ﬁve times (the smallest pedestrian to

be detected is only 25 pixels high, and the height of

the detection model equals 120 pixels), and use a rota-

tion step size of 10 degrees. Using this approach, the

calculation time for a single frame increases to about

13 minutes. Figure 14 displays the detections found

using this naive approach (left), and the output of our

algorithm (right). As seen, the naive approach yields

several false positives and fails to detect all pedestri-

ans. Our algorithm achieves excellent accuracy re-

sults with minimal computational cost (89ms for this

frame).

Figure 14: Qualitative comparison between running a de-

tector on all scales and rotations (left) versus the output of

our algorithm (right).

5 CONCLUSIONS

We presented a fast and accurate pedestrian detection

and tracking framework targeting challenging surveil-

lance videos. Our proposed algorithm integrates

foreground segmentation methods with scene con-

straints to generate region proposals, which are then

warped and evaluated by a single-scale pedestrian de-

tector. Using this approach we can employ a highly

accurate pedestrian detector for non-trivial camera-

viewpoint images where existing pedestrian detectors

fail, while still achieving real-time performance. We

performed extensive evaluation experiments concern-

ing both accuracy and speed on the publicly avail-

able CAVIAR dataset. This dataset consists of typi-

cal low-resolution high-compression surveillance im-

ages taken with a wide-angle lens from a challeng-

ing viewpoint. We show that our approach achieves

both excellent accuracy and processing speeds using a

single-core CPU implementation only. Furthermore,

our proposed method easily lends itself for a multi-

threaded implementation. Currently, the scene cali-

bration is based on annotation data. In the future we

plan to investigate if an automated calibration method

can be implemented (using e.g. an ofﬂine exhaustive

search over all scales and rotations).

ACKNOWLEDGEMENTS

The authors would like to acknowledge that the

dataset used here is from the EC Funded CAVIAR

project/IST 2001 37540 (CAVIAR project, 2005).

REFERENCES

Benenson, R., Mathias, M., Timofte, R., and Van Gool, L.

(2012a). Fast stixels computation for fast pedestrian

detection. In ECCV, CVVT workshop, pages 11–20.

Benenson, R., Mathias, M., Timofte, R., and Van Gool, L.

Real-timeAccuratePedestrianDetectionandTrackinginChallengingSurveillanceVideos

333

(2012b). Pedestrian detection at 100 frames per sec-

ond. In Proceedings of CVPR, pages 2903–2910.

Benenson, R., Mathias, M., Tuytelaars, T., and Van Gool, L.

(2013). Seeking the strongest rigid detector. In Proc.

of CVPR, pages 3666–3673, Portland, Oregon.

Benenson, R., Omran, M., Hosang, J., and Schiele, B.

(2014). Ten years of pedestrian detection, what have

we learned? In ECCV, CVRSUAD workshop.

Benezeth, Y., Jodoin, P.-M., Emile, B., Laurent, H., and

Rosenberger, C. (2008). Review and evaluation of

commonly-implemented background subtraction al-

gorithms. In Pattern Recognition, 2008. ICPR 2008.

19th International Conference on, pages 1–4. IEEE.

Benfold, B. and Reid, I. (2011). Stable multi-target tracking

in real-time surveillance video. In CVPR, pages 3457–

3464.

Breitenstein, M. D., Reichlin, F., Leibe, B., Koller-Meier,

E., and Van Gool, L. (2011). Online multiper-

son tracking-by-detection from a single, uncalibrated

camera. IEEE PAMI, 33(9):1820–1833.

CAVIAR project (2005). The CAVIAR project: Context

aware vision using image-based active recognition.

http://homepages.inf.ed.ac.uk/rbf/CAVIAR/.

Cho, H., Rybski, P., Bar-Hillel, A., and Zhang, W. (2012).

Real-time pedestrian detection with deformable part

models. In IEEE Intelligent Vehicles Symposium,

pages 1035–1042.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In Proceedings of CVPR,

volume 2, pages 886–893.

Doll´ar, P., Appel, R., Belongie, S., and Perona, P. (2014).

Fast feature pyramids for object detection.

Doll´ar, P., Belongie, S., and Perona, P. (2010). The fastest

pedestrian detector in the west. In Proceedings of

BMVC, pages 68.1–68.11.

Doll´ar, P., Tu, Z., Perona, P., and Belongie, S. (2009a). Inte-

gral channel features. In Proc. of BMVC, pages 91.1–

91.11.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2009b).

Pedestrian detection: A benchmark. In Proceedings

of CVPR, pages 304–311.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2012).

Pedestrian detection: An evaluation of the state of the

art. In IEEE PAMI, 34:743–761.

Felzenszwalb, P., Girschick, R., and McAllester, D. (2010).

Cascade object detection with deformable part mod-

els. In Proceedings of CVPR, pages 2241–2248.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In Proceedings of CVPR.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014a).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In Computer Vision and

Pattern Recognition.

Girshick, R., Felzenszwalb, P., and McAllester, D. (2012).

Discriminatively trained deformable part models, re-

lease 5. IEEE Transactions on Pattern Analysis and

Machine Intelligence.

Girshick, R. B., Iandola, F. N., Darrell, T., and Malik, J.

(2014b). Deformable part models are convolutional

neural networks. CoRR, abs/1409.5403.

Girshick, R. B. and Malik, J. (2013). Training deformable

part models with decorrelated features. In IEEE Inter-

national Conference on Computer Vision, ICCV 2013,

Sydney, Australia, December 1-8, 2013.

Kalman, R. (1960). A new approach to linear ﬁltering

and prediction problems. In Transaction of the ASME

Journal of Basic Engineering, volume 82.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in Neural Information Pro-

cessing Systems 25, pages 1097–1105. Curran Asso-

ciates, Inc.

Leykin, A. and Hammoud, R. (2010). Pedestrian tracking

by fusion of thermal-visible surveillance videos. Ma-

chine Vision and Applications, 21(4):587–595.

Orts-Escolano, S., Garcia-Rodriguez, J., Morell, V., Ca-

zorla, M., Azorin, J., and Garcia-Chamizo, J. M.

(2014). Parallel computational intelligence-based

multi-camera surveillance system. Journal of Sensor

and Actuator Networks, 3(2):95–112.

Parks, D. H. and Fels, S. S. (2008). Evaluation of back-

ground subtraction algorithms with post-processing.

In Advanced Video and Signal Based Surveillance,

2008. AVSS’08. IEEE Fifth International Conference

on, pages 192–199. IEEE.

Pedersoli, M., Gonzalez, J., Hu, X., and Roca, X. (2013).

Toward real-time pedestrian detection based on a de-

formable template model. In IEEE ITS.

Rogez, G., Orrite, C., Guerrero, J. J., and Torr, P. H. S.

(2014a). Exploiting projective geometry for view-

invariant monocular human motion analysis in man-

made environments. Computer Vision and Image Un-

derstanding, 120:126–140.

Rogez, G., Rihan, J., Guerrero, J. J., and Orrite, C. (2014b).

Monocular 3D gait tracking in surveillance scenes.

IEEE Transactions on Systems Man and Cybernetics

Part B (Cybernetics), 44(6):894–909.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2014). Ima-

genet large scale visual recognition challenge.

Singh, V. K., Wu, B., and Nevatia, R. (2008). Pedestrian

tracking by associating tracklets using detection resid-

uals. In Motion and video Computing, 2008. WMVC

2008. IEEE Workshop on, pages 1–8. IEEE.

Van Beeck, K., Tuytelaars, T., and Goedem´e, T. (2012). A

warping window approach to real-time vision-based

pedestrian detection in a truck’s blind spot zone. In

Proceedings of ICINCO.

Zivkovic, Z. and van der Heijden, F. (2006). Efﬁcient adap-

tive density estimation per image pixel for the task of

background subtraction. Pattern recognition letters,

27(7):773–780.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

334