Open Framework for Combined Pedestrian Detection

Floris De Smedt and Toon Goedem´e

EAVISE, KU Leuven, Sint-Katelijne-Waver, Belgium

Keywords:

Pedestrian Detection, Real-time, Framework.

Abstract:

Pedestrian detection is a topic in computer vision of great interest for many applications. Due to that, a large

amount of pedestrian detection techniques are presented in current literature, each one improving previous

techniques. The improvement in accuracy in recent pedestrian detection, is commonly in combination with

a higher computational requirement. Although, recently a technique was proposed to combine multiple de-

tection algorithms to improve accuracy instead. Since the evaluation speed of this combination is dependent

on the detection algorithm it uses, we provide an open framework that includes multiple pedestrian detection

algorithms, and the technique to combine them. We show that our open implementation is superior on speed,

accuracy and peak memory-use when compared to other publicly available implementations.

1 INTRODUCTION

Pedestrian detection is a subject of great interest in

recent literature. Over the years a lot of work has

been performed on both speed (Doll´ar et al., 2010;

Benenson et al., 2012; Doll´ar et al., 2014; De Smedt

et al., 2013), and accuracy (Benenson et al., 2013;

Park et al., 2010). Although most of these techniques

are based on similar detection techniques, the con-

tribution they propose are applied on a single pedes-

trian detector. These improvements come mostly with

an extra computational requirement which is not al-

ways available. Recently, (De Smedt et al., 2014)

proposed an alternative technique to improve accu-

racy by combining the detection results of multiple

detectors. Their work uses only the detection results

neglecting the evaluation speed of the detection algo-

rithms themself. In this paper we propose an open

framework that provides the whole pipeline, from the

image to running multiple object detection algorithms

and ﬁnally the combination of their results. By re-

ducing the computational requirement of the pedes-

trian detection algorithms, the combination they are

part of will also come at a limited computational cost.

Therefor we compare our algorithms with other pub-

licly available implementations on speed, accuracy

and peak memory-use, and show that our implemen-

tations turn out to be superior based on these criteria.

The paper is structured as follows: In section 2 we

give an overview of existing literature. In section 3

we discuss the implementation of the pedestrian de-

tection algorithms we implemented. In section 4 we

give a detailed insight on how to combine the detec-

tion results. And ﬁnally we conclude in section 5.

2 RELATED WORK

Due to the wide applicability of pedestrian detec-

tion in a variety of applications (trafﬁc, surveillance,

robotics and safety), their has been a lot of research

on this topic. In 2005 (Dalal and Triggs, 2005) pro-

posed a technique of using gradient information for

this task. They use a grid of HOG-features trained

with a linear Support Vector Machine, which imposed

impressive detection results on the INRIA pedes-

trian dataset. Datasets from more realistic conditions

(such as the Caltech pedestrian dataset (Doll´ar et al.,

2012b)) showed room for improvement on both accu-

racy and speed.

We can distinguish two fundamental techniques to

improve the accuracy of this detector. The model can

be extended, so instead of using a rigid model that

searches only for the object as a whole, the model also

includes parts (e.g. the limbs of a person). By allow-

ing a limited position deviation of the parts relatively

to the root model, a certain pose variation is allowed

(Felzenszwalb et al., 2008). On the other hand, one

can enrich the features used for pedestrian detection

by using color information next to gradient informa-

tion. This has been done in Integral Channel Features

proposed by (Doll´ar et al., 2009).

551

De Smedt F. and Goedemé T..

Open Framework for Combined Pedestrian Detection.

DOI: 10.5220/0005359205510558

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 551-558

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

(Doll´ar et al., 2012b) and (Benenson et al., 2014)

give an overview of over 40 different pedestrian de-

tection algorithms in literature, discussing their eval-

uation methodology and accuracy. Here we see that

most of these techniques are based on the two al-

gorithms we discussed before. One of the most ac-

curacte algorithms is described in (Benenson et al.,

2013), where each step of the training process of In-

tegral Channel Features is evaluated and optimised.

This detector forms the base of (Mathias et al., 2013)

which copes with occlusion.

To allow the use of pedestrian detection in ap-

plications, both speed and accuracy need to be very

high. The speed can be improved by (a combina-

tion of) three approaches. One can use more capa-

ble hardware such as GPU or multi-core CPU to ex-

ploit parallellisation, as has been done by (Benen-

son et al., 2012; De Smedt et al., 2012; De Smedt

et al., 2013). Constraining the appearances to a limit-

ted amount of sizes and positions by using ground

plane assumption and/or tracking reduces the search-

ing space (De Smedt et al., 2013; Benenson et al.,

2012), and thereby the calculation time needed. A last

option is to optimise the algorithms themself. Some

examples are approximating features from calculated

ones (Doll´ar et al., 2014; Doll´ar et al., 2010), learn

a model for each scale (Benenson et al., 2012), us-

ing a cascaded evaluation(Felzenszwalb et al., 2010a;

Bourdev and Brandt, 2005) and using Crosstalk Cas-

cade where the detection results at nearby locations

is exploited (and so reducing the amount of locations

evaluated) (Doll´ar et al., 2012a).

To allow the use of these algorithms, some of the

implementations are made publicly available. The

implementations of (Benenson et al., 2012; Benen-

son et al., 2013; Benenson et al., 2014) are com-

bined in a single framework

. This framework is

mostly directed to using a (modern) GPU for fast

processing, while using very accurate detection ap-

proaches. (Dubout and Fleuret, 2012) made the code

available

for an optimised version of the Deformable

Part Model detector, as described in (Felzenszwalb

et al., 2008). In contrast to (Felzenszwalb et al.,

2010a), where a cascade-approach is used for speed

improvement, they make use of multi-threading and

apply convolution in the Fourier-domain as a dot-

product. Altough, this implementation comes at the

cost of high memory-use, as we will point out in sec-

tion 3. The availability of GPU and high memory

restricts the applicability of these frameworks on for

example embedded systems. In this paper we pro-

pose a complementary framework focussing on cpu-

https://bitbucket.org/rodrigob/doppia

http://charles.dubout.ch/en/coding.html

implementations at limited memory-use.

Recently (De Smedt et al., 2014) proposed a tech-

nique to even further improve the accuracy. Instead of

tuning a single object detection algorithms, they com-

bine the strenghts of different algorithms by combin-

ing the detection results. They use the measurement

of conﬁdence (how trustworthy is a detection of a cer-

tain detector) and complementarity (how different are

two detectors, so how likely is it they result in the

same detections) to combine the detecions scores in

a weighted sum. Since the score-range between de-

tectors can vary drastically, they use a normalisation

step based on the average and mean of the scores. In

this paper we apply this technique on our implemen-

tations as an alternative for a computational intensive

single tuned algorithm.

3 IMPLEMENTATION OF

OBJECT DETECTION

ALGORITHMS

Each choice that is made in the construction of an

object detection algorithm inﬂuences the ﬁnal results

it will obtain. In (Benenson et al., 2013) for exam-

ple, is shown how optimising each step of the training

procedure of Integral Channel Detector (Doll´ar et al.,

2009) can lead to a drastic improvement in accuracy

(coined the Roerei-detector). The distinguishability

of features and structure of the model is bound to the

dimension reduction it imposes on the huge search-

ing space to ﬁnd pedestrians. Next to that, is also

the computational complexity, and by consequence

the evaluation speed, of great importance when real-

time execution is required. As is shown in (De Smedt

et al., 2014), each detector has its own strenghts and

weaknesses due to the differences in design, and will

also lead to different detection results. They provide

a technique to combine the detection results of mul-

tiple pedestrian detectors to improve detection accu-

racy, in contrast to the traditional approach of improv-

ing a single detector with higher computational cost

as a consequence. The cost of combining multiple

detection algorithms depends on the algorithms them-

selves. In this section we describe the pedestrian de-

tection algorithms we implemented in our framework,

and compare them on speed, peak memory-use and

accuracy. Based on this comparison, we make pair-

wise combinations of pedestrian detection algorithms

in section 4.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

552

3.1 Traditional Object Detection

Approach

The approach to distinguish the appearance of an ob-

ject from the background is a difﬁcult task, mostly

solved in the same sequence of steps. The image is

rescaled multiple times, to obtain the detections at

multiple scales. Rescaling the model (or obtaining

a model for different scale appearances of the ob-

ject) is discussed in (Benenson et al., 2012), and is

far more complex than rescaling the image instead.

For each scale certain features are calculated to em-

phasise properties of the image capable to distinguish

the object from the background. The last step is to

express the similarity between the features and a pre-

trained model as a numeric value. All the detections

with a score above a chosen threshold will be seen

as containing the object, where the ones below the

threshold are treated as background information. A

high threshold will lead to only a few detections, but

also less false detections (a high precision), where a

low threshold ensures that more appearences of the

object will be found, but also more false detections

will be made (high recall). The accuracy of an object

detection algorithm for different thresholds can be ex-

pressed in a precision-recall curve. We use this kind

of curve extensively in this paper, an example can be

seen in ﬁgure 3.

3.2 Histogram of Oriented Gradients

The ﬁrst algorithm we integrated in our framework

is Histogram of Oriented Gradients (HOG). This

technique, described in (Dalal and Triggs, 2005) for

pedestrian detection, makes use of contrast informa-

tion of the image to recognise the appearance of

pedestrians. The model used here, forms the root-

model for the Deformable Part Models-detector and

is shown in ﬁgure 2 at the left. The implementation

we use is part of OpenCV. For speed improvement,

they use a very similar approach as we do, by evaluat-

ing the layers of the scale-space-pyramid in parallel.

For integration in our framework, we convert the de-

tection results to our format. To allow a fair compar-

ison with the other algorithms, we use our own Non-

Maximum suppression implementation instead of the

one provided by OpenCV.

3.3 Integral Channel Features

As described in section 2, the Integral Channel Fea-

tures detector makes use of multiple features for ob-

ject detection. Each feature is presented as a chan-

nel. In the original implementation, 10 channels are

used (6 gradient orientation, 1 gradient magnitude

and 3 colour channels). These are shown in ﬁgure 1.

The model is constructed from a selection of rectan-

gles containing the sum of the intensity values from a

channel. The rectangles are selected from a huge pool

of random rectangles using Adaboost. To improve

the speed, we use softcascade (Bourdev and Brandt,

2005) so that after the evaluation of each feature, the

current score is required to reach a pre-determined

treshold to continue the evaluation at that location.

To ﬁnd the annotation, it is required that the detec-

tion bounding box has an overlap of at least 50% with

the annotation. The selection of the threshold used for

softcascade determines the resulting balance between

accuracy and speed. A high threshold will lead to

more pruning, so higher speed, where a lower thresh-

old is more indulgent and will allow more detections

to reach the ﬁnal stage at the cost of evaluation speed.

Figure 1: The channels used in the original implementation

of (Doll´ar et al., 2009).

To calculate the features, we use the code released

as part of the ACF-implementation of (Doll´ar et al.,

2014), which is heavily optimised for cpu (Doll´ar,

2013). To improve accuracy, the model used for eval-

uation is trained with an extra space around the anno-

tation. After detection, the bounding box is altered to

the original dimensions, which better ﬁts the object.

The accuracy and evaluation speed of our implemen-

tation are discussed in subsection 3.5.

3.4 Deformable Part Models

The Deformable Part Model detector we use is based

on the vanilla implementation used by (De Smedt

et al., 2012) and (De Smedt et al., 2013). It is a

C++-port of the matlab implementation of DPMv4 re-

leased in (Felzenszwalb et al., 2010b). Deformable

Part Models can be seen as an extension of the HOG-

model used by (Dalal and Triggs, 2005) with part

models. The evaluation of the model can be divided

in the search for the root model, representing the ob-

ject as a whole, and the search for parts (e.g. the limbs

of a person). The location of the part models relative

to the position of the root model can deviate a litte,

to allow a certain pose variation in contrast to rigid-

model detectors. The model we use is trained on the

OpenFrameworkforCombinedPedestrianDetection

553

INRIA-dataset and is visualized in ﬁgure 2. Since the

part models are a more detailed element of the object,

they are searched for at twice theresolution as the root

model. This imposes the choice of 3 different imple-

mentation approaches to obtain features at twice the

size.

Figure 2: The model used by (Felzenszwalb et al., 2008;

Felzenszwalb et al., 2010a). The root-model at the left, the

part-models in the middle and the allowed deviation at the

right.

The most intuitive manner is to upscale the im-

age where the root model is searched on (coined DPM

Up). Altough the information contained in an image

can not be extended, the stability of the model can

beneﬁt from this. Another option is to do the oppo-

site, and downscale the image used for the part ﬁlters

instead (coined DPM Down). Since downscaling is

faster than upscaling (both in memory as in compu-

tational complexity), this seems a good option. Al-

though it comes with a pitfall. The object can only be

found at twice the original size of the model. To ob-

tain the same scale range as the previous design, we

have to upscale the image to twice the resolution up

front, which takes away the advantage. The last op-

tion is to not rescale at all, but use half the amount of

pixels to calculate the histogram (coined DPM Half).

These three implementation methods are discussed in

subsection 3.5, and as can be seen, only the use of tak-

ing half the cell-size comes with a minimal accuracy

loss, while being a lot less computationally intensive.

We optimised our implementation by eliminating

redundant work, exploit locality in memory and avoid

global memory for thread-safe code. This allows us

to evaluate the layers of the scale-space-pyramid in

parallel.

3.5 Speed, Memory and Accuracy

In the previous subsections, we described three al-

gorithms seperately. The choice of which algorithm

to use independently, or as part of a combination, is

based on memory-use, evaluation time and the accu-

racy. To evaluate these criteria, we use the evalua-

tion framework of (Doll´ar et al., 2012b; Doll´ar, 2013)

at a Reasonable setting (50px and higher, with max

45% occlusion). The 120px height of the Deformable

Part Model-model imposes the requirement to per-

form an initial upscale of 2.4 times (we round to 2.5

times) to obtain detections at 50px. Therefor, we also

evaluate the speed and memory-use at different im-

age sizes. All experiments are performed on the same

platform. The amount of parallellisation inﬂuences

both the evaluation speed and memory-use, but the

accuracy remains constant.

The HOG-implementation of OpenCV (which we

use) has the model embedded in the code, in contrast

to our algorithms that uses a text-ﬁle. The accuracy

we obtain is visualised in ﬁgure 3. As we can observe,

the OpenCV-implementation improves its accuracy at

higher threshold compared to the original accuracy re-

sults.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Comparison of standard HOG with OpenCV implementation

70.62% HOG−OpenCV

58.38% HOG

Figure 3: Accuracy comparison between the original HOG

and the implementation of OpenCV we use.

We trained an Integral Channel Features model

ourself, based on the INRIA pedestrian training set.

The complete images are used instead of the nor-

malised ones, and each annotation is rescaled to

41x100 pixels (size of our inner-model). The outer-

model (annotation + extra spacing) is chosen to be

64x128 pixels, as has been done in the ACF-training

code of (Doll´ar et al., 2014; Doll´ar, 2013). We used

a 2048 stage model, where each stage is a level-two

decision tree as weak classiﬁer. To obtain a complete

PR-curve, we used a permissive threshold for softcas-

cade. In ﬁgure 4 we compare the accuracy results we

obtain with the ones in the framework, as obtained

by (Doll´ar et al., 2009). We can observe that our im-

plementation performs slightly better compared to the

original Matlab implementation.

The model used for our Deformable Part Model

detector is also trained on INRIA, and comes with the

original Matlab release (Felzenszwalb et al., 2010b).

We used a matlab-script to convert the .mat-ﬁle to

a textﬁle which can be used by our implementation.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

554

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Comparison of standard Integral Channel Features with our implementation.

76.43% ICF−Ours

75.78% ChnFtrs

Figure 4: Accuracy comparison between our implemen-

tation of Integral Channel Features and the results from

(Doll´ar et al., 2009).

This is a model of 40x120 pixels. As described in

subsection 3.4, there are 3 methods to obtain the fea-

tures for the evaluation of the root-model and the

part-models. In ﬁgure 5 we compare these 3 options

with the original Matlab-implementation (Latv4-cc)

and the implementation of (Dubout and Fleuret, 2012)

(FFLD). As we can see, the accuracy results we ob-

tain are all very similar.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Comparison of Deformable Part Models implementations

78.23% FFLD

77.49% DPM−Half

77.38% DPM−Up

77.36% DPM−Down

76.53% Latv4−cc

Figure 5: Comparison of the accuracy results of Deformable

Part Models.

In table 1 and table 2 we compare the evaluation

speed and memory-use respectively, of each imple-

mentation of our framework. The improvement in ac-

curacy for both Deformable Part Models and Integral

Channel Features over Histogram of Oriented Gradi-

ents comes at the cost of speed loss and higher peak

memory-use. When we compare our implementations

of Deformable Part Models, we can point out that we

only need a fraction of the memory needed by FFLD,

while using half the cell-size is more than twice as

fast on VGA-resolution.

Based on the criteria we just evaluated, we can

Table 1: Comparison of evaluation speed of the algorithms

we provide.

640x480 1280x960 1600x1200

HOG 15.1 fps 4.1 fps 2.66 fps

ICF

10.71 fps 2.12 fps 1.37 fps

DPM Half

8.63 fps 1.30 fps 0.82 fps

DPM Up

5.58 fps 0.82 fps 0.45 fps

DPM Down

2.98 fps 0.74 fps 0.48 fps

FFLD 3.828 fps 0.96 fps 0.67 fps

Table 2: Comparison of memory-use when running the

pedestrian detection algorithms.

640x480 1280x960 1600x1200

HOG 42.9MB 141.5 MB 243.4 MB

ICF

163.1 MB 446.4 MB 659.4 MB

DPM Half

82.72 MB 240.3 MB 429.2 MB

DPM Up

102 MB 253 MB 442 MB

DPM Down

120 MB 332.6 MB 537.2 MB

FFLD 790 MB 3.0 GB 4.7 GB

select the best pair of detectors to combine. For accu-

racy, it will be better to combine our Integral Channel

Feature-detector with our Deformable Part Models-

detector, while for speed it may be a better choice to

combine Histogram of Oriented Gradients with Inte-

gral Channel Features. In section 4 we discuss how

to combine detectors, and evaluate the combinations

on the same criteria (evaluation speed, peak memory-

use and accuracy) for the pairwise combinations of

our implementations.

4 COMBINATION

In this section, we will discuss how to combine the

detectors we described in section 3. The steps to

obtain a combined detection result are visualized in

ﬁgure 6. The ﬁrst steps are performed as described

in section 3, where a scale-space-pyramid is created

from the source image. Each layer of the scale-space

pyramid is then processed by a pedestrian detection

algorithm. The next step is to normalise the detec-

tion scores. This is required, since each detector has a

different score-range. The normalisation of detection

scores is described in subsection 4.1. After normali-

sation, we have multiple options. We can just throw

the detections (before NMS) in one big pool and treat

them as coming from a single detector. This is de-

scribed in subsection 4.2. A more accurate alternative

is performing a smart combination from the detection

results after NMS, which is described in subsection

4.3.

OpenFrameworkforCombinedPedestrianDetection

555

Figure 6: Overview of combination approach.

4.1 Normalisation

The normalisation of detection scores is necessary,

since the range of detection scores can differ drasti-

cally between different detection algorithms. Here we

use the standard score approach as has been used by

(De Smedt et al., 2014). The equation looks as fol-

lows:

norm

(S− µ

)

This results in all detection scores being positions

around the zero-value. For this approach a working

point has to be deﬁned and all detections with detec-

tion score above the working point are taken into ac-

count to calculate the mean and standard deviation.

Next to the score, we also have to equalise the as-

pect ratio of the bounding boxes between the detec-

tors. The aspect ratio between ICF and DPM (Inte-

gral Channel Features and Deformable Part Models)

differ slightly. We empirically found out that altering

the bounding boxes by keeping the height constant,

turns out to aquire the best accuracy results.

4.2 Pool Combination

With the normalised scores, we can treat the resulting

detections of all detectors equally. This means that

we can just put the detections together into the same

pool and perform Non-Maximum suppression over

all detections found with multiple detectors. As we

will show in section 4.4, the accuracy results depend

on the accuracy of the algorithms to combine. Due

to normalisation, the detectors are treated equally, so

the accuracy difference is lost. The combination ap-

proach described in section 4.3 takes this difference

into account in the form of the conﬁdence value.

4.3 The Combinator

Recently (De Smedt et al., 2014) proposed a tech-

nique to combine the results of different object de-

tection algorithm, which they apply for the detec-

tion of pedestrian detectors. The combined score

they obtained is formed by using a weighted sum

of normalised detection scores, where each score is

weighted by using two coeﬁcients, the conﬁdence and

the complementarity. The conﬁdence is a measure-

ment to express how well a certain detector works on

itself, while the complementarity measures the differ-

ences between detection results. If detectors are based

on completely different feature pools, they most prob-

ably will result in different detections, meaning that a

combined detection has a higher chance of being cor-

rect, compared to very similar detectors leading to the

same detection. They use the following equation to

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

556

obtain the ﬁnal detection score:

ﬁnal

∑

i=1

conf(i)

compl(i)

+ Q)

The Q added to the normalised score is required

when the Standard score approach is used for nor-

malisation, to avoid the presence of negative nor-

malised scores, which would lead to a decrease in the

weighted sum formula.

The same working point is used to aquire the mean

and standard deviation for normalisation (as described

in subsection 4.1) as to obtain the conﬁdence and

complementarity coeﬁcients. The conﬁdence is de-

ﬁned as the area between the origin and the working

point. Here, we simplify the weighted sum formula

by eliminating the use of a complementarity value,

since it has no additional information when used for a

pairwise combination.

4.4 Evaluation of Speed and Accuracy

Finally, we compare the implementations we have

proposed as part of our framework. The accuracy

of the object detection algorithms, and the models

we use, is already discussed in subsection 3.5, but

are shown again in comparison with the combina-

tion techniques we described earlier. In ﬁgure 7 we

compare the use of pool-combination as described in

subsection 4.2. As we can observe, the combination

of Histogram of Oriented Gradients with either De-

formable Part Models or Integral Channel Features

does not improve accuracy. This is due to the fact that

the accuracy-difference is ignored. When we com-

bine Integral Channel Features and Deformable Part

Models on the other hand, which have more or less an

equal accuracy, we obtain a big improvement in accu-

racy.

In ﬁgure 8, we compare the accuracy results we

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Comparison of our implementations of Pool Combination

80.13% Pool−ICF+DPM

77.49% DPM−Half

76.43% ICF−Ours

76.04% Pool−HOG+DPM

71.71% Pool−HOG+ICF

70.62% HOG−OpenCV

Figure 7: Comparison of the accuracy of the implementa-

tions we propose.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

Comparison of our implementations of The Combinator

81.48% Combinator−ICF+DPM

78.20% Combinator−HOG+ICF

78.00% Combinator−HOG+DPM

77.49% DPM−Half

76.43% ICF−Ours

70.62% HOG−OpenCV

Figure 8: Comparison of the accuracy of the implementa-

tions we propose.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

Precision

81.48% TheCombinatorICFDPM

81.18% Roerei

80.23% MultiResC

Figure 9: Comparison of the our most accurace combina-

tion with the state-of-the-art.

obtain by using the smarter combination approach de-

scribed in subsection 4.3. Here we can point out that

the accuracy beneﬁts from combining. The additional

value of Histogram of Oriented Gradients is not that

big though.

Table 3 and table 4 show the evaluation speed and

peak memory-userespectively of our combination ap-

proaches. As we can point out, the evalutation speed

and peak-memory of a combination is dependend on

the slowest algorithm in the combination. When for

example Integral Channel features is combined with

Deformable Part Models, part of the CPU-cores are

assigned to the Deformable Part Models algorithm.

Since Deformable Part Models does not need as much

memory, the peak memory-use is less then the sum of

Integral Channel Features and Deformable Part Mod-

els seperately.

When we compare the combination of De-

formable Part Models and Integral Channel Features

with the state-of-art detectors (ﬁgure 9), it can be seen

that a combination reaches impressive accury results.

Altough the PR-curve crosses the ones of Roerei (Be-

OpenFrameworkforCombinedPedestrianDetection

557

nenson et al., 2013) and MultiRes (Park et al., 2010),

it is only at the point where only half the detec-

tions are correct (precision of 0.5). For most appli-

cations the required precision is signiﬁcantly higher.

For MultiRes, no evaluation speed is mentioned in

(Park et al., 2010), and (Benenson et al., 2013) (the

Roerei-detector) claims an evaluation speed of 5Hz-

20Hz while using GPU-hardware.

Table 3: Comparison of evaluation speed of the combined

algorithms.

640x480 1280x960 1600x1200

HOG 15.1 fps 4.1 fps 2.66 fps

ICF 10.71 fps 2.12 fps 1.37 fps

DPM Half

8.63 fps 1.30 fps 0.82 fps

Pool HOG+DPM 6.8 fps 0.57 fps 0.31 fps

Pool HOG+ICF 7.65 fps 1.60 fps 1.03 fps

Pool ICF+DPM 7 fps 0.54 fps 0.29 fps

Combinator HOG+DPM 6.8 fps 0.57 fps 0.31 fps

Combinator HOG+ICF 7.57 fps 1.57 fps 1.03 fps

Combinator ICF+DPM 6.78 fps 0.53 fps 0.29 fps

Table 4: Comparison of memory-use when running the

pedestrian detection algorithms.

640x480 1280x960 1600x1200

HOG 42.9MB 141.5 MB 243.4 MB

ICF 163.1 MB 446.4 MB 659.4 MB

DPM Half 82.72 MB 240.3 MB 429.2 MB

Pool HOG+DPM 80.2 MB 261 MB 424 MB

Pool HOG+ICF

162 MB 428 MB 657 MB

Pool ICF+DPM 105 MB 318 MB 505 MB

Combinator HOG+DPM 82.2 MB 264 MB 424 MB

Combinator HOG+ICF

174 MB 420 MB 694 MB

Combinator ICF+DPM 124 MB 336 MB 482 MB

5 CONCLUSION

In this paper we present for the ﬁrst time a full-

pipeline implementation of detection combination as

an open framework. In contrast to the traditional ap-

proach of improving detection accuracy by optimis-

ing a single detector, we use a technique of com-

bining multiple pedestrian detectors instead, a tech-

nique proposed in (De Smedt et al., 2014). Herefor

we use the Histogram of Oriented Gradients imple-

mentation of OpenCV with our own implementation

of Integral Channel Features and of Deformable Part

Models. Based on the criteria of evaluation speed,

peak memory-use and accuracy, we obtained supe-

rior results to publicly available (CPU) implemen-

tations. The accuracy obtained by combining De-

formable Part Models with Integral Channel Features

is impressive compared to state-of-the-art detectors

which are far more computation intensive.

Code for this framework is available at http://

eavise.be/AbnormalBehaviour, and can be used for

research purposes.

REFERENCES

Benenson, R., Mathias, M., Timofte, R., and Van Gool, L.

(2012). Pedestrian detection at 100 frames per second.

In CVPR. IEEE.

Benenson, R., Mathias, M., Tuytelaars, T., and Van Gool, L.

(2013). Seeking the strongest rigid detector. In CVPR.

IEEE.

Benenson, R., Omran, M., Hosang, J., and Schiele, B.

(2014). Ten years of pedestrian detection, what have

we learned?

Bourdev, L. and Brandt, J. (2005). Robust object detection

via soft cascade. In CVPR, volume 2. IEEE.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In CVPR, volume 1.

IEEE.

De Smedt, F., Struyf, L., Beckers, S., Vennekens, J.,

De Samblanx, G., and Goedem´e, T. (2012). Is the

game worth the candle? evaluation of opencl for ob-

ject detection algorithm optimization. PECCS.

De Smedt, F., Van Beeck, K., Tuytelaars, T., and Goedem´e,

T. (2013). Pedestrian detection at warp speed: Ex-

ceeding 500 detections per second. In CVPRW. IEEE.

De Smedt, F., Van Beeck, K., Tuytelaars, T., and Goedem´e,

T. (2014). The combinator: optimal combination of

multiple pedestrian detectors. In ICPR.

Doll´ar, P. (2013). Piotrs image and video matlab toolbox

(pmt). Software available at: http://vision. ucsd. edu/˜

pdollar/toolbox/doc/index. html.

Doll´ar, P., Appel, R., Belongie, S., and Perona, P. (2014).

Fast feature pyramids for object detection. PAMI,

36(8).

Doll´ar, P., Appel, R., and Kienzle, W. (2012a). Crosstalk

cascades for frame-rate pedestrian detection. In

ECCV.

Doll´ar, P., Belongie, S., and Perona, P. (2010). The fastest

pedestrian detector in the west. In BMVC.

Doll´ar, P., Tu, Z., Perona, P., and Belongie, S. (2009). Inte-

gral channel features. In BMVC, volume 2.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2012b).

Pedestrian detection: An evaluation of the state of the

art. PAMI, 34.

Dubout, C. and Fleuret, F. (2012). Exact acceleration of

linear object detectors. In ECCV. Springer.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In CVPR. IEEE.

Felzenszwalb, P. F., Girshick, R. B., and McAllester, D.

(2010a). Cascade object detection with deformable

part models. In CVPR. IEEE.

Felzenszwalb, P. F., Girshick, R. B., and

McAllester, D. (2010b). Discriminatively

trained deformable part models, release 4.

http://people.cs.uchicago.edu/ pff/latent-release4/.

Mathias, M., Benenson, R., Timofte, R., and Gool, L. V.

(2013). Handling occlusions with franken-classiﬁers.

In ICCV. IEEE.

Park, D., Ramanan, D., and Fowlkes, C. (2010). Multireso-

lution models for object detection. In ECCV. Springer.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

558