Analysis of Regionlets for Pedestrian Detection

Niels Ole Salscheider

, Eike Rehder

and Martin Lauer

FZI Forschungszentrum Informatik at Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe, Germany

Department of Measurement and Control (MRT), Karlsruhe Institute of Technology (KIT), 76131 Karlsruhe, Germany

salscheider@fzi.de, eike.rehder@kit.edu, martin.lauer@kit.edu

Keywords:

Pedestrian Detection, Regionlets, Perception.

Abstract:

Human detection in camera images is an important task for many autonomous robots as well as automated

driving systems. The Regionlets detector was one of the best-performing approaches for pedestrian detection

on the KITTI dataset when we started this work in 2015.

We analysed the Regionlets detector and its performance. This paper discusses the improvements in accuracy

that were achieved by the different ideas of the Regionlets detector. It also analyses what the boosting algo-

rithm learns and how this relates to the expectations.

We found that the random generation of regionlet conﬁgurations can be replaced by a regular grid of region-

lets. Doing so reduces the dimensionality of the feature space drastically but does not decrease detection

performance. This translates into a decrease in memory consumption and computing time during training.

1 INTRODUCTION

Many autonomous robots as well as automated driv-

ing systems require an accurate and fast pedestrian

detector. Such systems can only avoid collisions if

the locations of all pedestrians can be determined pre-

cisely.

Even though there are very reliable techniques for

face detection, pedestrian detection from camera im-

ages remains a challenging problem. This is because

the body of a pedestrian is deformable and the rela-

tive positions of body parts can vary to some degree.

In contrast, faces are mostly rigid.

Another challenge is that pedestrians can be seen

from arbitrary angles in road scenes. This problem

requires either an ensemble of different detectors for

the different viewing angles or an approach that can

cope with the difference in appearance.

Earlier approaches to solve these tasks include

Deformable Part Models (DPM) (Felzenszwalb et al.,

2008) and Bag of Words (BoW) (Vogel and Schiele,

2007) models. Both methods can handle some de-

gree of deformation by breaking objects into parts.

DPM learns a root ﬁlter for the complete object and

ﬁlters for all object parts. For detection, an exhaus-

tive search for all of these ﬁlters is performed. DPM

also learns latent variables to describe the relative lo-

cations of these parts. These are used to calculate the

ﬁnal detection score.

BoW methods extract features from image patches

and cluster the extracted features to determine the

words in the codebook. Then a histogram over the

occurrence of these words is calculated and used as

input for a classiﬁer. In contrast to DPM, BoW meth-

ods ignore the spatial relationships between the im-

age patches. This makes the detector insensitive to

deformation but a precise localisation of the object is

difﬁcult.

A newer approach that also tries to break objects

into parts is Regionlets (Wang et al., 2013). It was

one of the best-performing techniques for pedestrian

detection on the KITTI dataset (Geiger et al., 2012)

when we started this work in 2015. During training,

this approach generates a large number of regions that

cover the complete detection window. A boosting al-

gorithm is used to learn which regions are relevant for

the classiﬁcation problem. The idea behind this is that

an object can be decomposed into parts and each se-

lected region corresponds to the area in which such a

part can be observed. When the number of generated

regions is large enough it is likely that the generated

set also contains regions that are close to optimal.

Each region is divided into regionlets. Classic

features are extracted from each regionlet and max-

pooling is performed on all feature vectors extracted

in a region. This allows the object part to occur in

any regionlet of the region and makes the detector ro-

bust to local deformations. Again, the most relevant

Salscheider, N., Rehder, E. and Lauer, M.

Analysis of Regionlets for Pedestrian Detection.

DOI: 10.5220/0006094100260032

In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2017), pages 26-32

ISBN: 978-989-758-222-6

regionlet conﬁgurations for each region are chosen by

the boosting algorithm from a large number of ran-

domly generated conﬁgurations.

The authors of (Wang et al., 2013) give a thorough

evaluation of the overall detection performance on

different datasets in comparison to other well-known

approaches. However, the original paper does not

contain a break-down of the contributions of the in-

dividual ideas. Also, many implementation details

are omitted. This leads to the questions why Region-

lets achieve good performance and which of the ideas

work well.

We analysed the Regionlets approach and the con-

tributions of the individual ideas to the overall im-

provement in detection performance. Also, we ex-

amined what the boosting classiﬁer learned and how

that compares to the expectations given by the design

ideas.

We could reduce the memory consumption and

computation time during training considerably by re-

placing the randomly generated regionlet conﬁgura-

tions by a regular grid of regionlets. Our proposed

stereo image based candidate bounding box selection

needs little computation time and reduces the number

of detector windows that have to be evaluated by a

factor of 3 to 5.

The remainder of this paper is laid out as follows:

In Section 2, the Regionlets approach is described in

detail. Then, our experiments and an analysis of the

approach is presented in Section 3. Finally, a conclu-

sion is drawn in Section 4.

2 REGIONLETS

The task of object detection in images can be broken

into two sub-tasks. The ﬁrst is to determine the loca-

tions of objects in the image. Then, the class of these

objects is determined by a classiﬁer in order to decide

if they are of interest.

A possible solution to the ﬁrst sub-problem is the

sliding window approach that performs an exhaustive

search of all possible locations and sizes. In the Re-

gionlets approach, however, these candidate bound-

ing box proposals are generated by selective search

(van de Sande et al., 2011). This reduces the number

of candidate bounding boxes that have to be evaluated

to around 1 000-2 000 per image while still achieving

high recall.

The main contribution of the Regionlets approach

is a new descriptor that is calculated for the candidate

bounding boxes. This descriptor contains information

about different scales of the image and is insensitive

to deformation. The former is achieved by calculating

features of regions with different sizes while the latter

is achieved by max-pooling.

The resulting feature vector has a very high di-

mensionality. Therefore, the authors use a cascaded

boosting classiﬁer to select only the most discrimina-

tive features.

2.1 The Regionlets Descriptor

Most objects can be divided into parts. A pedestrian,

for example, might be broken down into the head, the

upper body, arms and legs. Usually, the likelihood

of such a part to appear at a speciﬁc position inside

the bounding box of the object is not uniformly dis-

tributed. Instead, for each part there is a region rela-

tive to the bounding box that covers (nearly) all pos-

sible locations of that part.

The Regionlets descriptor is based on this idea. A

large number of regions with different sizes and dif-

ferent positions is generated in a sliding window fash-

ion. The feature vector for each candidate bounding

box is then calculated by concatenating the feature

vectors of all regions. It is then up to the boosting

classiﬁer to select the relevant regions.

In each region, there are multiple sub-regions

called regionlets that describe a possible location of

the object part in the region. A ﬁxed set of region-

lets in a region is called a regionlet conﬁguration. In

(Wang et al., 2013), regionlet conﬁgurations are gen-

erated randomly. First, the size for all regionlets in

a conﬁguration is ﬁxed randomly and then a random

number of regionlets with this size is positioned ran-

domly in the region.

The feature vector of each regionlet conﬁguration

is calculated by performing max-pooling over all fea-

ture vectors of all regionlets in this conﬁguration. The

idea behind this is that it does not matter in which re-

gionlet the object part is, but only whether it is in (at

least) one of them or not. The feature vector of a re-

gion is the concatenation of the feature vectors of all

regionlet conﬁgurations of this region. Again, it is up

to the boosting classiﬁer to select the relevant region-

let conﬁgurations for each region.

Finally, the feature vector of each regionlet is the

concatenation of classic appearance features extracted

from the image patch corresponding to the region-

let. These appearance features can for example be

the HOG (Dalal and Triggs, 2005) and LBP (Ahonen

et al., 2004) descriptors.

2.2 The Boosting Classiﬁer

Because of the large number of regions and the ran-

domly generated regionlet conﬁgurations, the feature

Analysis of Regionlets for Pedestrian Detection

space is of very high dimensionality. In (Wang et al.,

2013), about 100 million real-valued feature proto-

types are generated for training.

RealBoost (Schapire and Singer, 1999) is used to

train a classiﬁer cascade for the object detector. This

allows to only select relevant features and to reject

candidate bounding boxes that do not contain a rele-

vant object early.

The weak learners for the boosting cascades are

lookup tables similar to (Huang et al., 2004). They

are mathematically deﬁned as

h(x) =

n−1

∑

o=1

1(B(x) = o) , (1)

where n is the number of bins, v

is the table entry at

the oth position and 1 is an indicator function. B(x)

maps the input x to a table entry and is usually chosen

in a way so that the expected range of x is uniformly

distributed to the bins.

During training, the features are ﬁrst sorted by the

Bhattacharyya distance of the posterior probability of

positive and that of negative samples. Then, a feature

is iteratively selected and a weak learner is trained for

it. In each step, all v

in the lookup table of the weak

learner are chosen as v



−



. Here U

is the

sum of weights of all positive training samples that

fall into the oth bin and U

−

is the sum of weights

of all negative samples in that bin. In each step, the

weights of the wrongly classiﬁed training examples

are increased and the weights of the correctly classi-

ﬁed training examples are decreased.

The classiﬁcation output of the learnt strong

classiﬁer for a feature vector f is then given by

sign(

∑

( f

)).

3 ANALYSIS & EXPERIMENTS

We analysed the Regionlets approach and which in-

ﬂuence the different ideas from (Wang et al., 2013)

have.

In our implementation, we chose HOG and LBP

features as basic appearance features for each region-

let. For HOG, we use 9 equally spaced orientation

bins for the unsigned gradient direction. Each region-

let contains one block that is divided into 2 × 2 cells.

The block feature vector is normalised by its L2 norm.

Our LBP descriptor considers 8 neighbours with a ra-

dius of one. Again, each regionlet contains one block

and the corresponding feature vector is normalised by

its L1 norm.

We use RealBoost to train a cascade classiﬁer.

Our weak learner is a lookup table with 4 entries that

Region

(a) Typically rejected Re-

gionlet conﬁguration.

Region

(b) Typically accepted Re-

gionlet conﬁguration.

Figure 1: Examples of regionlet conﬁgurations that were

used during training. The outer box represents the region

while the smaller grey boxes inside represent the regionlets.

correspond to input values in the intervals [0,0.25),

[0.25,0.5), [0.5,0.75) and [0.75, 1].

The set of regions is generated as follows: Let w

the width of the candidate bounding box and h

its

height. The set of possible widths w

of our regions is

then W

= {w

,0.5w

,0.25w

} and the set of possi-

ble heights h

is H

= {0.5h

,0.25h

,0.125h

}. We

generate regions (x

) from each element

) ∈ W

× H

by using a sliding window ap-

proach. The stride is given by

in x-direction and

in y-direction.

3.1 Max-Pooling

In our ﬁrst experiment we generated several regionlet

conﬁgurations randomly as described in (Wang et al.,

2013) and trained the classiﬁer. We then examined

which regionlet conﬁgurations were chosen by the

boosting algorithm.

Figure 1 shows an example of a conﬁguration that

was chosen and one that was not chosen. We found

that most conﬁgurations where the regionlets did not

overlap were discarded. In contrast, most of the cho-

sen conﬁgurations contained regionlets that had a sig-

niﬁcant overlap. This is in line with what is intuitively

expected: Object parts can usually be found at any

position inside the allowed region and not only at a

few distinct positions. The chosen regionlet conﬁg-

uration allows for exactly that by performing max-

pooling over a large area inside the region.

In order to verify this theory, we replaced all re-

gionlet conﬁgurations by an artiﬁcial one that only

performs max-pooling. It contains 9 regionlets in a

3×3 grid where direct neighbours have 75 % overlap.

We found that there was nearly no change in detection

performance which conﬁrms our theory.

This is an important observation. When using a

regular grid, the dimensionality of the feature space

can be reduced by a factor equal to the number of re-

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

0 0.2 0.4

0.6

0.8

0.2

0.4

0.6

0.8

Recall

Precision

Regionlet features, max-pooling

Regionlet features, no max-pooling

Classic features, max-pooling

Classic features, no max-pooling

Figure 2: Precision-recall curves of our Regionlets based

classiﬁer and a classic approach. The evaluation was per-

formed on 730 randomly chosen images of the KITTI

benchmark (Geiger et al., 2012).

gionlet conﬁgurations that would be generated other-

wise. This translates into a corresponding decrease

in memory consumption and computing time during

training.

We also evaluated the beneﬁts of max-pooling.

We used the KITTI benchmark and selected 730 la-

belled images randomly for evaluation. All other la-

belled images were used for training.

Figure 2 shows the precision-recall curves

achieved with this. It is clearly visible that max-

pooling improves the performance considerably. This

is true for both the Regionlets descriptor and a classic

descriptor that we used for comparison. The latter is

described in detail in the following section.

3.2 Different Scales

One of the most important ideas of the Regionlets de-

scriptor is the introduction of regions that cover the

possible locations of all object parts. These regions

have different sizes to account for object parts of dif-

ferent sizes and different variability in their possible

locations. When all regions are ﬁxed to one size and

there is no max-pooling inside the regions, the Re-

gionlet descriptor degrades to a classic block-based

descriptor.

We evaluated the beneﬁts of having regions with

different sizes by also implementing a classic descrip-

tor and comparing the results. For our classic descrip-

tor, we extract HOG and LBP features from a window

Figure 3: Examples of regions selected by the boosting al-

gorithm. Each region is printed in black but with a low

opacity so that overlapping regions can be easily visualised.

Therefore, areas where many regions overlap appear darker

than those with less overlapping regions. The ﬁrst image

shows the 3 most important regions, the second image the

250 most important regions and the last image the 500 most

important regions.

of size 64×128. The block size of the each descriptor

is 16 × 16 and the stride is 8 pixels both in x- and in

y-direction. All other parameters match the ones that

were used in our Regionlets descriptor.

The results are plotted in Figure 2. The Regionlets

descriptor clearly outperforms the classic approach.

It can also be seen that the improvement by max-

pooling over the pure classic approach is considerably

smaller than the improvement by using regions of dif-

ferent sizes.

Figure 3 shows the regions that the boosting clas-

siﬁer selects ﬁrst, i. e. the most important ones. The

three most important regions cover the areas where

the upper body, the head and the feet of the pedes-

trian are expected. Then, the classiﬁer mostly focuses

on the upper body. This can be seen when the 250

most import regions are visualised. The 500 most im-

portant regions already cover the whole body of the

pedestrian. As expected, only very few regions that

cover the background are selected.

3.3 Exhaustive Search

The standard approach for object detection in images

is the sliding window approach. The detector win-

dows is slid over the image with a ﬁxed stride. Each

detector window is evaluated with the classiﬁer to de-

termine whether it contains an object of interest or

not. This procedure is repeated with multiple window

sizes to search for objects at different scales.

The number of candidate bounding boxes that

have to be evaluated can be reduced by using a ground

plane assumption. For many applications in trafﬁc, a

camera is mounted at a ﬁxed position in the vehicle

Analysis of Regionlets for Pedestrian Detection

and the vehicle can only move on the ground. There-

fore, the distance between the camera and the ground

is ﬁxed. This is also true for the KITTI dataset. Since

pedestrians can also only walk on the ground, some

combinations of bounding box position and size are

very unlikely to occur.

However, the assumption that the road is a perfect

plane may not always hold in curved terrain. Also,

pitching of the car may violate a static plane inclina-

tion assumption. We handle this by allowing a cer-

tain deviation from the perfect plane since we do not

want to estimate the ground plane in every shot. This

results in a Region Of Interest (ROI) for each scale

that has to be searched for pedestrian. We learnt these

ROIs from the training images.

The ground plane assumption can be expressed by

the following linear equation

h = m · u + h

(2)

where h is the height of the ROI in pixels, u is the

lower boundary of the ROI in the image, and m and

are the parameters. We learnt m and h

from the

training labels by using linear least square estimation.

The possible ROI locations are then given by

m · u + h

− h

< k

(3)

where k

is a non-negative threshold.

On the KITTI dataset, an exhaustive search in all

ROIs results in about 10 000 bounding boxes per im-

age that have to be evaluated.

3.4 Selective Search

In (Wang et al., 2013), selective search (van de Sande

et al., 2011) is used to reduce the number of candidate

bounding boxes that are evaluated. This can help to

reduce the required detection run-time. But it can also

be beneﬁcial for the precision of the detector since

the chance for false positives is reduced. It is, how-

ever, important that the recall of the candidate bound-

ing box proposals is very high. The selective search

implementation from (van de Sande et al., 2011) gen-

erates around 1 000-2 000 candidate bounding boxes

per image.

We compared the precision reached when using

selective search with a sliding window approach. The

results can be found in Figure 4. The precision in-

creased slightly for high decision value thresholds

(i. e. where recall is low). However, for lower de-

cision value thresholds the precision decreased. The

maximum recall also decreased because the bounding

box candidate proposals do not have perfect recall.

0 0.2 0.4

0.6

0.8

0.2

0.4

0.6

0.8

Recall

Precision

Sliding window approach

Selective search

Stereo based approach

Figure 4: Comparison of precision-recall curves when us-

ing selective search, a stereo based approach and a sliding

window approach. The evaluation was performed on 730

randomly chosen images of the KITTI benchmark (Geiger

et al., 2012).

3.5 Stereo based Bounding Box

Proposals

Bounding box proposals can also be generated from

stereo image pairs. In general, there are two possi-

bilities to do so. More complex methods take advan-

tage of the fact that there is usually a difference in the

depth of an object and its background. Pixels can then

be clustered by their depth values and each cluster can

be used to generate a bounding box proposal.

The other, simpler, possibility is to assume a ﬁxed

size of the searched object in the world. Bounding

box candidates can then be generated in a sliding win-

dow fashion but all candidates that do not satisfy this

assumption are immediately discarded.

We evaluated the performance of the second ap-

proach as an alternative to selective search. We as-

sume that a pedestrian has an average size of H in the

world. Then, the size h (in pixels) of a person in the

image depends on the distance z between camera and

this person. The relationship is given by

(4)

where f is the focal length of the camera in pixels.

Using H = const. and disparity d ∝ f /z, we can derive

z · h

∝

= c (5)

and thus, we only have to estimate the constant c from

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods

the training data. This again can be done using least

squares estimation.

The disparity d is not constant for all pixels in a

candidate bounding box. We achieved good results

by computing d as the mean disparity of a local neigh-

bourhood at the centre of the candidate bounding box.

Then, all candidate bounding boxes for which the fol-

lowing equation holds are discarded:



− c



> k

(6)

where again, k

is a non-negative threshold.

We used (Ranft and Strauß, 2014) to obtain the

disparity image. This algorithm focuses on run-

time performance while achieving competitive re-

sults. However, any other stereo matcher with good

run-time performance could be used.

Our stereo based approach reduces the number of

candidate bounding boxes per image to about 2 000-

3 500 per image. The achieved detection performance

is similar to selective search and is plotted in Figure 4.

3.6 Run-time

Our OpenCL implementation needs around 700 ms

per image on an Nvidia Titan graphics card for an ex-

haustive search using a sliding window approach. The

number of evaluated bounding boxes can be drasti-

cally reduced by using image segmentation to gener-

ate candidate bounding box proposals. This promises

to reduce the run-time of the detection algorithm con-

siderably.

However, we found that the code for image seg-

mentation provided with (van de Sande et al., 2011)

takes around 1.6 s to execute on our Intel Xeon E5-

2640. This means that we could not reproduce the

performance of (Wang et al., 2013) which reports an

overall detection rate of 5 frames per second on a sin-

gle 2.1 GHz CPU core. It also means that run-time

performance cannot be improved by using this orig-

inal implementation. Other implementations of this

algorithm or other approaches for image segmenta-

tion can, however, be faster and advantageous for the

overall detection run-time.

Our stereo based method for candidate bounding

box generation takes approximately 72 ms (65 ms for

the stereo matcher and 7 ms for the candidate gener-

ation). It reduces the number of evaluated bounding

boxes compared to an exhaustive search by a factor of

approximately 3 to 5 depending on the image. This re-

sults in a corresponding decrease in detector run-time

that outweighs the cost of the stereo matching.

4 CONCLUSION

We have analysed the different ideas of the Regionlets

approach and determined what makes it work well.

The largest improvement in detection performance is

achieved by dividing the detection window into re-

gions and learning which of these regions are relevant.

We showed that the boosting algorithm learns that the

regions that contain parts of the pedestrian’s body are

the most relevant ones.

Randomly choosing regionlet conﬁgurations and

performing max-pooling on them achieves a simi-

lar performance as max-pooling on a regular grid.

The boosting algorithm tends to select conﬁgurations

that are regular and contain overlapping regionlets.

Performing max-pooling on a regular grid decreases

the memory consumption and computing time during

training.

The performance improvement by max-pooling is

less than the improvement achieved by using regions

but it is still important. However, it is not speciﬁc to

the regionlets approach—the same improvement can

be seen when max-pooling is used together with clas-

sic features (HOG and LBP).

The use of selective search inﬂuences the detec-

tion performance slightly but it does not necessarily

improve it. While it decreases the risk for false pos-

itives it also decreases recall. Concerning the run-

time performance, selective search does not help in

our case. This is because computing the candidate

bounding box proposals alone takes more time than

an exhaustive search with our OpenCL implementa-

tion.

However, our stereo based approach for the gen-

eration of candidate bounding box proposals is fast

enough to improve the overall run-time. With this

approach, the overall performance can meet the soft

real-time constraints of some applications.

The original Regionlets paper (Wang et al., 2013)

omits many implementation details. Therefore, the

results are not necessarily comparable with our im-

plementation. We could however show that the Re-

gionlets approach performs much better than classic

features.

REFERENCES

Ahonen, T., Hadid, A., and Pietik

ainen, M. (2004). Face

Recognition with Local Binary Patterns. In Pajdla,

T. and Matas, J., editors, ECCV (1), volume 3021 of

Lecture Notes in Computer Science, pages 469–481.

Springer.

Dalal, N. and Triggs, B. (2005). Histograms of Oriented

Analysis of Regionlets for Pedestrian Detection

Gradients for Human Detection. pages 886–893.

IEEE Computer Society.

Felzenszwalb, P., Mcallester, D., and Ramanan, D. (2008).

A Discriminatively Trained, Multiscale, Deformable

Part Model. Computer Vision and Pattern Recogni-

tion, 2008. CVPR 2008. IEEE Conference on.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? The KITTI vision bench-

mark suite. In CVPR, pages 3354–3361. IEEE Com-

puter Society.

Huang, C., Ai, H., Wu, B., and Lao, S. (2004). Boosting

Nested Cascade Detector for Multi-View Face Detec-

tion. In ICPR (2), pages 415–418. IEEE Computer

Society.

Ranft, B. and Strauß, T. (2014). Modeling Arbitrarily Ori-

ented Slanted Planes for Efﬁcient Stereo Vision based

on Block Matching. pages 1941–1947.

Schapire, R. E. and Singer, Y. (1999). Improved Boosting

Algorithms Using Conﬁdence-rated Predictions. Ma-

chine Learning, 37(3):297–336.

van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., and

Smeulders, A. W. M. (2011). Segmentation as Se-

lective Search for Object Recognition. In Metaxas,

D. N., Quan, L., Sanfeliu, A., and Gool, L. J. V., edi-

tors, ICCV, pages 1879–1886. IEEE.

Vogel, J. and Schiele, B. (2007). Semantic Modeling of Nat-

ural Scenes for Content-Based Image Retrieval. Inter-

national Journal of Computer Vision, 72(2):133–157.

Wang, X., Yang, M., Zhu, S., and Lin, Y. (2013). Regionlets

for Generic Object Detection. In ICCV, pages 17–24.

IEEE.

ICPRAM 2017 - 6th International Conference on Pattern Recognition Applications and Methods