regionlet configurations for each region are chosen by
the boosting algorithm from a large number of ran-
domly generated configurations.
The authors of (Wang et al., 2013) give a thorough
evaluation of the overall detection performance on
different datasets in comparison to other well-known
approaches. However, the original paper does not
contain a break-down of the contributions of the in-
dividual ideas. Also, many implementation details
are omitted. This leads to the questions why Region-
lets achieve good performance and which of the ideas
work well.
We analysed the Regionlets approach and the con-
tributions of the individual ideas to the overall im-
provement in detection performance. Also, we ex-
amined what the boosting classifier learned and how
that compares to the expectations given by the design
ideas.
We could reduce the memory consumption and
computation time during training considerably by re-
placing the randomly generated regionlet configura-
tions by a regular grid of regionlets. Our proposed
stereo image based candidate bounding box selection
needs little computation time and reduces the number
of detector windows that have to be evaluated by a
factor of 3 to 5.
The remainder of this paper is laid out as follows:
In Section 2, the Regionlets approach is described in
detail. Then, our experiments and an analysis of the
approach is presented in Section 3. Finally, a conclu-
sion is drawn in Section 4.
2 REGIONLETS
The task of object detection in images can be broken
into two sub-tasks. The first is to determine the loca-
tions of objects in the image. Then, the class of these
objects is determined by a classifier in order to decide
if they are of interest.
A possible solution to the first sub-problem is the
sliding window approach that performs an exhaustive
search of all possible locations and sizes. In the Re-
gionlets approach, however, these candidate bound-
ing box proposals are generated by selective search
(van de Sande et al., 2011). This reduces the number
of candidate bounding boxes that have to be evaluated
to around 1 000-2 000 per image while still achieving
high recall.
The main contribution of the Regionlets approach
is a new descriptor that is calculated for the candidate
bounding boxes. This descriptor contains information
about different scales of the image and is insensitive
to deformation. The former is achieved by calculating
features of regions with different sizes while the latter
is achieved by max-pooling.
The resulting feature vector has a very high di-
mensionality. Therefore, the authors use a cascaded
boosting classifier to select only the most discrimina-
tive features.
2.1 The Regionlets Descriptor
Most objects can be divided into parts. A pedestrian,
for example, might be broken down into the head, the
upper body, arms and legs. Usually, the likelihood
of such a part to appear at a specific position inside
the bounding box of the object is not uniformly dis-
tributed. Instead, for each part there is a region rela-
tive to the bounding box that covers (nearly) all pos-
sible locations of that part.
The Regionlets descriptor is based on this idea. A
large number of regions with different sizes and dif-
ferent positions is generated in a sliding window fash-
ion. The feature vector for each candidate bounding
box is then calculated by concatenating the feature
vectors of all regions. It is then up to the boosting
classifier to select the relevant regions.
In each region, there are multiple sub-regions
called regionlets that describe a possible location of
the object part in the region. A fixed set of region-
lets in a region is called a regionlet configuration. In
(Wang et al., 2013), regionlet configurations are gen-
erated randomly. First, the size for all regionlets in
a configuration is fixed randomly and then a random
number of regionlets with this size is positioned ran-
domly in the region.
The feature vector of each regionlet configuration
is calculated by performing max-pooling over all fea-
ture vectors of all regionlets in this configuration. The
idea behind this is that it does not matter in which re-
gionlet the object part is, but only whether it is in (at
least) one of them or not. The feature vector of a re-
gion is the concatenation of the feature vectors of all
regionlet configurations of this region. Again, it is up
to the boosting classifier to select the relevant region-
let configurations for each region.
Finally, the feature vector of each regionlet is the
concatenation of classic appearance features extracted
from the image patch corresponding to the region-
let. These appearance features can for example be
the HOG (Dalal and Triggs, 2005) and LBP (Ahonen
et al., 2004) descriptors.
2.2 The Boosting Classifier
Because of the large number of regions and the ran-
domly generated regionlet configurations, the feature
Analysis of Regionlets for Pedestrian Detection
27