Towards Reliable Real-time Person Detection
Silviu-Tudor Serban, Srinidhi Mukanahallipatna Simha, Vasanth Bathrinarayanan, Etienne Corvee
and Francois Bremond
INRIA Sophia Antipolis - Mediterranee, 2004 route des Lucioles - BP 93, 06902 Sophia Antipolis, France
Keywords:
Random Sampling, AdaBoost, Soft Cascade, LBP Channel Features.
Abstract:
We propose a robust real-time person detection system, which aims to serve as solid foundation for develop-
ing solutions at an elevated level of reliability. Our belief is that clever handling of input data correlated with
efficacious training algorithms are key for obtaining top performance. We introduce a comprehensive training
method based on random sampling that compiles optimal classifiers with minimal bias and overfit rate. Build-
ing upon recent advances in multi-scale feature computations, our approach attains state-of-the-art accuracy
while running at high frame rate.
1 INTRODUCTION
In most applications of person detection a high level
of accuracy is not sufficient, as good detection speed
is equally critical. It is thus essential to choose a
mixture of learning algorithm, features and detection
strategy that satisfies both requirements with minimal
compromise.
We propose a versatile training system which al-
lows automatic training optimization and possesses
the ability to efficiently discriminate training samples,
choose satisfactory subsets and cluster the training
data. We capture substantial information at low com-
putational cost by computing the Local Binary Pattern
operator (T. Ojala and Maenpaa, 2002) and the Mod-
ified Census Transform (B. Froba, 2004) on several
color channels of the training images. We implement
a variant of the AdaBoost classifier that uses soft cas-
cades (C. Zhang, 2007) for lossless reduction of de-
tection time.
1.1 Related Work
We present a list of major contributions in the object
detection field, with a focus on sliding window ap-
proaches. Most of them have influenced our work in
some degree, providing both inspiration and motiva-
tion for improvement.
One of the first sliding window detectors was Pa-
pageorgiou et al. (Papageorgiou and Poggio, 2000).
It focused on applying Support Vector Machines
(Cortes and Vapnik, 1995) to a dictionary of mul-
tiscale Haar wavelets. Viola and Jones [VJ] (Viola
and Jones, 2004) improved on the idea by introduc-
ing integral images for fast feature computation and
by using a cascade-like structure of AdaBoost clas-
sifiers for increasing the efficiency of detection. The
wide acceptance for gradient-based features began af-
ter Dalal and Triggs [HOG] (Dalal and Triggs, 2005)
proposed histogram of oriented gradient (HOG) fea-
tures for detection by showing substantial gains over
intensity based features. Zhu et al. (Q. Zhu and
Cheng, 2006) improved the original HOG implemen-
tation by using integral histograms. A vast majority
of modern detectors are still HOG-based.
Shape features have also shown good promise.
Gavrila and Philomin (Gavrila and Philomin,
1999)(Gavrila, 2007) used Hausdorff distance
transform together with a template hierarchy to
match image edges with a shape templates set.
Wu and Nevatia (Wu and Nevatia, 2005) aimed to
represent shape locally by using edgelet features,
with boosted classifiers for full-body, head, torso
and legs. A combination of features was used in
order to provide complementary information. Wojek
and Schiele (Wojek and Schiele, 2008) combined
Haar-like features, shapelets (Sabzmeydani and Mori,
2007), shape context (G. Mori and Malik, 2005) and
HOG features obtaining a detector that outperforms
individual features of any kind. Wu and Nevatia
(Wu and Nevatia, 2008) combined HOG, edgelet
and covariance features. Ojala et al. (T. Ojala and
232
serban S., Mukanahallipatna Simha S., Bathrinarayanan V., Corvee E. and Bremond F..
Towards Reliable Real-time Person Detection.
DOI: 10.5220/0004651302320239
In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 232-239
ISBN: 978-989-758-004-8
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
Maenpaa, 2002) combined a texture descriptor based
on LBP and HOG.
Dollar et al. (P. Dollar and Belongie, 2009) pro-
posed an extension of the Viola and Jones frame-
work where Haar-like features are computed over
multiple channels of visual data including LUV color
channels, grayscale, gradient magnitude and gradient
magnitude quantized by orientation. In the Fastest
Pedestrian Detector in the West (P. Dollar and Per-
ona, 2010), this approach was extended to fast multi-
scale detection given that features computed at a sin-
gle scale can be used to approximate feature at nearby
scales.
Tuzel et al. (O. Tuzel and Meer, 2008) utilized
covariance matrices computed locally over various
features as object descriptors. The boosting frame-
work was modified to work on Riemannian man-
ifolds, leading to better performance. Maji et al.
(S. Maji and Malik, 2008) presented a way to approx-
imate the histogram intersection kernel for use with
SVMs, which provided speed-ups significant enough
to enable a non-linear SVM to be used in sliding-
window detection.
Babenko et al. (B. Babenko and Belongie, 2008)
proposed an approach for simultaneously separating
data into coherent groups and training separate classi-
fiers for each; (C. Wojek and Schiele, 2009) showed
that both (S. Maji and Malik, 2008) and (B. Babenko
and Belongie, 2008) gave modest gains over lin-
ear SVMs and AdaBoost for pedestrian detection,
especially when used in combination (S. Walk and
Schiele, 2010).
Several groups worked on efficiently utilizing
large feature spaces. Feature mining was proposed
by (P. Dollar and Belongie, 2007) to explore huge
feature spaces using strategies like steepest descent
search before training a boosted classifier. The no-
tion of pose and body parts was investigated by a
number of authors. Mohan et al. (Mohan and Pog-
gio, 2001) successfully extended (Papageorgiou and
Poggio, 2000) with a two stage approach: supervised
training of head, arm and leg detectors, and detec-
tion that involved combining outputs in a rough ge-
ometric model. Keypoints represent the base for early
contributions in unsupervised part learning, includ-
ing the constellation model (M. Weber and Perona,
2000)(R. Fergus and Zisserman, 2003) and the sparse
representation approach of (Agarwal and Roth, 2002).
Leibe et al. (A. Leibe and Schiele, 2005) adapted the
implicit shape model for detecting pedestrians. How-
ever, as few interest points are detected at lower res-
olutions, unsupervised part based approaches that do
not rely on keypoints have been proposed.
Multiple instance learning (MIL) was employed
in order to automatically determine the position of
parts without part-level supervision (P. Dollar and
Z. Tu, 2008)(Z. Lin and Davis, 2009). In one of the
most successful approaches for general object detec-
tion to date, Felzenszwalb et al. (P. Felzenszwalb and
Ramanan, 2008)(P. F. Felzenszwalb and Ramanan,
2009) proposed a discriminative part based approach
that models unknown part positions as latent vari-
ables in an SVM framework. As part models seem
to be most successful at higher resolutions, Park et al.
(D. Park and Fowlkes, 2010) extended this to a multi-
resolution model that automatically switches to parts
only at sufficiently high resolutions.
In terms of detection speed, recent notable
publications reveal outstanding results. Dollar et. al
(P. Dollar and Kienzle, 2012), builds upon previous
contributions (P. Dollar and Belongie, 2009)(P. Dol-
lar and Perona, 2010) and uses Crosstalk cascades
to further reduce detection time. Benenson et al
(Rodrigo Benenson, 2012) propose a method of a
similar nature, but use GPU for accelerating feature
computation.
2 CLASSIFICATION
Our method combines detection techniques that
greatly reduce computational time without compro-
mising accuracy. We use efficient LBP and MCT
features which we compute on integral images for
optimal retrieval of rectangular region intensity and
nominal scaling error. AdaBoost is used to create
cascading classifiers with significantly reduced detec-
tion time. We further refine detection speed by us-
ing the soft cascades approach and by transferring all-
important computation from detection stage to train-
ing stage.
2.1 LBP Channel Features
Local binary pattern(LBP) is a non-parametric ker-
nel which summarizes the local spatial structure of an
image and is invariant to monotonic gray-scale trans-
formations. At a given image location, LBP is de-
fined as an ordered set of binary comparisons of val-
ues between the center block and its eight surrounding
blocks (Figure 1). In the same degree, the Modified
Census Transform (MCT) is defined as an ordered set
of binary comparisons between all nine blocks and
their mean value.
Inspired by the logic behind Integral Channel
Features (P. Dollar and Belongie, 2009), at training
stage feature extraction is performed on 5 channels
TowardsReliableReal-timePersonDetection
233
Figure 1: Extracting LBP response code (8-bit and deci-
mal).
of the input image: Red, Green, Blue, Grayscale and
Edges(Gradient Magnitude). Our approach combines
two variants of the LBP feature: classic LBP with
8-bit response code and Modified Census Transform
with 9-bit response code. This results in 10 differ-
ent types of features, which we refer to as LBP chan-
nel features. (LBP x 5 color channels, MCT x 5
color channels). The fusion of informative channel
features, along with the usage of integral image, al-
lows comprehensive object description and fast fea-
ture computation.
2.2 AdaBoost
The AdaBoost learning algorithm stands at the core
of our training system. Boosting offers a convenient,
fast approach to learning given a large number of can-
didate features. It has all the desirable attributes that a
linear classifier can provide, has good generalization
properties, automatically selects features based on a
strategy that minimizes error and produces a sequence
of gradually more complex classifiers. AdaBoost con-
structs a strong classifier by linearly combining weak
classifiers.
Algorithm 1: The AdaBoost Algorithm.
Given
(x
1
,y
1
),...,(x
m
,y
m
);x
i
X ,y
i
{−1, +1}
Initialise weights D
1
(i) = 1/m.
For t = 1,...,T
Find h
t
= arg min
h
j
H
ε
j
=
m
i=1
D
t
(i)[y
i
6= h
j
(x
i
)]
If ε
t
1/2 then stop
Set α
t
=
1
2
log(
1ε
t
ε
t
)
D
t+1
(i) =
D
t
(i)exp(α
t
y
i
h
t
(x
i
))
Z
t
end
Output the final classifier
H(x) = sign
T
t=1
α
t
h
t
(x)
2.3 Cascading Classifiers
A Cascading Classifier represents the concatenation
of several classifiers, using all information collected
from the output from a given classifier as additional
information for the next classifier in the cascade. In
our cascading ensemble we compile classifiers by
boosting LBP channel features.
When we train a classifier, an extensive array of
features is extracted (training images x feat. loca-
tions x feat. sizes x feat. types). The array serves
as input for the AdaBoost meta-learning method that
constructs a classifier in an iterative fashion. At each
iteration, a non-complex learning algorithm is ran in
order to select the feature with the lowest discrimina-
tion error. This feature becomes a weak learner and
is attributed a weight signifying its importance in the
strong classifier. When the cumulus of weak learners
can correctly discriminate all the training samples, it-
erating stops and the strong classifier is stored.
At each stage we combine the trained classifiers
in a partial cascade which filters-out negative samples
for training the next classifier. The cascade is con-
sidered complete when combined classifiers reach the
desired level of performance. At detection time, the
cascade approach works as a multistage False Posi-
tives filter for a given set of candidates.
2.4 Cascading and Random Sampling
We use random sampling to construct cascades of op-
timized classifiers. A random sampling classification
stage is a seemingly exhaustive process. Repeatedly,
random positive and negative samples are chosen to
be trained via our boosting method, creating a tempo-
rary classifier, which is tested as shown in (Figure 2) .
It iterates until a training goal is met and the classifier
is stored.
Weights can be used to guide the construction
of each classifier. After each test, training samples
will be mapped with a performance table which
keeps count of how many times they were correctly
identified by all temporary classifiers. If samples with
high counts are given more chances to be selected for
training, the final classifier will learn to discriminate
a vast majority of training samples in less time. If
samples with low counts are given more chances to
be selected, the final classifier will very accurately
recognize all True Positives, but will suffer in terms
of speed and robustness.
3 RANDOM SAMPLING
TRAINING
Our system was designed to obtain best possible clas-
sification with little or no supervision while tackling
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
234
Figure 2: a)Random sampling approach for classification optimization. b)Weighted random sampling for dataset clustering.
existing dataset problems (Antonio Torralba, 2011)
by using the statistical method of random sampling.
This minimizes bias, better estimates a general model
and simplifies analysis of results.
The approach reveals an unsupervised mechanism
of training sample selection and classification opti-
mization. We describe the general work flow of the it-
erative system, see (Figure 2.a), and highlight its most
important features and advantages.
Automatic Generation and Validation. From
given positive sample dataset P a subset P
rs
is ran-
domly chosen and from given negative sample dataset
N, a subset N
rs
is chosen. Feature extraction is per-
formed on P
rs
and N
rs
and a cascade classifier candi-
date C is trained via AdaBoost. Upon generation, the
performance of C is validated on testing sets t
P
= P -
P
rs
and t
N
= N - N
rs
. We define the training goal G
as a set of rules that include a maximum number of
allowed features per classifier and the threshold per-
centage of accuracy obtained by the classifier on t
P
and t
N
. When G is satisfied, C is stored and the sys-
tem begins training the next classifier cascade. If G is
not satisfied, C is dismissed, P
rs
and N
rs
are regener-
ated and the classifier cascade training is restarted.
Dataset Bias and Overfitting Reduction. We
aim to reduce bias by concatenating several person
datasets in order to obtain a diverse and ample dataset.
Also, by randomly selecting a subset of training sam-
ples from the training dataset, we minimize the level
of similarity between training samples and effectively
decrease the overfitting effect.
Efficient Big Dataset Handling. The random
sampling technique allows training large datasets
without the need of supercomputers. By choosing the
subset that represents the entire set with minimal er-
ror, computation charges of classification are greatly
reduced and the resulting classifiers are similar or bet-
ter.
Detection Optimization. An outcome of using a
selected subset of the total samples is that just a hand-
ful of features are needed to correctly discriminate ob-
ject class. During our experiments, we have obtained
up to 10-fold speed-up in feature computation time
while in terms of quality, we show State-of-the-Art
detection rates on all the evaluated datasets.
Dataset Clustering. Dataset segregation can be
achieved by using weight vectors in conjunction with
our random sampling approach. In this technique,
weight vectors W
P
and W
N
store the number of
chances any members of P and N has to be se-
lected for training. Preliminary to training a cascade
classifier, weights of all members are set to default,
W
P
[1..size
(P)
] = 1 and W
N
[1..size
(N)
] = 1. When
classification is concluded, the resulting trained clas-
sifier C is validated on t
P
and t
N
(as shown in Fig-
ure 2.b). Correctly identified samples receive an ad-
ditional chance to get randomly selected (W
P
[z] =
W
P
[z]+1, where z is sample ID). We define as Recall
score RS, a distribution that reveals how many times
each test sample has been correctly identified over
several validation stages (iterations). The clustering
goal G
0
is composed by maximum number of itera-
tions T and a set of thresholds for automatically sep-
arating P and N into subsets with similar RS. When
TowardsReliableReal-timePersonDetection
235
Figure 3: Ordered Recall score distribution (28k positive
samples, 55 iterations of Random Sampling Clustering).
The graph should be read in the following manner: The 1st
sample has been correctly identified by the highest number
of random sample classifiers (53) and holds a recall score of
53/55, while the 28000th sample holds a score of 1/55.
Figure 4: Top row: Hard Positives, Bottom row: Soft Posi-
tives.
the clustering goal G
0
is reached, the weight vectors
W
P
and W
N
will be stored in RS
P
(Figure 3) and RS
N
.
In the case of person detection, samples with low RS
represent the images that are difficult to discriminate
using general models (hard positives - view examples
in (Figure 4) - and hard negatives).
Our positive training set consists of 28.000 im-
age samples which have been extracted from MIT
1
,
DAIMLER
2
, NICTA
3
and INRIA
4
person datasets.
The negatives samples are generated from a compre-
hensive set of background images.
We train, in parallel, classifiers with same training
goal but different number of samples (see Figure 5).
Training goals serve great purpose in making train-
ing automatic and enforcing quality. By adjusting
1
http://cbcl.mit.edu/software-datasets/PedestrianData
.html
2
www.gavrila.net/Datasets/datasets.html
3
http://nicta.com.au/research/projects/AutoMap/compu
ter vision datasets
4
http://pascal.inrialpes.fr/data/human
Figure 5: Random sampling training is performed with sub-
sets of 1k, 2k, 3k and 4k size. The 3k subset (10.7% of par-
ent set) reveals adequate discrimination and requires a low
number of weak learners.
the training goal to a default 95% accuracy thresh-
old, our system compiles classifier cascades in trivial
time (around 6 hours for our configuration) and high-
lights classifiers with minimal number of features and
classifiers with highest accuracy levels.
A qualitative improvement of the classifier cas-
cades can be obtained by raising the accuracy thresh-
old, at the cost of increased training time.
Output data such as the distribution in (Figure 3)
is of high value, as it highlights atypical training sam-
ples. For instance, a subset of samples that registers
low recall scores may offer great insight in regard to
the limitations of the current classifier configuration.
3.1 Classifier Comparison
We compare the average performance of our standard
AdaBoost classification, Random Sampling method
and Weighted Random Sampling method. (Figure 6)
reveals the trade-off in terms of detection, training
time and detection speed between the 3 approaches.
The standard boosting approach has a minimal train-
ing time. However it falls behind the other approaches
in terms of True Positives and detection speed. The
random sampling method maximizes detection accu-
racy, improves detection speed (Figure 7), at the cost
of increased training time. With the use of weights,
difficult positive samples are given a lesser chance of
selection in the classification process. This results in
a faster generation of Optimal classifiers and reduced
False Positives. A negative, but negligible, side-effect
is a mild decrease in True Positives.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
236
Regarding training time, in our experiments, gen-
erating a classifier using the Random Sampling ap-
proach took, on average, 4 times longer than train-
ing in the classical manner, and only 2 times longer
when using weights. Consequently, classifiers gener-
ated with either random sampling approaches use less
weak learners thus minimizing detection time(up to
15fps on VGA resolution input).
Figure 6: Classifier comparison.
Figure 7: Cascade performance graph revealing early detec-
tion for the majority of False Positives.
4 EXPERIMENTAL RESULTS
Our object detector is tested on 2 public datasets
namely PETS
5
, ETISEO
6
and 2 private datasets
namely VANAHEIM
7
, Hospital
8
. The sequences
contain single to various objects in challenging con-
ditions that include illumination change, low resolu-
tion, appearance change, pose change and partial oc-
clusions. All sequences have been processed with de-
fault configuration settings: 10 different scan window
sizes and search step of 2 pixels. The classifier thresh-
old is set to 50%, classic for boosting.
The detector presented a commendable behavior
on all testing data and even more so on the PETS
5
http://pets2012.net - Dataset S2.L1-walking.
6
http://www-sop.inria.fr/orion/ETISEO/
7
http://www.vanaheim-project.eu/
8
http://www.demcare.eu/
Table 1: Comparison of detection results.
Results
Dataset Metric OpenCV DPM Ours
PETS
Precision 0.92 0.95 0.95
Recall 0.42 0.71 0.83
F-Score 0.57 0.81 0.88
VANAHEIM
Precision 0.82 0.72 0.86
Recall 0.41 0.73 0.62
F-Score 0.54 0.72 0.72
Hospital
Precision 0.66 0.77 0.89
Recall 0.7 0.76 0.83
F-Score 0.67 0.76 0.85
ETISEO
Precision 0.91 0.83 0.91
Recall 0.48 0.77 0.77
F-Score 0.62 0.8 0.84
dataset. We have performed comparative tests with
the OpenCV
9
HoG detector (Dalal and Triggs, 2005)
and state of the art Deformable Parts Model detec-
tor (DPM) (P. F. Felzenszwalb and Ramanan, 2009).
On Hospital ,ETISEO and PETS datasets our ap-
proach outperforms both OpenCV and DPM, while
on the VANAHEIM dataset it is on par with DPM and
outperforms OpenCV. A visual comparison between
DPM and our algorithm can be seen in (Figure 8).
The detection results are shown in (Table 1), where:
TP - True Positives, FP - False Positives, FN - False
Negatives, P - Precision, R - Recall, F - F-Score,
P =
T P
T P + FP
, R =
T P
T P + FN
, F =
2 × P × R
P + R
When possible, our approach takes advantage of
context information. Here the context information
refers to the camera calibration details which includes
camera intrinsic, extrinsic information and 3D mea-
surements of the mobile objects.
When performing detection on 640x480 resolu-
tion images, with default configuration, we attain a
constant detection speed of 15FPS running on a new
generation processor, using a single core with 3.0Ghz
clock speed. For some datasets, context information
allows us to limit the scan window search range. Us-
ing this information, we have reached up to 50FPS
detection speed without any loss in terms of quality.
5 CONCLUSIONS
Our comprehensive training method based on random
sampling is a powerful tool for training optimization.
Coupled with an efficient configuration of feature ex-
traction, classification and detection strategy, it com-
piles a competent classifier, efficient in both speed and
detection quality. In our evaluation we have shown
9
http://docs.opencv.org/
TowardsReliableReal-timePersonDetection
237
Figure 8: First column: DPM, Second Column: Ours. Detection samples from PETS, VANAHEIM, Hospital and ETISEO.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
238
that our method outperforms relevant state-of-the-art
approaches.
We plan to extend feature descriptors by adding
more feature channels and Local Ternary Patterns
(X. Tan, 2010) and we hope to improve detection
speed by applying techniques presented in (P. Dollar
and Kienzle, 2012), (Rodrigo Benenson, 2012).
ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the European Community’s Seventh
Framework Programme FP7/2007-2013 - Challenge
2 - Cognitive Systems, Interaction, Robotics - under
grant agreement n. 248907 - VANAHEIM.
REFERENCES
A. Leibe, E. S. and Schiele, B. (2005). Pedestrian detection
in crowded scenes. CVPR.
Agarwal, S. and Roth, D. (2002). Learning a sparse repre-
sentation for object detection. ECCV.
Antonio Torralba, A. A. E. (2011). Unbiased Look at
Dataset Bias. CVPR.
B. Babenko, P. Dollar, Z. T. and Belongie, S. (2008). Simul-
taneous learning and alignment: Multi-instance and
multi-pose learning. ECCV.
B. Froba, A. E. (2004). Face detection with the modified
census transform. In Proc. of 6th Int. Conf. on Auto-
matic Face and Gesture Recognition, pages 91–96.
C. Wojek, S. W. and Schiele, B. (2009). Multi-cue onboard
pedestrian detection. CVPR.
C. Zhang, P. A. V. (2007). Multiple-Instance Pruning For
Learning Efficient Cascade Detectors. NIPS.
Cortes, C. and Vapnik, V. (1995). Support-Vector Networks.
Machine Learning.
D. Park, D. R. and Fowlkes, C. (2010). Multiresolution
models for objdetection. ECCV.
Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. CVPR.
G. Mori, S. B. and Malik, J. (2005). Efficient shape match-
ing using shape contexts. TPAMI, pages 1832–1837.
Gavrila, D. M. (2007). A bayesian, exemplar-based ap-
proach to hierarchical shape matching. TPAMI.
Gavrila, D. M. and Philomin, V. (1999). Real-time object
det. for smart vehicles. ICCV.
M. Weber, M. W. and Perona, P. (2000). Unsupervised
learning of models for recognision. ECCV.
Mohan, C. P. and Poggio, T. (2001). Example-based object
det. in images by components. TPAMI, 23, no. 4:349–
361.
O. Tuzel, F. P. and Meer, P. (2008). Ped. det. via clas-
sification on riemannian manifolds. TPAMI, 30 no
10:1713–1727.
P. Dollar, Z. Tu, H. T. and Belongie, S. (2007). Feature
mining for image classification. CVPR.
P. Dollar, Z. Tu, P. P. and Belongie, S. (2009). Integral chan-
nel features. BMVC.
P. Dollar, R. A. and Kienzle, W. (2012). Crosstalk Cascades
for Frame-Rate Pedestrian Detection. ECCV.
P. Dollar, S. B. and Perona, P. (2010). The fastest pedestrian
detector in the west. BMVC.
P. Dollar, B. Babenko, S. B. P. P. and Z. Tu, M. (2008). Mul-
tiple component learning for object detection. ECCV.
P. F. Felzenszwalb, R. B. Girshick, D. M. and Ramanan, D.
(2009). Object detection with discriminatively trained
part based models. TPAMI, 99.
P. Felzenszwalb, D. M. and Ramanan, D. (2008). A
discriminatively trained, multiscale, deformable part
model. CVPR.
Papageorgiou, C. and Poggio, T. (2000). A trainable system
for object detection. IJCV, 38:111–136.
Q. Zhu, S. Avidan, M. Y. and Cheng, K. (2006). Fast human
detection using a cascade of histograms of oriented
gradients. CVPR.
R. Fergus, P. P. and Zisserman, A. (2003). Object classMVA
recognition by unsupervised scale-invariant learning.
CVPR.
Rodrigo Benenson, Markus Mathias, R. T. L. J. V. G.
(2012). Pedestrian detection at 100 frames per sec-
ond. CVPR.
S. Maji, A. B. and Malik, J. (2008). Classification using
intersection kernel SVMs is efficient. CVPR.
S. Walk, K. S. and Schiele, B. (2010). Disparity statistics for
pedestrian detection: Combining appearance, motion
and stereo . ECCV.
Sabzmeydani, P. and Mori, G. (2007). Detecting pedestrians
by learning shapelet features. CVPR.
T. Ojala, M. P. and Maenpaa, T. (2002). Multiresolution
grayscale and rotation invariant texture classification
with local binary patterns. TPAMI, 24 no. 7:971–987.
Viola, P. A. and Jones, M. J. (2004). Robust real-time face
detection. IJCV, 57 no. 2:137–154.
Wojek, C. and Schiele, B. (2008). A performance evaluation
of single and multi-feature people detection. DAGM.
Wu, B. and Nevatia, R. (2005). Detection of multiple, par-
tially occluded humans in a single image by bayesian
combination of edgelet part detection. ICCV.
Wu, B. and Nevatia, R. (2008). Optimizing discrimination-
efficiency tradeoff in integrating heterogeneous local
features for object detection. CVPR.
X. Tan, B. T. (2010). Enhanced Local Texture Feature
Sets for Face Recognition Under Difficult Lighting
Conditions. IEEE Transactions on Image Processing,
19(6):1635–1650.
Z. Lin, G. H. and Davis, L. S. (2009). Multiple instance
feature for robust part-based object detection. CVPR.
TowardsReliableReal-timePersonDetection
239