super-classes. Second, the increased number of lin-
ear classifiers allows to be non-linear during detec-
tion. This is due to the approximation of the non-
linear decision line between classes by a piece-wise
linear classification function.
We call the linear SVMs for each vertex a filter.
There are as many filters as nodes in the tree. Ev-
ery filter is represented by a weight vector. These
weight vectors are learned in a novel joint framework.
During detection, we extract features in the image on
different scales. Every location is then scored by all
the filters in the tree. The scores of individual paths
are added together producing k final scores. These
scores are ranked and the highest score determines the
foreground class. In case of a background region, the
score is negative. This allows us to implicitely clas-
sify background regions without modelling it in the
tree. Our algorithm combines ranking and classifica-
tion constraints. We rank between the k classes and
are able to between foreground and background la-
bel. This is very different from the classification task
where all the possible classes are directly modelled
and one needs to rank among the known classes.
In summary, we make the following contributions:
(i) We show that hierarchical learning improves ob-
ject detection performance. (ii) We combine ranking
and classification constraints into one hybrid learn-
ing framework. All the weight vectors in the tree are
learned jointly. (iii) Our approach is independent of
the underlying feature descriptor and many descrip-
tors can be exploited (e.g. (Dalal and Triggs, 2005b;
Felzenszwalb et al., 2010; Zhang et al., 2011)).
In Sec. 2 we give an overview of the previous
works. In Sec. 3, we describe our detector and show
in Sec, 4 how it is trained, We show the evaluations in
Sec, 5 and conclude in Sec. 6.
2 RELATED WORK
Multi-class detection is a challenging task and an im-
portant subject of research. We show an approach that
is generic in the choice of features. This is an impor-
tant constraint on our formulation as feature descrip-
tors change over time. Traditional techniques such as
OvO (Kressel, 1999) or DAGSVM (Platt et al., 2000)
cannot be exploited as they are only able to distin-
guish between k known categories thus not handling
background.
Tree structures are widely used in image classifi-
cation not handling a negative class. Here, we further
have to discriminate background regions. To our best
knowledge, Salakhutdinov et al. (Salakhutdinov et al.,
2011) were the first to present a tree structure for ob-
ject detection where the weight vectors are learned it-
eratively one after another. We jointly optimize over
the tree. Their objective was to show that equilibrat-
ing the number of samples between classes helps to
improve object detection. We show that grouping
classes based on their feature similarity improves per-
formance.
In the domain of object detection, the idea of fea-
ture sharing using boosting was proposed by Torralba
et al. (Torralba et al., 2007). The decisions among the
classes are shared using combined weak classifiers.
Opelt et al. (Opelt et al., 2008) enhanced the previous
system by further incorporating geometric part infor-
mation.
Others take a more global approach by sharing
parts instead of features. Razavi et al. (Razavi et al.,
2011) apply a voting scheme where different parts
shared among classes use multi-class hough trans-
form to determine the object label. A similar tech-
nique (Ott and Everingham, 2011) applies a modi-
fied version of the DPM (Felzenszwalb et al., 2010)
where common parts are found and grouped together.
Other approaches include sharing part locations and
deformations across the classes such as in (Fidler and
Leonardis, 2007; Fidler et al., 2010; Zhu et al., 2010).
These approaches in object detection lack to use
a generic feature descriptor as they are a multi-class
extension of their single-class formulation.
As mentioned earlier, hierarchical classification
using SVM has been a subject of research in classifi-
cation. (Griffin and Perona, 2008) exploit binary trees
where each node in the tree is learned in a top-down
manner. Structured SVM was used in (Tsochantaridis
et al., 2004; Cai and Hofmann, 2004) to learn the fil-
ters of a tree jointly. Zhou et al. (Xiao et al., 2011)
enforce orthogonality between parent and child in a
tree. Dekel et al. (Dekel et al., 2004) speed up the
training process by updating the vectors in the tree
using an online approach. In (Gao and Koller, 2011;
Marszalek and Schmid, 2008) the strict separation be-
tween categories is relaxed and each class can be fur-
ther split into sub-classes.
Again, these approaches are not suited for multi-
class object detection with a dominant background
label. We show an elegant algorithmic approach to
overcome the previously mentioned limitations.
3 SYSTEM OVERVIEW
In this section, we introduce our notation of the hier-
archical classification module. We present our multi-
class detection procedure and describe how our model
attributes a score to different locations in an image.
JointLearningforMulti-classObjectDetection
105