Hierarchical Deformable Part Models for Heads and Tails

Fatemeh Shokrollahi Yancheshmeh, Ke Chen and Joni-Kristian K

ainen

Laboratory of Signal Processing, Tampere University of Technology, Finland

Keywords:

Deformable Part Model, Object Detection, Long-tail Distribution, Imbalanced Datasets, Localization, Visual

Similarity Network, Sub-category Discovery.

Abstract:

Imbalanced long-tail distributions of visual class examples inhibit accurate visual detection, which is addressed

by a novel Hierarchical Deformable Part Model (HDPM). HDPM constructs a sub-category hierarchy by al-

ternating bootstrapping and Visual Similarity Network (VSN) based discovery of head and tail sub-categories.

We experimentally evaluate HDPM and compare with other sub-category aware visual detection methods with

a moderate size dataset (Pascal VOC 2007), and demonstrate its scalability to a large scale dataset (ILSVRC

2014 Detection Task). The proposed HDPM consistently achieves signiﬁcant performance improvement in

both experiments.

1 INTRODUCTION

Large intra-class diversities induced by camera pose,

object viewpoint and appearance variations inhibit

accurate object detection. Moreover, ambiguous

bounding box annotations make visual class detection

even more challenging. In the light of this, how to dis-

cover and exploits intra-category variation remains an

open and hot problem in object detection (Gu et al.,

2012; Dong et al., 2013; Zhu et al., 2014). For in-

stance, in Figure 1, the dining table samples consist

of at least four obvious sub-categories: upper view

circular tables, tables with empty plates, tables with

people sitting around, and side views of tables with

chairs visible. Visual appearance of samples belon-

ging to the same class are thus severely varied, e.g.

samples that include mainly the foods and dishes or

just a ﬂower jar rather than the table itself. Moreo-

ver, sub-categories are not balanced but some exam-

ples may occur more frequently and make that subca-

tegory a dominant one. To be more precise, sam-

ples from different sub-categories is long-tail distribu-

ted (Zhu et al., 2014) where dominant sub-categories

are in the head and rare sub-categories in the tail.

Visual detectors can be substantially improved by

capturing ﬁne-grained head sub-categories and, in

particular, capturing rare sub-categories in the tail

which are often omitted. Monolithic learning mo-

dels, such as Deformable Part Model (DPM) (Felzen-

szwalb et al., 2008), mainly capture the dominant sub-

categories (such as “tables with empty plates” in

Figure 1: Distribution of class examples is a long tail dis-

tribution where class examples that happen more frequently

(e.g. table including plates) are located in the head and rare

examples (e.g. tables with chair) and outliers are in the tail.

Learning a monolithic model on this dataset may fail to de-

tect the tails. The goal is to improve learning accuracy by

discovering head and tail sub-categories.

Yancheshmeh, F., Chen, K. and Kämäräinen, J-K.

Hierarchical Deformable Part Models for Heads and Tails.

DOI: 10.5220/0006532700450055

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

45-55

ISBN: 978-989-758-290-5

Figure 1), but fail with the sub-categories having

sparse samples. Hence, modeling head and tail sub-

categories separately is necessary as suggested in ot-

her works (Gu et al., 2012; Dong et al., 2013; Zhu

et al., 2014). However, selecting right criteria to

group the objects properly remains an open problem.

Often long-tail examples spread among the clusters of

dominant ones and cannot form their own clusters or

outliers distort sub-category discovery.

In this work, we introduce a novel Hierarchi-

cal Deformable Part Model where we construct sub-

category hierarchy by alternating two sub-category

discovery approaches with complementary proper-

ties: bootstrapping and Visual Similarity Network

(VSN) (Shokrollahi Yancheshmeh et al., 2015). Boot-

strapping in the terms of DPM true positives and DPM

false negatives provides a rough division to the head

and tail parts. However, pair-wise similarity provides

better quality sub-categories for both dominant head

and ambiguous long-tail parts. The idea is to learn

strong models for the ones with enough examples and

share examples between rare models to improve de-

tection performance. For this intuitive approach the

hierarchy where all samples are used multiple times

establishes a strong data sharing principle (Salakhut-

dinov et al., 2011). Moreover, the hierarchy model na-

turally adapts to the number of examples - a small da-

taset allows only a shallow hierarchy while more ex-

amples allow deeper hierarchy and discovery of more

diverse sub-categories.

We make the following contributions:

• We propose a Hierarchical Deformable Part Mo-

del (HDPM) to capture long-tail distributions of

visual categories - our hierarchy is based on two

complementary and alternating approaches: i)

DPM bootstrapping and ii) Visual Similarity Net-

work based sub-category discovery.

• We adopt bootstrapping to make a coarse division

between the head and tail parts which are genera-

ted from true positives and false negatives of DPM

detection.

• We develop Visual Similarity Network (VSN)

based sub-category discovery to reﬁne the sub-

categories of both head and tail parts.

• The baseline detector in our hierarchy is DPM.

We compare our method to other sub-category aware

works on the Pascal VOC 2007 benchmark where

our method outperforms other competitors. Moreo-

ver, we demonstrate the scalability of our method with

the ILSVRC 2014 Detection benchmark for which

our HDPM provides substantial performance boost as

compared to the conventional DPM. Our code will be

made publicly available.

2 RELATED WORK

Deformable Part Model (DPM) by Felzenswalb et

al. (Felzenszwalb et al., 2008; Felzenszwalb et al.,

2010) was a state-of-the-art approach in object de-

tection before the rise of convoluational neural net-

works (Girshick et al., 2014; Girshick, 2015; Ren

et al., 2015; Redmon et al., 2016). The seminal work

of Girshick et al. (Girshick et al., 2014) proposed the

Region-based Convolutional Neural Network model

(R-CNN) which has recently inspired many follow-

ups (Girshick, 2015; Ren et al., 2015; Redmon et al.,

2016). Interestingly, DPM achieves comparable accu-

racy to R-CNN if its HOG features are replaced by

activations of the CNN layers (Girshick et al., 2015;

Wan et al., 2015). CNN features require external trai-

ning data for optimization of the network parameters

(ILSVRC 2012 was used in (Girshick et al., 2015)).

In this work, our goal is to learn a model from the

scratch using only a moderate number of training ex-

amples and therefore we select the more conventional

HOG-based DPM (Felzenszwalb et al., 2010) as our

baseline detector. Experimental results of HDPM de-

monstrate the superiority of HDPM over recent met-

hods (Felzenszwalb et al., 2010; Malisiewicz et al.,

2011; Aghazadeh et al., 2012; Gu et al., 2012; Dong

et al., 2013; Zhu et al., 2014). Special cases of DPM

hierarchy have been introduced, for example, Ghiasi

and Fowlkes (Ghiasi and Fowlkes, 2014) used hierar-

chy of occlusion patterns to robust face detection, but

to the authors’ best knowledge our work is the ﬁrst to

explicitly model intra-class diversities by introducing

hierarchical DPMs.

Long-Tail Distributions – Distributions of object

class examples in human captured images is not uni-

form, but certain classes and certain view points do-

minate datasets. For example, most of the people

stand in images, but they can also be assumed to have

a large number of unusual poses (e.g. a person riding

a horse). A practical solution is to balance a data-

set (Russakovsky et al., 2015). However it seems

difﬁcult for collecting sufﬁcient samples for tail sub-

categories. Reducing the sample size of dominant

sub-categories is more practical but it can have ne-

gative effect on detection performance. Performance

boost by modeling these rare sub-categories was rai-

sed by Salakhutdinov et al. (Salakhutdinov et al.,

2011) who introduced the data sharing principle: rare

classes borrow statistical strength from related classes

that have dense examples. The sharing was extended

to sub-categories that depict rare viewpoints by (Zhu

et al., 2014). In this work, we term both dominant

and rare appearance and viewpoints with a common

term of sub-category. While most of the works on

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

long-tail visual detection use more traditional base-

line detection models, recently performance boost has

been reported also for the state-of-the-art CNN appro-

ach (Ouyang et al., 2016). In this work, we capture

sub-categories of the head and tail parts of long-tail

class distributions by adopting two principles: boots-

trapping and Visual Similarity Network (VSN) based

sub-category discovery.

Bootstrapping – Bootstrapping can be used to le-

arn different aspects of training data. For example,

mining hard negatives aims to distinguish two simi-

lar classes and mining hard positives works for co-

vering rarely encountered examples of a category. In

visual class detection bootstrapping is not new. Mi-

ning hard negatives (misclassiﬁed as positives) was

used already in the original HOG detector (Dalal and

Triggs, 2005) and later in DPM as well (Felzenszwalb

et al., 2008). In (Li et al., 2013) a term “relevant ne-

gatives” was introduced, but the principle is the same.

Our DPM learning also re-trains with hard negatives,

but we adopt bootstrapping to identify hard positives.

Zhu et al. (Zhu et al., 2014) construct candidate mo-

dels by training Exemplar-SVMs (Malisiewicz et al.,

2011) for each positive training example and selecting

a ﬁxed amount of best scoring positive examples to

form a new sub-category training set. Greedy se-

arch is applied to select the best combination of the

DPMs, but the method is computationally expensive

due to a large number of the sub-models to be tested.

While their approach is bottom up, our bootstrapping

is top-down. We ﬁrst construct a single DPM and se-

lect the false negatives to form a bootstrap set. Diffe-

rent from Zhu et al. (Zhu et al., 2014) bottom-up data

sharing, we propose a top-down hierarchy to discover

sub-categories by alternating visual similarity graph

based sub-category discovery and bootstrapping.

Sub-category Discovery – Sub-category discovery

and effective data exploitation are important for long-

tail distribution models in view of sparse samples in

long-tail sub-categories. Important research questions

arise: how to ﬁnd sub-categories and how to effecti-

vely exploit small data in model learning? Sivic et

al. (an B.C. Russell et al., 2008) proposed a visual

Bag-of-Words based discovery of class hierarchy, but

their work assumed a number of dominant and balan-

ced categories. Sub-category (subordinate class) mi-

ning was proposed by Hillel and Weinshall (Hillel and

Weinshall, 2006) using generative models, but their

method cannot cope with viewpoint changes. In a

more recent work, Gu et al. (Gu et al., 2012) used a set

of seed images where other train images were aligned

using object boundaries and seed speciﬁc classiﬁers

were trained to represent sub-categories. The draw-

backs of their method are more demanding supervi-

sion in the terms of object boundary annotations and

unsupervised selection of good seeds. The method

by Aghazadeh et al. (Aghazadeh et al., 2012) avoided

strong supervision by using Exemplar-SVMs (Ma-

lisiewicz et al., 2011) to construct a classiﬁer for

each example, but this exhaustive procedure limits

their method suitability only for moderate size data-

sets. (Aghazadeh et al., 2012) and (Malisiewicz et al.,

2011) did not particularly mine sub-categories, but

their methods relied on a large number of detectors

that together represent all aspects of the data. The

Exemplar-SVM based approach was extended by Zhu

et al. (Zhu et al., 2014) who also selected a number

of seed images for which Exemplar-SVM was em-

ployed, but they built stronger models by data sharing

where best scoring examples were added to the seed

sub-categories and DPMs trained as stronger detec-

tors. The DPMs were pruned by greedy search, but

the seed selection remains as a problem as well as

the non-adaptive addition of shared examples (a ﬁxed

number). Dong et al. (Dong et al., 2013) proposed si-

milarity based sub-category mining, but their afﬁnity

measure was based on Exemplar-SVN training with

again limits data set size to moderate. Alternatively,

Yu and Grauman (Yu and Grauman, 2014) postpo-

ned sub-category discovery to the testing phase and

used local neighborhood of each test sample. Pu et

al. (Pu et al., 2014) introduced intra-class grouping

(sub-categories) and data sharing penalties for Sup-

port Vector cost function to improve ﬁne-grained ca-

tegorization.

Visual Similarity Networks – Hard decisions by

classiﬁers and clustering methods are always bia-

sed toward dominating patterns of data and there-

fore in this work we adopt a graph approach which

has recently gained momentum in vision applicati-

ons (Deng et al., 2014; Krause et al., 2015; John-

son et al., 2015; Rematas et al., 2015; Rubinstein

et al., 2016). A graph represents multiple latent and

even subtle connections between samples without the

need of hard decisions. We adopt a special structure

denoted as Visual Similarity Network (VSN) where

the links between vertices denote the strength of a

pair-wise similarity between two images. One of the

ﬁrst VSN models was the matching graph structure

by Philbin and Zisserman (Philbin and Zisserman,

2008), but it was developed for speciﬁc object ma-

tching as link between local features connected two

different viewpoints of the same object. In parallel,

Kim et al. (Kim et al., 2008) seeded the term Vi-

sual Similarity Network and proposed a method to

construct a graph representing visual categories. Re-

cently, more advanced Visual Similarity Network ap-

proaches were proposed in (Isola et al., 2015; Shok-

Hierarchical Deformable Part Models for Heads and Tails

Figure 2: The HDPM workﬂow: bootstrapping and VSN sub-category discovery alternate along the hierarchy levels (the steps

2 and 3 repeat). The best combination of DPM models are selected in section 3.3.

rollahi Yancheshmeh et al., 2015; Zhou et al., 2015).

Isola et al. (Isola et al., 2015) VSN represents ca-

tegory transformations, e.g. from a raw tomato to

a rotten tomato while Shokrollahi et al. (Shokrol-

lahi Yancheshmeh et al., 2015) and Zhou et al. (Zhou

et al., 2015) provided pair-wise similarity in a full

connected manner. They both constructed a network

from local features that align over geometric transfor-

mations and therefore capture viewpoint changes. In

this work, we adopt the VSN approach in (Shokrol-

lahi Yancheshmeh et al., 2015) and adapt it for sub-

category discovery by accumulating local features to

represent category speciﬁc landmarks.

3 HIERARCHICAL DPM

Our Hierarchical DPM (Figure 2) constructs hierar-

chy where at the top we have a root model learned

using all training examples (Step 1). Branches re-

present division of images to sub-categories by two

alternating procedures: 1) Step 2 – bootstrapping

(Section 3.1) and 2) Step 3 – Visual Similarity Net-

work (VSN) sub-category discovery (Section 3.2). As

we move from a root to leaves, the sub-categories re-

present more exclusive distinctions. Intuitively boot-

strapping will provide a coarse division to long-tail

distribution’s head and tail parts. VSN sub-category

discovery reﬁnes both parts by identifying less domi-

nant sub-categories that would otherwise be suppres-

sed by dominating ones.

The input of HDPM is a set of N training examples

S =

{

I, b, c

}

n=1,...,N

where I

denotes the nth training

image, b

bounding box coordinates of a class with

the label c. Each class contains N

training samples

and HDPM

:s are constructed independently. The ba-

seline model used in each tree node is Deformable

Part Model (DPM) (Felzenszwalb et al., 2008; Fel-

zenszwalb et al., 2010)

DPM

= DPM (S

, S

neg

) (1)

where S

∈ S is a set of positive examples for the class

c and S

neg

is a set of negative examples (images of all

other classes S

neg

= S \ S

in our experiments).

3.1 Bootstrapping

A known characteristic of the Latent SVM (Felzen-

szwalb et al., 2008) algorithm in DPM is that it can

effectively learn a dominating sub-category from S

but cannot represent less dominating sub-categories

and suffers from more than one competing domi-

nant sub-category. To cope with multiple domina-

ting sub-categories, DPM clusters box aspect ratios

of the input bounding boxes to M clusters for which

separate DPMs are trained referring to components in

the DPM terminology. Our hierarchy replaces heu-

ristic discovery of components by data-driven sub-

categorization.

The baseline model in (1) is trained for remaining

training examples after each branch in the tree. Then,

we test the trained model M

on the positive exam-

ples to divide them into two bootstrap sets: domi-

nant sub-category examples (true positives) S

and

rare sub-category examples (false negatives) S

−

. In

the next step, the both sets are reﬁned to the next le-

vel sub-categories by Visual Similarity Network sub-

category discovery (Section 3.2). In addition to VSN

we also train two bootstrapped DPMs using the two

sets:

DPM

= DPM



, S

neg



−

DPM

= DPM



−

, S

neg



(2)

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

In our preliminary experiments we evaluated two va-

riants by adding not detected or detected examples

to the negative set respectively, S

neg

= S

neg

∪ S

−

and

−

neg

= S

neg

∪ S

, but found this inferior to using only

neg

3.2 Visual Similarity Network for

Sub-Categories

The motivation behind VSN based sub-category dis-

covery is to identify both appearance sub-categories,

e.g. enduro and scooter motorbikes, and viewpoint

sub-categories, e.g. frontal and side views of cars.

If the number of samples is small for these sub-

categories (long-tail sub-categories), we need to adopt

an approach that retains all pair-wise image similari-

ties. A suitable data structure is the similarity graph

which has been used in similar tasks in the recent

works (Isola et al., 2015; Shokrollahi Yancheshmeh

et al., 2015; Zhou et al., 2015). We adopt Visual Si-

milarity Network (VSN) by Shokrollahi et al. (Shok-

rollahi Yancheshmeh et al., 2015) and adapt it for our

problem by introducing class speciﬁc landmark accu-

mulation.

The main component of (Shokrollahi Yan-

cheshmeh et al., 2015) is a cost function

C(I

, I

) = λ

match

, I

) + λ

dist.

, I

) , (3)

which computes the matching cost of two images, I

and I

, using feature correspondence with matching

cost C

match

and spatial distortion cost C

dist

. The ma-

tching cost denotes similarity in a feature space (e.g.

SIFT) and distortion cost how well the features can

be aligned in the spatial domain. Minimization of the

cost function (3) is cast as a generalized assignment

problem:

maximize

∑

i j

subject to

∑

i j

≤ 1 i = 1, . . . , N

∑

i j

≤ 1 j = 1, . . . , N

i j

∈

{

0, 1

}

, (4)

where a

i j

are binary assignments between the features

i = 1, . . . , N

of the image I

and j = 1, . . . , N

of I

The assignment constraints require at most one match

between each feature and s

i j

:s are similarity values in

[0, 1] that combine the previously deﬁned feature and

distortion costs into a single value

S(i, j) = e

C(i, j)

= e

(i, j)

= S

(i, j)S

(i, j) .

(5)

is the distance in the spatial space and D

denotes

the distance in the feature space. For solving the opti-

mization problem Shokrollahi et al. (Shokrollahi Yan-

cheshmeh et al., 2015) proposed to replace the cost

values with rank-order statistics yielding to fast ap-

proximation sketched in Algorithm 1 where the trans-

formation T is parameterized with a translation vector

), rotation θ and uniform scaling s.

Algorithm 1: Generalized assignment approx. solution.

1: Compute the feature distance matrix D

×N

(e.g.,

dense SIFT).

2: On each row of D

set the K smallest to 1 and 0 other-

wise.

3: S

= 0.

4: for i = 1 : N

(features of I

) do

5: Compute the distance from x

(a)

to T (x

(b)

) for j =

1, . . . , K non-zero entries of D

and if D

(i, j) ≤ τ

then set S

(i, j) = 1 and break.

6: end for

7: return the number of non-zero terms in S

3.2.1 Object Landmark Driven Similarity

The main problem of Algorithm 1 is that in the pre-

sence of background clutter it can get stuck to ma-

tching backgrounds (background-to-background ma-

tching) or object-to-background matching instead of

the desired object-to-object matching. Our solution is

two-fold: i) we mask features outside object region

by bounding boxes and ii) accumulate features that

match in multiple images to produce a reﬁned set of

class speciﬁc landmarks.

We adopt otherwise the parameter settings from

the original work, but run an accumulation procedure

in Algorithm 2. The algorithm retains only the lo-

Algorithm 2: Accumulation of I

landmark scores.

1: S

acc

= 0.

2: for n = 1 : N

in S

3: Compute Nelder-Mead optimization (max search) of

the transformation matrix T parameters using Algo-

rithm 1 as the target function.

4: S

acc

= S

acc

+ S

5: end for

6: return Remove other than the B best features of I

cal features that are veriﬁed by multiple training set

images and therefore removes object-to-background

and background-to-background matches effectively.

Using the veriﬁed set of local features, object speciﬁc

landmarks, we can re-execute Algorithm 1 and com-

pute the afﬁnity matrix representing pair-wise mat-

ches between all images using the veriﬁed landmarks:

Hierarchical Deformable Part Models for Heads and Tails

a f f







k kS

k . . . kS

. . . .

k kS

k . . . kS







(6)

where k · k is the number of non-zero terms in S

and now represents similarity in [0, B]. For our expe-

riments we set B = 80, but this selection does not have

drastic effect to the performance as long as B ≥ 20.

3.2.2 Spectral Sub-categories

The next step is to discover sub-categories in the

full connected graph deﬁned by the afﬁnity ma-

trix A

a f f

in (6). For this task we adopt the self-

tuning spectral clustering by Zelnik-Manor and Pe-

rona (Zelnik-Manor and Perona, 2004) which ex-

tends the original spectral clustering of Ng et al. (Ng

et al., 2001) by adding unsupervised selection for the

spectral scale and number of clusters.

Similarity values in (6) represent pair-wise simi-

larity as an integer where B denotes high similarity

(all landmarks match) and 0 low similarity (no mat-

ches). Self-tuning spectral clustering is based on pre-

computed node distance matrix and therefore we con-

vert integer similarities to rank-order distances (the

super- and the sub-scripts of the afﬁnity matrix omit-

ted for clarity):

i j

rank(A

i j

, sort(A

i,:

))

. (7)

where N

is the highest rank (distance is 1) and 1 is

the lowest (distance N

The scaled afﬁnity matrix is computed from the

rank-order distance matrix

i j

= exp

−D

i j

(8)

where

= D

(9)

where D

is “distance” to the K:th neighbor of i:th

entry (K = 10 in our experiments). Moreover,

= 0.

For the normalized afﬁnity matrix the C largest eigen

vectors are selected and the rotation R that aligns the

matrix formed from the eigen vectors to the canonical

coordinate system and the number of clusters is iden-

tiﬁed by selecting the number of clusters that provides

the minimal alignment cost. The number of clusters

from 2 to 15 were tested in our experiments. See Fi-

gure 3 for illustrations of the found sub-categories.

Figure 3: VSN sub-category discovery for the head and tail

parts of the VOC2007 bicycle class and ILSVRC2014 ca-

mel. Note how the head sub-categories are more evident and

tail sub-categories represent rare examples (upside-down

bicycle, resting camel). Outliers not detected by our system

are marked red and often represent ambiguous annotation

(e.g. only a bicycle handle, heavily occluded camel, extre-

mely poor resolution).

Similar to the bootstrapping in Section 3.1 we

train separate DPMs for each VSN sub-category c(1),

c(2), . . . , c(C)

DPM

c(i)

= DPM



c(i)

, S

neg



. (10)

Since the bootstrapping and VSN sub-category disco-

very steps alternate (Figure 2) the sub-category DPMs

give raise to new division to head and tail sets by boot-

strapping and the process continues similarly until too

few examples to train DPMs (e.g. ≤ 15 images for

each sub-category). In our experiments, we ﬁxed the

maximum level to 8.

3.3 Model-Selection

In validation time, we select the best set of DPM mo-

dels in (Figure 2). This procedure can be very time

consuming since all possible combination for n num-

ber of DPM models is

∑

r=1

r!(n−r)!

and thus not com-

putationally effective. Instead, we take advantage of

the hierarchical tree where we expect to have the mo-

dels of dominant subcategories on the right branch

(green in Figure 2) and long tails’ on the left branch

(orange in Figure 2). Therefore, we should have the

strongest models on the right side and if the left side

learned something new, then it will be added. Thus,

we ﬁrst explore the best models combination from the

right side that gives the maximum score on validation

set, next we give the selected set to the left side and

repeat the process again. The output will be our ﬁnal

selected set of models.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Table 1: Pascal VOC 2007 comparison (AP in %) of state-of-the-art sub-category aware methods.

∗

DP-DPM Uses external

training data (1.2M image ILSVRC 2012 training set).

Method aero bike bird boat bott bus car cat chair cow tabl dog horse mbike pers plant sheep sofa train tv mAP

DPM (Felzenszwalb et al., 2010) 28.7 55.1 6.0 14.5 26.5 39.7 50.2 16.3 16.5 16.6 24.5 5.0 45.2 38.7 36.2 9.0 17.4 22.8 34.1 38.4 27.1

DPMv5 Website 32.1 60.2 10.5 14.0 30.0 54.0 57.0 24.7 22.6 26.8 29.1 8.6 59.8 46.7 41.4 13.4 22.1 34.4 44.3 44.5 33.8

DPMv5-reproduced 32.8 59.3 10.9 14.1 29.2 52.6 57.8 27.5 23.1 24.6 30.8 13.0 61.6 46.6 40.2 13.0 19.1 31.2 46.3 44.4 33.9

ESVM (Malisiewicz et al., 2011) 20.8 48.0 7.7 14.3 13.1 39.7 41.1 5.2 11.6 18.6 11.1 3.1 44.7 39.4 16.9 11.2 22.6 17.0 36.9 30.0 22.7

MCIL (Aghazadeh et al., 2012) 33.3 53.6 9.6 15.6 22.9 48.4 51.5 16.3 16.3 20.0 23.8 11.0 55.3 43.8 36.9 10.7 22.7 23.5 38.6 41.0 29.8

MCM (Gu et al., 2012) 33.4 37.0 15.0 15.0 22.6 43.1 49.3 32.8 11.5 35.8 17.8 16.3 43.6 38.2 29.8 11.6 33.3 23.5 30.2 39.6 29.0

DPM-AGS (Dong et al., 2013) 34.7 61.4 11.5 18.6 30.0 53.8 58.8 24.7 24.7 26.8 31.4 13.8 61.4 49.2 42.2 12.9 23.9 38.5 50.8 45.5 35.7

DPM-LTS (Zhu et al., 2014) 34.1 61.2 10.1 18.0 28.9 58.3 58.9 27.4 21.0 32.3 34.6 15.7 54.1 47.2 41.2 18.1 27.2 34.6 49.3 42.2 35.7

HDPM (ours) 35.8 61.6 11.9 17.2 30.5 53.9 59.1 29.2 23.8 27.5 37.0 15.3 62.4 48.4 42.4 16.3 21.2 35.1 47.7 45.8 36.1

DP-DPM (Girshick et al., 2015)

∗

44.6 65.3 32.7 24.7 35.1 54.3 56.5 40.4 26.3 49.4 43.2 41.0 61.0 55.7 53.7 25.5 47.0 39.8 47.9 59.2 45.2

4 EXPERIMENTS

4.1 Datasets and Settings

To make our method comparable with the similar

works (Zhou et al., 2015; Dong et al., 2013), we eva-

luated HDPM on the PASCAL VOC 2007 dataset.

Pascal VOC 2007 contains 20 categories with 9, 963

images in total. The dataset is divided into ‘train-

val’ and ‘test’ subsets including 5011 and 4952 ima-

ges, respectively. In our experiments, we follow the

protocol in (Zhou et al., 2015) and report the results

for the ‘test’ subset. In the experiments on the VOC

2007 benchmark, we set the number of components in

DPM detector to 2 and we continue constructing the

HDPM hierarchy up to the level 8.

We also evaluate the performance of the propo-

sed method on the ILSVRC2014 detection task to

test scalability. ILSVRC 2014 detection set contains

456, 567 training images, 20, 121 validation images,

and 40, 152 test images. There are 200 categories

and the number of positive training images per cate-

gory varies between 461 and 67, 513. The number

of negative training images per category is between

42, 945 and 70, 626. The ground truth are released

only for the training and validation data and for test

data we submitted our results to the evaluation ser-

ver (one submission was made for each class). To

speed up the computation with ILSVRC we made the

following compromises: a) the maximum number of

negative examples was set to the number of positive

images per each class, b) the number of DPM com-

ponents was ﬁxed to 1 and c) HDPM hierarchy was

computed only up to the level L3. These compromi-

ses allowed computation of 200 HDPMs within one

week.

With the both datasets, we ﬁrst constructed

HDPMs for each class using the training exam-

ples, and then selected the best combination of the

sub-category DPMs using the validation examples

(Section 3.3). For constructing the VSNs, we ﬁrst

cropped images inside the bounding boxes and sca-

led them to the size of 200 × 200 pixels keeping as-

pect ratios. Secondly, we executed spectral cluste-

ring (Zelnik-Manor and Perona, 2004) that automa-

tically selected the number of clusters and provided

sub-category assignments for each image. In all ex-

periments, we set the search range for the number

of clusters to 2 − 15, and the method typically pro-

vided 2−4 on each hierarchy level. We employed the

standard Non-Maximum Suppression (NMS) on the

detected candidates of bounding boxes. The evalu-

ation metric was Average Precision (AP) and mean

of Average Precision (mAP) without contextual re-

scoring.

4.2 Comparison with State-of-the-Arts

In this experiment, we compared our method to ot-

her published sub-category aware methods on the

Pascal VOC 2007 dataset and using only the pro-

vided training and validation set images in model

training: Exemplar-SVM (ESVM) by Malisiewicz et

al. (Malisiewicz et al., 2011), Multi-Component Mo-

del (MCM) by Gu et al. (Gu et al., 2012), Mixture

Component Identiﬁcation and Learning (MCIL) by

Aghazadeh et al. (Aghazadeh et al., 2012) Ambiguity

Guided Graph Shift DPM (DPM-AGS) by Dong et

al. (Dong et al., 2013) and Long-Tail Subcategories

DPM (DPM-LTS) by Zhu et al. (Zhu et al., 2014).

The results are shown in Table 1. Since our baseline

model is based on the DPM version 5 (DPMv5), we

also report results achieved in our own experiments

as well as the results reported in (Felzenszwalb et al.,

2010). The results slightly vary due to parallel execu-

tion since computing the cost function gradient varies

with different numbers of threads. In addition, we re-

port the results for the the deep DPM where the HOG

features are replaced with activations of the full Ima-

geNet dataset trained neural network AlexNet (DP-

DPM) by Girshick et al. (Girshick et al., 2015).

Our model achieved the best average precision

(AP) for 8 out of the 20 categories and the best overall

mean average precision (mAP). By comparing our re-

Hierarchical Deformable Part Models for Heads and Tails

Figure 4: Per class boost in ILSVRC2014 detection task using HDPM over DPM (blue: > 1, green: ≈ 1, orange: < 1).

sults to the two similar and recent works, DPM-AGS

and DPM-LTS, it is evident that the upper limit of

performance using HOG features and only the provi-

ded data is almost reached. There is a clear difference

as compared to DP-DPM, but their method uses the

additional massive ImageNet dataset for training the

CNN feature extraction network.

4.3 Large Scale Scalability on the

ImageNet 200

Figure 4 shows the proportional improvement using

HDPM over DPM for the 200 ILSVRC2014 detection

task classes. It is noteworthy, that using our restricted

settings we improved performance for 143 out of 200

categories. For 30 classes the boost was > 2× and

only for 12 classes the performance degraded below

0.5. The mAP values were 9.84% for HDPM and

8.54% for DPM providing average improvement of

15% and median improvement of 40% being clearly

signiﬁcant. The results are on pair with the state-of-

the-art before the era of CNNs. Per class accuracies

are available in appendix.

5 CONCLUSIONS

We proposed HDPM to address the problem of long-

tail distributions of visual class examples. Our mo-

del achieved superior accuracy to other proposed sub-

category aware DPM-based models and provides sca-

lability to large scale problems. In our future work,

we will replace the standard HOG DPM with the more

recent Deep DPM (Girshick et al., 2015) to bene-

ﬁt from the performance of data optimized features

and we will investigate computationally more power-

ful bootstrapping and SVN construction.

REFERENCES

Aghazadeh, O., Azizpour, H., Sullivan, J., and Carlsson, S.

(2012). Mixture component identiﬁcation and lear-

ning for visual recognition. In ECCV.

an B.C. Russell, J. S., Zisserman, A., Freeman, W., and

Efros, A. (2008). Unsupervised discovery of visual

object class hierarchies. In CVPR.

Dalal, N. and Triggs, B. (2005). Histograms of oriented

gradients for human detection. In CVPR.

Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio,

S., Li, Y., Neven, H., and Adam, H. (2014). Large-

scale object classiﬁcation using label relation graphs.

In ECCV.

Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., and Yan,

S. (2013). Subcategory-aware object classiﬁcation. In

CVPR.

Felzenszwalb, P., Girshick, R., McAllester, D., and Rama-

nan, D. (2010). Object detection with discriminatively

trained part-based models. IEEE PAMI, 32(9):1627–

1645.

Felzenszwalb, P., McAllester, D., and Ramanan, D. (2008).

A discriminatively trained, multiscale, deformable

part model. In CVPR.

Ghiasi, G. and Fowlkes, C. (2014). Occlusion coherence:

Localizing occluded faces with a hierarchical defor-

mable part model. In CVPR.

Girshick, R. (2015). Fast R-CNN. In ICCV.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In CVPR.

Girshick, R., Iandola, F., Darrell, T., and Malik, J. (2015).

Deformable part models are convolutional neural net-

works. In CVPR.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Gu, C., Arbelaez, P., Lin, Y., Yu, K., and Malik, J. (2012).

Multi-component models for object detection. In

ECCV.

Hillel, A. and Weinshall, D. (2006). Subordinate class re-

cognition using relational object models. In NIPS.

Isola, P., Lim, J., and Adelson, E. (2015). Discovering states

and transformations in image collections. In CVPR.

Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D.,

Bernstein, M., and Fei-Fei, L. (2015). Image retrieval

using scene graphs. In CVPR.

Kim, G., Faloutsos, C., and Hebert, M. (2008). Unsupervi-

sed modeling of object categories using link analysis

techniques. In CVPR.

Krause, J., Jin, H., Yang, J., and Fei-Fei, L. (2015).

Fine-grained recognition without part annotations. In

CVPR.

Li, X., Snoek, G., Worring, M., D.Koelma, and Smeulders,

A. (2013). Bootstrapping visual categorization with

relevant negatives. IEEE Trans. on Multimedia, 15(4).

Malisiewicz, T., Gupta, A., and Efros, A. (2011). Ensemble

of exemplar-SVMs for object detection and beyond.

In ICCV.

Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clus-

tering: Analysis and an algorithm. In NIPS.

Ouyang, W., Wang, X., Zhang, C., and Yang, X. (2016).

Factors in ﬁnetuning deep model for object detection

with long-tail distribution. In CVPR.

Philbin, J. and Zisserman, A. (2008). Object mining using

a matching graph on very large image collections. In

Indian Conference on Computer Vision, Graphics and

Image Processing.

Pu, J., Jiang, Y.-G., Wang, J., and Xue, X. (2014). Which

looks like which: Exploring inter-class relationships

in ﬁne-grained visual categorization. In ECCV.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time ob-

ject detection. In CVPR.

Rematas, K., Fernando, B., Dellaert, F., and Tuytelaars, T.

(2015). Dataset ﬁngerprints: Exploring image col-

lections through data mining. In CVPR.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-

CNN: Towards real-time object detection with region

proposal networks. In NIPS.

Rubinstein, M., Liu, C., and Freeman, W. (2016). Joint in-

ference in weakly-annotated image datasets via dense

correspondence. Int J Comp Vis, 119:23–45.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., Berg, A. C., and Fei-Fei, L. (2015). Image-

Net Large Scale Visual Recognition Challenge. Int J

Comp Vis, 115(3):211–252.

Salakhutdinov, R., Torralba, A., and Tenenbaum, J. (2011).

Learning to share visual appearance for multiclass ob-

ject detection. In CVPR.

Shokrollahi Yancheshmeh, F., Chen, K., and K

ainen,

J.-K. (2015). Unsupervised visual alignment with si-

milarity graphs. In CVPR.

Wan, L., Eigen, D., and Fergus, R. (2015). End-to-end inte-

gration of a convolutional network, deformable parts

model and non-maximum suppression. In CVPR.

Yu, A. and Grauman, K. (2014). Predicting useful neig-

hborhoods for lazy local learning. In NIPS.

Zelnik-Manor, L. and Perona, P. (2004). Self-tuning

spectral clustering. In NIPS.

Zhou, T., Lee, Y., Yu, S., and Efros, A. (2015). Flow-

Web: Joint image set alignment by weaving consis-

tent, pixel-wise correspondences. In CVPR.

Zhu, X., Anguelov, D., and Ramanan, D. (2014). Captu-

ring long-tail distributions of object subcategories. In

CVPR.

Hierarchical Deformable Part Models for Heads and Tails

APPENDIX

Figure 5: AP over the ﬁrst 200 categories (synsets) of ImageNet test set (part1).

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Figure 6: AP over next 200 categories of ImageNet test set (part2).

Hierarchical Deformable Part Models for Heads and Tails