Fine-Grained Retrieval with Autoencoders

Tiziano Portenier

, Qiyang Hu

, Paolo Favaro

and Matthias Zwicker

University of Bern, Bern, Switzerland

University of Maryland, College Park, U.S.A.

Keywords:

Image Retrieval, Deep Learning, Autoencoders.

Abstract:

In this paper we develop a representation for ﬁne-grained retrieval. Given a query, we want to retrieve data

items of the same class, and, in addition, rank these items according to intra-class similarity. In our training

data we assume partial knowledge: class labels are available, but the intra-class attributes are not. To com-

pensate for this knowledge gap we propose using an autoencoder, which can be trained to produce features

both with and without labels. Our main hypothesis is that network architectures that incorporate an autoen-

coder can learn features that meaningfully cluster data based on the intra-class variability. We propose and

compare different architectures to construct our features, including a Siamese autoencoder (SAE), a classi-

fying autoencoder (CAE) and a separate classiﬁer-autoencoder (SCA). We ﬁnd that these architectures indeed

improve ﬁne-grained retrieval compared to features trained purely in a supervised fashion for classiﬁcation.

We perform experiments on four datasets, and observe that the SCA generally outperforms the other two. In

particular, we obtain state of the art performance on ﬁne-grained sketch retrieval.

1 INTRODUCTION

In example-based retrieval, given a query represented

by an exemplar the goal is to return data items that are

as similar to the exemplar as possible, usually in a list

ordered by similarity to the exemplar. Similarity be-

tween exemplar and query results may be judged by

whether they belong to the same object class, or by

more ﬁne-grained properties, such as whether query

and result show the same instance of a class (instance-

level retrieval), or by similarity in pose, color, or style

of objects in images. Thanks to the availability of

a large amount of labeled data, neural networks for

classiﬁcation and class-based retrieval can be trained

very successfully in a supervised fashion (He et al.,

2016). The resulting features, however, do not na-

turally support ﬁne-grained retrieval, because class-

based training leads to invariance against ﬁne-grained

properties such as object pose etc. To suppress inva-

riance, one could enrich the training data with labels

for ﬁne-grained properties (such as instance-level la-

bels), and train again in a supervised manner (Rade-

novi

c et al., 2016; Gordo et al., 2016). However, this

requires much more effort to prepare suitable labeled

training data.

To avoid cumbersome labeling of training data,

one could leverage representations obtained using au-

toencoder networks for ﬁne-grained retrieval. By con-

struction, features extracted from autoencoders retain

as much information as possible about the data, in-

cluding ﬁne-grained properties. Therefore, it seems

attractive to exploit such representations for ﬁne-

grained retrieval. When trained in an unsupervised

manner, however, autoencoder features suffer from

entanglement, which means that information about

class membership and ﬁne-grained properties may be

encoded in overlapping feature dimensions. There-

fore, we cannot expect good class-based retrieval per-

formance when using autoencoder features.

Our goal in this paper is to develop a representa-

tion that supports ﬁne-grained retrieval, but does not

require supervised learning with ﬁne-grained labels.

Given a query, we want to retrieve data items of the

same class, and in addition, rank the query results ac-

cording to intra-class similarity (for example object

pose). Our assumption is that we have training data

containing class labels, but intra-class variability is

not labeled and needs to be taken into account using

unsupervised learning. Our main hypothesis is that

we can achieve ﬁne-grained retrieval by leveraging an

autoencoder, which should learn to capture intra-class

variability in an unsupervised manner.

We explore different ways to combine class-based

supervised learning and an autoencoder to construct

Portenier, T., Hu, Q., Favaro, P. and Zwicker, M.

Fine-Grained Retrieval with Autoencoders.

DOI: 10.5220/0006602100850095

In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, pages

85-95

ISBN: 978-989-758-290-5

our features, and show that indeed this can improve

ﬁne-grained retrieval. As a key contribution of this

paper, we propose and compare three main architec-

tures: (1) a siamese autoencoder (SAE) that learns a

representation that respects class membership using a

contrastive loss; (2) a classifying autoencoder (CAE)

that jointly minimizes the sum of an autoencoding and

a classiﬁcation loss using a single network; (3) a sepa-

rate classiﬁer-autoencoder (SCA) based on separate

training of a supervised classiﬁcation and an unsu-

pervised autoencoding network, followed by conca-

tenation of the respective feature vectors. In our ex-

periments, we ﬁnd that all three architectures indeed

improve ﬁne grained retrieval compared to features

trained purely in a supervised fashion for classiﬁca-

tion. In addition, the SCA outperforms the other two

(SAE and CAE). Intuitively, this is because the SCA

leads to better disentanglement, separating invariant

class properties from ﬁne-grained variability. We ex-

ploit this to construct a feature similarity metric that

is highly effective for ﬁne-grained retrieval.

We evaluate and compare our approaches using

four different datasets, MNIST hand-written di-

gits (Lecun et al., 1998), Google street view house-

numbers (Netzer et al., 2011), images rendered from

ShapeNet (Chang et al., 2015), and sketches from the

Sketchy database (Sangkloy et al., 2016). In gene-

ral, we show that concatenating supervised and unsu-

pervised features that were trained separately outper-

forms the other techniques that we explored.

In particular, we demonstrate that for sketch-based

sketch retrieval we signiﬁcantly outperform the state

of the art on ﬁne-grained retrieval in the Sketchy da-

tabase (Sangkloy et al., 2016). This is remarkable be-

cause the previous state of the art used ﬁne-grained

labels in a supervised training setup (Sangkloy et al.,

2016). In contrast, we do not use ﬁne-grained labels

for training, yet obtain improved ﬁne-grained retrieval

thanks to the inclusion of unsupervised autoencoder

features in our representation.

2 RELATED WORK

Image Retrieval. While our main application in

this paper is focusing on sketch data, our problem

is related to image retrieval in general. Tradi-

tionally, descriptors for image retrieval are based

on local features, for example by aggregating lo-

cal gradient-based features and by building bag-of-

visual-words (BOV), Fisher kernel (Perronnin et al.,

2010), or VLAD (vector of locally aggregated des-

criptors) (Jgou et al., 2010) representations. More re-

cently, the success of deep convolutional neural net-

works for image classiﬁcation has inspired holistic

image representations based on these techniques tar-

geted at image retrieval. Babenko et al. (Babenko

et al., 2014) were among the ﬁrst to leverage acti-

vations in convolutional neural networks (CNNs) as

features for image retrieval, demonstrating competi-

tive results compared with traditional hand-crafted fe-

atures. Paulin et al. learn patch-based features using

CNNs and aggregate them using VLAD for image re-

trieval (Paulin et al., 2015).

Babenko et al. (Babenko and Lempitsky, 2015)

make the interesting observation that a global des-

criptor constructed from local CNN features by sum

pooling aggregation, without high-dimensional em-

bedding, outperforms aggregation using more sophi-

sticated techniques such as Fisher vectors and VLAD.

Similarly, Tolias et al. (Tolias et al., 2016) propose to

build a feature based on a regional maximum activa-

tion of convolutions (R-MAC). They show that their

representation is signiﬁcantly more suitable for ﬁne-

grained retrieval tasks, such as particular object re-

trieval, compared to previous work based on CNN

features (Babenko et al., 2014; Babenko and Lem-

pitsky, 2015). They also develop a re-ranking appro-

ach using approximate object localization and query

expansion, and they show that with these additional

steps, their technique also outperforms the previous

state of the art based on hand crafted features (Mi-

kulik et al., 2013) on standard benchmarks (Philbin

et al., 2007; Philbin et al., 2008).

Instead of relying on hand crafted aggregation

strategies, like in R-MAC (Tolias et al., 2016), it

seems attractive to learn parameters of feature ag-

gregation in an end-to-end manner. Arandjelovic et

al. (Arandjelovic et al., 2016) propose a network ar-

chitecture that includes an aggregation layer inspired

by VLAD (Jgou et al., 2010) that can be trained by

backpropagation. They report state of the art results

on place recognition benchmarks. Gordo et al. (Gordo

et al., 2016) build on R-MAC, but include a region

proposal network that is trained in an end to end man-

ner, instead of using a ﬁxed grid of regions. Rade-

novic et al. (Radenovi

c et al., 2016) use a represen-

tation also based on maximum activation of convo-

lutions (MAC), but instead of regionally aggregating,

they propose to ﬁne tune the network using hard po-

sitive and hard negative examples. Both Gordo and

Radenovic et al. achieve excellent results, although

Gordo (Gordo et al., 2016) reports the highest mean

average precision scores on standard benchmarks for

image retrieval.

Sketch Retrieval. A main difference between these

image retrieval techniques and our approach is that

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

our goal is to construct a feature representation suita-

ble for ﬁne grained retrieval in a partly unsupervised

manner, that is, when ﬁne-grained labels are not avai-

lable. In addition, our main application is sketch re-

trieval, rather than image retrieval. Deep learning has

been used for sketch classiﬁcation (Yu et al., 2016b)

or retrieval (Su et al., 2015), outperforming classical

techniques based on bags-of-visual-words (Eitz et al.,

2012) or Fisher vectors (Schneider and Tuytelaars,

2014) by a large margin. One could leverage fea-

tures extracted from classiﬁcation networks also for

ﬁne-grained retrieval, but we show that our approach

outperforms this strategy.

Recently, Sangkloy et al. (Sangkloy et al., 2016)

constructed a database (called Sketchy) with sketch-

photo pairs that provide ﬁne-grained instance level

labels. They leverage this data to learn a joint em-

bedding for sketches and images using a triplet loss.

Their approach is very similar to concurrent work by

Yu et al. (Yu et al., 2016a), who collected a similar da-

tabase and also learned a joint embedding with CNNs

and a triplet loss. Both Sankloy and Yu et al. show

state of the art results for instance-level sketch-based

image retrieval and sketch-based sketch retrieval. In

contrast, we are not using instance-level labels for

training, and we focus on sketch retrieval. We use

the ﬁne-grained labels in the Sketchy database only

to evaluate the ﬁne-grained retrieval performance, and

we show that our approach improves ﬁne-grained ske-

tch retrieval, even though we do not use ﬁne-grained

labels for training.

3 NETWORK ARCHITECTURES

FOR FINE-GRAINED

RETRIEVAL USING

AUTOENCODERS

In this section we propose our approach to learn fea-

tures that supports ﬁne-grained example-based retrie-

val. Our assumption is that the training data contains

semantic object category labels, but no ﬁne-grained

labels that encode intra-class variability such as ob-

ject pose, style, or color. To learn a representation that

includes both class-level semantics and ﬁne-grained

properties, we propose three network architectures as

shown in Figure 1, and all three leverage autoencoders

to capture intra-class variability in a partly unsupervi-

sed manner. We are calling these architectures Sia-

mese autoencoders (SAE), classifying autoencoders

(CAE), and separate classiﬁer-autoencoders (SCA).

The motivation behind choosing these three architec-

tures is as follows: comparing the SAE and CAE al-

lows us to evaluate the suitability of contrastive versus

classiﬁcation loss, and the CAE explores separate ver-

sus joint training (as in SAE and CAE) of the classi-

ﬁer and the autoencoder. We report on our evaluation

in Section 4.

3.1 Siamese Autoencoder (SAE)

Siamese networks with contrastive loss functions

have been widely used to learn representations that

support classiﬁcation, retrieval, and cross-domain

embeddings in a common feature space (see Wang

et al. (Wang et al., 2015) for an example). If the

loss is purely driven by class membership, howe-

ver, the representations are pushed to become inva-

riant to intra-class variability, which is not desirable

for ﬁne-grained retrieval. Therefore, we propose Sia-

mese autoencoders (SAE), which extend the Siamese

architecture with a pair of decoders as shown in Fi-

gure 1(a). Intuitively, this should force the learned re-

presentation to retain intra-class variability, while still

separating different classes. To the best of our know-

ledge this is novel in the context of image represen-

tations, although a similar architecture has been in-

troduced to learn speaker-speciﬁc representations for

speaker recognition (Chen and Salman, 2011).

A Siamese network (Chopra et al., 2005) consists

of a pair of networks with shared weights. Training is

performed by feeding triplets (x

, x

, l

i, j

) that contain

a pair of images (x

, x

) and a binary label l

i, j

∈ {0, 1}

that is zero if x

and x

have the same class label and

one otherwise. Siamese networks can be trained by

minimizing a contrastive loss,

con

, x

, l

i, j

) =(1 − l

i, j

)d(z

, z

)

+ l

i, j

max(0, m − d(z

, z

)),

(1)

where d(x, y) is the Euclidean distance between x and

y, z is the network output for image x, and m is a user-

deﬁned margin. The network acts as an encoder E

that produces a latent representation z = E(x) of x. In

this representation, input images from the same class

are pulled together, and images from different classes

are pushed further apart than the margin m.

Our Siamese autoencoder adds a pair of Siamese

decoders at the end of the network, see Figure 1(a),

to capture ﬁne-grained properties. Each decoder D

tries to reconstruct the original image from the latent

representation z such that D(z) is as similar to the in-

put x as possible, forcing the representation to retain

intra-class variability. The autoencoder can be trained

by minimizing a reconstruction loss, for example L

rec

(x) = ||x − ˜x||

, (2)

where ˜x = D(E(x)). Hence, we train the SAE using

the following loss function consisting of a weighted

Fine-Grained Retrieval with Autoencoders

c c

Figure 1: We propose and evaluate three architectures for ﬁne-grained retrieval in a partly unsupervised manner (no ﬁne-

grained labels) by leveraging autoencoders: a Siamese autoencoder (SAE), classifying autoencoder (CAE), and separate

classiﬁer-autoencoder (SCA).

sum of contrastive and reconstruction loss,

sae

, x

, l

i, j

) =γL

con

, x

, l

i, j

)

+ (1 − γ)

rec

) + L

rec

)

(3)

where γ ∈ [0, 1] is used to balance contrastive vs. re-

construction loss.

3.2 Classifying Autoencoder (CAE)

We observe in practice that training using a contras-

tive loss can be unstable and requires very careful

initialization. To alleviate these issues, Siamese net-

works can be regularized by including a softmax clas-

siﬁcation loss (Bui et al., 2016; Sangkloy et al., 2016).

To determine whether the contrastive loss provides

any advantage for ﬁne-grained retrieval at all, we pro-

pose a classifying autoencoder (CAE) that only uses

a classiﬁcation network instead of a Siamese setup to

learn the labeled class-level semantics. We again in-

clude a decoder D(z) and reconstruction loss to cap-

ture intra-class variability, as shown in Figure 1(b).

We achieve this by adding a single fully-connected

layer f with linear activation function to the output

of the encoder z. This maps the latent representation

z to a probability distribution over the object catego-

ries f (z). Note that we feed z as input to the decoder

and not f (z). We train the CAE by minimizing the

following loss function,

cae

(x, c) = γL

class

(x, c) + (1 − γ)L

rec

(x), (4)

where c is the category label of image x, L

class

is the

traditional classiﬁcation loss, i.e. cross entropy with

softmax, and γ is used to balance classiﬁcation vs. re-

construction loss. Note that one can easily combine

SAE and CAE, and we evaluate this option as well.

3.3 Separate Classiﬁer-Autoencoder

(SCA)

A disadvantage of the two previous architectures is

that they lead to detrimental competition between the

classiﬁcation and reconstruction losses, for example

when two images belong to different classes, but exhi-

bit similar ﬁne-grained properties. In theory, with

enough training data this issue should resolve itself.

But in practice, we found that we can train the net-

works to be good at classiﬁcation or ﬁne-grained re-

trieval, but not both. Moreover, the weighting para-

meter γ depends on the training data and has to be

ﬁxed at training time, and ﬁnding γ such that the lear-

ned representation provides a desired trade-off is te-

dious.

To mitigate these issues, we propose a representa-

tion based on a separate classiﬁer-autoencoder (SCA),

which consists of two separately trained networks.

Hence, the two loss functions do not directly com-

pete. The ﬁrst networks is an autoencoder that learns

a representation z

to encode intra-class variability.

The second network is trained in a supervised fashion

using a classiﬁcation loss to learn a representation z

that captures the semantics in the data. We concate-

nate these two vectors to form the ﬁnal representation

z = (z

, z

). See Figure 1(c) for a visualization of the

proposed architecture.

The SCA leads to a representation that better di-

sentangles class information from ﬁne-grained pro-

perties compared to SAE and CAE. Since the second

part of our representation encodes only semantic in-

formation, we can design a similarity metric that ena-

bles the user to choose the tradeoff between seman-

tic and ﬁne-grained information during retrieval time.

Using the cosine similarity as distance metric, we pro-

pose a weighted dot product for retrieval,

sim(z

, z

) = α

· z

||z

||||z

+(1−α)

· z

||z

||||z

, (5)

where α is a user-deﬁned parameter for the tradeoff

between semantics and intra-class variability. One

drawback of SCA is that the learned representation

is prone to redundancy, since the autoencoder will en-

code some semantic information in z

. Our experi-

ments, however, show that in practice the beneﬁts of

separating the two loss functions outweigh this issue.

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

4 EXPERIMENTS

In this section, we present extensive evaluations of

our method on different datasets. First, we quanti-

tatively evaluate our approach on sketch images from

the Sketchy database and compare to previous state of

the art on this dataset. Second, we show a quantita-

tive comparison of the proposed method to two base-

lines on images rendered from ShapeNet. Third, we

show qualitative results on MNIST and SVHN. We

use Tensorﬂow (Abadi et al., 2016) with the Adam

optimizer (Kingma and Ba, 2014) in all our experi-

ments.

4.1 Sketch-based Sketch Retrieval

To provide a quantitative evaluation of our propo-

sed architectures, we consider the problem of sketch-

based sketch retrieval and train using sketch images

from the Sketchy database (Sangkloy et al., 2016).

This dataset is attractive for our evaluation because

it includes both class-level and ﬁne-grained annotati-

ons. The data consists of sketch-photo pairs of 125

categories, collected using crowd sourcing: for each

photo, participants were asked to sketch the object

with a pose similar to that of the object in the photo.

The database contains 12,500 photos and 75,481 ske-

tches with a resolution of 256 × 256, and each photo

has at least 5 associated sketches that show the same

object with similar pose. This information can be le-

veraged to design a benchmark for ﬁne-grained ske-

tch retrieval: we consider a retrieved sketch relevant

if it stems from the same photo as the query sketch,

which implies that the result is of the same category

and has the same pose as the query. We use the test set

proposed by Sankloy et al. (Sangkloy et al., 2016) for

evaluation (7,063 sketches) and the remaining 68,418

sketches for training.

Inspired by Sankloy et al. (Sangkloy et al., 2016),

we use GoogLeNet (Szegedy et al., 2015) as enco-

der, initialized with weights pre-trained on Image-

Net (Russakovsky et al., 2015). We use the activa-

tions of the last pooling layer, a 1024-dimensional

vector, as latent representation z. The decoder con-

sists of 11 layers: a fully-connected layer to map z to

a 980-dimensional vector that is reshaped to form a

7 ×7 × 20 tensor, followed by 10 layers of transposed

convolutions, sometimes called deconvolutions, with

3 ×3 kernels. We use ReLU activations for all but the

output layer, and the hyperbolic tangent activation on

the output layer. The input to the network is of size

224 × 224 and we randomly crop and ﬂip input sket-

ches for data augmentation.

In our experiment, we observed that training a

SAE on this type of data using contrastive loss as

deﬁned in Equation 1 leads to unstable training and

the network often diverges. As proposed by Bui et

al. (Bui et al., 2016), we obtained more stable trai-

ning by adding a classiﬁcation term to our loss, which

results in a combination of our SAE and CAE archi-

tectures. The classiﬁcation term is weighted 20 times

lower than the contrastive term, which is enough to

achieve stable training. We report results for γ = 0.05

and γ = 0.0005. Since the encoder is pre-trained on

ImageNet and the decoder is trained from scratch, we

start training with γ = 0.0001 and continuously incre-

ase γ during training to the ﬁnal value.

In addition to SAE, we also train a CAE by mi-

nimizing the loss as deﬁned in Equation 4. We start

training with γ = 0.01 and increase γ during training

to γ = 0.5. Note the different magnitude of γ compa-

red to SAE: classiﬁcation loss and reconstruction loss

have comparable magnitudes on our training data,

whereas the contrastive loss is about two orders of

magnitude higher than the other two.

Our third approach is to train a SCA as introduced

in Section 3.3. We use the same architecture as be-

fore but add an additional fully-connected layer with

ReLU at the end of both encoders (in the separate

classiﬁcation and autoencoder branches) to reduce the

dimensionality to 512. After training, we concatenate

the two feature vectors to form a 1024-dimensional

embedding (the same as used in our SAE and CAE

architecture), and we perform retrieval using the dis-

tance metric in Equation 5 with α = 0.3.

We compare the performance of our SAE, CAE,

and SCA features to four baselines: (1) GoogLeNet

trained solely on classiﬁcation, (2) a fully unsuper-

vised autoencoder, (3) the sketch branch from the

Sketchy network (Sangkloy et al., 2016), and (4) R-

MAC (Tolias et al., 2016). Note that (3) was trained

by leveraging the ﬁne-grained sketch-photo relations

in a supervised manner via a triplet loss. In contrast,

our SAE, CAE, and SCA architectures only use ob-

ject categories and learn the ﬁne-grained similarities

in an unsupervised manner using the autoencoder. For

R-MAC, we use the activations from the last convolu-

tional layer of (1), which is a 7 × 7 × 1024 tensor, to

construct the R-MAC features. In addition to the ﬁne-

grained retrieval benchmark described above, we use

the sketches from the Sketchy database to deﬁne a se-

cond, semantic only retrieval benchmark. In this ben-

chmark, we consider retrieval results relevant if they

are of the same category as the query, regardless of the

pose. For all methods, we report mean average preci-

sion (mAP) on both benchmarks in Table 1. On the

ﬁne-grained benchmark, retrieval with R-MAC featu-

res (which were designed to facilitate ﬁne-grained re-

Fine-Grained Retrieval with Autoencoders

Table 1: mAP for both ﬁne-grained and semantic only re-

trieval benchmark on the Sketchy test set.

method ﬁne-grained semantic only

Autoencoder 0.2370 0.0405

Classiﬁcation Network 0.2334 0.6290

Sketchy Network

(Sangkloy et al., 2016) 0.2867 0.5125

R-MAC

(Tolias et al., 2016) 0.2809 0.4171

SAE γ = 0.01 (ours) 0.1313 0.3843

SAE γ = 0.0005 (ours) 0.2837 0.2396

CAE γ = 0.5 (ours) 0.4654 0.2652

SCA α = 0.3 (ours) 0.4946 0.5303

trieval) increases mAP by almost 5% compared to the

classiﬁcation baseline that serves as input to construct

the R-MAC features. This is roughly on par with

Sketchy (Sangkloy et al., 2016). Our SCA feature

performs best on the ﬁne-grained benchmark, closely

followed by CAE. SAE performs worse than CAE,

which is surprising, since using a Siamese architec-

ture and a triplet loss has been proposed for image re-

trieval (Wang et al., 2014). Yet in our experiment, we

observe that the CAE architecture, which combines

classiﬁcation and an autoencoder, is more effective

for ﬁne-grained retrieval. As a key contribution of our

work, SCA obtains an mAP score more than 20% hig-

her than the previous state of the art (Sangkloy et al.,

2016), and CAE is still 17% better, even though we

do not use the ﬁne-grained labels for training our mo-

dels.

On the semantic only benchmark, the classiﬁca-

tion network (unsurprisingly) performs best, followed

by SCA, which still beats Sketchy. Note that in con-

trast to SCA, semantic retrieval performance decrea-

ses drastically for both SAE and CAE, compared to

the classiﬁcation baseline. We believe this is because

(1) the learned embedding is too entangled, and (2)

ﬁnding an optimal γ is not feasible, since it has to

be ﬁxed at training time. Remarkably, R-MAC also

decreases the semantic only retrieval performance by

more than 20%, compared to the classiﬁcation base-

line. It seems that SAE, CAE, and R-MAC introduce

a strong tradeoff between ﬁne-grained and semantic

only retrieval, which is undesirable.

Figure 2 plots the mAP for SCA for different va-

lues of α on both benchmarks. Although there is still

a tradeoff between ﬁne-grained and semantic retrie-

val, we can obtain good retrieval performance on both

benchmarks for a wide range of α values. Intuitively,

this is because SCA better enforces the separation of

class-level and ﬁne-grained information in the featu-

res. Moreover, SCA enables the user to choose the

Figure 2: SCA mAP versus α for both ﬁne-grained and se-

mantic only retrieval on the Sketchy test set.

Figure 3: Precision-recall curves on the Sketchy ﬁne-

grained and semantic only retrieval benchmarks, averaged

over all queries.

tradeoff by setting the parameter α at retrieval time.

Figure 3 shows precision-recall curves on both ben-

chmarks for SCA, our best performing method, in

comparison to R-MAC and Sketchy. Even though

SCA and Sketchy perform very similar on the seman-

tic only benchmark in terms of mAP, the difference

is signiﬁcant for small recalls, which is a useful pro-

perty in practice. Note that R-MAC performs misera-

bly on the semantic only benchmark, which is surpri-

sing because the input features for R-MAC perform

superior. Finally, we show some qualitative retrieval

results in Figure 4. Note that our method retrieves

sketches at the top that match both object category

and object pose.

4.2 ShapeNet

In this experiment, we evaluate ﬁne-grained retrieval

of rendered 3D objects according to viewpoint. Gi-

ven a rendered object, the goal is to retrieve images

of other objects of the same class, seen from the same

viewpoint. We obtained an image dataset by rende-

ring objects from 11 ShapeNet categories: airplane,

bed, bench, bus, car, chair, guitar, piano, table, train,

and boat. For each object, we render diffuse RGB

images of resolution 256x256 from 32 discrete view-

points, 8 azimuth and 4 elevation angles. This results

in a dataset of 902,336 images and we split the ob-

jects to form a training set of 812,103 images and a

test set of 90,233 images. We train a SAE, CAE, and

SCA as described in Section 3.3 using the same net-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

R-MAC Sketchy Ours (SCA)

Figure 4: Qualitative retrieval results on the Sketchy test set. The query images are on the top row and we show the top-10

retrieval results beneath. Retrieval results that do not match the query class label are marked with a red cross. Blue circles

mark retrieval results that match the query class label but do not stem from the same photo, thus they potentially do not match

the query pose. Results that stem from the same photo as the query sketch are marked with a green tick. These are the results

that we consider relevant for ﬁne-grained retrieval, since they are guaranteed to match both query category label and pose.

We use α = 0.3 for SCA.

Table 2: mAP for both ﬁne-grained and semantic retrieval

benchmark on the ShapeNet test set.

method ﬁne-grained semantic only

Autoencoder 0.1738 0.2138

Classiﬁcation Network 0.1890 0.6827

SAE γ = 0.0005 (ours) 0.2910 0.6578

SAE γ = 0.01 (ours) 0.2854 0.6686

CAE γ = 0.5 (ours) 0.2120 0.6827

CAE γ = 0.3 (ours) 0.2545 0.6746

CAE γ = 0.1 (ours) 0.3081 0.6552

SCA α = 0.1 (ours) 0.3310 0.6443

work architectures as in the Sketchy experiments, and

the ﬁne-grained viewpoint labels are not used for trai-

ning. An autoencoder and a classiﬁcation network are

trained as baselines for comparison. All encoders fol-

low a GoogLeNet architecture and are initialized with

weights pre-trained on ImageNet. Unfortunately, we

are not able to compare RMAC performance on this

experiment. Computing RMAC on the Sketchy data-

set took more than one week using the publicly avai-

lable implementation, and since our ShapeNet dataset

is 10 times larger, applying RMAC to this dataset is

not feasible.

Our test set serves as a ﬁne-grained retrieval ben-

chmark, where we consider results as relevant only if

they match both object category and viewpoint with

the query image. In addition, we also evaluate on a

semantic only retrieval benchmark by considering re-

sults relevant if they have the same object category as

the query, regardless of the pose. Table 2 shows mAP

for all networks on both benchmarks. The classiﬁ-

cation network and the autoencoder perform very si-

milar on ﬁne-grained retrieval. SAE, CAE, and SCA

all increase ﬁne-grained retrieval performance signi-

ﬁcantly, and SCA outperforms both SAE and CAE.

Note that even though the classiﬁcation network per-

forms best on semantic only retrieval, SAE, CAE, and

SCA performance is only insigniﬁcantly lower. We

show qualitative results in Figure 5. Note that SCA

retrieves objects that match both query category and

viewpoint, whereas the classiﬁcation network is more

invariant to pose and the autoencoder often retrieves

the wrong categories. Our ShapeNet benchmark con-

tains objects that are almost rotationally symmetric,

such as buses or guitars. In these cases, SCA often

ﬁnds images of objects that are 180° rotated, which

is reasonable for nearly symmetric objects. Howe-

ver, our benchmark considers these results as irrele-

vant (that is, wrong pose), which may explain why

the mAP for ﬁne-grained retrieval (Table 2) is lower

than in the Sketchy benchmark (Table 1).

4.3 MNIST

For a qualitative evaluation of our method, we train

a SCA on hand-written digits from the MNIST (Le-

Cun and Cortes, 2010) dataset. Here we demonstrate

that SCA enables retrieval of digits with similar hand-

writing style, where sensitivity to style is learned in an

unsupervised manner.

The ofﬁcial training set of 60,000 images is used

for training and we evaluate using the ofﬁcial test set

of 10,000 examples. The two encoders E

and E

take

28 × 28 grayscale images as input and consist of four

Fine-Grained Retrieval with Autoencoders

Classification NetworkAutoencoder Ours (SCA)

Figure 5: Qualitative retrieval results on the ShapeNet test set. The query images are on the top row and we show the top-10

retrieval results beneath. Retrieval results that do not match the query class label are marked with a red cross. Blue circles

indicate retrieval results that match the query category but do not match the query pose. Results that we consider relevant for

ﬁne-grained retrieval are marked with a green tick. We use α = 0.1 for SCA.

convolutional layers with 3 × 3 kernels, followed by

ReLU activations. The outputs of the last convoluti-

onal layers are 4 × 4 × 256 tensors that are mapped

to a 16-dimensional latent representations z

and z

using a single fully-connected layer with ReLU acti-

vation. Finally, we add an additional fully-connected

layer with linear activation f behind E

. We employ

batch normalization (Ioffe and Szegedy, 2015) for all

convolutional layers but not for the fully-connected

layers. The decoder is the reverse of the encoders:

we start with a fully-connected layer to increase the

dimensionality of the latent representation to 4096-

dimensional and reshape to form a 4× 4 ×256 tensor.

This tensor is mapped back to the input image space

using four layers of transposed convolutions. We em-

ploy batch normalization and ReLU activations for all

but the output layer. The output layer uses hyperbolic

tangent activation.

To compare the effect of the proposed SCA on

the learned embedding, we trained two baseline net-

works: a classiﬁcation network and a fully unsupervi-

sed autoencoder. Both baselines use the same archi-

tecture as our SCA, with the exception of the latent

representation being 32-dimensional.

Figure 6 shows t-SNE (Maaten and Hinton, 2008)

embeddings for digits “one” and “seven” from the

MNIST test set, where we plot the digits at the po-

sition of their t-SNE coordinates. The visualization

shows that digits are embedded depending on writing

style when training the SCA. For example, straight di-

gits “one” are mapped to the upper right region of the

cluster and more italic styles are mapped to the lower

left region. This also holds for digits “seven”, for ex-

ample all digits featuring a cross are embedded close

together (lower left inset). The autoencoder also em-

beds digits according to writing style, but it does not

separate the different classes as well as SCA and the

classiﬁcation network, it actually maps italic digits

and straight digits to two completely distinct clusters.

In contrast, the embedding learned by the classiﬁca-

tion network is completely invariant to writing style,

as shown in the insets.

Figure 7 shows some retrieval examples on the

MNIST test set for all three networks. Training a

classiﬁcation network leads to retrieval results with

the same class label as the query, but arbitrary wri-

ting style. In contrast, training an autoencoder yields

retrieval results with similar writing styles but often

wrong class labels. Note that the proposed SCA le-

arns an embedding where neighboring samples are si-

milar in both class label and writing style, which ena-

bles ﬁne-grained retrieval without any supervision on

writing styles.

4.4 SVHN

Similar to the previous experiment, we also train a

SCA on the SVHN (Netzer et al., 2011) dataset.

Training is performed without any data augmentation

using the ofﬁcial training set consisting of 73,257 ex-

amples, and we evaluate our networks using the test

set of 26,032 images. The SCA takes 32 × 32 × 3

RGB images as input and follows the exact same ar-

chitecture as for the MNIST experiments presented

above.

Again, we train two baseline networks for com-

parison: a classiﬁcation network and an autoencoder.

Figure 8 shows retrieval examples on the SVHN test

set for all three networks. We can observe the same

behavior as for MNIST: the embedding learned by the

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

Figure 6: Learned t-SNE embeddings for digits “one” (black) and “seven” (red) from MNIST test set. We plot the digits at

their t-SNE coordinates.

Classification Network Autoencoder Ours (SCA)

Figure 7: Qualitative retrieval results on the MNIST test

set. The query images are on the top row and we show the

top-10 retrieval results beneath. We use α = 0.2 for SCA.

Classification Network Autoencoder Ours (SCA)

Figure 8: Qualitative retrieval results on the SVHN test set.

The query images are on the top row and we show the top-

10 retrieval results beneath. We use α = 0.4 for SCA.

SCA enables retrieval of samples that are similar in

both content and style, whereas the embedding lear-

ned by the classiﬁcation network is invariant to style

and the autoencoder does not learn much semantics.

Note that for this dataset the learned style is mainly

determined by the color of the digits, the background

colors and textures, and adjacent distractor digits.

5 CONCLUSIONS

In this paper we have developed features that sup-

port ﬁne-grained retrieval in a partly unsupervised

manner, without requiring ﬁne-grained labels. We

proposed three different architectures leveraging au-

toencoders for this purpose: a Siamese autoencoder

(SAE), a classifying autoencoder (CAE), and a se-

parate classiﬁer-autoencoder (SCA). We found that,

despite its simplicity, the SCA architecture performs

best in practice. The SCA avoids using a contrastive

loss, which can be unstable to train. In addition, it re-

tains ﬁne-grained information by including an autoen-

coder. Finally, it better separates semantic class-level

information from ﬁne-grained properties compared to

the CAE approach. This avoids detrimental competi-

tion between the classiﬁcation and reconstruction loss

during training. We quantitatively evaluate our ap-

proach and show that it leads to a signiﬁcant impro-

vement over the state of the art in a ﬁne-grained sketch

retrieval benchmark. In addition, it reliably retrieves

correct object poses in a benchmark with images ren-

dered from ShapeNet. We further demonstrate ﬁne-

grained retrieval of hand-written digits and images of

house numbers based on style, without requiring style

annotations.

As a disadvantage of our approach, various ﬁne-

grained properties (color, style, viewpoint, etc.) of

complex data will be entangled in the autoencoding

features, and the user cannot control which of these

properties should be considered (ir)relevant for re-

trieval. In the future, we would like to investigate

interactive techniques that allow the user to intuiti-

vely control the retrieval criteria on the ﬂy and easily

obtain desired results.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., et al. (2016). Tensorﬂow: Large-scale machine

Fine-Grained Retrieval with Autoencoders

learning on heterogeneous distributed systems. arXiv

preprint arXiv:1603.04467.

Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., and Sivic,

J. (2016). Netvlad: Cnn architecture for weakly su-

pervised place recognition. In 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 5297–5307.

Babenko, A. and Lempitsky, V. (2015). Aggregating local

deep features for image retrieval. In The IEEE Inter-

national Conference on Computer Vision (ICCV).

Babenko, A., Slesarev, A., Chigorin, A., and Lempitsky, V.

(2014). Neural Codes for Image Retrieval, pages 584–

599. Springer International Publishing, Cham.

Bui, T., Ribeiro, L., Ponti, M., and Collomosse, J.

(2016). Generalisation and sharing in triplet con-

vnets for sketch based visual search. arXiv preprint

arXiv:1611.05301.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,

Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,

Su, H., Xiao, J., Yi, L., and Yu, F. (2015). ShapeNet:

An Information-Rich 3D Model Repository. Techni-

cal Report arXiv:1512.03012 [cs.GR], Stanford Uni-

versity — Princeton University — Toyota Technolo-

gical Institute at Chicago.

Chen, K. and Salman, A. (2011). Extracting speaker-

speciﬁc information with a regularized siamese deep

network. In Shawe-Taylor, J., Zemel, R. S., Bartlett,

P. L., Pereira, F., and Weinberger, K. Q., editors, Ad-

vances in Neural Information Processing Systems 24,

pages 298–306. Curran Associates, Inc.

Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a

similarity metric discriminatively, with application to

face veriﬁcation. In Computer Vision and Pattern Re-

cognition, 2005. CVPR 2005. IEEE Computer Society

Conference on, volume 1, pages 539–546. IEEE.

Eitz, M., Hays, J., and Alexa, M. (2012). How do humans

sketch objects? ACM Trans. Graph., 31(4):44:1–

44:10.

Gordo, A., Almaz

an, J., Revaud, J., and Larlus, D. (2016).

Deep image retrieval: Learning global representations

for image search. In ECCV.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep re-

sidual learning for image recognition. In 2016 IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 770–778.

Ioffe, S. and Szegedy, C. (2015). Batch normalization:

Accelerating deep network training by reducing inter-

nal covariate shift. CoRR, abs/1502.03167.

Jgou, H., Douze, M., Schmid, C., and Prez, P. (2010). Ag-

gregating local descriptors into a compact image re-

presentation. In 2010 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recognition,

pages 3304–3311.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. CoRR, abs/1412.6980.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).

Gradient-based learning applied to document recogni-

tion. Proceedings of the IEEE, 86(11):2278–2324.

LeCun, Y. and Cortes, C. (2010). MNIST handwritten digit

database.

Maaten, L. v. d. and Hinton, G. (2008). Visualizing data

using t-sne. Journal of Machine Learning Research,

9(Nov):2579–2605.

Mikulik, A., Perdoch, M., Chum, O., and Matas, J. (2013).

Learning vocabularies over a ﬁne quantization. In-

ternational Journal of Computer Vision, 103(1):163–

175.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and

Ng, A. Y. (2011). Reading digits in natural images

with unsupervised feature learning. In NIPS workshop

on deep learning and unsupervised feature learning,

volume 2011, page 5.

Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin,

F., and Schmid, C. (2015). Local convolutional fea-

tures with unsupervised training for image retrieval.

In The IEEE International Conference on Computer

Vision (ICCV).

Perronnin, F., S

anchez, J., and Mensink, T. (2010). Impro-

ving the Fisher Kernel for Large-Scale Image Classi-

ﬁcation, pages 143–156. Springer Berlin Heidelberg,

Berlin, Heidelberg.

Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman, A.

(2007). Object retrieval with large vocabularies and

fast spatial matching. In 2007 IEEE Conference on

Computer Vision and Pattern Recognition, pages 1–8.

Philbin, J., Chum, O., Isard, M., Sivic, J., and Zisserman,

A. (2008). Lost in quantization: Improving particu-

lar object retrieval in large scale image databases. In

2008 IEEE Conference on Computer Vision and Pat-

tern Recognition, pages 1–8.

Radenovi

c, F., Tolias, G., and Chum, O. (2016). CNN

Image Retrieval Learns from BoW: Unsupervised

Fine-Tuning with Hard Examples, pages 3–20. Sprin-

ger International Publishing, Cham.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,

S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,

Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015).

ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV),

115(3):211–252.

Sangkloy, P., Burnell, N., Ham, C., and Hays, J. (2016). The

sketchy database: Learning to retrieve badly drawn

bunnies. ACM Transactions on Graphics (proceedings

of SIGGRAPH).

Schneider, R. G. and Tuytelaars, T. (2014). Sketch classi-

ﬁcation and classiﬁcation-driven analysis using ﬁsher

vectors. ACM Trans. Graph., 33(6):174:1–174:9.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E.

(2015). Multi-view convolutional neural networks for

3d shape recognition. In Proceedings of the 2015

IEEE International Conference on Computer Vision

(ICCV), ICCV ’15, pages 945–953, Washington, DC,

USA. IEEE Computer Society.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Angue-

lov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A.

(2015). Going deeper with convolutions. In Procee-

dings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 1–9.

Tolias, G., Sicre, R., and J

egou, H. (2016). Particular object

retrieval with integral max-pooling of CNN activati-

VISAPP 2018 - International Conference on Computer Vision Theory and Applications

ons. In International Conference on Learning Repre-

sentations.

Wang, F., Kang, L., and Li, Y. (2015). Sketch-based 3d

shape retrieval using convolutional neural networks.

In 2015 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 1875–1883.

Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J.,

Philbin, J., Chen, B., and Wu, Y. (2014). Learning

ﬁne-grained image similarity with deep ranking. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 1386–1393.

Yu, Q., Liu, F., Song, Y. Z., Xiang, T., Hospedales, T. M.,

and Loy, C. C. (2016a). Sketch me that shoe. In 2016

IEEE Conference on Computer Vision and Pattern Re-

cognition (CVPR), pages 799–807.

Yu, Q., Yang, Y., Liu, F., Song, Y.-Z., Xiang, T., and Hospe-

dales, T. M. (2016b). Sketch-a-net: A deep neural

network that beats humans. International Journal of

Computer Vision, pages 1–15.

Fine-Grained Retrieval with Autoencoders