Simple Domain Adaptation for CAD based Object Recognition

Kripasindhu Sarkar and Didier Stricker

DFKI Kaiserslautern, Germany

Technische Universit

at Kaiserslautern, Germany

Keywords:

3D Object Recognition, Domain Adaptation, CNN.

Abstract:

We present a simple method of domain adaptation between synthetic images and real images - by high quality

rendering of the 3D models and correlation alignment. Using this method, we solve the problem of 3D

object recognition in 2D images by ﬁne-tuning existing pretrained CNN models for the object categories

using the rendered images. Experimentally, we show that our rendering pipeline along with the correlation

alignment improve the recognition accuracy of existing CNN based recognition trained on rendered images

- by a canonical renderer - by a large margin. Using the same idea we present a general image classiﬁer of

common objects which is trained only on the 3D models from the publicly available databases, and show that

a small number of training models are sufﬁcient to capture different variations within and across the classes.

1 INTRODUCTION

3D model based object recognition is the method

for automatically identifying a 3D object from a

database, given an input image taken from a close

range. Often the catalogue contains only 3D CAD

models or 3D scans of different object, and identi-

fying them in images in traditional way will require

creating image database from actual photographs of

each object in the catalogue - which may be imprac-

tical. Therefore, existing approaches extract visual

information from the 3D models and use them in

a standard off-the-shelf recognition pipeline (Collet

Romea et al., 2011; Sarkar et al., 2016; Sarkar et al.,

2017). This is done either by extracting and augment-

ing 2D-local-features to the 3D location of the CAD

model followed by feature-matching based recogni-

tion (Sarkar et al., 2016), or using a powerful CNN

and solve the problem of classiﬁcation (with recogni-

tion instances as classes) using the rendered images

of the 3D models (Sarkar et al., 2017).

In spite of the fact that CNN based methods have

progressed the state of the art results signiﬁcantly for

tasks related to images, the progress on 3D model

based object recognition using CNN has not been sub-

stantial. For example, solving the task of recogni-

tion on rendered images of the 3D models in (Sarkar

et al., 2017) achieved an 8% increase of recognition

recall w.r.t. the local-feature base recognition. This

is a big improvement over the feature-matching based

methods, but is not comparable to the improvement

achieved by CNN based methods in other areas where

only real images are used for training (Krizhevsky

et al., 2012; Ren et al., 2015; Ren et al., 2015). In

this paper, we provide a simple method for bridging

the gap between rendered and real images in an aim to

signiﬁcantly improve the results on recognition com-

pared to the previous CNN based solutions.

Other than the problem of CAD based object

recognition, learning categories of images using only

3D models as the source of training data, is also an

important problem. This is because one single 3D

model contains a lot more information than one par-

ticular view or image of that model. However, the

rendered images of a 3D model lie in a signiﬁcantly

different domain compared to its real image. In this

work, by using high quality rendering and a simple

concept of correlation alignment, we decrease the gap

between the domains of real and rendered images

and provide a high performing classiﬁcation system

which is trained only using 3D models. We exploit the

advances in computer graphics rendering techniques

to improve the vision task of 3D model based recog-

nition and classiﬁcation.

Our contributions are the following:

1. We resolve the domain between synthetic images

and real images by the combination of high qual-

ity rendering and a simple method of correlation

alignment inspired by (Sun et al., 2015).

2. In the presence of accurate 3D models, we use our

domain adaptation method and beat the existing

Sarkar, K. and Stricker, D.

Simple Domain Adaptation for CAD based Object Recognition.

DOI: 10.5220/0007346504290437

In Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2019), pages 429-437

ISBN: 978-989-758-351-3

429

state-of-art CAD based recognition in 2D images

by a large margin w.r.t. both local-feature based

and CNN based recognition system.

3. In the absence of accurate 3D models, we use our

method of domain adaptation and provide an ac-

curate real-time classiﬁcation system of common

objects in ofﬁce-desks, trained only on the 3D

models from a publicly available dataset.

The demonstration video of the real-time classiﬁ-

cation/recognition system of the common objects in

provided in the supplementary materials.

2 RELATED WORK

CNN based Object Classiﬁcation. AlexNet

(Krizhevsky et al., 2012) was the ﬁrst deep CNN

model trained in GPU for the task of classiﬁcation,

and is still often used as the base model or feature

extractors for performing other tasks. Other famous

models which are often used as base CNN are

VGG (Simonyan and Zisserman, 2014), GoogLeNet

(Szegedy et al., 2015a), ResNet (He et al., 2015),

InceptionV3/V4 (Szegedy et al., 2015b). VGG is a

simple network which uses a series of small convolu-

tion ﬁlters of size 3 × 3 followed by fully connected

layers. Due to its simplicity, we use the conﬁguration

of AlexNet in our network and ﬁne-tune the weights

based on our requirements.

Use of Rendered Images in Vision. These methods

render 3D models using computer graphics pipeline

and uses the rendered images as to augment the train-

ing data for the problem involving 2D images. (Peng

et al., 2014) used rendering of similar looking 3D

models of some of the categories of PASCAL VOC

dataset (Everingham et al., 2010) to augment the

training dataset, and found it to perform superior to

training on only the given training set. (Su et al.,

2015b) used image of rendered 3D models to augment

PASCAL 3D+ dataset to improve results on viewpoint

detection. (Sarkar et al., 2017) uses only the rendered

images as training set for 3D model recognition using

CNN. In contrast to their work a) we render highly

realistic images using powerful rendering engine b)

bridge the gap further using correlation alignment of

CNN feature c) achieve a large margin of improve-

ment on recognition accuracy in comparison to them.

Feature based Object Recognition. Feature-

augmented 3D models are created by performing

Structure From Motion (SFM) on the training images

taken of the object to be recognized. This association

of 3D points - to - 2D descriptors, as a result of

SFM, forms the pillar of most of the feature based

detection, where the features extracted from a given

input image, are matched to that of the feature

augmented 3D models and subsequently, a 6 DOF

recognition is made (Skrypnyk and Lowe, 2004;

Hao et al., 2013; Collet Romea et al., 2011; Collet

Romea and Srinivasa, 2010; Irschara et al., 2009).

(Sarkar et al., 2016) provided a new method for

creating feature-augmented models in the presence

of accurate 3D models at the training time. They

used the texture-map of the 3D models to assign

2D features to the 3D points of the model, and in

the second method, took rendered virtual snapshots

to group 2D features assigning to the 3D points.

In contrast we use high quality rendered images to

train CNN and outperform the result of recognition

accuracy of this work.

Domain Adaptation. There has been several work

on the line of adapting domain from different dis-

tributions. Geodesic methods bridge the source and

target domain by projecting source and target onto

points along a geodesic path (Gopalan et al., 2011).

DLID trains a joint source and target CNN architec-

ture for domain adaptation (Chopra and Balakrishnan,

2013). DAN (Long et al., 2015) and DDC (Tzeng

et al., 2014) directly optimize the deep representa-

tion for domain invariance. Correlation Alignment

or CORAL (Sun et al., 2015) is one of the simplest

domain adaptation system where the whitened source

features are recolored using target covarience. Be-

cause of its simplicity we adapted our method from

this idea.

3 APPROACH

The focus of our work is to perform the task of object

recognition of a dataset of 3D models in 2D query

images. That is, we train only using the 3D models

in the database and use the trained model for recog-

nition during the test time (in query 2D images). As

mentioned in the previous section we use a 2D CNN

based pipeline (over local feature based pipeline) for

this task because of its huge success in the general

tasks of classiﬁcation, recognition and detection.

Given a database of 3D models we take virtual

snapshots of each model from different views with

high quality rendering settings. Using this set of high

quality images and a simple adaptation of the ﬁnal

features by correlation alignment (with the test im-

ages) we ﬁne-tune a pretrained model for the task

of recognition of the 3D instances. The following

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

430

subsection describes the rendering settings in details

for the aim of bridging domain gap. Section 3.2 in-

troduces the idea of correlation alignment, which is

followed by Section 3.3 containing the details of the

CNN framework to work with and without correlation

alignment.

Assumption of 3D Models. We assume that the 3D

models in the database are upright oriented along a

consistent axis. This assumption holds true for most

of the publicly available databases (Wu et al., 2015;

Chang et al., 2015a), and has been use extensively

by popular shape analysis methods (Su et al., 2015a;

Johns et al., 2016; Qi et al., 2016; Maturana and

Scherer, 2015). The information from the gravity di-

rection is used in different settings for rendering - like

lights and viewpoints.

3.1 Rendering Scheme

We train a CNN on rendered images to use it for the

classiﬁcation of real images. The main goal is to have

the rendering as realistic as possible. Therefore, we

use a dedicated rendering engine (Blender) with the

following aim. (1) The rendered images are realistic.

(2) The collection of the rendered images are overﬁt-

resistant. The rendering settings with their motiva-

tions are outlined in the following paragraphs.

General Rendering Ideas. Rendering is the pro-

cess of creating a 2D image from a 3D scene. The

ﬁnal image is based on factors such as camera set-

tings, lighting settings, material and the render set-

tings. First we start with the light settings and later

we describe the different types of material settings

(shaders). Please note that the lights work together

with the shaders and therefore the following sections

are not independent.

3.1.1 Lighting

Lighting/shading is one of the most important factors

for realistic rendering. Even though rotation and scale

invariance are taken care by data augmentation with

random crops and rotations of the training images

(Krizhevsky et al., 2012), incorporation of lighting in-

variance is not technically feasible when training is

performed with real images. To mitigate this prob-

lem, the training set containing real images are of-

ten increased to cover instances with different lighting

conditions. The presence of 3D models along with an

advanced rendering pipeline gives us a big advantage

on generating training images with different lighting

conditions. Observing the lighting patterns in general

Background

Directional

Rays

Lamps

3D models

(with different material

properties)

Camera

Figure 1: Overview of our rendering pipeline. See Section

3.1 for details.

scenarios, we experimented with the following light-

ing settings.

• Uniform directional light : We place a directional

light of moderate intensity (from the top to bot-

tom) which provides a uniform light from the top

on every objects. We keep this light on for all ren-

dered images.

• Point light at the camera: Point light is a light

which radiates the same amount of light in all di-

rections. The light intensity/energy decays based

on the distance from the light to the object. We

ﬁx a low intensity point light at the location of the

camera for all the renderings. This light is added

to better highlight the textures and provide a well

illuminated environment. In highly accurate 3D

models which contains signiﬁcant textures, this

light has been crucial. If we just use this light (and

ignore all other lights), the setting will reduce to

the default rendering settings of popular rendering

toolkits such as VTK and MeshLab.

• Random point lights Along with the two afore-

mentioned lights which are ﬁxed for all the ren-

dering, we add 0 to 6 moderate intensity point

lights (the actual number chosen at random) at

random location at a distance 8 to 20 times the

size of the object. We found this range of distance

to add soft shadows similar to the real images of a

model.

All the light sources are added to produce both

specular highlights and diffuse shading. In addition,

we used ray-traced mechanism for computing soft

shadows.

3.1.2 Materials

In the presence of accurate 3D models (eg. in Wave-

front .obj model), we use the material properties

present in the models - which include the diffuse

colors/intensity, specular exponent, transparency etc.

Also based on our observation, we found that a ﬂat

shading is realistic (over the default smooth shad-

ing) in the presence of precise 3D models (scanned

Simple Domain Adaptation for CAD based Object Recognition

431

through an accurate 3D scanner). This is due the

high amount of details present in the models which

needs no further smoothing, and the normal map in

the faces are good enough to produce high quality

shading. Therefore, we use the following conﬁgura-

tions:

Accurate 3D scans: With accurate 3D scans (Eg.

high texture dataset from (Sarkar et al., 2016)), we use

ﬂat shading with the material properties taken from

the settings available in the scans. In the absence of

material properties, we use a Lambert diffuse shader

(factor of 1 - or max intensity) and a low intensity

specular shader of high hardness. This is in congruent

to the test images of the corresponding 3D models as

well as the observation common objects.

Models from publicly available datasets: The

models from the publicly available databases are of-

ten hand designed and not accurate as the real objects

(in comparison to the 3D models which are scanned

through a 3D scanner). Therefore smooth shading

(which is also the default settings in many render-

ing engines) is more suitable for these kind of mod-

els. We use an ‘auto smooth’ functionality to combine

both smoothing of normal and preserving sharp edge

- edges where an angle between the faces is smaller

than 30 degrees are smoothed. All the materials are

chosen to cast and receive shadows.

3.1.3 Background Images

Along with the aim of making the rendering realis-

tic, we also make the rendered images overﬁt resis-

tant. We do this by adding different backgrounds

to the rendered image with the above mentioned set-

tings. Following (Sarkar et al., 2017), we use 3 differ-

ent backgrounds for one rendered image (with trans-

parent background) - a) white background, b) ran-

dom background from PASCAL dataset (Everingham

et al., 2010) (without having any conﬂicting classes

with our instances) c) background involving a table -

which resembles test images. Therefore we get three

times the number of images in the training set as the

rendered images.

3.1.4 Views

When the class or the class of recognition instance be-

longs to one of the classes in PASCAL3D dataset (Xi-

ang et al., 2014) (for example chairs), we sample the

azimuth, elevation and in-plane rotation of the cam-

era from a distribution estimated from the real image

training set of PASCAL3D. In the absence of such

categories sample it from a uniform distribution with

the range adjusted to the categories. For example, for

the instance Totem which is thin and tall in shape, the

elevation angles are chosen between 0 and 45 degrees

(so that the extreme top views are avoided), whereas

for keyboard, the elevation angles are chosen between

30 and 80 degrees (so that the extreme side views are

avoided).

3.2 Correlation Alignment

In the previous section, we described the details of

creating realistic and overﬁt resistant set of images for

decreasing the domain gap. We further apply a very

simple, yet effective, method of Correlation Align-

ment of source and target feature distribution to fur-

ther minimize the domain gap between the rendered

images and real images. The method which is mo-

tivated by (Sun et al., 2015), aligns the input fea-

ture distributions of the source and target domains by

minimizing the difference between their second-order

statistics. Simply put, given the source feature set S,

and target T , we perform the following steps to align

the correlation of S to T to get the adapted feature set

Correlation Alignment algoritm.

1. Compute covariance matrix of both S and T .

2. Whiten S using its covariance matrix to get S

3. Recolor S

with target covariance to get S

Even though the method is simple, it is shown to

be as good as other specialized methods (Sun et al.,

2015). Since we are using CNN for solving the task

of classiﬁcation, we can either treat the base CNN as

a feature extractor to get source and target features

for domain adaption, or use a specialized version of

Correlation Alignment for CNN where we add a spe-

cialized loss (which is the frobenius norm between

source and target features) along with the classiﬁca-

tion (or regression) loss while training the CNN. The

exact method of using the adaptation in CNN is de-

scribed in the next section.

3.3 CNN Architecture

We use the eight layer ‘AlexNet’ as our neural net-

work architecture ((Krizhevsky et al., 2012)) for train-

ing and testing because of its popularity and simplic-

ity. Even though recent deep networks like VGGNet

(Simonyan and Zisserman, 2014), or recent ResNet

(He et al., 2015) should work better, our experiments

on the with 5 categories (from the dataset by (Sarkar

et al., 2016)), we found AlexNet to be sufﬁcient for

the recognition task. This also enabled us an easy

comparison with (Sarkar et al., 2017).

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

432

Table 1: Comparison of object recognition recall of our system with local-feature based method (feature augmented models

using the settings snap2 and tmap in (Sarkar et al., 2016) + online matching using MOPED (Collet Romea et al., 2011)) and

CNN based method without specialized rendering (Sarkar et al., 2017). Boldface numbers highlights the best and underlined

numbers highlights the second best performing values.

Models snap2 tmap

(Sarkar et al., 2016) (Sarkar et al., 2016) (Sarkar et al., 2017) our-r

Milk-carton 0.88 0.71 0.81 0.98

Totem 0.83 0.21 0.98 0.99

Lion 0.89 0.63 0.74 0.98

Whitener 0.48 0.63 0.77 0.95

Matriochka 0.58 0.62 0.64 1

Average 0.73 0.56 0.79 0.98

3.3.1 Finetuning with Correlation Alignment

There are specialized methods of using Correlation

Alignment in CNN, which essentially adds a factor

- correlation mismatch between the source and target

domain - in the original loss (Sun and Saenko, 2016).

This requires processing a batch of test images in ev-

ery training iteration. Instead of processing a batch

of test images, we further simplify the process by the

following way. At ﬁrst as a preprocessing step, we

compute and store the covariance of the target CNN

features (from the last FC layer - using the real test

images). Then during training (using rendered im-

ages), we transform the features (by whitening and

subsequently recoloring with the target distribution)

to target domain as explained in Section 3.2. Finally,

we compute the loss using the transformed features.

4 EXPERIMENTAL RESULTS

4.1 3D Model Based Object Recognition

In this section, we provide our results for the prob-

lem of object recognition of 3D models in 2D im-

ages when the models are of high quality and closely

resemble the real objects. We use the subset of the

high-texture dataset provided by (Sarkar et al., 2016)

for this task. In brief, the dataset contains 5 textured

meshes - Lion, Totem, Matriochka, Milk-carton and

Whitener and a set of test images. The test image

set contains in total around 3000 images of the real

objects corresponding to the provided meshes. The

meshes represents accurate version of the real objects

and has been reconstructed by a high quality 3D scan-

ner of 3Digify (3Digify, 2015).

4.1.1 Rendered Images

We took the ideas presented in Section 3.1 and per-

formed rendering on the 5 models present in dataset.

Since no view statistics (or training set containing real

images) is available in this dataset, we manually de-

ﬁne the ranges of camera extrinsics depending on the

models (see Section 3.1.4). Since all the models in

the dataset are provided in unit bouding box, the lights

and the camera positions represented w.r.t. the bound-

ing box unit. As mentioned in Section 3.1.3, we added

a total of three background images per rendering -

white, random and of a table similar to that in the test

images. This idea was taken from (Sarkar et al., 2017)

which also performs the tasks of 3D object recogni-

tion in 2D images using CNN.

We render 2000 images with different views per

model - which results to 6000 images per category

instance with different background.

4.1.2 CNN Architecture and Training Details

In order to evaluate our method against (Sarkar et al.,

2017) which uses AlexNet on rendered images, we

use AlexNet as our base network. This is done to iso-

late the effect of our domain adaption from CNN de-

sign. Because of the simplicity of AlexNet, our train-

ing do not require large amount of GPU memory and

we could easily use a batch size of 64 in GTX 1070.

We ﬁnetune our network initialized from pretrained

ImageNet for 30 epochs with a cross entropy loss. We

use a learning rate of 0.0001 which we decreased by

half after 20 epochs and Adam optimizer. The ﬁne-

tuneing of 30 epochs for a training set of size 30k im-

ages takes around 30 minutes.

4.2 Comparison Algorithms

We compare our results with two very different ap-

proaches:

Simple Domain Adaptation for CAD based Object Recognition

433

Figure 2: Some examples of the positive recognition from our system.

Figure 3: Some of the erroneous recognition. The predicted

labels (left to right, top to bottom) are Whitener, Whitener,

Whitener, MilkBottle, MilkBottle, Matriochka..

(1) Local-feature based Recognition: We use

the classical local-feature based object recognition

pipeline where feature-augmented 3D models are cre-

ated by (a) performing Structure From Motion (SFM)

on the training images or by (b) model based approach

such as (Sarkar et al., 2016). During test/query time,

features extracted from the given query image are

matched to that of the feature augmented 3D mod-

els and subsequently a recognition is made. For this

query phase we used a sophisticated version of PNP +

RANSAC (solution of Perspective-n-Points problem

under RANSAC iterations) known as MOPED (Collet

Romea et al., 2011). For creating feature augmented

models we use both the technique of tmap and snap2

as explained in (Sarkar et al., 2017).

(2) CNN based Recognition: We consider a base-

line CNN approach where a pretrained model is ﬁne-

tuned with the rendered images with minimal render-

ing conﬁgurations. This minimal rendering scheme

has been used by many of the existing model based

systems (Hinterstoisser et al., 2017; Sarkar et al.,

2017; Kehl et al., 2017). We use the results from

(Sarkar et al., 2017) for this task, where the 3D mod-

els are rendered with the default rendering settings of

Visualization Toolkit (VTK) - a directional headlight

located at the center of the camera and Phong shad-

ing interpolation. We use their best conﬁguration of

background and texture for comparison.

The comparison of recognition recall is shown in

Table 1. As seen, our simple yet effective method im-

prove the previous CNN based recognition system by

a large margin of 24% (0.79 → 0.98). In fact, existing

CNN based system (without sophisticated rendering)

could only improve upon the classical local feature

based system by 8% (0.73 → 0.79). This was mostly

due to high quality and texture details of the 3D mod-

els in the dataset which could not create much dif-

ference in performance from feature matching based

method with the CNN based approach with simple

rendering. Our high quality rendering along with the

domain adaptation improve on the local feature based

system by a margin of 34% (0.73 → 0.98). Figure

2 and 3 respectively shows some of the positive and

erroneous recognition examples.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

434

Figure 4: Snapshots of our demo application of real-time object classiﬁcation using 3D models. Even with a very high

background clutter our system provides good classiﬁcation results.

4.3 Real-time Classiﬁcation System

In this section we provide the details of our real time

AR application of object classiﬁcation which uses

no real images for training. We chose 2 specialized

categories and 6 common categories for ofﬁce desks

(namely - Keyboard, Monitor, Headphone, Mug, Bot-

tle, Chair) in the aim of having an appropriate demo

application in the ofﬁce/lab environment.

Our classiﬁcation system is trained just using the

rendered images of general 3D models with the set-

tings mentioned in Section 3.1. No real images are

used at any step. Also, the 3D models used are from

publicly available databases and do not correspond to

the exact objects from the real world.

4.3.1 Training Set

We use ShapeNet (Chang et al., 2015b) to get

3D models of the respective categories. ShapeNet

core contains common daily objects with alignments

which we used for the gravity direction while plac-

ing lights and camera. We randomly take a maximum

of 100 3D models for each category, and sample 600

different views for each 3D model - giving a total of

6000 rendered images for a category. Along with the

background augmentation (3 different backgrounds -

Section 3.1.3), we train using a total of 18k images

per class.

The details of ﬁnetuning CNN for the classiﬁca-

tion application is similar to that of Section 4.1.2.

We ﬁnetune AlexNet using the generated images ex-

plained above.

4.3.2 Testing with Real Images and AR

Application

Our application performs impressively even though

our network is not trained using any real images. Fig-

ure 4 shows some screenshots of our application. As

we zoom towards an object, the application shows the

category class. A short video showing the output of

our application is provided in the supplementary ma-

terial.

5 CONCLUSION

In this work we provided a simple, yet powerful do-

main adaptation system by the fusion of high quality

rendering and correlation alignment. Using the ren-

dered images of our method, we signiﬁcantly outper-

formed the existing system which uses a simple ren-

derer but same learning technique. We also showed

that this idea can be generalized to learn a classiﬁer

capturing different variations of categories only by us-

ing the 3D models of the publicly available datasets.

In future we would like to extend our classiﬁcation

system to identify 6DOF pose by attaching a regres-

sor when trained only using the 3D models.

ACKNOWLEDGMENTS

This work was partially funded by the BMBF

projects DYNAMICS (01IW15003) and VIDETE

(01IW18002).

Simple Domain Adaptation for CAD based Object Recognition

435

REFERENCES

3Digify (2015). 3digify, http://3digify.com/.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,

Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,

Su, H., Xiao, J., Yi, L., and Yu, F. (2015a). ShapeNet:

An Information-Rich 3D Model Repository. Techni-

cal Report arXiv:1512.03012 [cs.GR], Stanford Uni-

versity — Princeton University — Toyota Technolog-

ical Institute at Chicago.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,

Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S.,

Su, H., Xiao, J., Yi, L., and Yu, F. (2015b). ShapeNet:

An Information-Rich 3D Model Repository. Techni-

cal Report arXiv:1512.03012 [cs.GR], Stanford Uni-

versity — Princeton University — Toyota Technolog-

ical Institute at Chicago.

Chopra, S. and Balakrishnan, S. (2013). Deep learning for

domain adaptation by interpolating between domains.

Collet Romea, A., Martinez Torres, M., and Srinivasa, S.

(2011). The moped framework: Object recognition

and pose estimation for manipulation. International

Journal of Robotics Research, 30(10):1284 – 1306.

Collet Romea, A. and Srinivasa, S. (2010). Efﬁcient multi-

view object recognition and full pose estimation. In

2010 IEEE International Conference on Robotics and

Automation (ICRA 2010).

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2010). The pascal visual ob-

ject classes (voc) challenge. International Journal of

Computer Vision, 88(2):303–338.

Gopalan, R., Li, R., and Chellappa, R. (2011). Domain

adaptation for object recognition: An unsupervised

approach. In Computer Vision (ICCV), 2011 IEEE In-

ternational Conference on, pages 999–1006. IEEE.

Hao, Q., Cai, R., Li, Z., Zhang, L., Pang, Y., Wu, F., and

Rui, Y. (2013). Efﬁcient 2d-to-3d correspondence ﬁl-

tering for scalable 3d object recognition. In Computer

Vision and Pattern Recognition (CVPR), 2013 IEEE

Conference on, pages 899–906.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Hinterstoisser, S., Lepetit, V., Wohlhart, P., and Kono-

lige, K. (2017). On pre-trained image features

and synthetic images for deep learning. CoRR,

abs/1710.10710.

Irschara, A., Zach, C., Frahm, J.-M., and Bischof, H.

(2009). From structure-from-motion point clouds to

fast location recognition. In Computer Vision and

Pattern Recognition, 2009. CVPR 2009. IEEE Con-

ference on, pages 2599–2606.

Johns, E., Leutenegger, S., and Davison, A. J. (2016). Pair-

wise decomposition of image sequences for active

multi-view recognition. In Computer Vision and Pat-

tern Recognition (CVPR), 2016 IEEE Conference on,

pages 3813–3822. IEEE.

Kehl, W., Manhardt, F., Tombari, F., Ilic, S., and Navab, N.

(2017). Ssd-6d: Making rgb-based 3d detection and

6d pose estimation great again. In Proceedings of the

International Conference on Computer Vision (ICCV

2017), Venice, Italy, pages 22–29.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C. J. C., Bottou,

L., and Weinberger, K. Q., editors, Advances in Neu-

ral Information Processing Systems 25, pages 1097–

1105. Curran Associates, Inc.

Long, M., Cao, Y., Wang, J., and Jordan, M. I. (2015).

Learning transferable features with deep adaptation

networks. In Proceedings of the 32Nd International

Conference on International Conference on Machine

Learning - Volume 37, ICML’15, pages 97–105.

JMLR.org.

Maturana, D. and Scherer, S. (2015). VoxNet: A 3D Convo-

lutional Neural Network for Real-Time Object Recog-

nition. In IROS.

Peng, X., Sun, B., Ali, K., and Saenko, K. (2014). Explor-

ing invariances in deep convolutional neural networks

using synthetic images. CoRR, abs/1412.7122.

Qi, C. R., Su, H., Nießner, M., Dai, A., Yan, M., and

Guibas, L. J. (2016). Volumetric and multi-view cnns

for object classiﬁcation on 3d data. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 5648–5656.

Ren, S., He, K., Girshick, R. B., and Sun, J. (2015). Faster

R-CNN: towards real-time object detection with re-

gion proposal networks. CoRR, abs/1506.01497.

Sarkar, K., Pagani, A., and Stricker, D. (2016). Feature-

augmented trained models for 6dof object recognition

and camera calibration. In Proceedings of the 11th

Joint Conference on Computer Vision, Imaging and

Computer Graphics Theory and Applications, pages

632–640.

Sarkar, K., Varanasi, K., and Stricker, D. (2017). Trained

3d models for cnn based object recognition. In Pro-

ceedings of the 12th International Joint Conference

on Computer Vision, Imaging and Computer Graphics

Theory and Applications - Volume 5: VISAPP, (VISI-

GRAPP 2017), pages 130–137.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Skrypnyk, I. and Lowe, D. (2004). Scene modelling, recog-

nition and tracking with invariant image features. In

Mixed and Augmented Reality, 2004. ISMAR 2004.

Third IEEE and ACM International Symposium on,

pages 110–119.

Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. G.

(2015a). Multi-view convolutional neural networks

for 3d shape recognition. In Proc. ICCV.

Su, H., Qi, C. R., Li, Y., and Guibas, L. J. (2015b). Render

for cnn: Viewpoint estimation in images using cnns

trained with rendered 3d model views. In The IEEE

International Conference on Computer Vision (ICCV).

Sun, B., Feng, J., and Saenko, K. (2015). Return of frustrat-

ingly easy domain adaptation. CoRR, abs/1511.05547.

Sun, B. and Saenko, K. (2016). Deep CORAL: correla-

tion alignment for deep domain adaptation. CoRR,

abs/1607.01719.

ICPRAM 2019 - 8th International Conference on Pattern Recognition Applications and Methods

436

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,

Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-

novich, A. (2015a). Going deeper with convolutions.

In Computer Vision and Pattern Recognition (CVPR).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,

Z. (2015b). Rethinking the inception architecture for

computer vision. CoRR, abs/1512.00567.

Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., and Darrell,

T. (2014). Deep domain confusion: Maximizing for

domain invariance. CoRR, abs/1412.3474.

VTK. Visualization toolkit (vtk), http://www.vtk.org/.

Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X.,

and Xiao, J. (2015). 3d shapenets: A deep representa-

tion for volumetric shapes. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 1912–1920.

Xiang, Y., Mottaghi, R., and Savarese, S. (2014). Beyond

pascal: A benchmark for 3d object detection in the

wild. In IEEE Winter Conference on Applications of

Computer Vision (WACV).

Simple Domain Adaptation for CAD based Object Recognition

437