object in a real image with a sufficient accuracy to be
fed into a pose refinement process.
2 RELATED WORK
2.1 Local Features Matching
Local features extraction has been widely studied in
the literature as it easily gives matching pair opportu-
nities between two views. These features were first
constructed from parametric shapes isolated within
the image contours (Honda et al., 1995). The indus-
trial parts need to have specific shape singularities
to be properly localized, though. More recently, the
well known SIFT local descriptor was employed to
get a pose description of an object (Lowe, 2004). In
(Gordon and Lowe, 2004) a matching between the ex-
tracted SIFT with the pose description of the object
is performed to get a camera pose estimation. Be-
cause of the high dimension of the features vector,
the S IFT severely impacts the algorithm computa-
tion time. Later, the SURF local descriptors, faster
to extract, were introduced. However, they appear to
be less robust to rotation and image distorsion than
SIFT (Bay et al., 2008). To be computed, the local
descriptors often rely on the object texture, one el-
ement absent from industrial parts. Moreover, they
suffer from high luminosity and contrast variations
which make them improper to be used in a challeng-
ing plant environment. Using the depth channel of
RGB-D images, (Lee et al., 2016) proposes an ICP al-
gorithm fed with 3D SURF features and Closed Loop
Boundaries (CLB) to estimate the pose of industrial
objects. However, the system can not deal with oc-
clusions likely to happen inside a bin.
2.2 Template Matching
To tackle the issue of texture-less objects, template
matching methods are employed to assign more com-
plex feature pairs from two different points of view.
Prime works built an object descriptor composed of
different templates hierarchically extracted from an
object point of view and later compared with an input
image through a distance transform (Gavrila, 1998).
To handle more degrees of freedom in the camera
pose estimation, the recent machine learning tech-
niques are used to learn the object templates and
the associated camera pose to infer the position and
then refine it using the distance transform. However,
the algorithms still need a contour extraction pro-
cess which is not suitable for low contrasted, noisy
or blurred images. In (Hinterstoisser et al., 2010) the
discretized gradient directions are used to build tem-
plates compared with an object model through an en-
ergy function robust to small distorsion and rotation.
This method called LINE is yet not suitable for clut-
tered background as it severely impacts the computed
gradient. A similar technique is proposed in (Muja
et al., 2011). The arrival of low-cost RGB-D cameras
led the templates to become multimodal. LINEMOD
presented in (Hinterstoisser et al., 2011) uses a depth
canal in the object template among the gradient from
LINE to easily remove background side effects. The
method is later integrated into Hough forests to im-
prove the occlusion robustness (Tejani et al., 2014).
To deal with more objects inside the database, (Hodan
et al., 2015) proposes a sliding window algorithm ex-
tracting relevant image areas to build candidate tem-
plates. These templates are verified to get a rough
3D pose estimation later refined with a stochastic op-
timization procedure. The camera pose estimation
problem can be solved with template matching but re-
mains constrained to RGB-D information to achieve
acceptable results. In this paper we propose to use 2D
images without depth information therefore not suit-
able for this type of matching.
2.3 Features Matching from Neural
Networks
Neural networks offer the ability to automatically ex-
tract relevant features to perform a given task. In
a pose estimation problem they use convolutional
neural networks to compute object descriptors as a
database prior to a kNN algorithm to find the closest
camera pose in the set (Wohlhart and Lepetit, 2015).
The loss function is fed with a triplet formed with
one sample from the training dataset and two others
respectively close and far from the considered cam-
era position. Forcing the similarity between features
and estimated poses for two close points of view is
also employed in (Doumanoglou et al., 2016) with a
Siamese neural network doing a real pose regression.
Although these methods are quite appealing, they still
take advantage of RGB-D modalities. In (Kendall
et al., 2015) a deep CNN known as PoseNet based
on the GoogLeNet network is built to regress a 6D
camera pose from a 2D image. Whereas slightly dif-
ferent from our work for dealing with urban scenes,
the work shows the continuous 3D space regression
capability of a CNN in an end-to-end manner. Other
works are using pre-trained CNN for object classifi-
cation to estimate a view pose through SVMs. The
depth channel of the RGB-D image is converted into
an RGB image to be fed into a CNN with relevant 3D
features (Schwarz et al., 2015).
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
410