2 THE PROPOSED METHOD
In order to recognize known objects during normal
operation, our system should be trained with a set
of the objets of interest, i.e. we need to populate a
database with the models of these objects. The pro-
posed method describes the known objects by means
of a set of statistical distribution of the object visual
features, that embeds also structure information as
the keypoint locations and the view-point from which
they are extracted.
Given an object of interest, we collect a set of im-
ages, depth maps and view points tuples {I
i
,D
i
,ω
i
},
where ω
i
∈ R
3
is the orientation vector of the view-
point from which the image I
i
is taken and D
i
is
the depth image. We can express a rigid body trans-
formation g = [T|Ω] ∈ SE(3) in terms of a transla-
tion vector T = [t
x
t
y
t
z
]
T
and a orientation vector
Ω = [r
x
r
y
r
z
]
T
, both in R
3
. We make explicit this
fact using the notation g(T,Ω) = g. R(Ω)
.
= exp(
b
Ω)
is the rotation matrix corresponding to the rotation
vector Ω, where
b
Ω is the skew-symmetric matrix cor-
responding to Ω, and Log
SO(3)
(R(Ω))
.
= Ω is the ro-
tation vector Ω corresponding to the rotation matrix
R(Ω). A feature detector (e.g., SIFT features) is then
run over each image I
i
to extract a set of keypoints
and the relative descriptors {k
i
j
,d
i
j
}, with k
i
j
∈ R
6
the 6 DoF coordinates of the extracted keypoint and
d
i
j
the descriptor tuple. The keypoint 3D position is
extracted from the depth map D
i
and its 3D orien-
tation is obtained through cross product of the SIFT
orientation with the normal to the keypoint surface
patch. Each collected keypoint k
i
j
in the image in I
i
votes for a 6 DoF object position v
i
j
expressed by:
v
j
i
= (−k
i
j
)(c
i
,ω
i
)
where c
i
∈ R
3
are the 3D coordinates of the object
center (computed as the centroid of the point cloud).
For the sake of efficiency, and to reduce the number
of distributions that compose an object model, we
clusterize the visual descriptors d
i
j
into simpler visual
words. The Bag-of-Words {w
k
}
k=1..N
we employ is
created using the k-means clustering method from a
large and random set of feature descriptors, extracted
from a set of natural images. In this way, keypoint
with descriptors close to others are expressed by a
single visual word w
k
grouped together in order
to populate a single 6 dimensional voting space V
k
represented by a Mixture of Gaussian distribution:
each object position hypothesis v
j
i
contributes to gen-
erate this multi-modal PDF. The MoG is efficiently
computed online using an integration-simplification
based method (see Sec. 2.1). When a new keypoint
along with its voting position v
j
i
is extracted, the
visual words w
k
nearest to d
i
j
is searched. In case of
success, v
j
i
will contribute to modify the voting space
V
k
as described in Sec. 2.1. To improve recognition
performances, a vote v
j
i
is generated only if the
assignment of d
i
j
to w
k
is not ambiguous. Let be
w
k
1
,w
k
2
the two nearest words to d
i
j
and let be
d
h
= |d
i
j
−w
k
h
|
2
their distances, v
j
i
is accepted only
if
d
2
d
1
> 0.8. At the end of the training step, each
object model contains N voting spaces V
k
(MoG),
one for each visual word w
k
.
During the online recognition step, the process de-
scribed above is used to dynamically create a model
of the scene M
S
, using as input frames gathered by
RGB-D camera and poses obtained through a struc-
ture from motion algorithm. After each update, a set
of candidate object models M
O
i
is selected from all
learned object models. The candidates set includes all
models that contain a non empty MoG for at least one
of the visual words detected in the last video frames.
Each candidate model M
O
h
is then matched against
M
S
to verify if the object is actually present in the
scene, and where. To figure out if M
O
h
is embedded
in M
S
, and to evaluate the best embedding points p
h
,
for each visual word we select from the two models
M
O
h
and M
S
the corresponding MoGs. Then, we
compute their Cross-Correlation MoG as described in
Sec. 2.2. The result of this operation is a set of MoGs
that are merged together. We apply a mode finding al-
gorithm to this final MoG and the modes actually rep-
resents embedding points p
h
. The points p
h
are con-
sidered as insights of the possible locations of M
O
in
M
S
, their embedding quality is the actual probability
of these guesses.
2.1 MoG Online Training
The most common method for fitting a set of data
points (in our case, the object position v
j
i
) with a
MoG is based on Expectation Maximization (Demp-
ster et al., 1977). Unfortunately, this is an offline
method and does not suit for a scenario in which
new data comes continuously (i.e., new keypoints)
and can’t be stored indefinitely. Many solutions have
been proposed to address this issue (Hall et al., 2005)
(Song and Wang, 2005) (Ognjen and Cipolla, 2005),
but most of them are based on the split and merge
criterion and they are too slow or constrained for
our application. In order to fit the objects position
with a MoG, we employ a continuous integration-
simplification loop that relies on a fidelity criterion
to guarantee the required accuracy (Declercq and Pi-
GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications
500