2 THE PROPOSED METHOD
In order to recognize known objects during normal
operation, our system should be trained with a set
of the objets of interest, i.e. we need to populate a
database with the models of these objects. The pro
posed method describes the known objects by means
of a set of statistical distribution of the object visual
features, that embeds also structure information as
the keypoint locations and the viewpoint from which
they are extracted.
Given an object of interest, we collect a set of im
ages, depth maps and view points tuples {I
i
,D
i
,ω
i
},
where ω
i
∈ R
3
is the orientation vector of the view
point from which the image I
i
is taken and D
i
is
the depth image. We can express a rigid body trans
formation g = [TΩ] ∈ SE(3) in terms of a transla
tion vector T = [t
x
t
y
t
z
]
T
and a orientation vector
Ω = [r
x
r
y
r
z
]
T
, both in R
3
. We make explicit this
fact using the notation g(T,Ω) = g. R(Ω)
.
= exp(
b
Ω)
is the rotation matrix corresponding to the rotation
vector Ω, where
b
Ω is the skewsymmetric matrix cor
responding to Ω, and Log
SO(3)
(R(Ω))
.
= Ω is the ro
tation vector Ω corresponding to the rotation matrix
R(Ω). A feature detector (e.g., SIFT features) is then
run over each image I
i
to extract a set of keypoints
and the relative descriptors {k
i
j
,d
i
j
}, with k
i
j
∈ R
6
the 6 DoF coordinates of the extracted keypoint and
d
i
j
the descriptor tuple. The keypoint 3D position is
extracted from the depth map D
i
and its 3D orien
tation is obtained through cross product of the SIFT
orientation with the normal to the keypoint surface
patch. Each collected keypoint k
i
j
in the image in I
i
votes for a 6 DoF object position v
i
j
expressed by:
v
j
i
= (−k
i
j
)(c
i
,ω
i
)
where c
i
∈ R
3
are the 3D coordinates of the object
center (computed as the centroid of the point cloud).
For the sake of efﬁciency, and to reduce the number
of distributions that compose an object model, we
clusterize the visual descriptors d
i
j
into simpler visual
words. The BagofWords {w
k
}
k=1..N
we employ is
created using the kmeans clustering method from a
large and random set of feature descriptors, extracted
from a set of natural images. In this way, keypoint
with descriptors close to others are expressed by a
single visual word w
k
grouped together in order
to populate a single 6 dimensional voting space V
k
represented by a Mixture of Gaussian distribution:
each object position hypothesis v
j
i
contributes to gen
erate this multimodal PDF. The MoG is efﬁciently
computed online using an integrationsimpliﬁcation
based method (see Sec. 2.1). When a new keypoint
along with its voting position v
j
i
is extracted, the
visual words w
k
nearest to d
i
j
is searched. In case of
success, v
j
i
will contribute to modify the voting space
V
k
as described in Sec. 2.1. To improve recognition
performances, a vote v
j
i
is generated only if the
assignment of d
i
j
to w
k
is not ambiguous. Let be
w
k
1
,w
k
2
the two nearest words to d
i
j
and let be
d
h
= d
i
j
−w
k
h

2
their distances, v
j
i
is accepted only
if
d
2
d
1
> 0.8. At the end of the training step, each
object model contains N voting spaces V
k
(MoG),
one for each visual word w
k
.
During the online recognition step, the process de
scribed above is used to dynamically create a model
of the scene M
S
, using as input frames gathered by
RGBD camera and poses obtained through a struc
ture from motion algorithm. After each update, a set
of candidate object models M
O
i
is selected from all
learned object models. The candidates set includes all
models that contain a non empty MoG for at least one
of the visual words detected in the last video frames.
Each candidate model M
O
h
is then matched against
M
S
to verify if the object is actually present in the
scene, and where. To ﬁgure out if M
O
h
is embedded
in M
S
, and to evaluate the best embedding points p
h
,
for each visual word we select from the two models
M
O
h
and M
S
the corresponding MoGs. Then, we
compute their CrossCorrelation MoG as described in
Sec. 2.2. The result of this operation is a set of MoGs
that are merged together. We apply a mode ﬁnding al
gorithm to this ﬁnal MoG and the modes actually rep
resents embedding points p
h
. The points p
h
are con
sidered as insights of the possible locations of M
O
in
M
S
, their embedding quality is the actual probability
of these guesses.
2.1 MoG Online Training
The most common method for ﬁtting a set of data
points (in our case, the object position v
j
i
) with a
MoG is based on Expectation Maximization (Demp
ster et al., 1977). Unfortunately, this is an ofﬂine
method and does not suit for a scenario in which
new data comes continuously (i.e., new keypoints)
and can’t be stored indeﬁnitely. Many solutions have
been proposed to address this issue (Hall et al., 2005)
(Song and Wang, 2005) (Ognjen and Cipolla, 2005),
but most of them are based on the split and merge
criterion and they are too slow or constrained for
our application. In order to ﬁt the objects position
with a MoG, we employ a continuous integration
simpliﬁcation loop that relies on a ﬁdelity criterion
to guarantee the required accuracy (Declercq and Pi
GRAPP2014InternationalConferenceonComputerGraphicsTheoryandApplications
500