points are shown in 4(a) as colored small spheres.
The software calculates the rigid transformation that
aligns the object model to the input scene by align-
ing the picked points and renders a transformed ob-
ject model on to the input scene as an image repre-
sentation. Users refines the transformation using an
ICP algorithm (Besl and McKay, 1992) by pressing
the key, as shown in 4(b). The final transformation
T
M
i
= [R, t; 0, 1] can be checked via the 3D point cloud
viewer as shown in figure 4(c).
In Step 2, relative camera poses T
C
j
between
the reference image I and the rest of the frames
I
j
, j = {1, ..., n} are calculated by using ArUco li-
brary, which can accurately compute the pose of the
Augmented Reality (AR) markers.
Finally, in Step 3, multiple ground-truth annota-
tions (including bounding boxes, 6DoF poses, affor-
dance labels, and object class labels) are generated
based on the pose data computed in Steps 1 and 2. Ac-
cording to the results of Steps 1 and 2, the pose of the
object M
i
in I
n
can be denoted by T
C
j
T
M
i
. However,
the two transformations may include slight errors that
should be corrected. The object models appearing in
the scene M
i
are merged into a point cloud after ap-
plying the transformation T
C
j
T
M
i
. A transformation
for correction T
r
that minimizes errors between the
merged point cloud and the scene I
j
is calculated by
an ICP algorithm. Therefore, the final transformation
is denoted by T
r
T
C
j
T
M
i
, and is applied for all object
models M
i
.
Annotations are generated by rendering them to
the images of I
j
. As for the affordance and object
class labels, an assigned label of each point of M
i
is
rendered. Bounding boxes are calculated from the
silhouettes of each model. The 6DoF poses of each
model M
i
in an image I
n
is T
r
T
C
j
T
M
i
.
4 EXPERIMENTS
4.1 Annotation Cost
This section discusses the annotation cost of the
proposed annotation software. One importance dif-
ference from the previous method (Akizuki and
Hashimoto, 2019) is Step 1, the 6DoF annotation
shown in figure 4(a). The previous method incremen-
tally transformed the object model via keyboard and
mouse operation. Our method only needs a few cor-
responding points to be picked from an object model
and input scene. It is confirmed that the annotation
cost of this process improved from 15–20 min to 8
min.
4.2 Object Recognition Task: Setting
We have conducted the benchmarking for our dataset
using modern object recognition algorithms. Three
tasks–affordance segmentation, object class segmen-
tation and object detection–were tested. The dataset
was split into a training set and a testing set. Of the
100 layouts, 80 were used as a training set and 20
were used as a testing set. Each set has 43831 and
11036 images, respectively.
4.3 Task 1: Affordance Segmentation
In this experiment, we employed the following seg-
mentation models for comparison.
1. Fully convolutional networks (FCN-8s) (Long
et al., 2015)
2. SegNet(Badrinarayanan et al., 2017)
3. Full-resolution residual networks (FRRN-A
(Pohlen et al., 2017)
We trained each model for 150k iterations at a batch
size of 1. To evaluate recognition performance, we
used intersection over union (IoU), which is widely
used as an index for segmentation tasks. IoU scores
of each method are shown in Table 4. Figure5 shows
the result of affordance segmentation.
Mean IoU score of FRRN-A achieved a higher
score than that of others. IoU scores of affordance
class that have high frequency in figure 3(b) are rela-
tively higher compared with that of the other class.
This was the common tendency of three network
models.
4.4 Task 2: Object Class Segmentation
In this experiment, we used the same algorithms that
were used for Task 1. IoU scores of each method
are shown in Table 5. The classes mug and shovel
were composed of the affordances contain, grasp, and
wrap-grasp, and the IoU scores of these affordance la-
bels were relatively higher in the Task 1 experiment.
Therefore, it seems that the recognition performance
of two kinds of label have a correlation.
4.5 Task 3: Object Detection
In this experiment, we evaluated the performance of
object-detection tasks by comparing mean average
precision (mAP) scores for each object class. We
used SSD (Liu et al., 2016) for this experiment. mAP
scores of each object class are shown in Table 6.
A Multi-purpose RGB-D Dataset for Understanding Everyday Objects
473