segmentation that we have created, called the NEDO
Part Affordance Dataset v1. In section 4, we describe
the low-cost ground-truth annotation procedure using
6DoF pose estimation, which we proposed for crea-
ting this dataset. In section 5, we describe benchmark
tests done on de-facto standard semantic segmenta-
tion methods using the NEDO Part Affordance Data-
set v1, and in section 6, we summarize the results of
this research.
2 RELATED WORK
2.1 Affordance Estimation
Various earlier methods have been proposed for esti-
mating affordance of objects appearing in a scene.
One approach is to model human poses for affor-
dance, and to evaluate their consistency with real sce-
nes. For example, one procedure for handling the sit-
table affordance is to prepare a human model in sitting
poses, and then compare it with input scenes (Grab-
ner et al., 2011). This approach is also closely rela-
ted to Robotic grasping, and a method proposed for
estimating graspability by comparing hand state with
measurement data (Domae et al., 2014)
Recently, the main approach has been to train for
the correspondence between affordance and local fea-
tures extracted from the input scene to identify multi-
ple affordance values. Affordance estimation is being
solved using so-called multi-class classifiers. Myers
et al. proposed a method that estimates seven affor-
dances (Grasp, Cut, Scoop, Contain, Pound, Support,
Wrap-grasp) for everyday objects at the pixel level.
The method trains for so-called hand-crafted local
features such as depth, color, and curvature, which
are obtained from depth images of the everyday ob-
jects(Myers et al., 2015).
Prompted by efforts such as Seg-
Net(Badrinarayanan et al., 2017), which have
had success using Deep Learning (DL) for semantic
segmentation, there has also been active research
applying DL to estimation of affordances. Nguyen
et al. proposed a method to estimate the affordance
defined by Myers et al. using an encoder-decoder
network, which demonstrated superiority for hand-
crafted features(Nguyen et al., 2016). Another
method that performs object-detection as a prior
step, and then uses a network to estimate object class
and affordance for each detected object region, has
also been proposed(Do et al., 2018). Other methods
have also been proposed, implementing affordance
segmentation by taking RGB input images and
using networks to estimate mid-level features, depth
information, normal-vectors and object classes(Roy
and Todorovic, 2016).
2.2 Semantic Segmentation
Semantic segmentation is the task of estimating the
class to which each pixel of an input image belongs.
Fully Convolutional Networks(Long et al., 2015)
implement the semantic segmentation task by using
convolutional layers that output 2D heatmaps to re-
place the fully-connected layers of classification net-
works used in methods like VGG(Simonyan and Zis-
serman, 2014) and AlexNets(Krizhevsky et al., 2012).
SegNet(Badrinarayanan et al., 2017) is an encoder-
decoder network. The decoder is able to discrimi-
nate accurately by gradually increasing resolution. U-
Net is able to reflect segmentation results on detailed
shapes in the input image by concatenating the fea-
ture maps of each encoder with the decoder feature
maps(Ronneberger et al., 2015).
3 PROPOSED DATASET
3.1 Dataset Overview
We now describe the NEDO Part Affordance Dataset
v1 proposed for affordance segmentation in this rese-
arch. The dataset is composed of the following data.
• 3D models created from measurements of real
everyday objects with point-wise annotation.
• Multiple RGB-D images of these everyday ob-
jects.
• Ground-truth images with pixel-wise annotation,
corresponding to the RGB-D images.
Examples of data in the NEDO Part Affordance
Dataset v1 are shown in Figure 1. Examples of object
models are shown in Figure 1(a). Object models are
composed of meshes with affordance labels on each
vertex. The colors on the 3D models indicate affor-
dance labels as defined in Table 2. All objects in the
dataset were measured using a precise 3D sensor, and
modeled with actual-size dimensions. The affordance
labels are defined in Table 2. These include labels de-
fined by Myers et al. with the additional labels, Stick,
None, and Background. A total of 74 3D models of
10 types of kitchen and DIY tools were measured. A
breakdown of the objects is given in Table 1.
Figure 1(b) gives examples of input images and
ground truth. A total of 10125 images were prepared,
which were divided into 8706 training set and 1419
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
608