and the Locally Convex Connected Patches (LCCP)
segmenter (Stein et al., 2014) available in Point Cloud
Library (PCL, (Rusu and Cousins, 2011)).
Section 2 puts our contribution in context by re-
viewing existing RGB-D datasets. Section 3 describes
in detail our proposed dataset and some of its char-
acteristics. In Section 4 we discuss briefly how the
three state-of-the-art segmentation methods function,
and also present the experiments and results on the
toy-dataset. In Section 5 the results are analyzed, and
Section 6 states the conclusion.
2 RELATED WORK
Several datasets exist for benchmarking segmentation
algorithms. Most of these previous datasets have con-
centrated on relatively simple objects comprised of
primitive shapes, such as cuboid and spherical shapes,
cylinders and combinations of these.
Among the most popular RGB-D datasets are the
Willow Garage dataset and the OSD (Richtsfeld et al.,
2012), both created with a Kinect v1. The Willow
Garage dataset contains 176 images of household ob-
jects with a little or no occlusion, as well as pixel-
based ground truth annotation. The OSD consists
of similar type of household objects as the Willow
Garage dataset. The OSD contains a total of 111 im-
ages of stacked and occluding objects on a table along
with their pixel-based annotated ground truth images.
Both of the aforementioned datasets include roughly
20-30 objects with relatively simple cylindrical and
cuboidal shapes and diverse texture.
Two popular datasets, the RGB-D Object Dataset
(Lai et al., 2011) and BigBIRD (Singh et al., 2014),
utilize a turntable to obtain RGB-D images of com-
mon household items. In addition, both datasets have
been recorded using two cameras, an RGB-D camera
and a higher resolution RGB camera. The data is gen-
erated by capturing multiple synchronized images of
an object while it spins on the turntable for one revo-
lution.
The RGB-D Object Dataset is one of the most ex-
tensive RGB-D datasets available. It comprises of 300
common household objects and 22 annotated video
sequences of natural scenes. These natural scenes in-
clude common indoor environments, such as office
workspaces and kitchen areas, as well as objects from
the dataset. The dataset was recorded with a proto-
type RGB-D camera manufactured by PrimeSense to-
gether with a higher resolution Point Grey Research
Grasshopper RGB camera. Each object was recorded
with the cameras mounted on three different heights
to obtain views from the objects from different angles.
The BigBIRD dataset provides high quality RGB-
D images along with pose information, segmenta-
tion masks and reconstructed meshes for each ob-
ject. Each one of the dataset’s 125 objects has been
recorded using PrimeSense Carmine 1.08 depth sen-
sors and high resolution Canon Rebel T3 cameras.
The objects in the RGB-D Object Dataset and Big-
BIRD dataset are, however, mainly similar to the ge-
ometrically simple objects in the Willow Garage and
OSD datasets. In addition, apart from the video se-
quences of the RGB-D Object Dataset, the images do
not contain occlusion.
The datasets provided by Hinterstoisser et al.
(Hinterstoisser et al., 2012) and Mian et al. (Mian
et al., 2006; Mian et al., 2010) contain more compli-
cated objects than the previously mentioned datasets.
The dataset of Hinterstoisser et al. consists of 15
different texture-less objects on a heavily cluttered
background and with some occlusion. The dataset
includes video sequences of the scenes, and a total
of 18,000 Kinect v1 images along with ground truth
poses for the objects. As this dataset is more aimed
for object recognition and pose estimation, it does not
include pixel-based annotation of the objects. The
dataset proposed by Mian et al. comprises of five
complicated toy-like objects with occlusion and clut-
ter. The 50 depth-only images have been created us-
ing a Minolta Vivid 910 scanner to get a 2.5D view of
the scene. The dataset includes also pose information
for the objects and pixel-based annotated ground truth
images.
As opposed to the completely textureless and uni-
colored objects in the dataset by Hinterstoisser et al.,
the objects in our dataset retain some texture and
many are also multicolored. Also, the dataset by Mian
et al. contains only depth data, and is considerably
smaller than our proposed dataset.
Multiple RGB-D datasets have been gathered
from the real world as well. For instance, the video
sequences of the RGB-D Object Dataset, Cornell-
RGBD-Dataset (Anand et al., 2011; Koppula et al.,
2011) and NYU Dataset v1 (Silberman and Fergus,
2011) and v2 (Silberman et al., 2012) contain al-
together hundreds of indoor scene video sequences,
where the three latter datasets were captured with
Kinect v1. These sequences are recorded in typi-
cal home and office scenes, such as bedrooms, liv-
ing rooms, kitchens and office spaces. The images
are highly cluttered, and all the scenes in Cornell-
RGBD-Dataset are labeled, and a subset of the im-
ages in NYU Datasets contain labeled ground truths.
However, these scenes involve generally larger ob-
jects, such as tables, chairs, desks and sofas, which
are not the kind of objects a lightweight robot typi-
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
108