that small objects of the order 1cm-5cm, appear very
small, making it difficult for the present algorithms to
recognize them from far away. A better approach is
to guess from far and recognize from near. It is dif-
ficult to recognize a single object from far but it is
easy to recognize a group of objects placed together.
In most of the scenarios small objects are placed on
support structures. If support structure is not visible ,
it is mostly due to the support structure occluded by
non distinguishable objects which we define as clut-
ter. Clutter can act as a clue in adding the small object
within the search space. These image regions give a
strong prior for object search for a robot in an indoor
environment.The primary problem is the computation
time that previous approaches take. A vision-based
autonomous robot needs to tackle the problem of ob-
ject search efficiently in the shortest possible time.
Our proposed method demonstrates simple yet effi-
cient strategy for object class segmentation exploit-
ing the rich geometric information from the 3D point
cloud. In summary, our main contributions are:
• We propose a method for segmenting clutter and
support structures from RGBD data using a dense
CRF formulation over appearance and SVM fea-
tures extracted from geometry of the scene.
• We use clutter as a clue for recognizing areas
where objects could be present despite support
structure completely absent in a scene. The ab-
sence can be mainly due to two reasons. 1)The
support structure being occluded. 2)Unreliable
depths after 3m from kinect. Clutter also depicts
the presence of an assortment of objects, which
can be included in the search space.
• Our quantitative results on NYU and LAB show
that our model is reliable across datasets without
training on every dataset unlike ALE. We show
considerable improvements over ALE on both the
datasets.
2 RELATED WORK
Scene labeling, aiming to densely label everything in
a scene, has been extensively studied in computer vi-
sion. Single color image based methods have been
extremely successful, especially in outdoor scenes
(Shotton et al., 2009),(Gould et al., 2009),(Ladicky
et al., 2010), (Zheng et al., 2015). (Shotton et al.,
2009) proposed a segmentation method incorporating
a boosting based unary classifiers into a conditional
random field (CRF). (Ladicky et al., 2010) showed
that global potentials like co-occurrence statistics can
be defined over all variables in the CRF to obtain sig-
nificant improvement in accuracy. More recently, the
methods combining CRF’s with convolutional neural
nets (CNN) have been shown to obtain effective re-
sults (Zheng et al., 2015). But purely image based
approaches do not perform equally well in the harder
case of indoor scenes (Quattoni and Torralba, 2009).
They tell only a little about the physical relationships
between objects, possible actions that can be per-
formed, or the geometric structure of the scene.
The work by Silberman and Fergus (Silberman
and Fergus, 2011) was one of the extensively tested
method to demonstrate that incorporating depth data
gives a significant performance gain over methods
limited to intensity information for the task of indoor
scene labeling. A large variety of other RGB-D based
segmentation works have been proposed (Ren et al.,
2012), (Reza and Kosecka, 2014), (Koppula et al.,
2011), (Gupta et al., 2015), (Kim et al., 2013). The
work by (Reza and Kosecka, 2014) combines a Ad-
aboost classifier on combined RGB-D features with a
CRF framework, to obtain binary segmentation (par-
ticular object vs background). Ren et al. (Ren et al.,
2012) uses kernel features and solves a standard MRF
over superpixels. Gupta et al. (Gupta et al., 2015)
propose a framework to exploit depth data in mul-
tiple related task of contour detection, object detec-
tion to semantic segmentation. In contrast to these
approaches, where large number of scene labels have
been considered, our approach focuses only on pre-
dicting the areas where the objects are more likely
to be present (clutter) or the areas where new object
could be placed (support structure). This allows us to
avoid using large number of complicated features.
In one such work by koppula et al. (Koppula et al.,
2011) the point clouds obtained from the Kinect sen-
sor are merged together using RGBDSLAM and are
segmented using Euclidean clustering. These seg-
ments are the underlying basic structures for MRF
and are labelled to different categories. The idea of
3D geometry has also been exploited in the voxel
based approach proposed by Kim et al. (Kim et al.,
2013). Although, these approaches have shown im-
pressive results, the algorithm can take upto 18 min-
utes to run on a single stitched point cloud (Koppula
et al., 2011), which is unacceptable in most robotic
tasks.
In this paper, we extend the framework by
Kr
¨
ahenb
¨
uhl and Koltun (Kr
¨
ahenb
¨
uhl and Koltun,
2012) to incorporate the depth information. Previous
methods have used basic CRF or MRF methods com-
posed of unary potentials on individual pixels or su-
perpixels and pairwise potentials on neighboring pix-
els or superpixels. It has been shown in the past (Toy-
VISAPP 2017 - International Conference on Computer Vision Theory and Applications
354