3D Region Proposals For Selective Object Search

Sheetal Reddy, Vineet Gandhi and Madhava Krishna

International Institute of Information Technology, Hyderabad, India

sheetal.reddy@research.iiit.ac.in, vgandhi@iiit.ac.in, mkrishna@iiit.ac.in

Keywords:

RGB-D Scene Classiﬁcation, RGB-D Semantic Segmentation, RGB-D Object Search.

Abstract:

The advent of indoor personal mobile robots has clearly demonstrated their utility in assisting humans at

various places such as workshops, ofﬁces, homes, etc. One of the most important cases in such autonomous

scenarios is where the robot has to search for certain objects in large rooms. Exploring the whole room

would prove to be extremely expensive in terms of both computing power and time. To address this issue,

we demonstrate a fast algorithm to reduce the search space by identifying possible object locations as two

classes, namely - Support Structures and Clutter. Support Structures are plausible object containers in a scene

such as tables, chairs, sofas, etc. Clutter refers to places where there seem to be several objects but cannot be

clearly distinguished. It can also be identiﬁed as unorganized regions which can be of interest for tasks such

as robot grasping, fetching and placing objects. The primary contribution of this paper is to quickly identify

potential object locations using a Support Vector Machine(SVM) learnt over the features extracted from the

depth map and the RGB image of the scene, which further culminates into a densely connected Conditional

Random Field(CRF) formulated over the image of the scene. The inference over the CRF leads to assignment

of the labels - support structure, clutter, others to each pixel.There have been reliable outcomes even during

challenging scenarios such as the support structures being far from the robot. The experiments demonstrate the

efﬁcacy and speed of the algorithm irrespective of alterations to camera angles, modiﬁcations to appearance

change, lighting and distance from locations etc.

1 INTODUCTION

The ability to locate a speciﬁc object in an indoor en-

vironment is a fundamental problem in creating fully

autonomous mobile robotic systems. This requires

the robot to 1)Locate the object in the exploration en-

vironment. 2)Plan a path to reach the object and 3)

Perform the desired operation on the object such as

servoing to a desired pose, grasping etc.

In indoor scenes, objects are likely to be placed

over raised ﬂat surfaces like tables, which we call sup-

port structures. Moreover, the objects are often sur-

rounded by several other related articles, which can

be termed as clutter. The aim of this work is to locate

all support structures and cluttered areas in a given

scene. More formally, given a depth and RGB im-

age pair, the proposed method classiﬁes each pixel

into one of the three categories i.e. clutter, support

structure or other. An example output of the proposed

approach is shown in Figure 1, where all the objects

(keyboard, mouse, computer screen etc.) are marked

as clutter and the rest of the table is marked as support

structure. The robot can now move close to the ar-

eas marked as clutter to search for the desired object.

Figure 1: The ﬁgure shows sample results of the proposed

method. We show an a) input RGB image taken from the

LAB dataset. B) shows the input depth image from kinect.

C) depicts the ground truth labelling for object search in in-

door environments. D) Show the results using our method.

Furthermore, the obtained result can also be useful

for the problem of ﬁnding likely locations of placing

an object (connected support structure pixels are can-

didate positions). The motivation behind addressing

it as a 3 label problem is to use the labels as a prior

for object search. Our work is inspired from the idea

Reddy S., Gandhi V. and Krishna M.

3D Region Proposals For Selective Object Search.

DOI: 10.5220/0006172903530361

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 353-361

ISBN: 978-989-758-226-4

353

that small objects of the order 1cm-5cm, appear very

small, making it difﬁcult for the present algorithms to

recognize them from far away. A better approach is

to guess from far and recognize from near. It is dif-

ﬁcult to recognize a single object from far but it is

easy to recognize a group of objects placed together.

In most of the scenarios small objects are placed on

support structures. If support structure is not visible ,

it is mostly due to the support structure occluded by

non distinguishable objects which we deﬁne as clut-

ter. Clutter can act as a clue in adding the small object

within the search space. These image regions give a

strong prior for object search for a robot in an indoor

environment.The primary problem is the computation

time that previous approaches take. A vision-based

autonomous robot needs to tackle the problem of ob-

ject search efﬁciently in the shortest possible time.

Our proposed method demonstrates simple yet efﬁ-

cient strategy for object class segmentation exploit-

ing the rich geometric information from the 3D point

cloud. In summary, our main contributions are:

• We propose a method for segmenting clutter and

support structures from RGBD data using a dense

CRF formulation over appearance and SVM fea-

tures extracted from geometry of the scene.

• We use clutter as a clue for recognizing areas

where objects could be present despite support

structure completely absent in a scene. The ab-

sence can be mainly due to two reasons. 1)The

support structure being occluded. 2)Unreliable

depths after 3m from kinect. Clutter also depicts

the presence of an assortment of objects, which

can be included in the search space.

• Our quantitative results on NYU and LAB show

that our model is reliable across datasets without

training on every dataset unlike ALE. We show

considerable improvements over ALE on both the

datasets.

2 RELATED WORK

Scene labeling, aiming to densely label everything in

a scene, has been extensively studied in computer vi-

sion. Single color image based methods have been

extremely successful, especially in outdoor scenes

(Shotton et al., 2009),(Gould et al., 2009),(Ladicky

et al., 2010), (Zheng et al., 2015). (Shotton et al.,

2009) proposed a segmentation method incorporating

a boosting based unary classiﬁers into a conditional

random ﬁeld (CRF). (Ladicky et al., 2010) showed

that global potentials like co-occurrence statistics can

be deﬁned over all variables in the CRF to obtain sig-

niﬁcant improvement in accuracy. More recently, the

methods combining CRF’s with convolutional neural

nets (CNN) have been shown to obtain effective re-

sults (Zheng et al., 2015). But purely image based

approaches do not perform equally well in the harder

case of indoor scenes (Quattoni and Torralba, 2009).

They tell only a little about the physical relationships

between objects, possible actions that can be per-

formed, or the geometric structure of the scene.

The work by Silberman and Fergus (Silberman

and Fergus, 2011) was one of the extensively tested

method to demonstrate that incorporating depth data

gives a signiﬁcant performance gain over methods

limited to intensity information for the task of indoor

scene labeling. A large variety of other RGB-D based

segmentation works have been proposed (Ren et al.,

2012), (Reza and Kosecka, 2014), (Koppula et al.,

2011), (Gupta et al., 2015), (Kim et al., 2013). The

work by (Reza and Kosecka, 2014) combines a Ad-

aboost classiﬁer on combined RGB-D features with a

CRF framework, to obtain binary segmentation (par-

ticular object vs background). Ren et al. (Ren et al.,

2012) uses kernel features and solves a standard MRF

over superpixels. Gupta et al. (Gupta et al., 2015)

propose a framework to exploit depth data in mul-

tiple related task of contour detection, object detec-

tion to semantic segmentation. In contrast to these

approaches, where large number of scene labels have

been considered, our approach focuses only on pre-

dicting the areas where the objects are more likely

to be present (clutter) or the areas where new object

could be placed (support structure). This allows us to

avoid using large number of complicated features.

In one such work by koppula et al. (Koppula et al.,

2011) the point clouds obtained from the Kinect sen-

sor are merged together using RGBDSLAM and are

segmented using Euclidean clustering. These seg-

ments are the underlying basic structures for MRF

and are labelled to different categories. The idea of

3D geometry has also been exploited in the voxel

based approach proposed by Kim et al. (Kim et al.,

2013). Although, these approaches have shown im-

pressive results, the algorithm can take upto 18 min-

utes to run on a single stitched point cloud (Koppula

et al., 2011), which is unacceptable in most robotic

tasks.

In this paper, we extend the framework by

ahenb

uhl and Koltun (Kr

ahenb

uhl and Koltun,

2012) to incorporate the depth information. Previous

methods have used basic CRF or MRF methods com-

posed of unary potentials on individual pixels or su-

perpixels and pairwise potentials on neighboring pix-

els or superpixels. It has been shown in the past (Toy-

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

354

Figure 2: The Flow chart shows the stages of our system pipeline. The Input RGBD data contains the RGb image and depth

image from the kinect sensor. We do a preprocessing on the input data to give the super-pixels using the SLIC algorithm

and the 3D point cloud. Using the above preprocessed information, we do a feature extraction to give the entropy, point

cloud normals and the height image. Once the features are extracted, we run a RBF kernel classiﬁcation algorithm to gain

probability of each class to be input to the CRF. We create a dense fully connected crf with multiple pairwise terms and run an

mean-ﬁeld based inference algorithm to accurately segment the scene into support structure and clutter.(Figure Best viewed

in color and enlarged).

oda and Hasegawa, 2008), that fully connected CRF’s

can improve the accuracy of semantic labelling over

standard CRF’s. (Hermans et al., 2014) and (Wolf

et al., 2015) use a similar segmentation pipeline as

ours but unlike us, they train a Random forest classi-

ﬁer on appearance features. They do not use height or

normal pairwaise kernels in Dense CRF.

3 SYSTEM OVERVIEW

The motivation behind our model is to speed up the

scene labelling process for selective search rather than

resorting to an exhaustive search. The system archi-

tecture that we follow is explained in the ﬁgure 2.

The input for the model are the RGB image and depth

image of a indoor scene containing multiple support

structures and grouped objects from Microsoft kinect

sensor. The RGB image is ﬁrst superpixelled using

SLIC algorithm(Achanta et al., 2012) and the corre-

sponding depth data for each segment is extracted.

We use the RGB and depth data to compute the nor-

mals of each segment. We train a Support Vector

Machine (SVM) over these images for support struc-

ture and clutter detection. The SVM probabilities

are taken as the initialization into the fully connected

CRF model and inferred using the mean-ﬁeld approx-

imation.

The structure of the paper is as follows. In sec-

tion 4 we give the formulation for detection of sup-

port structure using input RGB and depth images. In

section 5 we use probabilities estimated from the Sec-

tion 4 to formulate a CRF. The evaluation algorithm

on the dataset has been explained in Section 6.

4 OBJECT CLASS DETECTION

This section explains computation of the features for

the object class segmentation and its classiﬁcation us-

ing the Kernel SVM. We ﬁrst preprocess the input

RGBD image to reduce the computational time of the

algorithm.

4.0.1 Superpixels

In our approach, rather than performing classiﬁcation

on every pixel, we consider small regions or patches

called superpixels as the basic units of classiﬁcation

to speed up the process. We compute superpixels

over the image using Simple Linear Iterative Cluster-

ing(SLIC) method (Achanta et al., 2012). Over seg-

menting the image allows us to work on a few hun-

dred data points per image rather than working with

640X480 pixels per image. Figure 3 shows an exam-

ple of a super pixelated image

Figure 3: A sample scenario. (a) RGB image. (b) SLIC

superpixelled image.

3D Region Proposals For Selective Object Search

355

4.1 Feature Computation

3D point cloud is extracted using the depth map and

the RGB image from the kinect sensor. We use the

PCL library (Rusu and Cousins, 2011) for the com-

putation of the point cloud. We compute 3D features,

which capture the geometry, shape and texture of the

support structures on which objects can be placed. To

support an object, we exploit the constraint that the

surface should always be horizontal and ideally par-

allel to the ground. We have not used appearance

features to avoid sensitivity to color parameters. The

features we have used are listed in the table 2. We

considered small feature set to enable speed in the al-

gorithm.

4.1.1 Entropy Map

Entropy is a statistical measure of randomness that

can be used to characterize the texture of an image. It

is deﬁned as

−

p=k

∑

p=k

plog

(p) (1)

where k

....k

are the histogram counts. We take a

9X9 neighbourhood around each pixel and compute

the histogram counts for each window. Secondly we

compute the entropy values at every pixel in all the

three channels R,G,B and also on the depth image.

Entropy map gives high values at inconsistent depth

changes. We then compute the average entropy value

for each superpixel.

4.1.2 Point Cloud Normals

For each superpixel, surface normal is computed at

the centroid. The normal at a point is computed by

approximating it to the problem of computing normal

of a plane located tangent to the surface.

4.1.3 Height

We consider height(h

) of the centroid as one of the

3D cues. Our locations of interest are ﬂat surfaces and

grouped objects raised to a certain height. The height

features give us a good demarcation in segmenting

the regions of interest. We can justify the selection

of these features by the intuition that objects cannot

be found close to the ceiling of an indoor scene and

similarly support structures can not lie on the ground.

4.2 Classiﬁcation

Based on the above features the support vector ma-

chine assigns a label and score to every superpixel.

We use the SVM implementation from libSVM and

libLinear(Chang and Lin, 2001). We tested our model

on both the linear and kernel SVM. The RBF kernel

performed well compared to the linear kernel. We

used LibSVM (Chang and Lin, 2001) with Radial ba-

sis function kernel. The training set is small and im-

balanced as the positive samples in a image will be

far less than the negative samples. Appropriate cost

paramater C and γ were found by searching over the

grid with the cross validation data set. SVM predicts

the class labels without probability information. To

incorporate the SVM output into a conditional ran-

dom ﬁeld we follow the method given in(Wu et al.,

2004), where they have extended SVM to give prob-

ability estimates. The probability estimates are given

to the Conditional random ﬁelds as Unary potentials

as shown in section 5.

5 CONDITIONAL RANDOM

FIELD

We formulate the labelling problem as a Conditional

Random ﬁeld(CRF) in the image space. CRF is a

graph based method generally used for segmentation

problems. This is implemented using a model consti-

tuting of set of random variables X = {X

,....,X

}

each taking a state from the label space ζ = l

These random variables belong to all image pixels

i ∈ ν = 1, 2,..., N, Let η be the neighbourhood sys-

tem of the random ﬁeld deﬁned by the sets η

,∀i ∈ ν,

where η

denotes the neighbourhood of the variable

. Here l

belongs to the support structure, l

be-

longs to the clutter and the l

belongs to the regions

not belonging to either the support structure or clutter.

These energies take the form

E(X ) =

∑

i∈ν

) +

∑

i∈ν, j∈η

) (2)

Here ψ

is deﬁned as the unary potential. This poten-

tial represents whether the pixel belongs to the sup-

port structure or clutter or neither. Here ψ

represents

the pairwise potential, which exploits the consistency

of the label in the image space.

5.1 Unary Potential

We compute the unary potentials using the probabil-

ity estimates from Section 4.2. Given k classes of

the data, the goal of the classiﬁcation algorithm as

proposed by (Wu et al., 2004) is to estimate p

P(y = i|x),i = 1,..., k. The algorithm follows the

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

356

Table 1: Quantitative evaluation for LAB dataset. The

ﬁgure shows the percentage of correctly classiﬁed support

structures. The ﬁrst, second and third columns presents re-

sults with the ALE, fully connected CRF and our method

proposed.

ALE FULLY-C OURS

36% 46% 69%

one-against-one approach for multi-class classiﬁca-

tion. We train the SVM for the three classes proposed

earlier and input the probabilities to the CRF unary

potential as follows:

) = p

(3)

represent the probability of each label. The unary

potential gives a probability of the pixel belonging to

each class.

5.2 Pairwise Potential

The pairwise potential exploits the consistency of the

label in the image space. We use multiple constraints

to exploit the continuity of the label. The formulation

of the pairwaise potential is given in (Kr

ahenb

uhl and

Koltun, 2012)

) = µ(x

)[

∑

m=1

(m)

( f

, f

)] (4)

Where each k

(m)

is a Gaussian kernel , the vectors

and f

are feature vectors for pixels i and j in

an arbitrary feature space, w

(m)

are linear combina-

tion weights, and µ is a label compatibility function.

Since our problem is a multi-class image segmenta-

tion and labelling problem, we follow (Kr

ahenb

uhl

and Koltun, 2012) and use contrast-sensitive kernel

potentials as:

k( f

, f

) = w

(1)

exp(−

− p

2θ

−

− I

2θ

| {z }

appearancekernel

)

(2)

exp( −

− p

2θ

| {z }

smoothnesskernel

)

(3)

exp(−

− p

2θ

−

− h

2θ

| {z }

heightkernel

)

(4)

exp(−

− p

2θ

−

− n

2θ

| {z }

normalkernel

)

where p

, p

are the positions, I

are the intensity

vectors, n

are normal vectors, h

are heights.

Table 2: The above table shows the features used and their

corresponding count.

No Feature set of the superpixel Count

F1 Vertical position of the centroid c

F2 Vertical, Horizontal and z compo-

nents of the normal: n

F3 Entropy on RGB (3 channels) 3

F4 Entropy on Depth 1

Table 3: Recall accuracy on lab dataset.

Method Clutter Support Structure Others

ALE 43.28 36.71 98.63

FULLY-C 20.28 51.13 93.99

OURS 32.63 59.4 96.5

Table 4: Intersection over union accuracy on lab dataset

Method Clutter Support Structure Others

ALE 30.2 29.99 94.19

FULLY-C 15.11 39.87 94.9

OURS 19.10 39.6 93.99

(1)

(2)

(3)

(4)

are the corresponding weights for

each kernel. We have ﬁne tuned the CRF parame-

ters by empirical evaluation of qualitative results. The

appearance kernel is inspired by the observation that

nearby pixels with similar appearance are likely to

have same class. The smoothness kernel removes

small isolated regions. In the case of support struc-

ture and clutter we exploit the constraint of height in

segmenting the image. The height kernel exploits the

difference of height between the labels i.e. pixels be-

longing to the same label need to have the same height

in the depth image. Similarly pixels belonging to the

same label need to have same normal orientation and

is exploited using the normal kernel.

5.3 Inference

We follow (Kr

ahenb

uhl and Koltun, 2012), which

uses a mean ﬁeld approximation approach for infer-

ence. In this approach we try to ﬁnd a mean ﬁeld ap-

proximation Q(x) that minimizes the KL-divergence

D(Q||P) among all the distributions Q that can be

expressed as a product of independent marginals,

Q(x) =

∏

= l) =1/Z

exp{−ψ

)−

∑

∈L

∑

j6=i

= l

)ψ

)}

where Z

is a constant which normalizes the marginal

at pixel i. If the updates are made in sequence across

pixels, the KL-Divergence is guaranteed to decrease.

3D Region Proposals For Selective Object Search

357

Figure 4: Qualitative Results: Our method is able to de-

tect clutter in case of cluttered indoor scenes where support

sturctures are not visible. The 2nd and 3rd columns shows

the ground-truth labelling and our method labelling respec-

tively(Figure best viewed in color and enlarged).

6 EXPERIMENTAL RESULTS

6.1 Dataset

LAB Dataset. We consider labeling support structure

and clutter in a 3D scene using a Kinect sensor. Data

has been collected from 5 labs with varying compo-

sitions of the three labels, which we will be address-

ing as lab dataset. Each scene in the lab dataset had

a image resolution of 640x480 and contained about

300,000 points of depth points. These scenes are chal-

lenging as they contained objects, which cannot be

grouped or classiﬁed using existing computer vision

algorithms. Manual annotation of each scene from the

LAB dataset with the 3 classes is performed as shown

in the second column Fig 5. We have classiﬁed the

dataset into 35 traning examples and 60 testing exam-

ples.

NYU Dataset.To test the algorithm on publicly avail-

able dataset, we used the NYUv2 RBG-D dataset (Sil-

berman et al., 2012) which comprises of 795 train-

ing and 654 testing images. NYU is one of the most

widely used RGBD dataset for semantic segmenta-

tion. These images were semantically labelled to con-

tain multiple classes. We have selected 65 testing

samples containing challenging scenarios for support

structure detection. The ground truth provided by

NYU cannot be used for our problem as we are try-

ing to segment support structure tops rather than the

whole support struture to reduce computation. There-

fore manual annotation was performed on these se-

lected images for all the three classes.

6.2 Evaluation

In this section, we show an extensive evaluation of

our algorithm on the datasets mentioned in section

6.1. We quantitatively compare our results with

other state-of-the-art algorithms in scene understand-

Table 5: Intersection over union accuracy on NYU.

Method Clutter Support Structure Others

ALE 7.40 5.18 85.12

FULLY-C 30.04 36.01 84.45

OURS 31.68 38.14 83.13

Table 6: Recall accuracy on NYU.

Method Clutter Support Structure Others

ALE 7.34 5.41 99.0

FULLY-C 51.61 35.68 88.99

OURS 57.17 37.57 87.14

ing. All the currently available datasets like NYU V2,

contained scenes where support structures are closer

to the camera. We created the LAB dataset with chal-

lenging scenes where the support structures and clut-

ter are relatively far away and difﬁcult to segment

compared to other publicly available datasets.

To ﬁnd the best labelling algorithm over the SVM

trained potentials, we have experimented with mul-

tiple CRF formulations and inference. We have

used ALE (Ladicky et al., 2009) to test and train

over the RGBD images using the texton features

as the unary potential which we will call as ALE.

We evaluated the algorithm with the fully con-

nected CRF model(FULLY-C) from (Kr

ahenb

uhl and

Koltun, 2012) using the SVM potentials from sec-

tion 4.2 without the additional height and normal ker-

nels in the pairwise term, which we will be calling as

FULLY-C. Finally we compared the accuracies of our

proposed method with respect to the above mentioned

methods and show an improvement in the accuracy of

segmentation. We would like to emphasize that ALE

and our system are both trained on the LAB dataset

and tested on both the datasets. As ALE is trained

on texture features, It performed well on LAB dataset

but not on NYU dataset. Our algorithm is not trained

on appearance cues to avoid sensitivity to color. This

would allow us to test on variety of datasets with-

out needing to train on all the datasets. Our algo-

rithm is aimed at labelling only the part of the struc-

ture which supports objects for example table-top but

not the whole structure. The annotations provided in

the NYU dataset for structure cannot be used for the

problem we are trying to address due to the afore-

mentioned reason. Labeling only the support surfaces

of the whole structure can be used as a prior for a

faster object search in robotic applications. We show

a quantitative evaluation of the proposed algorithm on

the LAB dataset in Fig. 5. Image 2 of Fig. 5 shows

how we improve upon FULLY-C by adding normal

and height kernels.We show support structure level

accuracies in Table 1 which gives us the information

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

358

Input GT ALE FULLY-C OURS

Image 1Image 2Image 3Image 4

Figure 5: The columns from left to right represents the original RGB image from our lab data set with its corresponding

manually annotated ground truth, predicted labels by the Super-pixel clique CRF(ALE), predicted labels by the fully con-

nected CRF (FULLY-C). The rightmost column represents the predicted labels by our proposed CRF (OURS) based learning

approach. The locations of interest here are the support structures, and clutter.

Input GT ALE FULLY-C OURS

Image 1Image 2Image 3Image 4

Figure 6: The columns from left to right represents the original from the NYU V2 dataset with its corresponding ground truth,

predicted labels by the Super-pixel clique CRF(ALE), the dense CRF output. The rightmost column of images represent the

predicted labels of our proposed method(OURS) for different NYU scenes. The locations of interest here are the support

structures and clutter.

3D Region Proposals For Selective Object Search

359

about the percentage of support structures correctly

classiﬁed. We show an improvement of 10% on the

support structure detection with our proposed method

compared to FULLY-C and 33% increase compared to

ALE.All the methods aforementioned are trained on

LAB dataset. The models were tested on both LAB

dataset and NYU. The standard algorithms like ALE

have performed poorly on the NYU dataset and also

on LAB in support structure detection because of de-

pendency of the algorithm on the texture features. Our

proposed method has scaled well on both the datasets

because of its exclusive features which are not depen-

dent on the texture of the image.

We summarize the Intersection over Union ac-

curacies of object class segmentation on the LAB

Dataset in Table 4 and on the NYU dataset in Table

5. The Intersection over union measure is deﬁned as

T P/(T P +FP +FN), where TP represents True pos-

itive, FP represents False positive and FN represents

the false negatives. Similarly we evaluate on the recall

accuracies for each label and summarize them in Ta-

ble 6 for the NYU dataset and in Table 3 for the LAB

datasets. Here recall is deﬁned as T P/(T P + FN),

which deﬁnes the probability of retrieval of a speciﬁc

label with respect to its query. We observe that our al-

gorithm performs better than the standard dense CRF

based method in both the datasets.

We show 3D reconstruction of a lab environ-

ment using RTAB MAP((Real-Time Appearance-

Based Mapping), a RGB-D SLAM approach using

visual odometry in the supplementary video. We use

a Kinect mounted on a P3DX robot. The system is

built on ROS(Robotic Operating system). We run our

algorithm on the live 3-D stream and label the sup-

port structures present in the scene. The segmented

regions can help the robot for the task of object search

and can also be used for path planning for faster area

coverage while search for objects.

7 CONCLUSION

We have proposed an algorithm which uses geomet-

ric 3d cues and texture cues to classify the scene into

support structures and clutter which will be a prior

for reducing the object search space. We are propos-

ing a generic method which will work for a variety

of scenes without training on every dataset. In ﬁg-

ure 4 we show that clutter can be used as a feature to

locate regions of interest when support structures are

absent or occluded. The experiments performed on

NYU show the robustness of the algorithm to dras-

tic change in appearance of the support structures and

clutter in the scene. We show 7% and 2% increase

in pixelwise recall accuracies for support structure on

LAB and NYU. The performance can be attributed to

the consideration of geometric features from the 3-D

point cloud which would otherwise not be possible if

only texture cues were considered as features. Since

the algorithm is fast, it is possible to implement it in a

multi processor architecture for real time performance

which makes it easy to use in robotic environments for

region proposals. From the evaluation we conclude

that our proposed method scales well across datasets.

As part of the future research, We intend to segment

clutter to individual objects for object recognition and

formulate an Optimized path planning strategy for the

robot to simultaneously explore and navigate in large

rooms efﬁciently depending on the task it is assigned.

Further by assigning the conﬁdence for each pixel be-

ing a support structure or clutter, a more robust and

optimal search strategy can be derived.

8 FUTURE WORK

We would like to extend this work further and use

these region proposals for a faster object search in in-

door environment. Further, we would like to investi-

gate the performance of convolution neural networks

for the same task.

REFERENCES

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and

Susstrunk, S. (2012). Slic superpixels compared to

state-of-the-art superpixel methods. Pattern Analy-

sis and Machine Intelligence, IEEE Transactions on,

34(11):2274–2282.

Chang, C. and Lin, C. (2001). LIBSVM: a library for sup-

port vector machines.

Gould, S., Fulton, R., and Koller, D. (2009). Decomposing

a scene into geometric and semantically consistent re-

gions. In Computer Vision, 2009 IEEE 12th Interna-

tional Conference on, pages 1–8. IEEE.

Gupta, S., Arbel

aez, P., Girshick, R., and Malik, J.

(2015). Indoor scene understanding with rgb-d im-

ages: Bottom-up segmentation, object detection and

semantic segmentation. International Journal of Com-

puter Vision, 112(2):133–149.

Hermans, A., Floros, G., and Leibe, B. (2014). Dense 3d se-

mantic mapping of indoor scenes from rgb-d images.

In 2014 IEEE International Conference on Robotics

and Automation (ICRA), pages 2631–2638. IEEE.

Kim, B.-s., Kohli, P., and Savarese, S. (2013). 3d scene un-

derstanding by voxel-crf. In Proceedings of the IEEE

International Conference on Computer Vision, pages

1425–1432.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

360

Koppula, H. S., Anand, A., Joachims, T., and Saxena, A.

(2011). Semantic labeling of 3d point clouds for

indoor scenes. In Shawe-Taylor, J., Zemel, R. S.,

Bartlett, P. L., Pereira, F., and Weinberger, K. Q., edi-

tors, Advances in Neural Information Processing Sys-

tems 24, pages 244–252. Curran Associates, Inc.

ahenb

uhl, P. and Koltun, V. (2012). Efﬁcient inference

in fully connected crfs with gaussian edge potentials.

arXiv preprint arXiv:1210.5644.

Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. (2009).

Associative hierarchical crfs for object class image

segmentation. In Computer Vision, 2009 IEEE 12th

International Conference on, pages 739–746. IEEE.

Ladicky, L., Russell, C., Kohli, P., and Torr, P. H. (2010).

Graph cut based inference with co-occurrence statis-

tics. In Computer Vision–ECCV 2010, pages 239–253.

Springer.

Quattoni, A. and Torralba, A. (2009). Recognizing indoor

scenes. In Computer Vision and Pattern Recognition,

2009. CVPR 2009. IEEE Conference on, pages 413–

420. IEEE.

Ren, X., Bo, L., and Fox, D. (2012). Rgb-(d) scene label-

ing: Features and algorithms. In Computer Vision and

Pattern Recognition (CVPR), 2012 IEEE Conference

on, pages 2759–2766. IEEE.

Reza, M. A. and Kosecka, J. (2014). Object recognition

and segmentation in indoor scenes from rgb-d images.

In Robotics Science and Systems (RSS) conference-

5th workshop on RGB-D: Advanced Reasoning with

Depth Cameras.

Rusu, R. B. and Cousins, S. (2011). 3D is here: Point Cloud

Library (PCL). In IEEE International Conference on

Robotics and Automation (ICRA), Shanghai, China.

Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2009).

Textonboost for image understanding: Multi-class ob-

ject recognition and segmentation by jointly modeling

texture, layout, and context. International Journal of

Computer Vision, 81(1):2–23.

Silberman, N. and Fergus, R. (2011). Indoor scene segmen-

tation using a structured light sensor. In Computer

Vision Workshops (ICCV Workshops), 2011 IEEE In-

ternational Conference on, pages 601–608. IEEE.

Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012).

Indoor segmentation and support inference from rgbd

images. In Proceedings of the 12th European Confer-

ence on Computer Vision - Volume Part V, ECCV’12,

pages 746–760, Berlin, Heidelberg. Springer-Verlag.

Toyoda, T. and Hasegawa, O. (2008). Random ﬁeld model

for integration of local information and global infor-

mation. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 30(8):1483–1489.

Wolf, D., Prankl, J., and Vincze, M. (2015). Fast semantic

segmentation of 3d point clouds using a dense crf with

learned parameters. In 2015 IEEE International Con-

ference on Robotics and Automation (ICRA), pages

4867–4873. IEEE.

Wu, T.-F., Lin, C.-J., and Weng, R. C. (2004). Probabil-

ity estimates for multi-class classiﬁcation by pairwise

coupling. The Journal of Machine Learning Research,

5:975–1005.

Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V.,

Su, Z., Du, D., Huang, C., and Torr, P. H. (2015). Con-

ditional random ﬁelds as recurrent neural networks. In

Proceedings of the IEEE International Conference on

Computer Vision, pages 1529–1537.

3D Region Proposals For Selective Object Search

361