Joint Segmentation and Tracking of Object Surfaces in Depth Movies

along Human/Robot Manipulations

Babette Dellen, Farzad Husain and Carme Torras

Institut de Rob

otica i Inform

atica Industrial, CSIC-UPC, Llorens i Artigas 4-6, 08028 Barcelona, Spain

Keywords:

Range Data, Segmentation, Motion, Shape, Surface Fitting.

Abstract:

A novel framework for joint segmentation and tracking in depth videos of object surfaces is presented. Initially,

the 3D colored point cloud obtained using the Kinect camera is used to segment the scene into surface patches,

deﬁned by quadratic functions. The computed segments together with their functional descriptions are then

used to partition the depth image of the subsequent frame in a consistent manner with respect to the precedent

frame. This way, solutions established in previous frames can be reused which improves the efﬁciency of

the algorithm and the coherency of the segmentations along the movie. The algorithm is tested for scenes

showing human and robot manipulations of objects. We demonstrate that the method can successfully segment

and track the human/robot arm and object surfaces along the manipulations. The performance is evaluated

quantitatively by measuring the temporal coherency of the segmentations and the segmentation covering using

ground truth. The method provides a visual front-end designed for robotic applications, and can potentially be

used in the context of manipulation recognition, visual servoing, and robot-grasping tasks.

1 INTRODUCTION

During human or robotic manipulations, we face the

challenge of having to interpret a large amount of vi-

sual data within a short period of time. The data from

the sensors needs to be structured in a way that makes

task-relevant visual information more accessible. The

recognition of objects and scene context in a tempo-

rally consistent manner plays here a central role.

Moreover, in manipulation tasks, the use of 3D in-

formation is of particular importance, since accurate

grasping and object manipulation require knowledge

about both the 3D shape of the objects and their 3D

context, e.g., to avoid collisions. For depth acquisi-

tion, stereo set-ups, laser-range scanners, or time-of-

ﬂight depth sensors are commonly used. Recently, the

release of the Kinect camera (Kinect, 2010), a depth

sensor based on a structured light system, has opened

new possibilities for acquiring depth information in

real time.

A traditional way to process the visual data in ma-

nipulation tasks is to use geometric models for recog-

nizing objects in the image and to track them using

conventional tracking paradigms along the manipula-

tion (Kragic, 2001). In this case, exact object models

need to be deﬁned prior to the task, which, consid-

ering the variability of an object’s appearance in the

image, has the drawback that the system may not eas-

ily adapt to new scenarios.

In this work, we approach the problem from a dif-

ferent angle. Our main contribution and aim is the

creation of consistent segmentations of depth images,

into geometric surfaces, along a depth video and the

tracking of segments along the movie. Starting from

a known initial segmentation of the ﬁrst frame into

surface segments, we show in this paper how this in-

formation can be exploited in a consecutive frame to

group the current depth values into segments. This

way, information from the previous frame can be ef-

ﬁciently recycled, and segment labels can be kept

throughout the sequence, enabling tracking of surface

patches.

The robot can use such a representation to draw

conclusions about scene content (Aksoy et al., 2011),

to guide its own movements (visual servoing), or to

use surface information for the planning of grasping

movements (Taylor and Kleeman, 2002), or even in

a learning-by-demonstration context (Agostini et al.,

2011; Rozo et al., 2011). At a later stage, higher-level

information about objects may enter the task by de-

scribing objects through their composite 3D surfaces

(Hofman and Jarvis, 2000).

The paper is structured as follows: In Section II,

we discuss related work. The proposed algorithm is

244

Dellen B., Husain F. and Torras C..

Joint Segmentation and Tracking of Object Surfaces in Depth Movies along Human/Robot Manipulations.

DOI: 10.5220/0004209502440251

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 244-251

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

introduced in Sections III-IV. Then, in Section V, the

results for different human/robot manipulations are

presented. Future work is sketched in Section VI.

2 RELATED WORK

Joint segmentation and tracking has previously

been performed mostly for color image sequences

(Abramov et al., 2010; Deng and Manjunath, 2001;

Patras et al., 2001; Wang, 1998; Wang et al., 2009;

Grundmann et al., 2010). In a recent work, the color

images were segmented by ﬁnding the equilibrium

states of a Potts model (Abramov et al., 2010). Con-

sistency of segmentations obtained along the movie

and the tracking of segments were achieved through

label transfer from one frame to the next using optic

ﬂow information. This way, the equilibrium states in

the current frame could be encountered more rapidly.

The resulting segments represent regions of uniform

color and usually do not coincide with the object sur-

faces in a geometric sense, which we would desire for

our system. The solutions found by Abramov et al.

(2010) cannot be easily adapted to our problem, be-

cause color segmentation and depth segmentation are

inherently different problems. Surfaces cannot be de-

ﬁned based on local properties only, which increases

the difﬁculty of the problem considerably.

Other methods for video segmentation are usually

performing independent segmentations of each frame

and then try to match segments (Deng and Manjunath,

2001; Patras et al., 2001; Wang, 1998; Grundmann

et al., 2010). This is problematic because segmen-

tations have to be computed from scratch for every

frame, which has consequences on both the computa-

tional efﬁciency of the method and the temporal con-

sistency of the results. For cluttered scenes, the par-

tition of the segmentation tends to change from one

frame to the next, and temporal coherence of the seg-

mentations is prone to be impaired because of this ef-

fect.

In another work, segmentation and multi-object

tracking were performed simultaneously using graph-

ical models (Wang et al., 2009). Observed and hid-

den variables of interest describing the appearance

and the states of objects are jointly considered and

used to formulate the objective as a Markov random

ﬁeld energy minimization problem. Different from

our method, depth measurements do not enter the

framework, and objects are deﬁned based on their 2D

appearance alone. Also, objects of interest are de-

ﬁned in the ﬁrst frame and are then tracked along the

sequence. While the method delivers convincing re-

sults, energy minimization is computationally expen-

sive and efﬁcient optimizations would have to be de-

veloped to make the approach more practical.

To the authors’ knowledge, little work has been

done in this ﬁeld using depth information as the

primary vision cue for segmentation and tracking.

Parvizi and Wu performed multiple object tracking

using an adaptive depth segmentation method (Parvizi

and Wu, 2008). Time-of-ﬂight depth was used to seg-

ment each frame independently by ﬁnding the con-

nected components based on an absolute depth dis-

tance measure. The segments of adjacent frames were

then associated with each other using a depth his-

togram distribution. However, this depth segmenta-

tion method is rather simple and does not partition the

data into distinct surfaces. As a consequence, bound-

aries deﬁned by changes in 3D shape (curvature) can-

not be detected, which constitutes a major difference

in comparison to our method. In addition, each movie

frame is segmented from scratch. In the case of sur-

face segmentation, this can be rather costly. Further-

more, the temporal consistency of the segmentations

will degrade with increasing clutter in the scene.

In Lopez-Mendez et al. (2011), upper body track-

ing of a human using a range sensor (Microsoft

Kinect) is performed. Their technique is limited to

the human beings only as they use a prior model of

human body.

3 OVERVIEW OF THE METHOD

Our method for depth-video segmentation consists of

three parts: (i) Segment transfer and seeding, (ii) re-

estimation of surface models and grouping, and (iii) a

consistency check and respective re-grouping of pix-

els (see Fig. 1).

In the ﬁrst part (i), labels of the depth image ob-

tained at frame F

are transferred to the next frame

t+1

. Surface models that have been ﬁtted to the

depth of frame F

for all segments are transferred as

well. A seed is created for each label by comparing

the predicted depth with the measured depth in the

projected segment area. If the distance of the mea-

sured depth and model depth is smaller than a thresh-

old, the respective pixel is accepted and used as a seed

for constructing the full segment region in the current

frame.

In part (ii), the surface models are re-estimated for

each segment using the current depth values of the re-

spective seed. Non-seed points are grouped in con-

nected components, and then assigned to the closest

surface in the neighborhood. Connectedness of found

segments is evaluated, and the labeling is adjusted ac-

cordingly.

JointSegmentationandTrackingofObjectSurfacesinDepthMoviesalongHuman/RobotManipulations

245

Frame t Frame t+1

t → t+1

t+1 → t+2

Segment

Seed

Segment transfer

and seeding

Re-estimation of

surface models

and grouping

Consistency

check

Re-grouping

Segment

(step 3)

(step 4-6)

(step 7)

Figure 1: Schematic of the method. Segment regions ob-

tained for frame F

are transferred to frame F

t+1

. The

points lying inside a given segment region are compared

with the respective surface model, and only those points

which ﬁt the surface model are marked as seed points of the

given segment in frame F

t+1

. Then, the surface parame-

ters of the respective segment model are re-calculated using

the depth values of the seed. Using these models, the depth

of points outside the seed region can be predicted for each

segment, and the remaining points are assigned to the clos-

est segment surface, taking some proximity constraints into

account. Finally, the obtained segmentation for frame F

t+1

is compared with the previous segmentation for frame F

Only if an inconsistency is detected, the affected segments

are re-grouped using region growing and shrinking until the

problem is resolved.

Finally, in part (iii), the temporal consistency of

segments along the video is checked. Because of the

high frame rate of the Kinect, it can be assumed that

changes between frames (at least in the given scenar-

ios) are small, implying that a segment cannot grow or

shrink out of proportion from one frame to the next.

In case that such a temporal consistency problem is

detected, the points of the affected segments are re-

grouped until the problem is resolved, using a clearly

deﬁned termination criterion.

A Microsoft Kinect sensor along with the

Kinect package of ROS (Robot Operating Sys-

tem) is used to acquire sequences of depth images

,... ,F

t+1

,. .. F

t+n

for different scenarios. The

algorithm is implemented in Matlab. Each frame

contains the color values (r, g,b) and (x,y, z) values

from the depth sensor, resulting in a matrix of size

m × n × 6, where m and n are the spatial dimensions

of the image grid. However, only the (x,y,z) values

are used by the proposed algorithm.

4 ALGORITHM

Our algorithm for joint segmentation and tracking

consists of the following consecutive steps (Fig. 1):

1. Initial Labeling. A labeling l

(u,v) of the initial

frame at t = 1 is computed using an algorithm

proposed in Dellen et al. (2011). Here u and v are

the indexes of the image grid. Color segments are

extracted from the color image using a standard

algorithm (Felzenszwalb and Huttenlocher, 2004)

for different resolutions. Quadratic surfaces are

ﬁtted to the color segments using depth data, and

the best patches are selected from the hierarchy

of resolutions, creating a new segmentation. This

segmentation is further improved by merging

those patches that are considered to describe

the same surface. This is achieved by a recent

graph-based clustering method for surfaces based

on Kruskal’s algorithm (Kruskal, 1956). This

gives a segmentation of the image into k disjoint

segments s

,. .. ,s

with s

∩ s

= ∅ and

respective labels 1, .. ., j,...,k.

2. Model Fitting. A quadratic surface model f

(x,y)

of the form

z = ax

+ by

+ cx +dy + e , (1)

with surface parameters a, b, c, d, and e is ﬁtted

to each segment s

by performing a Levenberg-

Marquardt minimization of the mean square dis-

tance

= 1/n

∑

(u,v)∈s

(u,v) −z(u, v)]

(2)

of the measured depth points z(u,v) from the es-

timated model depth z

(u,v) = f [x(u, v),y(u, v)].

Here, n

is the number of measured depth points

in the area of segment s

. The chosen model type

allows modeling of planar and curved surfaces,

e.g., cylinders and spheres. The iterative solver

(Levenberg-Marquardt minimization) enables

us to use the solution obtained for the previous

time step as the starting location. This leads to

temporal consistency. For the initial frame we

set the starting location as zero. In our case, the

algorithm converged in an average of 4 iterations.

3. Seeding: In order to update the segmentation

grid according to the current frame, the ﬁrst step

should be to unlabel the points (u, v) that do not ﬁt

the surface. We achieve this by generating seeds.

For each point (u,v) of frame F

t+1

, we ﬁnd the

projected label p = l

(u,v) from the previous seg-

mentation and deﬁne a seed labeling for F

t+1

ac-

cording to

t+1

(u,v) =



p if |z

(u,v) −z(u, v)| < τ

0 otherwise,

(3)

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

246

with z

(u,v) = f

[x(u,v), y(u,v)], and

∑

(u,v)∈s

t+1

(u,v) −z

/(ρn

) , (4)

where s

t+1

is the segment s

projected into the

current frame F

t+1

, n

is the number of pixels in

the area of s

, and ρ is a constant. This deﬁnes a

labeling l

t+1

(u,v) = p for all (u,v) ∈ s

t+1

4. Updating Models. Now the surface model

parameters need to be updated, so that they can

model the current state of the surfaces. For each

label j we obtain a surface model f

t+1

(x,y) by

applying the ﬁtting procedure of step (2) to the

seed of s

, consisting of all the points (u, v) for

which m

t+1

(u,v) = j holds.

5. Grouping of Non-seed Points. Once we have up-

dated the model parameters, we can determine

the new labels of non-seed points. All points

(u,v) with m

t+1

(u,v) = 0 are grouped into con-

nected components. For each connected compo-

nent c

, we search the neighborhood of all bound-

ary points (u, v) within a radius r

for seed points.

If a seed point is found, its label is added to the

list of potential labels L

= {l

,..} for c

. For

each label q ∈ L

, we compute the distance

(u,v) = | f

t+1

[x(u,v), y(u,v)] −z(u, v)|.

(5)

For all (u,v) ∈ c

, we can set

t+1

(u,v) = arg[min({d

(u,v),d

(u,v),..})],

(6)

deﬁning the labeling for the non-seed points.

6. Ensuring Connectedness of the Current Labeling

t+1

(u,v): The assignment of new labels does not

guarantee that the segments deﬁned by the new la-

beling represent connected components. The seg-

ments should only get disconnected if the surface

becomes occluded by another surface(s). For ex-

ample in Fig. 2(b) it can be observed that at some

point in time the background was disconnected.

In order to avoid false non-connected segments,

we unlabel them if their size is less than the min-

imum allowable segment size (we use 800 pixels,

but this could be changed adaptively depending on

the scenario) and assign them the label of the seg-

ment with which the largest boundary is shared.

The current labeling l

t+1

(u,v) is updated accord-

ingly. Since connectedness is ensured, l

t+1

(u,v)

represents a segmentation of frame F

t+1

into k

segments.

7. Regrouping to Maintain Temporal Consistency.

Since Kinect camera can deliver a high frame

rate of up to 30 fps, we can reasonably assume

relatively small motion of objects between

consecutive frames. This implies that a segment

cannot grow or shrink in size out of proportions

from one frame to the next. For each segment

of frame F

t+1

, we compute the segment size

ratio 4a

= a

t+1

, where a

t+1

and a

are the

size of s

in frame F

and F

t+1

, respectively. If

> 1 + δ or 4a

< 1 − δ, a label assignment

error is assumed. In this case, we compute the

relative change for all direct segment neighbors

of s

. If relative change is almost equal (no. of

pixels added in one segment s

≈ no. of pixels

removed in the other s

), we extract the contact

line between the two segments and assign all

points (u,v) within a radius r

of the contact

line to s

until the ratio 4a

≈ 1, providing the

termination criterion.

8. Steps 2-8 are repeated for the next frame using

t+1

(u,v) as initial labeling.

5 RESULTS

5.1 Segmentation Results

We tested the algorithm for several depth movies

showing human and robot manipulations of objects.

Videos are provided as supplementary material at

http://www.iri.upc.edu/people/bdellen/Movies.html.

As an example of a typical manipulation action,

we show segmentation results for a human hand

grasping a carton box and placing it on top of a cylin-

drically shaped paper roll (see Fig. 2). In Fig. 2(a),

selected calibrated depth images acquired with the

Kinect are shown. In Fig. 2(b), the segmentation re-

sults obtained by our method are shown. Fig. 2(c)

shows a ground truth segmentation of surfaces as per-

ceived by a human for comparison. Fig. 2(d) shows

results obtained using video-segmentation algorithm

based on color for comparison (Grundmann et al.,

2010). The segments are color-coded, where each

color corresponds to a unique segment label. With

our method not all surfaces could be completely re-

covered in the initial segmentation, due to the lim-

ited depth resolution. However, in the following, a

change of position of the carton box allows correctly

JointSegmentationandTrackingofObjectSurfacesinDepthMoviesalongHuman/RobotManipulations

247

(a)

(b)

Figure 2: Hand grasping a carton box. (a) Depth image

(Kinect). (b) Video segmentation results using our method.

age video segmentation using (Grundmann et al., 2010).

segmenting and tracking the surfaces (see Fig. 2(b)).

In comparison with the ground truth (Fig. 2(c)), it

can be seen that a small percent of false label assign-

ments happens during the manipulation of the carton

box, since local depth information becomes insufﬁ-

cient. The problem is resolved to a certain degree by

regrouping (see step (7) of the algorithm). In compar-

ison, in the color-based approach (shown in Fig. 2(d)),

the hand is merged with both the background and the

carton box. This is an inherent problem in algorithms

which rely on color information alone, because dif-

ferent surfaces cannot be guaranteed to always have a

different color.

Next, we present results for a human hand rolling

a green ball forward and then backwards with its ﬁn-

gers (see Fig. 3). The ball and the hand are correctly

segmented and tracked along the image sequence,

even though the hand is changing its shape during the

motion (see Fig. 3 (b)). Ground-truth segmentations

and the results of the color-based video segmentation

(a)

(b)

Figure 3: Hand rolling a ball. (a) Depth image (Kinect). (b)

Video segmentation results using our method. (c) Human

segmentation used as ground-truth (d) Color video segmen-

tation using (Grundmann et al., 2010).

proposed by Grundmann et al. (2010) are shown for

comparison in Fig. 3(c-d), respectively.

We further show segmentation results for a movie

where a robot arm grasps a cylindrically-shaped paper

roll and moves it to a new position (see Fig. 4). Dur-

ing the movement, objects in the background become

occluded. Nevertheless, the sequence is correctly seg-

mented and both the robot arm and the paper roll are

tracked along the movie, as can be seen in Fig. 4(b).

In the color-based video segmentation (see Fig. 4(d)),

the robot arm gets over segmented and the carton box,

which is lying on the table, gets merged with the table

(undersegmented).

Finally, we demonstrate a scenario in which mul-

tiple segments are tracked simultaneously. Fig. 5

shows selected frames of a plant movie. It can be

seen that as the plant is being displaced, multiple seg-

ments are tracked jointly through the scene. Notice

that two leaves in the initial segmentation were un-

der segmented so they are tracked in the same way in

upcoming frames.

5.2 Quantitative Evaluation

We use the segmentation covering metric described

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

248

(a)

(b)

Figure 4: WAM robotic arm grasping and displacing a pa-

per roll. (a) Depth image (Kinect). (b) Video segmentation

results using our method. (c) Human segmentation used as

ground-truth (d) Color video segmentation using (Grund-

mann et al., 2010).

in Arbelaez et al. (2009) to determine how closely the

segmentation results match the ground truth segmen-

tation. Human annotated color images are used as

ground truth (column (c) of Fig. 2, 3, 4). For one

frame, the segmentation covering metric is deﬁned as

C(S

→S) =

∑

R∈S

|R|·max

∈S



R,R



(7)

(a)

(b)

Figure 5: Plant being displaced. (a) Depth image (Kinect).

(b) Video segmentation results using our method.

where N is the total number of pixels in the image, |R|

the number of pixels in the region R, and O(R,R

) is

the overlap between the regions R and R

deﬁned as

O(R,R

) =

|R ∩ R

|R ∪ R

(8)

Fig. 6 (blue line) shows a plot of the segmenta-

tion covering metric for the segmentation results of

our algorithm, corresponding to videos of Fig. 2, 3, 4.

The segmentation covering metric is computed for

frames taken at ﬁxed time intervals. The segmenta-

tion covering metric of the segmentation result shown

in Fig. 6(c,blue line) has a lower value compared to

the other examples because in this case the back-

ground got over-segmented initially and the algo-

rithm tries to track these over-segmented surfaces

in upcoming frames, as can be seen in Fig. 4(b).

We also plotted the segmentation covering metric for

the color-video-segmentation results obtained with a

graph-based video segmentation algorithm (Grund-

mann et al., 2010) in red for comparison. For the

given set of movies, our method clearly outperforms

the method described in (Grundmann et al., 2010).

For the plant movie (Fig. 5) similar results were ob-

tained. The segmentation covering metric gave a

value of 0.89 and 0.83 for the initial and the ﬁnal

frame, respectively, when compared to a human per-

ceived segmentation.

We evaluated the coherence of the segmentations

by measuring the segment size ratio 4a

of the seg-

ments s

as a function of time. For the example of

Fig. 2, results are shown in Fig. 7(a). The line-plot

colors correspond to the label colors used in Fig. 2(b).

The segment size ratios are ﬂuctuating between 0.8

JointSegmentationandTrackingofObjectSurfacesinDepthMoviesalongHuman/RobotManipulations

249

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0.2

0.4

0.6

0.8

Our segmentation algorithm

Color segmentation

0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150

0.2

0.4

0.6

0.8

0 5 10 15 20 25 30 35 40 45 50 55

0.2

0.4

0.6

0.8

iterations

C(S' S)

(b)

(a)

(c)

Figure 6: Segmentation covering metric for the results ob-

tained with our segmentation algorithm (in blue) and with

the graph-based method (Grundmann et al., 2010) (in red)

for the scenes shown in (a) Fig. 2, (b) Fig. 3 and (c) Fig. 4.

0 5 10 15 20 25

0.02

0.04

0.06

0.08

0.1

0.12

(b)

0 5 10 15 20 25

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Segment size ratio

Frames [t]

(a)

Average surface model fitting error [meters]

Frames [t]

Figure 7: Evaluation of segmentation coherence and surface

ﬁtting error. (a) Segment size ratio 4a

as a function of the

frame number for all segments s

shown in Fig. 2. (b) Sur-

face ﬁtting errors E

as a function of the frame number. Line

colors correspond to the segment-label colors of Fig. 2(c).

and 1.2, indicating that temporal consistency is main-

tained.

We further compared the depth predicted by the

surface models of the segments with the measured

(ground truth) depth and computed the ﬁtting error for

each segment (see Fig. 7(b)). Except for the hand, the

ﬁtting error remains below 0.02 meters. The larger

errors measured for the hand are caused by small seg-

mentation errors at its large boundary lines due to its

fast motion, which affects the tracking procedure neg-

atively.

5.3 Parameter Choices

The algorithm contains two important parameters,

i.e., ρ and δ, which may require tuning. The remain-

ing parameters r

and r

are less critical.

For our set-up, the parameter ρ required for the

segmentation of consecutive frames was determined

only once and not altered during the different experi-

ments, except for one experiment, for which it needed

to be increased. With our chosen value of ρ = 1.7, an

average of 74% of points per segment served as seed

points.

By evaluating the segment-size ratio over time

(see Fig. 7(a)), which, in case of successful track-

ing, stayed in between 0.8 and 1.2, we set parameter

δ equal to 0.2, providing a reasonable bound on the

segment-size change of 20%. This parameter, once

set, was not varied throughout the experiments. The

remaining parameters were set to r

= 10 pixels and

= 15 pixels.

6 CONCLUSIONS

We presented a novel algorithm for joint segmentation

and tracking of object surfaces, deﬁned by their geo-

metric shapes. Segments obtained for the ﬁrst frame

are used to initialize the segmentation procedure of

the next frame, and so on. The main novelties of

the proposed method are (i) the use quadratic surface

models for seeding and in the context of a tracking

problem (steps 2-3), (ii) a labeling technique for non-

seed points, deﬁned by Eq. 6 (see step 6 of the algo-

rithm), and a re-grouping strategy enforcing temporal

consistency between frames (step 7). We tested the al-

gorithm for several movies acquired with the Kinect

showing human and robot manipulations of objects.

The algorithm allowed us to segment and track the

main object surfaces in the scene, despite frequently

occurring occlusions, limited resolution of the depth

images, and shape changes of the hand and the robot

gripper. However, some problems still remain. Oc-

casionally, depth differences between surfaces are too

small, resulting in assignment conﬂicts that cannot be

resolved by the method as it is. In the future, we aim

to incorporate additional mechanisms for improving

the robustness of the method in this respect. Further-

more, we are currently developing mechanisms for

generating new segments in addition to the ones that

have been determined in the ﬁrst frame, which will be

important in case new objects are entering the scene.

This will also allow segmenting images from scratch

in the future, i.e., the initial frame. The segment con-

sistency check and the following re-grouping proce-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

250

dure are currently conducted using hard thresholds,

which we plan to make adaptive in the future. We plan

to make our tracking algorithm more robust to occlu-

sions and noise by using shape information from all

the previous time steps. A way to achieve this would

be building dynamic shape models (Cremers, 2006).

We provided a quantitative evaluation of the

method using human-annotated ground truth. Obtain-

ing ground-truth for video is however a very tedious

procedure and thus poses us limits. Since there is

no implementation of a similar algorithm performing

joint segmentation and tracking in depth space avail-

able, we compared our method to a standard color-

video segmentation algorithm (Grundmann et al.,

2010). We could show that our method outperformed

color-video segmentation for the videos analyzed.

However, this comparison may not be entirely fair,

since we are using a different feature, i.e., depth, and

not color.

Currently, the method needs ∼ 1.92 seconds to

process one frame of size 430 × 282 pixels in Matlab

on Intel 3.3 GHz processor. With an efﬁcient C/C++

implementation of the method, we expect to gain real-

time performance, which is one of our next goals.

ACKNOWLEDGEMENTS

This research is partially funded by the EU

projects GARNICS (FP7-247947) and IntellAct

(FP7-269959), and the Grup consolidat 2009

SGR155. B. Dellen acknowledges support from the

Spanish Ministry of Science and Innovation through

a Ramon y Cajal program.

REFERENCES

Abramov, A., Aksoy, E. E., D

orr, J., W

org

otter, F., Pauwels,

K., and Dellen, B. (2010). 3d semantic representation of

actions from efﬁcient stereo-image-sequence segmenta-

tion on gpus. In 5th Intl. Symp. 3D Data Processing,

Visualization and Transmission.

Agostini, A., Torras, C., and W

org

otter, F. (2011). Inte-

grating task planning and interactive learning for robots

to work in human environments. In IJCAI, Barcelona,

pages 2386–2391.

Aksoy, E. E., Abramov, A., D

orr, J., Ning, K., Dellen, B.,

and W

org

otter, F. (2011). Learning the semantics of

object-action relations by observation. Int. J. Rob. Res.,

30(10):1229–1249.

Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. (2009).

From contours to regions: An empirical evaluation. In

CVPR, pages 2294 –2301.

Cremers, D. (2006). Dynamical statistical shape priors for

level set-based tracking. IEEE TPAMI, 28(8):1262 –

1273.

Dellen, B., Alenya, G., Foix, S., and Torras, C. (2011). Seg-

menting color images into surface patches by exploiting

sparse depth data. In IEEE Workshop on Applications of

Computer Vision, pages 591 –598.

Deng, Y. and Manjunath, B. (2001). Unsupervised segmen-

tation of color-texture regions in images and video. IEEE

TPAMI, 23(8):800 –810.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efﬁcient

graph-based image segmentation. Intl. J. of Computer

Vision, 59(2):167–181.

Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010).

Efﬁcient hierarchical graph-based video segmentation.

In CVPR, pages 2141 –2148.

Hofman, I. and Jarvis, R. (2000). Object recognition via

attributed graph matching. In Proc. Australian Conf. on

Robotics and Automation, Melbourne, Australia.

Kinect (2010). Kinect for xbox 360. In http://

www.xbox.com/en-US/kinect.

Kragic, D. (2001). Visual Servoing for Manipulation: Ro-

bustness and Integration Issues. PhD thesis, Computa-

tional Vision and Active Perception Laboratory, Royal

Institute of Technology, Stockholm, Sweden.

Kruskal, J. B. (1956). On the Shortest Spanning Subtree of

a Graph and the Traveling Salesman Problem. In Proc.

of the American Mathematical Society.

Lopez-Mendez, A., Alcoverro, M., Pardas, M., and Casas,

J. (2011). Real-time upper body tracking with online ini-

tialization using a range sensor. In IEEE Intl. Conf. on

Computer Vision Workshops, pages 391 –398.

Parvizi, E. and Wu, Q. (2008). Multiple object tracking

based on adaptive depth segmentation. In Canadian

Conf. on Computer and Robot Vision, pages 273 –277.

Patras, I., Hendriks, E., and Lagendijk, R. (2001). Video

segmentation by map labeling of watershed segments.

IEEE TPAMI, 23(3):326 –332.

Rozo, L., Jimenez, P., and Torras, C. (2011). Robot learn-

ing from demonstration of force-based tasks with multi-

ple solution trajectories. In 15th Intl. Conf. on Advanced

Robotics, pages 124 –129.

Taylor, G. and Kleeman, L. (2002). Grasping unknown ob-

jects with a humanoid robot. In Proc. of Australasian

Conf. on Robotics and Automation, pages 191–196.

Wang, C., de La Gorce, M., and Paragios, N. (2009).

Segmentation, ordering and multi-object tracking using

graphical models. In IEEE 12th Intl. Conf. on Computer

Vision, pages 747 –754.

Wang, D. (1998). Unsupervised video segmentation based

on watersheds and temporal tracking. IEEE Transactions

on Circuits and Systems for Video Technology, 8(5):539

–546.

JointSegmentationandTrackingofObjectSurfacesinDepthMoviesalongHuman/RobotManipulations

251