dure are currently conducted using hard thresholds,
which we plan to make adaptive in the future. We plan
to make our tracking algorithm more robust to occlu-
sions and noise by using shape information from all
the previous time steps. A way to achieve this would
be building dynamic shape models (Cremers, 2006).
We provided a quantitative evaluation of the
method using human-annotated ground truth. Obtain-
ing ground-truth for video is however a very tedious
procedure and thus poses us limits. Since there is
no implementation of a similar algorithm performing
joint segmentation and tracking in depth space avail-
able, we compared our method to a standard color-
video segmentation algorithm (Grundmann et al.,
2010). We could show that our method outperformed
color-video segmentation for the videos analyzed.
However, this comparison may not be entirely fair,
since we are using a different feature, i.e., depth, and
not color.
Currently, the method needs ∼ 1.92 seconds to
process one frame of size 430 × 282 pixels in Matlab
on Intel 3.3 GHz processor. With an efficient C/C++
implementation of the method, we expect to gain real-
time performance, which is one of our next goals.
ACKNOWLEDGEMENTS
This research is partially funded by the EU
projects GARNICS (FP7-247947) and IntellAct
(FP7-269959), and the Grup consolidat 2009
SGR155. B. Dellen acknowledges support from the
Spanish Ministry of Science and Innovation through
a Ramon y Cajal program.
REFERENCES
Abramov, A., Aksoy, E. E., D
¨
orr, J., W
¨
org
¨
otter, F., Pauwels,
K., and Dellen, B. (2010). 3d semantic representation of
actions from efficient stereo-image-sequence segmenta-
tion on gpus. In 5th Intl. Symp. 3D Data Processing,
Visualization and Transmission.
Agostini, A., Torras, C., and W
¨
org
¨
otter, F. (2011). Inte-
grating task planning and interactive learning for robots
to work in human environments. In IJCAI, Barcelona,
pages 2386–2391.
Aksoy, E. E., Abramov, A., D
¨
orr, J., Ning, K., Dellen, B.,
and W
¨
org
¨
otter, F. (2011). Learning the semantics of
object-action relations by observation. Int. J. Rob. Res.,
30(10):1229–1249.
Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. (2009).
From contours to regions: An empirical evaluation. In
CVPR, pages 2294 –2301.
Cremers, D. (2006). Dynamical statistical shape priors for
level set-based tracking. IEEE TPAMI, 28(8):1262 –
1273.
Dellen, B., Alenya, G., Foix, S., and Torras, C. (2011). Seg-
menting color images into surface patches by exploiting
sparse depth data. In IEEE Workshop on Applications of
Computer Vision, pages 591 –598.
Deng, Y. and Manjunath, B. (2001). Unsupervised segmen-
tation of color-texture regions in images and video. IEEE
TPAMI, 23(8):800 –810.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient
graph-based image segmentation. Intl. J. of Computer
Vision, 59(2):167–181.
Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010).
Efficient hierarchical graph-based video segmentation.
In CVPR, pages 2141 –2148.
Hofman, I. and Jarvis, R. (2000). Object recognition via
attributed graph matching. In Proc. Australian Conf. on
Robotics and Automation, Melbourne, Australia.
Kinect (2010). Kinect for xbox 360. In http://
www.xbox.com/en-US/kinect.
Kragic, D. (2001). Visual Servoing for Manipulation: Ro-
bustness and Integration Issues. PhD thesis, Computa-
tional Vision and Active Perception Laboratory, Royal
Institute of Technology, Stockholm, Sweden.
Kruskal, J. B. (1956). On the Shortest Spanning Subtree of
a Graph and the Traveling Salesman Problem. In Proc.
of the American Mathematical Society.
Lopez-Mendez, A., Alcoverro, M., Pardas, M., and Casas,
J. (2011). Real-time upper body tracking with online ini-
tialization using a range sensor. In IEEE Intl. Conf. on
Computer Vision Workshops, pages 391 –398.
Parvizi, E. and Wu, Q. (2008). Multiple object tracking
based on adaptive depth segmentation. In Canadian
Conf. on Computer and Robot Vision, pages 273 –277.
Patras, I., Hendriks, E., and Lagendijk, R. (2001). Video
segmentation by map labeling of watershed segments.
IEEE TPAMI, 23(3):326 –332.
Rozo, L., Jimenez, P., and Torras, C. (2011). Robot learn-
ing from demonstration of force-based tasks with multi-
ple solution trajectories. In 15th Intl. Conf. on Advanced
Robotics, pages 124 –129.
Taylor, G. and Kleeman, L. (2002). Grasping unknown ob-
jects with a humanoid robot. In Proc. of Australasian
Conf. on Robotics and Automation, pages 191–196.
Wang, C., de La Gorce, M., and Paragios, N. (2009).
Segmentation, ordering and multi-object tracking using
graphical models. In IEEE 12th Intl. Conf. on Computer
Vision, pages 747 –754.
Wang, D. (1998). Unsupervised video segmentation based
on watersheds and temporal tracking. IEEE Transactions
on Circuits and Systems for Video Technology, 8(5):539
–546.
JointSegmentationandTrackingofObjectSurfacesinDepthMoviesalongHuman/RobotManipulations
251