5 CONCLUSIONS
In this paper, we have proposed a method for extract-
ing the region sequences of multiple objects using an
image and a depth sequence. The proposed method
extracts subregions from each frame, constructs sub-
region sequences through subregion matching be-
tween successive frames, and merges subregion se-
quences into the region sequences of individual ob-
jects. To effectively make use of depth features and
3D motion features in these processes, our proposed
method employs depth feature similarity adjusted by
each object movement and 3D motion feature similar-
ity computed only in adjacent parts. Through the ex-
periments, we demonstrated the effectiveness of our
proposed method in extracting the region sequences
of multiple moving objects, where the depth varies
with frames, and articulated objects, where the mo-
tion varies with parts.
Currently, our proposed method extracts object re-
gion sequences from a whole input sequence (i.e. it
cannot process every input frame serially), and the
average processing time of every frame is more than
two seconds. In future work, we would like to in-
vestigate extending our method not only to improve
the accuracy of object region sequence extraction but
also to process every set of a few input frames or ev-
ery input frame serially in real time. Furthermore,
we plan to conduct quantitative evaluation of the pro-
posed method for various scenes.
ACKNOWLEDGEMENT
This work was supported in part by the Japan Society
for the Promotion of Science (JSPS) under a Grant-
in-Aid for Scientific Research (C) (No.15K00171).
REFERENCES
Abramov, A., Pauwels, K., Papon, J., W
¨
org
¨
otter, F., and
Dellen, B. (2012). Depth-supported real-time video
segmentation with the Kinect. In Proc. IEEE Work-
shop Appl. Comput. Vision, pages 457–464.
Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and
Susstrunk, S. (2012). SLIC superpixels compared to
state-of-the-art superpixel methods. IEEE Trans. Pat-
tern Anal. Machine Intell., 34(11):2274–2282.
Bergamasco, F., Albarelli, A., Torsello, A., Favaro, M., and
Zanuttigh, P. (2012). Pairwise similarities for scene
segmentation combining color and depth data. In
Proc. 21st Int. Conf. Pattern Recognit., pages 3565–
3568.
C¸ i
˘
gla, C. and Alatan, A. A. (2008). Object segmentation
in multi-view video via color, depth and motion cues.
In Proc. IEEE Int. Conf. Image Process., pages 2724–
2727.
Comaniciu, D. and Meer, P. (1999). Mean shift analysis
and applications. In Proc. Int. Conf. Comput. Vision,
volume 2, pages 1197–2003.
Couprie, C., Farabet, C., LeCun, Y., and Najman, L. (2013).
Causal graph-based video segmentation. In Proc.
IEEE Int. Conf. Image Process., pages 4249–4253.
DeMenthon, D. and Megret, R. (2002). Spatio-temporal
segmentation of video by hierarchical mean shift anal-
ysis. Technical Report TR-4388, Center for Automat.
Res., U. of Md, College Park.
Farneb
¨
ack, G. (2003). Two-frame motion estimation based
on polynomial expansion. In Proc. Scand. Conf. Im-
age Anal., pages 363–370.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient
graph-based image segmentation. Int. J. Comput. Vi-
sion, 59(2):167–181.
Fern
´
andez, J. and Aranda, J. (2000). Image segmentation
combining region depth and object features. In Proc.
15th Int. Conf. Pattern Recognit., volume 1, pages
618–621.
Galasso, F., Cipolla, R., and Schiele, B. (2012). Video seg-
mentation with superpixels. In Proc. 11th Asian Conf.
Comput. Vision, volume 1, pages 760–774.
Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010).
Efficient hierarchical graph-based video segmenta-
tion. In Proc. IEEE Conf. Comput. Vision Pattern
Recognit., pages 2141–2148.
Lezama, J., Alahari, K., Sivic, J., and Laptev, I. (2011).
Track to the future: Spatio-temporal video segmenta-
tion with long-range motion cues. In Proc. IEEE Conf.
Comput. Vision Pattern Recognit., pages 3369–3376.
Microsoft (2013). Kinect for Windows SDK v1.8.
http://www.microsoft.com/en-us/download/
details.aspx?id=40278. Online; accessed 1–Sep.–
2015.
Microsoft (2015). Kinect – Windows app development.
https://dev.windows.com/en-us/kinect. Online; ac-
cessed 1–Sep.–2015.
Trichet, R. and Nevatia, R. (2013). Video segmentation with
spatio-temporal tubes. In Proc. IEEE Int. Conf. Adv.
Video Signal Based Surv., pages 330–335.
Xia, L., Chen, C.-C., and Aggarwal, J. K. (2011). Human
detection using depth information by Kinect. In Proc.
IEEE Conf. Comput. Vision Pattern Recognit. Work-
shops, pages 15–22.
Xu, C. and Corso, J. J. (2012). Evaluation of super-voxel
methods for early video processing. In Proc. IEEE
Conf. Comput. Vision Pattern Recognit., pages 1202–
1209.
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
262