Region Extraction of Multiple Moving Objects

with Image and Depth Sequence

Katsuya Sugawara

, Ryosuke Tsuruga

, Toru Abe

and Takuo Suganuma

NTT Resonant Inc., Granparktower, 3-4-1 Shibaura, Minato-ku, Tokyo 108-0023, Japan

Cyberscience Center, Tohoku University, 2-1-1 Katahira, Aoba-ku, Sendai 980–8577, Japan

Keywords:

Region Extraction, Multiple Moving Objects, Image Sequence, Depth Sequence.

Abstract:

This paper proposes a novel method for extracting the regions of multiple moving objects with an image and

a depth sequence. In addition to image features, diverse types of features, such as depth and image-depth-

derived 3D motion, have been used in existing methods for improving the accuracy and robustness of object

region extraction. Most of the existing methods determine individual object regions according to the spatial-

temporal similarities of such features, i.e., they regard a spatial-temporal area of uniform features as a region

sequence corresponding to the same object. Consequently, the depth features in a moving object region, where

the depth varies with frames, and the motion features in a nonrigid or articulated object region, where the

motion varies with parts, cannot be effectively used for object region extraction. To deal with these difﬁculties,

our proposed method extracts the region sequences of individual moving objects according to depth feature

similarity adjusted by each object movement and motion feature similarity computed only in adjacent parts.

Through the experiments on scenes where a person moves a box, we demonstrate the effectiveness of the

proposed method in extracting the regions of multiple moving objects.

1 INTRODUCTION

Determining the regions of individual moving objects

is a requisite preprocessing step for various applica-

tions including visual surveillance, control (e.g. man-

aging something by object movements), and analysis

(e.g. diagnosing something from object movements).

Thus, many object region extraction methods have

been proposed for use in such applications. Most of

the existing methods determine individual object re-

gions in an image sequence according to the spatial

and temporal similarity of image features, i.e., these

methods regard a spatial-temporal area of uniform im-

age features as a region sequence corresponding to the

same object (Grundmann et al., 2010; Galasso et al.,

2012; Xu and Corso, 2012).

Recently, along with the popularization of image

and range sensing cameras such as Kinect (Microsoft,

2015), not only image sequences but also depth se-

quences can be easily acquired. Consequently, be-

sides image features, depth features have been used in

several methods for improving the accuracy and ro-

bustness of object region extraction (Fern

andez and

Aranda, 2000; C¸ i

gla and Alatan, 2008; Xia et al.,

2011; Abramov et al., 2012; Bergamasco et al., 2012).

As is the case in image features, the spatial-temporal

similarity of depth features is usually used for deter-

mining individual object regions. However, as shown

in Figure 1, the depth in a moving object region varies

with frames; therefore, depth features cannot be ef-

fectively used in such cases for determining a region

sequence corresponding to the same object.

t = T (frame)

3.0m

1.5m

t = T + 15

3.0m

1.5m

(a) (b)

Figure 1: Example of (a) Image sequence and (b) Depth

sequence acquired by Kinect.

Sugawara, K., Tsuruga, R., Abe, T. and Suganuma, T.

Region Extraction of Multiple Moving Objects with Image and Depth Sequence.

DOI: 10.5220/0005782402550262

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 255-262

ISBN: 978-989-758-175-5

255

Meanwhile, 2D motion (optical ﬂow) features are

derived from an image sequence, and 3D motion

(scene ﬂow) features are derived from an image and a

depth sequence as shown in Figure 2. The similarities

of such motion features also can be used as clues for

extracting object regions. Unlike the 2D motion fea-

tures, the 3D motion features are less subject to the

difference in the distances between a camera and ob-

jects; therefore the similarity of 3D motion features

is more suitable than that of 2D motion features for

determining a region sequence corresponding to the

same object. However, in a nonrigid or articulated

object region, motion varies with part, and then it de-

creases the effectiveness of both 2D and 3D motion

features for object region extraction.

To deal with these difﬁculties, this paper proposes

a novel method for extracting the region sequences of

multiple moving objects with an image and a depth

sequence. The proposed method employs the simi-

larities of image, depth, and image-depth-derived 3D

motion features as clues for extracting object regions.

In this method, depth feature similarity is adjusted by

each object movement for adapting to the change of

depth features in moving object regions, and 3D mo-

tion feature similarity is computed only in adjacent

parts for adapting to the nonuniformity of motion fea-

tures in nonrigid or articulated object regions.

The remainder of this paper is organized as fol-

lows. Section 2 presents the existing object region

extraction methods based on several types of features.

t = T (frame)

t = T + 1

(a)

→

(b)

→

(c)

height [m]

width [m]

depth [m]

height [m]

t = T

Figure 2: Example of (a) Image sequence, (b) Depth se-

quence, and (c) 3D motion features derived from them.

Section 3 explains the details of our proposed method

for extracting the region sequences of multiple mov-

ing objects with an image and a depth sequence. Sec-

tion 4 presents the results of object region extraction

experiments. Finally, Section 5 concludes the paper.

2 RELATED WORK

Many methods have been proposed for extracting ob-

ject regions from an image. Most of them regard

a uniform area in the image as a subregion corre-

sponding to the same object, and determine individ-

ual object regions according to the spatial similar-

ity of image features, such as intensities and colors

in the image (Comaniciu and Meer, 1999; Felzen-

szwalb and Huttenlocher, 2004; Achanta et al., 2012).

To improve the accuracy and robustness of object re-

gion extraction, diverse types of features are used to-

gether with the image features. For example, 2D mo-

tion features (C¸ i

gla and Alatan, 2008), which can be

computed from an image sequence, and depth fea-

tures (Fern

andez and Aranda, 2000; C¸ i

gla and Ala-

tan, 2008; Bergamasco et al., 2012), which can be

acquired by a range sensing camera, are widely used.

Subregions extracted in each individual image

frame are not corresponded with those in the other

frames. Thus, various methods have been proposed

for extracting a subregion sequence corresponding

to the same object in a frame sequence. Their ap-

proaches are classiﬁed into two main types. One ap-

proach performs subregion extractions in each frame,

and then carries out subregion matching between suc-

cessive frames based on the temporal similarity of

features (Xia et al., 2011; Abramov et al., 2012; Cou-

prie et al., 2013). The other approach regards the

frame sequence as 3D data, and extracts a 3D region

(volume) corresponding with the same object accord-

ing to the spatial-temporal similarity of features (De-

Menthon and Megret, 2002; Grundmann et al., 2010).

In both the approaches, to improve the accuracy and

the robustness, 2D motion features (DeMenthon and

Megret, 2002; Grundmann et al., 2010), depth fea-

tures (Xia et al., 2011; Abramov et al., 2012), and

3D motion features (Xia et al., 2011) are also used in

combination with image features.

The entire region of an object is commonly com-

posed of different subregions with dissimilar proper-

ties. For such a object region, a number of different

subregion sequences are constructed. Accordingly,

several methods have been proposed for merging sub-

region sequences and extracting the sequence of an

entire object region (Lezama et al., 2011; Trichet and

Nevatia, 2013). Those methods use mainly motion

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

256

features for subregion sequence merging.

As mentioned above, in the existing methods,

spatial-temporal similarities are computed from di-

verse types of features, and employed for extracting

object region sequences. However, some types of fea-

tures cannot contribute to the object region extraction

in particular conditions. For example, the depth in a

moving object region varies with frames; therefore,

even if a series of subregions corresponds to the same

moving object, its depth features are not always tem-

porally similar under such a condition. Furthermore,

the motion in a nonrigid or articulated object region

varies with parts; therefore, although in the same ob-

ject region, the spatial similarity of motion features is

not always kept under such a condition.

3 REGION EXTRACTION OF

MULTIPLE MOVING OBJECTS

We proposes a novel method for extracting the region

sequences of multiple moving objects with an image

and a depth sequence. A brief overview of the pro-

posed method is depicted in Figure 3.

Our proposed method ﬁrstly extracts subregions

from each frame, secondly constructs subregion se-

quences through subregion matching between succes-

Input

(image and depth sequence)

t (frame)

Output

(object region sequences)

Subregion

extracting

Subregion

matching

Sequence

merging

Figure 3: Overview of the proposed method.

sive frames, and ﬁnally merges subregion sequences

into the region sequences of individual moving ob-

jects. To effectively make use of depth features

and motion features in these processes, the proposed

method employs depth feature dissimilarity adjusted

by each object movement and motion feature dissim-

ilarity computed only in adjacent parts.

3.1 Subregion Extracting

To extract subregions from each frame, our pro-

posed method uses a graph-based image segmentation

method (Felzenszwalb and Huttenlocher, 2004) with

image and depth features. Example of subregion ex-

traction from an image and a depth frame is shown

in Figure 4. In this ﬁgure, (c) shows extracted subre-

gions using (a) and (b), where each extracted subre-

gion is assigned a random color.

In graph-based image segmentation methods, an

entire image is represented by a graph G = (V, E)

with vertices v

∈V corresponding to pixels and edges

, v

) ∈ E corresponding to pairs of neighboring ver-

tices. Each edge (v

, v

) has a weight w

, v

) based

on some property of neighboring vertices v

and v

A segmentation S is a partition of V into components

(subregions) C

∈ S such that each C

corresponds to a

connected component in a graph G

= (V, E

), where

⊆ E.

The method of (Felzenszwalb and Huttenlocher,

2004) uses the dissimilarity of features between

neighboring vertices v

and v

as their edge weight

, v

). From such edge weights, the internal dif-

ference Int(C

) of a component C

⊆ V is determined

to be the largest edge weight in the minimum span-

ning tree MST (C

, E) of C

Int(C

) = max

)∈MST (C

,E)

, v

), (1)

and the difference Di f (C

, C

) between two compo-

nents C

, C

⊆ V is determined to be the minimum

edge weight connecting C

and C

Di f (C

, C

) = min

∈C

,(v

)∈E

, v

). (2)

If Di f (C

, C

) is small compared with Int(C

) and

Int(C

), then those C

and C

are merged into one

(a) (b) (c)

Figure 4: Example of (a) Image frame, (b) Depth frame,

and (c) Extracted subregions using (a) and (b).

Region Extraction of Multiple Moving Objects with Image and Depth Sequence

257

component. Through the iteration of this procedure,

an image is segmented into ﬁnal components, each of

which is an area of spatially uniform features, which

are extracted from a frame.

Our proposed method deﬁnes each edge weight

, v

) between neighboring pixels v

and v

from

the dissimilarities of their image features I (RGB in-

tensities) and depth features D by

, v

) = kI(v

) − I(v

)k + σ

kD(v

) − D(v

)k,

(3)

where σ

is a coefﬁcient for the dissimilarity of depth

features.

3.2 Subregion Matching

For constructing subregion sequences, the proposed

method carries out graph-based matching (Couprie

et al., 2013) on extracted subregions in successive

frames. Example of subregion matching is shown in

Figure 5, where random colors are assigned to each

extracted subregion in (a) and each constructed se-

quence (moving object sequence only) in (b).

In the method of (Couprie et al., 2013), two suc-

cessive frames t and t + 1 are represented by a graph

G = (V, E). A set V of vertices comprises two ver-

tex sets V (t) and V (t + 1); vertices v

∈ V (t) corre-

spond to subregions extracted in the frame t, and ver-

tices v

∈ V (t + 1) correspond to those in the frame

t + 1. Edges (v

, v

) ∈ E correspond to pairs of ver-

tices in different frames, and each (v

, v

) has a weight

, v

) associated with the dissimilarity of features

between v

and v

. Using a minimum spanning for-

est algorithm based on these edge weights w

, v

the correspondences between vertices v

∈ V (t) and

∈ V (t + 1) are determined. Through successively

applying this procedure to all frames, subregion se-

quences, each of which is a series of subregions with

temporally uniform features, are constructed.

The method of (Couprie et al., 2013) deﬁnes each

edge weight w

, v

) from the differences in shape,

position (centroid), and appearance (image) features

between subregions v

and v

. In addition to those

t = T (frame)

t = T + 15

t = T + 30

(a)

(b)

Figure 5: Example of (a) Subregions extracted in each

frame, and (b) Subregion sequences constructed from (a).

differences, for improving the accuracy and robust-

ness of subregion matching, our proposed method

also takes into account the differences in depth fea-

tures, and determines each edge weight w

, v

) by

, v

) = d

, v

)

+ σ

, v

) + σ

T d

, v

), (4)

where d

, d

, and d

represent the differences in

shape, position, appearance, and depth features, re-

spectively, while σ

and σ

T d

denote coefﬁcients for

, and d

. The feature differences in Equation (4) are

deﬁned as

, v

) =

| + |v

∩ v

, (5)

, v

) = k(C(v

) + M2(v

)) −C(v

)k, (6)

, v

) = kI(v

) − I(v

)k, (7)

, v

) = k(D(v

) + M3

)) − D(v

)k. (8)

In Equation (5), |v

|, |v

|, and |v

∩ v

| represent the

number of pixels of subregions v

, v

, and overlap-

ping area between v

and v

. In Equations (6)∼(8),

C, I, and D denote the centroid of pixels, the mean

of image features (RGB intensities), and the mean of

depth features in each subregion, respectively.

As shown in Equation (8), the proposed method

adds M3

) to the depth feature D(v

), and de-

termines the depth feature difference d

, v

) from

D(v

) + M3

) in the frame t and D(v

) in the

frame t + 1. Here, M3

) denotes the depth di-

rectional component of M3(v

), which is the median

of 3D motion (scene ﬂow) in v

. This is intended

to adapt for depth feature changes in moving object

regions and then use depth features effectively for

subregion matching. In the same way, as shown in

Equation (6), the centroid C(v

) is adjusted by adding

M2(v

), which is the median of 2D motion (optical

ﬂow) in v

, and used for determining the position fea-

ture difference d

, v

To obtain M2(v

) and M3

), the 2D motion

of each pixel in the frame t is determined from the

two successive image frames t and t + 1 (Farneb

ack,

2003). Using the 2D motion in the frame t, the me-

dian of 2D motion M2(v

) is computed for each sub-

region v

. Besides, the correspondences between the

pixels in the frame t and those in the frame t + 1 are

determined from the 2D motion in the frame t. By in-

corporating the pixel correspondences into the depth

frames t and t + 1, the 3D motion in the frame t also

can be determined. Using the 3D motion in the frame

t, the median of 3D motion M3(v

) and its depth di-

rectional components M3

) are computed for each

subregion v

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

258

3.3 Sequence Merging

The proposed method merges subregion sequences

corresponding with the same object into a single ob-

ject region sequence. Example of subregion sequence

merging is shown in Figure 6, where random colors

are assigned to each subregion sequence in (a) and

each individual object region sequence in (b).

To begin with, subregions of large 3D motion fea-

tures are determined in all frames, and each sequence

including such subregions is chosen as a target sub-

region sequence S

corresponding to part of a moving

object. Our proposed method computes the dissim-

ilarity between two target subregion sequences, and

merges them into a single sequence if their dissimi-

larity is less than a given threshold. Iterating this pro-

cess for all target subregion sequences until all sim-

ilar sequences unify, our method obtains the region

sequences of multiple moving objects individually.

Let subregion sequences S

and S

consist of sub-

regions s

(t) ∈ S

ranging in frame T s

≤ t ≤ Te

and

subregions s

(t) ∈ S

ranging in frame T s

≤ t ≤ Te

respectively. Suppose that the frames of these two se-

quences S

and S

overlap each other from T s to Te as

shown in Figure 7 (a). Our proposed method deﬁnes

the dissimilarity between S

and S

DS(S

, S

) =

Te − T s + 1

∑

t=T s

ds(s

(t), s

(t)), (9)

where ds(s

(t), s

(t)) is the dissimilarity, which is de-

ﬁned from the differences of 3D motion features, be-

tween subregions s

(t) and s

(t) in the frame t. Al-

though motion features are not always spatially uni-

form in a nonrigid or articulated object region, adja-

cent pixels in the same object region are thought to be

similar to each other in their motion features as shown

in Figure 7 (b). Therefore, to improve the accuracy

and robustness of sequence merging, the proposed

method determines the dissimilarity ds(s

(t), s

(t))

from the differences of 3D motion features in adja-

cent pixels between subregions s

(t) and s

(t).

To determine adjacent pixels between subregions

(t) and s

(t), for each pixel pair (v

(t), v

(t)) made

t = T (frame)

t = T + 15

t = T + 30

(a)

(b)

Figure 6: Example of (a) Subregion sequences, and (b) In-

dividual moving object region sequences obtained from (a).

t (frame)

(t)

(t), v

(t))

Ts Te

(t)

(c)(b)

(a)

Figure 7: (a) Subregion sequences S

, S

, (b) Subregions

(t), s

(t) in the frame t, (c) Pixels v

(t), v

(t), and their

pixel pair (v

(t), v

(t)).

from pixels v

(t) ∈ s

(t) and v

(t) ∈ s

(t), its inter-

pixel 3D distance d

(t), v

(t)) is computed by

(t), v

(t)) = kP3(v

(t)) − P3(v

(t))k, (10)

where P3(v

(t)) and P3(v

(t)) are 3D positions cor-

responding to v

(t) and v

(t), respectively. Among all

pixel pairs, N pixel pairs (v

(t), v

(t)) ∈ AP

i j

(t) with

the shortest inter-pixel 3D distances d

(t), v

(t))

are chosen for the pairs of adjacent pixels as shown

in Figure 7 (c). For each adjacent pixel pair

(t), v

(t)) ∈ AP

i j

(t), the difference between its pix-

els v

(t) and v

(t) is deﬁned as

dv(v

(t), v

(t)) = d

(t), v

(t))

+ σ

(t), v

(t)), (11)

where d

(t), v

(t)) represents the difference in

3D motion features deﬁned as

(t), v

(t)) = kM3(v

(t)) − M3(v

(t))k, (12)

and σ

is a coefﬁcient for d

(t), v

(t)). The me-

dian of dv(v

(t), v

(t)) computed for (v

(t), v

(t)) ∈

i j

(t) is used as the dissimilarity ds(s

(t), s

(t)) be-

tween subregions s

(t) and s

(t) in the frame t.

4 EXPERIMENTAL RESULTS

To demonstrate the effectiveness of our proposed

method, we conducted experiments on the region se-

quence extraction of multiple moving objects.

Scenes where a person moves a box were taken by

Kinect (Microsoft, 2015), and three pairs (Scenes 1,

2, and 3) of an image and a depth sequence were

used in the experiments. Both the image and depth

sequences were 640 × 480 pixels in size and 30 fps

Region Extraction of Multiple Moving Objects with Image and Depth Sequence

259

in frame rate. Scenes 1, 2, and 3 contained 500,

400, and 400 frames, respectively. In these sequences,

pixel positions on the depth frame were converted to

those on the image frame by using Kinect for Win-

dows SDK (Microsoft, 2013).

The experiments were carried out on a PC (In-

tel Core i7-3770@3.40GHz, 8GB, Windows 7 Pro

x64), and a program was implemented with Visual

C++ 2010. Through preliminary experiments, the pa-

rameters in the proposed method were determined as

= 3.0, σ

= 10.0, σ

T d

= 30.0, and σ

= 37.5.

4.1 Experiments of Subregion Matching

Firstly, to illustrate the effectiveness of our proposed

method in subregion matching, we carried out exper-

iments on the construction of subregion sequences.

As shown in Figure 8 (a), (b), and (c), subregions

are extracted using the image and the depth sequence

of Scene 1. To these subregions extracted from each

frame, three subregion matching methods are applied,

(a) Image sequence

t = 0 (frame)

t = 15

t = 30

t = 45

(b) Depth sequence

(d) Subregion sequences constructed w/o depth features

(e) With non-adjusted depth features

(f) With adjusted depth features (proposed method)

Figure 8: Experimental results of constructing subregion se-

quences (Scene 1).

and their results are shown in Figure 8 (d), (e), and (f),

where random colors are assigned to each constructed

sequence (moving object sequence only).

Figure 8 (d) shows the subregion sequences con-

structed by a subregion matching method without

depth feature dissimilarity (with only image feature

dissimilarity), whose approach corresponds to that of

(Couprie et al., 2013). Figure 8 (e) shows the result by

a method with non-adjusted depth feature dissimilar-

ity, which corresponds to the approach of (Abramov

et al., 2012). Figure 8 (f) shows the result by the pro-

posed method, which uses adjusted depth feature dis-

similarity.

As can be seen from Figure 8 (d), some parts

of the background are mistakenly regarded as subre-

gions of the person or box by the subregion matching

method without depth feature dissimilarity. Conse-

quently, subregions of the person’s head and the back-

ground are incorporated into the light blue sequence,

and subregions of the box and the background are in-

corporated into the dark green sequence. Compared to

this, as shown in Figure 8 (e), such inaccurate corre-

spondences between subregions can be reduced by us-

ing depth feature dissimilarity. However, subregions

of the box in t = 0, 15 and t = 30, 45 are assigned dif-

ferent colors. This is because, subregions correspond-

ing to the same object are regarded as corresponding

to different objects according to the change of their

depth features.

In contrast to those results, as shown in Fig-

ure 8 (f), the proposed method constructs three differ-

ent sequences from subregions of the person’s head,

the person’s body, and the box individually. These

experimental results indicate the effectiveness of our

proposed method, which uses the temporal dissimi-

larity of depth features adjusted by each object move-

ment, in subregion matching.

(a) Object region sequences using mode of 3D motion fea-

tures in each subregion

t = 0 (frame)

t = 15

t = 30

t = 45

(b) Using 3D motion features in adjacent parts (proposed

method)

Figure 9: Experimental results of merging subregion se-

quences (Scene 1).

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

260

4.2 Experiments of Sequence Merging

Secondly, to illustrate the effectiveness of our pro-

posed method in subregion sequence merging, we car-

ried out experiments on the extraction of individual

object region sequences.

From Scenes 1, 2, and 3, subregion sequences

are constructed by the proposed method with adjusted

depth feature dissimilarity as shown in Figures 8 (f),

10 (d), and 11 (d). To these subregion sequences, two

sequence merging methods are applied, and their re-

sults are shown in Figures 9 (a), (b), 10 (e), (f), 11 (e),

and (f), where random colors are assigned to each in-

dividual object region sequence.

Figures 9 (a), 10 (e), and 11 (e) show the region

sequences extracted by a sequence merging method

which computes the mode of 3D motion features in

each subregion and uses the dissimilarity of 3D mo-

tion feature modes between subregions. This ap-

(a) Image sequence

t = 0 (frame)

t = 30

t = 60

t = 90

(b) Depth sequence

(d) Subregion sequences constructed with adjusted depth

feature

(e) Object region sequences merged using mode of 3D mo-

tion features in each subregion

(f) Using 3D motion features in adjacent parts (proposed

method)

Figure 10: Experimental results of merging subregion se-

quences (Scene 2).

proach corresponds to that of (Trichet and Nevatia,

2013). As can be seen from those results, although the

region sequences of a person and a box are extracted

separately in all scenes, person’s head and body are

also extracted as different region sequences because

the movement of the head is substantially different

from that of the body.

Figures 9 (b), 10 (f), and 11 (f) show the results

by the proposed method using 3D motion feature dis-

similarity computed only in adjacent parts. Almost

the entire region of the person is correctly extracted as

a single region sequence in all scenes. These exper-

imental results indicate the effectiveness of our pro-

posed method, which uses the spatial dissimilarity of

3D motion features computed only in adjacent parts,

in subregion sequence merging.

(a) Image sequence

t = 0 (frame)

t = 30

t = 60

t = 90

(b) Depth sequence

(d) Subregion sequences constructed with adjusted depth

feature

(e) Object region sequences merged using mode of 3D mo-

tion features in each subregion

(f) Using 3D motion features in adjacent parts (proposed

method)

Figure 11: Experimental results of merging subregion se-

quences (Scene 3).

Region Extraction of Multiple Moving Objects with Image and Depth Sequence

261

5 CONCLUSIONS

In this paper, we have proposed a method for extract-

ing the region sequences of multiple objects using an

image and a depth sequence. The proposed method

extracts subregions from each frame, constructs sub-

region sequences through subregion matching be-

tween successive frames, and merges subregion se-

quences into the region sequences of individual ob-

jects. To effectively make use of depth features and

3D motion features in these processes, our proposed

method employs depth feature similarity adjusted by

each object movement and 3D motion feature similar-

ity computed only in adjacent parts. Through the ex-

periments, we demonstrated the effectiveness of our

proposed method in extracting the region sequences

of multiple moving objects, where the depth varies

with frames, and articulated objects, where the mo-

tion varies with parts.

Currently, our proposed method extracts object re-

gion sequences from a whole input sequence (i.e. it

cannot process every input frame serially), and the

average processing time of every frame is more than

two seconds. In future work, we would like to in-

vestigate extending our method not only to improve

the accuracy of object region sequence extraction but

also to process every set of a few input frames or ev-

ery input frame serially in real time. Furthermore,

we plan to conduct quantitative evaluation of the pro-

posed method for various scenes.

ACKNOWLEDGEMENT

This work was supported in part by the Japan Society

for the Promotion of Science (JSPS) under a Grant-

in-Aid for Scientiﬁc Research (C) (No.15K00171).

REFERENCES

Abramov, A., Pauwels, K., Papon, J., W

org

otter, F., and

Dellen, B. (2012). Depth-supported real-time video

segmentation with the Kinect. In Proc. IEEE Work-

shop Appl. Comput. Vision, pages 457–464.

Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and

Susstrunk, S. (2012). SLIC superpixels compared to

state-of-the-art superpixel methods. IEEE Trans. Pat-

tern Anal. Machine Intell., 34(11):2274–2282.

Bergamasco, F., Albarelli, A., Torsello, A., Favaro, M., and

Zanuttigh, P. (2012). Pairwise similarities for scene

segmentation combining color and depth data. In

Proc. 21st Int. Conf. Pattern Recognit., pages 3565–

3568.

C¸ i

gla, C. and Alatan, A. A. (2008). Object segmentation

in multi-view video via color, depth and motion cues.

In Proc. IEEE Int. Conf. Image Process., pages 2724–

2727.

Comaniciu, D. and Meer, P. (1999). Mean shift analysis

and applications. In Proc. Int. Conf. Comput. Vision,

volume 2, pages 1197–2003.

Couprie, C., Farabet, C., LeCun, Y., and Najman, L. (2013).

Causal graph-based video segmentation. In Proc.

IEEE Int. Conf. Image Process., pages 4249–4253.

DeMenthon, D. and Megret, R. (2002). Spatio-temporal

segmentation of video by hierarchical mean shift anal-

ysis. Technical Report TR-4388, Center for Automat.

Res., U. of Md, College Park.

Farneb

ack, G. (2003). Two-frame motion estimation based

on polynomial expansion. In Proc. Scand. Conf. Im-

age Anal., pages 363–370.

Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efﬁcient

graph-based image segmentation. Int. J. Comput. Vi-

sion, 59(2):167–181.

Fern

andez, J. and Aranda, J. (2000). Image segmentation

combining region depth and object features. In Proc.

15th Int. Conf. Pattern Recognit., volume 1, pages

618–621.

Galasso, F., Cipolla, R., and Schiele, B. (2012). Video seg-

mentation with superpixels. In Proc. 11th Asian Conf.

Comput. Vision, volume 1, pages 760–774.

Grundmann, M., Kwatra, V., Han, M., and Essa, I. (2010).

Efﬁcient hierarchical graph-based video segmenta-

tion. In Proc. IEEE Conf. Comput. Vision Pattern

Recognit., pages 2141–2148.

Lezama, J., Alahari, K., Sivic, J., and Laptev, I. (2011).

Track to the future: Spatio-temporal video segmenta-

tion with long-range motion cues. In Proc. IEEE Conf.

Comput. Vision Pattern Recognit., pages 3369–3376.

Microsoft (2013). Kinect for Windows SDK v1.8.

http://www.microsoft.com/en-us/download/

details.aspx?id=40278. Online; accessed 1–Sep.–

2015.

Microsoft (2015). Kinect – Windows app development.

https://dev.windows.com/en-us/kinect. Online; ac-

cessed 1–Sep.–2015.

Trichet, R. and Nevatia, R. (2013). Video segmentation with

spatio-temporal tubes. In Proc. IEEE Int. Conf. Adv.

Video Signal Based Surv., pages 330–335.

Xia, L., Chen, C.-C., and Aggarwal, J. K. (2011). Human

detection using depth information by Kinect. In Proc.

IEEE Conf. Comput. Vision Pattern Recognit. Work-

shops, pages 15–22.

Xu, C. and Corso, J. J. (2012). Evaluation of super-voxel

methods for early video processing. In Proc. IEEE

Conf. Comput. Vision Pattern Recognit., pages 1202–

1209.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

262