Low Cost Video Animation of People using a RGBD Sensor

Cathrine J. Thomsen, Thomas B. Moeslund and Troels H. P. Jensen

Analysis of People Lab, Aalborg University, Rendsburggade 14, 9000, Aalborg, Denmark

Keywords:

Kinect Sensor, Video Animation, Performance Capture, Segmentation, Optical Flow.

Abstract:

This paper is an investigation in a low cost solution for performing video animation using a Kinect v2 for

Windows, where skeleton, depth and colour data are acquired for three different characters. Segmentation of

colour and depth frames were based on establishing the range of a person in the depth frame using the skeleton

information, and then train a plane of the ﬂoor and exclude points close to it. Transitioning between motions

were based on minimizing the Euclidean distance between all feasible transitioning frames, where a source

and target frame would be found. Intermediate frames were made to create seamless transitions, where new

poses were found by moving pixels in the direction of the optical ﬂow between the transitioning frames. The

realism of the proposed animation was veriﬁed through a user study to have a higher rate of preference and

perceived realism compared to no animation and animation using alpha blending.

1 INTRODUCTION

Animating people by using marker-based motion cap-

ture has been widely used and is perhaps most known

for the Gollum character in The Lord of the Rings,

where the actor wears a tight suit with reﬂective mar-

kers. Tracking each of these markers in a multiple-

view studio, when the actor performs different moti-

ons, are then used for controlling the joints of an ani-

mated character performing the same motions. Not

only does the capture using the marker-based appro-

ach require a signiﬁcant setup time, but it also takes

the actor or actress out of the natural environments

and lacks surface details, such as the dynamics of the

hair and clothing.

Instead, marker-less animation has been introdu-

ced, where different motion sequences are captured

in a multiple-view studio as in (Xu et al., 2011) and

(De Aguiar et al., 2008) using HD cameras. Using a

multiple-view stereo approach, 3D meshes are then

reconstructed independently for each frame, which

means that both shape and appearance of the captu-

red person are preserved.

Recent work within reconstruction and modelling

of people using multiple cameras can result in a re-

alistic yet expensive video animation. The focus of

this paper is to investigate a low-cost solution using

a consumer video and depth cameras, in this case a

Kinect v2 for Windows, capturing sequences to make

a realistic animation of a person, which will build on

previous work within 4D animation. The main contri-

butions of this work are the segmentation and anima-

tion methods in this low-cost solution.

The next section presents the related work within

this topic and the rest of the paper is divided into ﬁve

sections. The ﬁrst section includes how the datasets

are captured, then the segmentation and the animation

methods are described. The paper is ﬁnalized by a

user study and a conclusion.

2 RELATED WORK

Capturing various motions of a person, thus having a

library of different motions, new animations can the-

reby be created by combining and transitioning be-

tween related motions in the library. Using a mo-

tion graph (Kovar et al., 2002), a user has the pos-

sibility to control the different movements that a cha-

racter should perform. This motion graph is an ani-

mation synthesis that controls feasible transitioning

points between chosen motions, where all states in the

graph consist of small clips in a video library of that

particular motion.

In order to establish the best transition between

two motions, a similarity measure between the fea-

sible transitioning frames is made. According to a

study of different similarity measures in (Huang et al.,

2010), the shape histogram is the similarity measure

that proved to give the best performance between dif-

244

Thomsen C., Moeslund T. and Jensen T.

Low Cost Video Animation of People using a RGBD Sensor.

DOI: 10.5220/0006132902440249

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 244-249

ISBN: 978-989-758-224-0

ferent people and motions, also used in (Budd et al.,

2013). The shape histogram subdivides a spherical

coordinate system into radial and angular bins, where

each bin represents the mesh in that speciﬁc area. The

similarity measure between two meshes compares all

rotations of the mesh around the centroid for which

the L2 distance is minimum, thus being rotation inva-

riant.

By having a reference mesh, it is possible to com-

pare how well the reconstruction of the surface and

texture of a character is (Casas et al., 2013). But

whether the animations are made in a realistic way

depends on how it is perceived. This means, it is hard

to establish a measure for the realism of the anima-

ted videos without testing how people perceive them.

Hereby, the approach in (Casas et al., 2014), where a

user study was conducted to test the realism of their

videos, would be an appropriate way for assessing an

animation procedure.

3 DATA CAPTURE

A dataset consisting of four characters performing va-

rious motions is used in this work, which are shown

in ﬁgure 1 and ﬁgure 2, where ﬁgure 1 is used for

evaluation and the others are used for testing. The

data was captured with 30 fps using the Kinect v2 for

Windows sensor, which includes a colour data stream

with a resolution of 1920x1080, a depth data stream

with a resolution of 512x324, and ﬁnally the skeletal

tracking from the Windows SDK v2.0 consisting of

25 joints per person.

Figure 1: Colour frame 100 of the character used for evalu-

ation waving alternately left and right hand.

Having one or multiple video clips for one parti-

cular motion, a motion graph as in (Starck and Hil-

ton, 2007) is constructed, thus having a database of

each motion for each character. This motion graph is

then used for controlling the transition between each

motion within each character by keeping track of the

current state and the possible transitions. Since the

Figure 2: Colour frame 100 of the characters used for tes-

ting. From the top is character 1 performing dance motions,

character 2 performing stretches and lastly is character 3

performing goalkeeper motions.

captured datasets in this work consists of a full video

of a person performing various motions, the motion

sequences are divided and labelled manually by de-

termining the end and start frame in each of the vi-

deos. The video is divided in such a manner that each

motion sequence do not overlap.

4 SEGMENTATION

Since the background is of no interest when creating

an animation of a person, a segmentation must be per-

formed. The segmentation can be done using the co-

lour or depth frames. If the colour frame is used, illu-

mination changes between the frames as well as back-

ground cluttering must be taken into account. Anot-

her solution is to use the depth frames instead, as in

(Fechteler et al., 2014), which avoids the cluttering

Low Cost Video Animation of People using a RGBD Sensor

245

Figure 3: Segmented depth frame using the information from the skeletal joints on the left, and the resulting segmented depth

and colour frame.

and contains less illumination issues. Since the ca-

mera is static, a depth segmentation could be done by

modelling the background of an empty scene. But in

this case, it does not provide a good enough segmen-

tation, therefore another approach must be used.

4.1 Plane Segmentation

With inspiration from the grund plane segmentation

in (Møgelmose et al., 2015), a similar depth-based

approach for the segmentation is proposed. Instead

of establishing a bounding box around the person by

e.g. HOG-detectors on RGB images, a simpler and

faster approach is proposed by using the depth image

only, since the skeleton of the person is available. The

range of the person is therefore established by using

the minimum and maximum positions of the joints

from the skeleton in each direction in the depth image

giving the output shown in ﬁgure 3.

Having the range of the person, a plane from the

ﬂoor is created by the span of the minimum and max-

imum value in the x-direction and the minimum value

in the z-direction, as illustrated in ﬁgure 4.

Figure 4: Illustration of the extraction of the ﬂoor plane.

The distance from the plane, α, with a normal

vector, (a, b, c), to a point, (x

, y

, z

) is evaluated

from equation 1, where the points close to the plane

are excluded so only the person remains.

dist

α, point

|a(x −x

) + b(y −y

) + c(z −z0)|

√

+ b

+ c

(1)

Since small pixel changes can occur between each

depth frame, an average of planes is used instead of

calculating a new plane for each frame. Experimental

results showed that using 20 frames was sufﬁcient to

construct an average plane. Additional noise left in

the frames were removed by preserving only the big-

gest BLOB using a 4-connectivity connected compo-

nent analysis.

The ﬁnal segmentation is shown in ﬁgure 3. Here

a boolean bitwise AND operation is applied to the

segmented output frame and the original depth frame.

This depth frame is then further used for mapping the

colour frame onto the depth frame. Thereby, the re-

sulting segmented colour frame has the same resolu-

tion as the depth frame of 512x424.

5 TRANSITIONING BETWEEN

MOTIONS

In order to create an animated video, where a segmen-

ted person is shifting between different motions, good

transitions between the motions needs to be determi-

ned. The three modalities captured with the Kinect v2

for Windows; skeleton, depth and colour, are used for

creating a similarity measure. This similarity mea-

sure is used for determining good transitions between

different frames in the video where shape and appea-

rance have a good match.

For each modality, the similarity is calculated as

the Euclidean distance between two feature vectors.

In terms of the colour and depth data, the feature vec-

tors correspons to the unravelled pixels in a frame,

whereas the skeleton’s feature vector corresponds to

the location of each of the 25 joints.

After individual normalization of each similarity

measure by dividing all values with the maximum va-

GRAPP 2017 - International Conference on Computer Graphics Theory and Applications

246

lue, they are combined into one similarity measure,

where the weight for each measure are α, β and γ:

S = αS

skel

+ βS

depth

+ γS

colour

α, β, γ ∈ [0, 1] (2)

When determining a possible transition between

two motions in a motion graph, a source and a tar-

get motion are established. Minimizing the similarity

measure, S, between the two motions will then ﬁnd

the two frames with the best transition, as illustrated

in ﬁgure 5.

Figure 5: Illustration of searching for the best transition be-

tween two motions is where the similarity measure is mini-

mized over all frames.

If multiple motions are added to a video, the tar-

get motion then becomes the new source motion, and

going through the motion graph and the database of

motions, a new set of one or more possible target mo-

tions are available. Finding the best transition to the

next motion is done, yet again, as illustrated in ﬁgure

6, where all frames in the dark red bar are a part of the

resulting video.

Figure 6: Illustration of searching for the best transitions

through multiple motions and the resulting video sequence

marked with dark red.

5.1 Smoothing Transitions

Since the source and target frames can have a diffe-

rence in shape and appearance when transitioning be-

tween these frames, it will create a jump in the re-

sulting video, which is unwanted. An example is gi-

ven in ﬁgure 7.

Smoothing the transitions will therefore be nee-

ded, which can be done by creating new intermediate

frames between the source and target frames. Cre-

ating intermediate frames using alpha blending will

result in ghosting (Casas et al., 2014) when the shape

is not similar, as shown in the top of ﬁgure 8, which

is unwanted.

Instead, the sparse optical ﬂow between the source

and target frame is established by using the imple-

mentation of Farneb

ack’s optical ﬂow (Farneb

ack,

2003) in OpenCV 2.4.10. The displacement for each

individual pixel between the source and target frame

Figure 7: A source and target frame for the best transition

between two motions where α = β = 1 and γ = 0.

are then gradually moved along this displacement

vector to perform the animation, which is shown in

ﬁgure 9.

Due to rounding errors, when doing a zeroth-order

interpolation of the ﬂow vector, holes appear in the

resulting frame. It also happens, when a pixel does

not have a destination of a ﬂow vector from anot-

her pixel. This makes sense when e.g. moving the

arm, where the previous position of the arm should

be empty in the new frame. But this can also happen

inside the body, which must be handled. One appro-

ach to avoid this is to do backward mapping using

the inverse transformation. Although, if a pixel does

not have a destination from the optical ﬂow, it has no

transformation thus no inverse transformation.

Thereby, the empty pixels that need to be ﬁlled

after a forward mapping are found by evaluating the

surrounding pixels in the output frame via a median

ﬁlter. The resulting animation compared to the alpha

blending is shown in ﬁgure 8.

6 USER STUDY

In order to evaluate the perceived realism, when tran-

sitioning between motions, a user study is performed.

10 different settings are each compared for 3 different

characters.

These settings, described in table 1, include diffe-

rent weightings of the three modalities for the simi-

lairy measure, and it also includes transitions with no

animation and with animation based either on alpha

blending or optical ﬂow. The ﬁrst, second and third

column is the value for the α, β and γ respectively

in equation 2, and the last corresponds to the num-

ber of intermediate animated frames inserted between

a transition, where ”∗” is animation using only alpha

blending.

21 non-experts took part in a web-based survey,

where the test subject was presented with two rando-

mized video pairs, i.e. 45 video pairs for each charac-

ter. Each pair was to be rated from 1 (top preference)

Low Cost Video Animation of People using a RGBD Sensor

247

Figure 8: Animation between the source and target frame from ﬁgure 7, the top is based on alpha blending and the bottom is

based on the proposed method.

Figure 9: Illustration of how an intermediate frame is made

based on the proposed method by using optical ﬂow.

Table 1: The parameter settings for each video for each cha-

racter. *Animated frames based on alpha blending.

α (skeleton) β (depth) γ (colour) frames

1 - - -

1 - - 5

1 1 - -

1 1 - 5

1 1 - 5*

1 - 1 5

1 1 1 5

- 1 - 5

- - 1 5

- 1 1 5

to 5 (bottom preference), where 3 was chosen if there

was no preference.

The results obtained for each character are shown

in ﬁgure 10, where each setting from table 1 on the

x-axis is labelled with four digits.

Interestingly, the result show that the transitions

with animation based on optical ﬂow using only the

colour data in the similarity measure, received a very

low score, and for the character 2, it even got the

lowest score. The results also show that using the

skeleton information only or adding it to the colour

and depth information gave a higher perceived rea-

lism than other settings.

The results show that the animation based on al-

pha blending was signiﬁcantly rated as one of the lo-

west for all three characters, and this is followed by

the transitioning with no animation, which also recei-

ved a low score. As an example, using the skeleton

and depth information only with no animation, in ﬁ-

gure 7, shows a difference in shape, and therefore cre-

ates a jump in the output video.

7 CONCLUSIONS

Four datasets of people performing various motions

were captured by acquiring the skeleton, depth and

colour data from a Kinect v2 for Windows sensor. The

motions sequences in each capture were manually di-

vided and inserted in a motion graph to assure only

feasible transitions between motions would happen.

Segmenting the colour and depth frames by lear-

ning a background model proved to be a poor segmen-

tation, thus another approach was suggested. Using

the skeleton information to establish the range of the

body in the depth frame gave a segmentation where

the person and a part of the ﬂoor remains. The ﬂoor

was then removed by averaging a plane over the ﬁrst

20 frames of a captured video and excluding points

close to the plane in the rest of the video.

The best transition between a source and target

motion was found by a similarity measure by minimi-

zing the Euclidean distance of a combination of ske-

leton, depth and colour data between all possible fra-

mes. In order to create seamless transitions between

motions, intermediate frames were needed. The sug-

GRAPP 2017 - International Conference on Computer Graphics Theory and Applications

248

Figure 10: Results for each character obtained from the user

study showing the mean and variance for each setting.

gested approach was to create new poses by estima-

ting the dense optical ﬂow between the transitioning

frames in each direction and then move the pixels step

wise towards the ﬂow and blend the frames from each

direction.

The suggested animation based on optical ﬂow

was compared with animations using a direct al-

pha blending between the transitioning frames and

not performing any animation. This was evaluated

through a web based user study between three diffe-

rent characters. The results showed consistency over

all characters, where the suggested animation with

different similarity measure settings had a higher rate

of preference and perceived realism than using anima-

tions based on alpha blending and using no animation.

Creating an adaptive number of intermediate fra-

mes, which should vary according to the the pose si-

milarity and speed of motion before and after transi-

tioning, could be a possible area for future investiga-

tion.

REFERENCES

Budd, C., Huang, P., Klaudiny, M., and Hilton, A. (2013).

Global non-rigid alignment of surface sequences.

International Journal of Computer Vision, 102(1-

3):256–270.

Casas, D., Tejera, M., Guillemaut, J.-Y., and Hilton, A.

(2013). Interactive animation of 4d performance cap-

ture. IEEE transactions on visualization and computer

graphics, 19(5):762–773.

Casas, D., Volino, M., Collomosse, J., and Hilton, A.

(2014). 4d video textures for interactive character ap-

pearance. In Computer Graphics Forum, volume 33,

pages 371–380. Wiley Online Library.

De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Sei-

del, H.-P., and Thrun, S. (2008). Performance capture

from sparse multi-view video. In ACM Transactions

on Graphics (TOG), volume 27, page 98. ACM.

Farneb

ack, G. (2003). Two-frame motion estimation based

on polynomial expansion. In Scandinavian conference

on Image analysis, pages 363–370. Springer.

Fechteler, P., Paier, W., and Eisert, P. (2014). Articulated

3d model tracking with on-the-ﬂy texturing. In 2014

IEEE International Conference on Image Processing

(ICIP), pages 3998–4002. IEEE.

Huang, P., Hilton, A., and Starck, J. (2010). Shape simila-

rity for 3d video sequences of people. International

Journal of Computer Vision, 89(2-3):362–381.

Kovar, L., Gleicher, M., and Pighin, F. (2002). Motion

graphs. In ACM transactions on graphics (TOG), vo-

lume 21, pages 473–482. ACM.

Møgelmose, A., Bahnsen, C., and Moeslund, T. B. (2015).

Comparison of multi-shot models for short-term re-

identiﬁcation of people using rgb-d sensors. In Inter-

national Joint Conference on Computer Vision, Ima-

ging and Computer Graphics Theory and Applicati-

ons 2015.

Starck, J. and Hilton, A. (2007). Surface capture

for performance-based animation. IEEE Computer

Graphics and Applications, 27(3):21–31.

Xu, F., Liu, Y., Stoll, C., Tompkin, J., Bharaj, G., Dai,

Q., Seidel, H.-P., Kautz, J., and Theobalt, C. (2011).

Video-based characters: creating new human perfor-

mances from a multi-view video database. ACM Tran-

sactions on Graphics (TOG), 30(4):32.

Low Cost Video Animation of People using a RGBD Sensor

249