Dance Analysis using Multiple Kinect Sensors

Alexandros Kitsikidis

, Kosmas Dimitropoulos

, Stella Douka

and Nikos Grammalidis

Informatics and Telematics Institute, ITI-CERTH, 1st Km Thermi-Panorama Rd, Thessaloniki, Greece

Department of Physical Education and Sport Science, Aristotle University of Thessaloniki, Thessaloniki, Greece

Keywords: Body Motion Analysis and Recognition, Conditional Random Fields, Skeletal Fusion, Dance Analysis.

Abstract: In this paper we present a method for body motion analysis in dance using multiple Kinect sensors. The

proposed method applies fusion to combine the skeletal tracking data of multiple sensors in order to solve

occlusion and self-occlusion tracking problems and increase the robustness of skeletal tracking. The fused

skeletal data is split into five different body parts (torso, left hand, right hand, left leg and right leg), which

are then transformed to allow view invariant posture recognition. For each part, a posture vocabulary is

generated by performing k-means clustering on a large set of unlabeled postures. Finally, body part postures

are combined into body posture sequences and Hidden Conditional Random Fields (HCRF) classifier is

used to recognize motion patterns (e.g. dance figures). For the evaluation of the proposed method, Tsamiko

dancers are captured using multiple Kinect sensors and experimental results are presented to demonstrate

the high recognition accuracy of the proposed method.

1 INTRODUCTION

Dance is an immaterial art as it relies on the motion

of the performer’s body. Dance can convey different

messages according to the context, and focus on

aesthetics or artistic aspects (contemporary dance,

ballet dance), the cultural and social aspects (folk

dances, traditional dances), story telling (symbolic

dances), spiritual meanings (whirling dervishes), etc.

Especially traditional dances are strongly linked to

local identity and culture. The know-how of these

dances survives at the local level through small

groups of people who gather to learn, practice and

preserve these traditional dances. Therefore, there is

always a risk that certain elements of this form of

intangible cultural heritage could die out or

disappear if they are not safeguarded and transmitted

to the next generation.

ICT technologies can play an important role

towards this direction. Specifically, the development

of a system for the capturing, analysis and modelling

of rare dance interactions could significantly

contribute to this transfer of knowledge. However,

the main challenge of this task lies in the accurate

recognition of human body movements. Today, the

major advantages over earlier systems include the

ability to make more precise measurements with a

wider array of sensing strategies, the increased

availability of processing power to accomplish more

sophisticated interpretations of data, and a greatly

enhanced flexibility in the area of media rendering

(Aylward, 2006).

Depending on the degree of precision of the

captured motion and the constraints posed, different

sensing technologies are used. They can be broadly

divided into three main categories: optical motion

capture, inertial motion capture and markerless

motion capture. Optical motion capture is the most

accurate technique but it is also expensive and

constraining. Inertial motion capture is less accurate

and less stable. Finally, markerless motion capture

based on real-time depth sensing systems, such as

Microsoft Kinect, is relatively cheap and offer a

balance in usability and cost compared to optical and

inertial motion capture systems. To this end, this

approach is considered as the most promising one

and has attracted particular attention recently

(Alexiadis et al., 2011).

Existing approaches to human action and gesture

recognition using markerless motion capture

technologies can be coarsely grouped into two

classes. The first uses 3D depth maps / silhouettes

which form a continuous evolution of body pose in

time. Action descriptors, which capture both spatial

and temporal characteristics, are extracted from

those sequences and conventional classifiers can be

789

Kitsikidis A., Dimitropoulos K., Douka S. and Grammalidis N..

Dance Analysis using Multiple Kinect Sensors.

DOI: 10.5220/0004874007890795

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (IAMICH-2014), pages 789-795

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

used for recognition. The other category of the

methods extracts features from each silhouette and

model the dynamics of the action explicitly. Bag of

Words (BoW) are often employed as an intermediate

representation with subsequent use of statistical

models such as hidden Markov models (HMM),

graphical models (GM) and conditional random

fields (CRF) (Li et al., 2010); (Wang et al., 2012)

Another more recent approach is to use the

skeletal data acquired from the depth maps (Shotton

et al., 2011). The subsequent use of skeletal data for

action detection can be divided into two categories.

The methods of the first category are based on 3D

joints feature trajectories (Waithayanon and

Aporntewan, 2011). Those features are either joint

position, rotation data, or some transformation of the

above. They are mainly based on various Dynamic

Time Warping (DTW) variants, like multi-

dimensional time warping (MD-DTW) (ten Holt et

al., 2007). The recognition is based on the alignment

of the movement trajectories compared to the

‘oracle’ move which is being detected. Another

approach is to extract features from the whole

skeleton (histograms) and to use statistical models as

in the case of silhouette based methods (Xia et al.,

2012).

In this paper, we present a method for dance

capture, analysis and recognition using a multi-depth

sensors set-up and a skeleton fusion technique to

address occlusion problems and increase the

robustness of the skeletal tracking. Subsequently, we

propose the splitting of the skeleton into five

different parts, and the automatic generation of a

posture vocabulary (codebook) for each part.

Finally, a Hidden State Conditional Random Field

(HCRF) (Quattoni et al., 2004); (Wang et al., 2006)

is applied for the recognition of the dance figures

(motion patterns). Experimental results with real

Tsamiko dancer (Tsamiko is a traditional Greek

dance) have shown the great potential of the

proposed method.

2 SYSTEM OVERVIEW

The flowchart of the data acquisition and motion

recognition process is presented in Figure 1. Several

Kinect sensors placed around the subject are used to

acquire skeletal animation data. Microsoft Kinect

SDK (Kinect for Windows, 2013), has been used as

a solution for skeletal tracking and acquisition. It

provides 3D position and rotation data (relative to a

reference coordinate system centred at the origin of

the sensor) of 20 predefined skeletal joints of a

human body. In addition a confidence level of the

joint tracking (low/medium/high) is provided per

joint. A skeletal fusion is proposed to combine the

data coming from multiple sensors onto a single

fused skeleton, which is then provided to the Motion

Analysis Module. Specifically, the skeleton is split

into five body parts (torso, left/right hand, left/right

foot), which are then transformed to allow view

invariant posture recognition. The next step is to

recognize each body part posture appearing in a

frame, based on a predefined vocabulary of postures,

obtained from a set of training sequences. Finally,

body part postures are combined into body posture

sequences and an HCRF is used to recognize a

motion pattern (e.g. a dance move) from a

predefined set of motion patterns from which the

HCRF was previously trained.

Figure 1: System overview.

2.1 Calibration

In order to improve the robustness of skeleton

tracking provided by the Microsoft Kinect SDK, to

reduce occlusion and self-occlusion problems and to

increase the area of coverage, multiple Kinect

devices were used. Prior to fusion, skeletal data from

all sensors have to be transformed to a common

reference coordinate system. One sensor is selected

as the reference sensor providing the reference

frame. A calibration procedure is then required to

estimate the transformations between the coordinate

systems of each sensor and the reference sensor. The

proposed calibration procedure does not require any

checker boards or similar patterns. Instead, the only

requirement is that a person needs to be visible from

multiple sensors, whose FOV’s need to partially

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

790

overlap. The skeleton joint positions are then fed

into the Iterative Closest Point algorithm (Besl and

McKay, 1992) to estimate the rigid transformation

(Rotation-Translation) that minimizes the distance

between the transformed positions in the reference

frame. This transformation is then used to register

the skeletons acquired from each sensor in the

reference coordinate system. The implementation of

ICP algorithm found in the Point Cloud Library

(PCL, http://pointclouds.org/), (Rusu, 2011) was

used.

Since the skeleton frame data are sparse, each

containing at most 20 points, and the estimation of

the joint positions can be erroneous, the calibration

procedure is iterated until two convergence criteria

are both met. The first criterion is that the number of

joints tracked with high confidence on both devices

needs to be higher that a threshold T

joints

. The higher

this number is the better the expected accuracy of

the calibration. The second criterion is that the

fitness score of the ICP algorithm needs to be lower

than a threshold T

ICP

. These thresholds can be

adjusted to accommodate various setups and

recording conditions.

2.2 Skeleton Fusion

Once all sensors are calibrated, skeleton registration

is performed, i.e. the representation of each skeleton

is transformed to the reference coordinate system.

This is accomplished by multiplying the skeleton

joint positions obtained from each sensor by the

corresponding RT matrix, estimated in the

calibration process. Then, a skeletal fusion

procedure is used to combine these registered

skeletons into a single skeleton representation

(Figure 2).

Specifically, the following fusion strategy has

been used on joint positional data, but could easily

be extended on joint rotations as well. Initially, the

sum of all joint confidence levels of each skeleton is

computed and the skeleton with the highest total is

selected. Since this is the skeleton with the most

successfully tracked joints, it is expected to be the

most accurate representation of the real person pose.

We use this skeleton as a base, and enrich it with

data provided from the remaining skeletons.

Specifically, the confidence of each joint of the base

skeleton is examined. If the confidence is medium or

low, the joint position is corrected by taking into

account the position of this joint in the remaining

skeletons. If corresponding joints with high

confidence are found in any of the remaining

skeletons, their average position is used to replace

the position value of the joint. Otherwise, the same

procedure is applied for joints containing medium

confidence values. Finally, if only low confidence

values exist, the same procedure is applied using the

available skeleton data for the joint.

Figure 2: Colour maps, depth maps and skeleton frames

from 3 Kinect sensors and a fused skeleton result.

As a last step, a stabilization filtering step is

applied in order to overcome problems due to rapid

changes in joint position from frame to frame which

may occur because of the use of joint position

averaging in our fusion strategy. We use a time

window of three frames, to keep the last three high-

confidence positions for each joint. The centroid of

these three previous positions is calculated and

updated for each frame. If the Euclidean distance

between a joint position and this centroid is higher

than a certain threshold, then we replace the joint

position with the value of the centroid, so as to avoid

rapid changes in joint positions. We have used

different thresholds for each joint since hands and

feet are expected to move more rapidly.

3 MOTION ANALYSIS

The motion analysis subsystem (Figure 3) can use as

input a skeleton animation stream, either provided

from a single Kinect device, or from multiple Kinect

devices, after using the skeleton fusion procedure

described in Section 2.2.

Initially, to achieve view invariance of motion

recognition, the skeleton joint positions are

translated relative to the root of the skeleton (Hip

Center) and rotated around the y axis so that the

skeleton is facing towards the y axis. Next, the

skeleton is divided into five parts, shown in Figure 4

(torso, left hand, right hand, left foot, right foot).

Each part has a root joint and children joints. For

each skeleton part we generate a feature vector

consisting of positions of each joint relative to the

root of the part (also shown in Figure 4).

DanceAnalysisusingMultipleKinectSensors

791

Figure 3: Motion analysis subsystem.

Specifically, the root of the torso part is the Hip

Center and the children joints are: Spine, Shoulder

Center and Head. The root of the left hand part is

the Shoulder Center and the children are: Left

Shoulder, Left Elbow, Left Wrist and Left Hand. The

root of the left foot part is the Hip Center and the

children are: Left Hip, Left Knee, Left Ankle and Left

Foot. The right hand and right foot parts consist of

the symmetrical joints of their left counterparts.

3.1 Posture Codebook

For each of the 5 skeleton parts described above, we

construct a codebook of basic postures of a

predefined size k. The identification of these basic

postures is performed automatically by using k-

means clustering of a large set of postures obtained

from one or more recorded training sequences.

Clustering essentially divides the ‘posture space’

into k discrete posture subspaces. After building a

posture codebook for each body part, we train a

multiclass SVM classifier to classify each incoming

feature vector as a specific posture from this posture

codebook. Thus we obtain five posture classifiers,

one per body part.

Figure 4: Skeleton parts.

3.2 Motion Pattern Recognition using a

HCRF Model

For the motion detection step, we have selected the

Hidden-state Conditional Random Fields (HCRF)

classifier (Quattoni et al., 2004). A set of M basic

motion patterns, i.e. sequences of frames of skeleton

data describing a specific movement, is first

identified. We then train an HCRF multi-class model

for each of these basic motion patterns. Specifically,

for the training phase, we use labelled sequences of

the basic motion patterns. Each sequence consists of

a sequence of labelled skeleton part posture vectors,

i.e. vectors of five elements, each being the index of

a basic posture from the codebook corresponding to

the specific skeleton part. For the testing phase, a

similar vector is initially estimated for each frame of

the input skeleton data sequence and is then used as

input to the HCRF classifier. Then, the identification

of each motion pattern is based on the

probability/likelihood of the model of the HCRF for

each observation sequence. For the implementation

of HCRF, the Hidden-state Conditional Random

Fields Library v2 was used

(http://sourceforge.net/projects/hcrf/).

4 EXPERIMENTAL RESULTS

To evaluate our methodology, a data recording

session took place, in which several dancers were

recorded performing the Tsamiko dance (Figure 5).

Tsamiko is a popular traditional folk dance of

Greece, done to music of ¾ meter. It is a masculine

(mostly) circular dance with more smoothly steps

danced by women. It is danced in an open circle

where the first dancer performs variations while the

others follow the basic steps. Tsamiko is danced in

various areas of Greece such as: Peloponnese,

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

792

Central Greece, Thessaly, W. Macedonia, with

variations in kinesiological structure (10, 12, 8, 16

steps). The dance follows a strict and slow tempo

with emphasis on the "attitude, style and grace" of

the dancer. The steps are relatively easy but have to

be precise and strictly on beat. Its variations consist

of both smooth and leaping steps, which give the

dance a triumphant air. The handle is hand elbows in

position W. The dance is accompanied by various

songs.

Figure 5: Recording session.

Figure 6: Tsamiko dance steps.

For the evaluation of our methodology we

recorded three male dancers, dancing the single step

version of the Tsamiko dance (Figure 6). The main

dance pattern in Tsamiko can be split in 3 basic

dance moves, which were used as the basic motion

patterns that we tried to detect. The recordings were

manually annotated to mark the beginning and end

of each move. Each dancer was recorded separately

and was required to perform the basic moves of the

dance several times.

4.1 Sensors Setup

For the final recording we used three Kinect sensors

placed in front of the dancer in an arc topology (one

in front and two at the sides), as seen in Figure 7(b).

One additional setup was tested, but was rejected for

the final recording. We tried placing four Kinect

sensors all around the dancer, at 90 degree angle

between them, as seen in Figure 7(a). This setup

allowed for approximately 2x2m active space for the

dancer to move. The interference due to infrared

emission from the sensors was minimal, but only the

two frontal sensors provided useful skeletal data,

since skeletal tracking of Microsoft SDK is designed

to work on people facing towards the sensor. Since

the dancers were moving on a small arc, they were

always facing in the same direction. Thus, our final

setup proved to be more effective since we had

skeletal tracking data from three sensors. In addition,

having a smaller angle between adjacent sensor

FOVs allowed for increased precision of calibration.

Adding more sensors proved to be problematic since

interference caused by the emission of infrared

pattern by each sensor increased significantly, which

had a negative impact on skeletal tracking.

Figure 7: Sensor setups (a) Setup A (b) Setup B (final

setup).

4.2 Evaluation Results

The recorded data consisted of eight repetitions of

basic Tsamiko dance pattern (three dance moves per

repetition) executed by each of the three dancers.

We split the recorded data into train and test sets by

using half repetitions of the basic dance pattern of

each dancer (12 repetitions per move) for training

and the remaining for testing. Initially the posture

DanceAnalysisusingMultipleKinectSensors

793

codebook was created with a codebook of k=20

basic postures for each body part using the training

motion sequences. Then, we trained an HCRF using

the train sequences to be able to distinguish between

the three basic Tsamiko dance moves. CRFs with a

varying number of hidden states were trained as can

be seen from Table 1, in which the dance move

detection accuracies of the test set are presented, per

dancer and overall. The best overall detection

accuracy that was achieved is 93,9% using an HCRF

with 11 hidden states. In Table 2, detection

accuracies are presented for each dance move.

Table 1: Recognition accuracies of Tsamiko dance moves

per person and overall recognition accuracies for varying

number of hidden states in the HCRF classifier.

Hidden

States

5 8 11 12 15 20

Dancer A

38,4 61,5 84,6 76,9 76,9 69,2

Dancer B 90,9 90,9 100 100 90,9 72,7

Dancer C 66,6 88,8 100 100 100 77,7

Overall 63,6 78,7

93,9

90,9 87,8 72,7

Table 2: Recognition accuracies of Tsamiko dance moves

for varying number of hidden states in the HCRF

classifier.

Hidden

States

5 8 11 12 15 20

Dance

move 1

83,3 66,6 91,6 100 83,8 100

Dance

move 2

27,2 81,8 90,9 81,8 90,9 36,3

Dance

move 3

80 90 100 90 90 80

Overall 63,6 78,7

93,9

90,9 87,8 72,7

5 CONCLUSIONS AND FUTURE

WORK

This paper presents a study on recognizing

predefined dance motion patterns from skeletal

animation data captured by multiple Kinect sensors.

As can be seen from the experimental results, our

method gave quite promising results providing high

recognition accuracies of the three Tsamiko dance

moves. In future work we aim to experiment on

recognition of different styles of these dance moves

and adding more complex dance patterns and

variations. In addition we plan to extend our

skeleton fusion algorithm on joint rotation data (both

absolute and hierarchical) which will allow the

construction of posture codebooks based on both

position and rotation data.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from the European Community's Seventh

Framework Programme (FP7-ICT-2011-9) under

grant agreement no FP7-ICT-600676 ''i-Treasures:

Intangible Treasures - Capturing the Intangible

Cultural Heritage and Learning the Rare Know-How

of Living Human Treasures''.

REFERENCES

Aylward, R., “Sensemble: A Wireless Inertial Sensor

System for InteractiveDance and Collective Motion

Analysis”, Masters of Science in Media Arts and

Sciences, Massachusetts Institute of Technology, 2006

Alexiadis, D., Kelly, P., Daras, P., O'Connor, N.,

Boubekeur, T., and Moussa, M., Evaluating a dancer's

performance using kinect-based skeleton tracking. In

Proceedings of the 19th ACM international conference

on Multimedia (MM '11). ACM, New York, NY,

USA, pp. 659-662, 2011.

Li, W., Zhang, Z., Liu, Z., “Action Recognition Based on

A Bag of 3D Points”, IEEE International Workshop

on CVPR for Human Communicative Behavior

Analysis (in conjunction with CVPR2010), San

Francisco, CA, June, 2010.

Wang, J., Liu, Z., Wu, Y., and Yuan, J., “Mining actionlet

ensemble for action recognition with depth cameras,”

in CVPR'12, 2012.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T.,

Finocchio, M., Moore, R., Kipman, A., and Blake, A.,

“Real-time human pose recognition in parts from

single depth images,”, in CVPR, pp. 1297 – 1304, June

2011.

Waithayanon, C. and Aporntewan, C., “A motion

classifier for Microsoft Kinect,” in Computer Sciences

and Convergence Information Technology (ICCIT),

2011 6th International Conference on, 2011.

ten Holt, G. A., Reinders, MJT., Hendricks, EA., Multi-

Dimensional Dynamic Time Warping for Gesture

Recognition. Conference Paper. 2007.

Xia, L., Chen, C.-C., and Aggarwal, J., “View invariant

human action recognition using histograms of 3D

joints,” in Computer Vision and Pattern Recognition

Workshops (CVPRW), 2012 IEEE Computer Society

Conference on, 2012.

Wang, S. Quattoni, A., Morency, L.-P., Demirdjian, D.,

and Trevor Darrell, Hidden Conditional Random

Fields for Gesture Recognition, Proceedings IEEE

Conference on Computer Vision and Pattern

Recognition, June 2006

Kinect for Windows | Voice, Movement & Gesture

Recognition Technology. 2013. [ONLINE] Available

at: http://www.microsoft.com/en-us/kinectfor

windows/

Besl, Paul J.; N.D. McKay (1992)."A Method for

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

794

Registration of 3-D Shapes". IEEE Trans. on Pattern

Analysis and Machine Intelligence (Los Alamitos,

CA, USA: IEEE Computer Society) 14 (2): 239–256.

Rusu, B., Cousins, S., "3D is here: Point Cloud Library

(PCL)," Robotics and Automation (ICRA), 2011 IEEE

International Conference on , vol., no., pp.1,4, 9-13

May 2011

Quattoni, A., Collins, M., Darrell, T., Conditional Random

Fields for Object Recognition, In Neural Information

Processing Systems, 2004.

DanceAnalysisusingMultipleKinectSensors

795