Action Graph

A Versatile Data Structure for Action Recognition

Jan Baumann, Raoul Wessel, Bj

orn Kr

uger and Andreas Weber

Department of Computer Science II, University of Bonn, Bonn,Germany

Keywords:

Animation, Action Recognition, Accelerometer, Motion Capture.

Abstract:

This work presents a novel and generic data-driven method for recognizing human full body actions from

live motion data originating from various sources. The method queries an annotated motion capture database

for similar motion segments, capable to handle temporal deviations from the original motion. The approach

is online-capable, works in realtime, requires virtually no preprocessing and is shown to work with a vari-

ety of feature sets extracted from input data including positional data, sparse accelerometer signals, skeletons

extracted from depth sensors and even video data. Evaluation is done by comparing against a frame-based Sup-

port Vector Machine approach on a freely available motion capture database as well as a database containing

Judo referee signal motions and concludes by demonstrating the applicability of the method in a vision-based

scenario using video data.

1 INTRODUCTION

Consumer motion capture systems (like Kinect, Wi-

iMote, EyeToys, accelerometers) have received a lot

of attention in recent years, primarily because they en-

able the user to interact with an application in a very

natural way using low cost hardware. The ﬁeld of

usage exceeds replacing the classic game controller

in computer games, as new applications beyond the

ﬁeld of computer games are emerging. This paper is

motivated by such a novel example application: The

automated detection of Judo referee signals, i.e. the

recognition of full body movements as belonging to

a set of small motion segments which are detected

as certain referee signals (usually denoted by their

Japanese names). Taking the developed method to the

gym would allow for cost-effective automatic score

counting and time keeping and greatly reduce the ad-

ministrative overhead required at Judo competitions.

Technically, a fully data driven action recognition

scheme is devised, where motion sequences can be

detected in real time. The method is very ﬂexible con-

cerning the used sensor input data, which can range

from high quality optical motion capture data, over

medium quality Kinect skeletons to highly noisy ac-

celerometer readings. Adding robust feature extrac-

tion from video data to the recognition pipeline even

enables the approach to detect actions from video in-

put. All this various input data can be compared

in real-time with previously recorded sample mo-

tions in a motion database. The framework detects

if the performed motion is similar (and possibly time-

warped) to one of the annotated clips contained in the

database.

The method requires very little preprocessing—

only sample motions for each action to be recognized

have to be labeled by the name of the action. No fur-

ther explicit learning phases are required. Additional

ﬂexibility comes from the ability to add action tem-

plates to the used action database in an online manner,

requiring only minimal processing.

For the purpose of evaluating different aspects of

the method it is applied to prerecorded high qual-

ity motion capture data, to live captured low qual-

ity motion data obtained by a Microsoft Kinect sen-

sor, to features extracted from video data and to a

sparse accelerometer sensor setup with only four sen-

sors attached to the actor’s body. Interestingly, even

for these very sparsely distributed accelerometers, the

method is able to detect some actions, making it very

effective in a low-cost sensor setup.

2 RELATED WORK

Related work for the method can be divided into four

groups, image-based action recognition, 3D point tra-

jectory action recognition methods, methods using

325

Baumann J., Wessel R., Krüger B. and Weber A..

Action Graph - A Versatile Data Structure for Action Recognition.

DOI: 10.5220/0004652703250334

In Proceedings of the 9th International Conference on Computer Graphics Theory and Applications (GRAPP-2014), pages 325-334

ISBN: 978-989-758-002-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

accelerometers as sensors and data-driven techniques

in the ﬁeld of computer animation.

The ﬁrst group of techniques uses 2D informa-

tion such as images coming from a video camera

to infer information about the actions taking place.

The work by (Bobick et al., 2001) presents a view-

based approach to action recognition using temporal

templates, which are static vector-images where the

vector value at each point is a function of the mo-

tion properties at the corresponding spatial location

in an image sequence. (Schuldt et al., 2004) use local

space-time features in combination with SVM classi-

ﬁcation schemes for action recognition.

The second group works directly on 3D point tra-

jectory data. (Barbi

c et al., 2004) show methods for

automatically segmenting motion capture data into

distinct behaviours. The work by (Campbell and

Bobick, 1995) presents techniques for representing

movements based on space curves in subspaces called

phase spaces, recognizing actions by calculating dis-

tances between these curves at every time step.

(Arikan et al., 2003) use an interactively guided

Support Vector Machine to generalize example anno-

tations made by a user to the entire motion capture

database. Their approach works well on the small (7

minutes) motion capture database presented in their

paper. The method presented in this paper uses a simi-

lar SVM approach for comparison with the developed

method.

Data-driven k-nearest neighbor approaches have

been quite popular in the ﬁeld of computer anima-

tion in recent years. In the context of synthesiz-

ing motions, (Chai and Hodgins, 2005) show how to

transform the positions of a small number of mark-

ers to full body poses. For nearest neighborhood pose

searches, they construct a neighbor graph, allowing

approximate NN-searches and requiring a quadratic

preprocessing time in the size of the number of poses

in the database. (Kr

uger et al., 2010) improve the

method presented in (Chai and Hodgins, 2005) by

querying a kd-tree for determining the neighborhood

of a query pose resulting in exact neighborhoods for

arbitrary query poses.

A novel and very intuitive puppet interface is used

by (Numaguchi et al., 2011) to retrieve motions from

a motion capture database. By sketching actions

with the 17-degree of freedom puppet, the method

matches the puppet’s sensor readings retargeted to hu-

man motion to behaviour primitives stored in the mo-

tion database.

(Raptis et al., 2011) develop a method to clas-

sify human dance gestures by using a special an-

gular skeleton representation designed for recogni-

tion robustness under noisy input. They use a

cascaded correlation-based classiﬁer for multivariate

time-series data in combination with a dynamic-time

warping based distance metric to evaluate the differ-

ence in motion between a performed gesture and an

oracle for the matching gesture. Although the clas-

siﬁcation accuracy of their approach is very good, it

assumes that the input motion adheres to the under-

lying musical beat, whereas the approach presented

here does not rely on such assumptions.

Another class of methods is about using ac-

celerometers for activity recognition. (Bao and In-

tille, 2004) present a system designed for context-

aware activity recognition detecting everyday physi-

cal activities from acceleration data. They focus on

a semi-naturalistic data collection protocol to train a

set of classiﬁers, and ﬁnd this is best evaluated by de-

cision tree recognition algorithms. Along the same

lines, (Maurer et al., 2006) study the effectiveness

of activity classiﬁers also within a multi-sensor sys-

tem. Their analysis of the proposed activity recog-

nition and monitoring system concludes it is able to

identify and record a subject’s activity in real-time.

While (Ravi et al., 2005) also study the activity

recognition techniques, they present a solution using

only a single triaxial accelerometer worn within dif-

ferent data collection setups. Within this context, they

analyze the quality of known classiﬁers for recog-

nizing activities with particular emphasis on the im-

portance of selected features and level of difﬁculty

of recognizing speciﬁc activities. The system de-

veloped by (Khan et al., 2010) is capable of recog-

nizing a broad set of human physical activities us-

ing only a single triaxial accelerometer. The ap-

proach is of higher accuracy than the previous works

due to a novel augmented-feature vector. Addition-

ally, they provide a data acquisition protocol using

data collected by the subjects at home without the re-

searcher’s supervision.

Since activity-aware systems have inspired novel

user interfaces and new applications, recognizing hu-

man activities in smart environments becomes in-

creasingly important. In this spirit, (Choudhury et al.,

2008) propose an automatic activity recognition sys-

tem using on-body sensors. Several real-world de-

ployments and user studies show the relevance of us-

ing the results to improve the hardware, software de-

sign, and activity recognition algorithms to context-

aware ubiquitous computing applications.

This paper presents a method which is data-driven

and uses motions from a motion capture database

to construct a prior-database. Publicly available

datasets, like the Carnegie Mellon University motion

capture database (Carnegie Mellon University Graph-

ics Lab, 2004) and the HDM05 (M

uller et al., 2007)

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

326

library, recorded at the Hochschule der Medien in

Stuttgart, contain large amounts of motion capture

data.

In this paper we used data from the HDM05 li-

brary, which contains more than three hours of sys-

tematically recorded and well documented motion

capture data. Of great beneﬁt in the evaluation of

the action recognition method are the manually cut

out motion clips that were arranged into 64 differ-

ent classes and styles. Each such motion class con-

tains 10 to 50 different realizations of the same type

of motion covering a broad spectrum of semantically

meaningful variations. The resulting motion class

database contains 457 motion clips of a total length

corresponding to roughly 50 minutes of motion data.

3 OVERVIEW

The workﬂow of the proposed action recognition

method can be divided into three distinct processes.

First, in an ofﬂine step, the motion capture database

is created from motion data, where the quality can

range from sparse and noisy data such as of a sensor

setup using only a single accelerometer to high accu-

racy optical motion capture data. All such data sets

can easily be handled and are manually or automati-

cally annotated by specifying start and end frames as

well as a keyword for labeling. This is followed by a

preprocessing phase in which a kd-tree is created us-

ing a speciﬁc feature set allowing fast k-nearest neigh-

borhood searches on the poses in the database. This

feature set depends on the application which, in turn,

is interdependent on the speciﬁc type of motion cap-

ture system respectively the employed sensor setup.

As of now, the workﬂow can be split into the ac-

tion graph-based (dark blue in Fig. 1) and the SVM-

based (green in Fig. 1) action recognition workﬂow.

The SVM-based method is implemented for compar-

ison with the action graph during evaluation. In the

online phase of the action graph method, actions are

recognized from any type input motion sequence by

feeding new frames of the input motion into the anno-

tation module. This module then uses similar poses

retrieved from the kd-tree built up from the motion

capture database in an action aware neighborhood

graph (see Subsection 4.3) to output all recognized

actions as soon as they are detected. Within the SVM-

based situation, the preprocessing phase consists of

learning SVM parameters on a training set, whereas in

the online phase, the SVM classiﬁer checks whether

a frame derived from the input query motion belongs

to a previously annotated action.

Since the approach at hand is generic, the input

mocap

database

mocap

database

annotation

by hand

database

annotation

by hand

offline preprocessing

output:

recognized

actions

output:

recognized

actions

input:

motion

sequence

input:

motion

sequence

online

annotation module

support

vector

machine

action

graph

kd-tree

construction

training

phase

kd-tree

construction

training

phase

Figure 1: Workﬂow of the proposed method. Starting out

with a motion capture database annotated with actions of

interest in an ofﬂine phase, the method builds up a kd-tree

from this data in the preprocessing phase. The online phase

then consists of feeding new frames of the input motion into

the annotation module, recognizing actions as soon as the

actor ﬁnishes executing them.

need not be of a speciﬁc data type and may even cover

cross-modal signals.

4 ACTION RECOGNITION

METHODS

Evaluation is done by comparing two different action

recognition methods side-by-side, the online-capable

action graph, where the input motion sequence is ef-

ﬁciently compared to annotated motions in the mo-

tion capture database and an ofﬂine Support Vector

Machine (SVM) approach similar to one that was in-

troduced for motion capture data by (Arikan et al.,

2003).

4.1 Data Preparation

Since the preparation of motion capture data for our

method is highly dependent on the application and

sensors used, one cannot give general rules for prepar-

ing the data in the database or for processing the poses

of the query motion. Various applications presented

in the results in Section 5 show realistic examples.

4.2 Data Annotations

For the training phase of the action recognition meth-

ods, as well as for evaluations during the testing

phase, accurate annotations are needed. Annotations

inform the system at which time in a motion sequence

speciﬁc actions are performed by the actor. For this

reason, in this method, all used datasets were anno-

tated by hand. Another possibility would be to use

ActionGraph-AVersatileDataStructureforActionRecognition

327

automatically annotated mocap data, e.g. by meth-

ods presented in (Arikan et al., 2003) or (Wyatt et al.,

2005). In this work, the decision was made on a com-

plete, reliable, manual annotation procedure to ensure

that the results are not affected by false, automatically

computed annotations. For each relevant action, an

annotator gives a start frame, an end frame and a key-

word that describes the performed motion, ultimately

creating a mapping from database frame f to the an-

notations stored in f .

4.3 Action Graph Based Recognition

The presented action recognition method searches for

motion segments that are similar to annotated actions

in the motion database by taking into account the tem-

poral continuity of the underlying motion. This is

in contrast to the SVM-based approach presented in

Subsection 4.4 for comparison, which ignores this in-

formation and decides whether a frame belongs to an

action on a frame-by-frame basis, leading to many

possible ambiguities.

To this end, the action graph detects if an action

ends at the current frame and then tries to ﬁnd possi-

bly time-warped motion segments spanning the action

in its entirety. Looking at the individual pose neigh-

borhoods of the knn-search alone can lead to possi-

bly many different annotations. By using the action

graph, paths representing motion segment matches

can be found through the annotated neighborhoods,

resolving the ambiguity.

Basically, the method presented in this paper ex-

tends the Lazy Neighborhood Graph (LNG) proposed

by (Kr

uger et al., 2010) to ﬁnd motion segments simi-

lar to the currently performed motion, enabling its use

for action recognition. In its original implementation,

the LNG ﬁrst retrieves the k nearest neighbors from

a motion database for every pose in the query mo-

tion. To bridge the gap from these locally matching

poses of the retrieved pose neighborhoods to glob-

ally matching similar motions in the database, their

method constructs a directed acyclic graph by regard-

ing the retrieved local neighboring poses from the mo-

tion database as vertices in the graph. Now, an edge

connects a pair of neighbors of temporally adjacent

pose neighborhoods, if certain stepsize conditions are

satisﬁed, similar to Dynamic Time Warping. In its

simplest form, the step tuple (step

pose

, step

time

) =

(1, 1) connects pose p at time t to pose p + 1 at time

t + 1. By allowing additional step tuples, the results

could also possibly be time warped. After having

made all possible connections, a single source vertex

s is connected to all the pose neighbors in the ﬁrst

neighborhood. The problem of ﬁnding a motion con-

tained in the database which is most similar to the

query motion can now be reduced to solving a single-

source shortest paths problem. Starting the search at

vertex s, one only has to check whether there exists

a path which terminates at a vertex in the last neigh-

borhood. The entire global matching can be solved

in O(km log(n)), where k is the number of retrieved

nearest neighbors, m the number of frames contained

in the query motion and n the number of frames in the

motion capture database.

In contrast to the original implementation of the

LNG, the developed action recognition framework

tries to ﬁnd motion segments which start close to the

beginning of an annotated action having the currently

processed frame close to the terminating frame of an

annotation (see Fig. 2). This is accomplished by ﬁrst

inspecting the pose neighborhood of the current frame

for annotated ending poses. For each of these

found action ending poses, a search for a starting pose

annotated with the same action is performed in all

neighborhoods up to frame f

n−1

. All of these start-

ing poses are then connected with the single source

vertex mentioned above. If an annotated action in the

database is similar to the currently performed motion

and is contained in the pose neighborhood queue of

length w, the single-source shortest paths algorithm is

able to ﬁnd paths from the start of an action to the end

of the action, containing only the speciﬁed annota-

tion. The found actions are possibly time-warped ac-

cording to the allowed time-steps, making the method

very ﬂexible and robust to time-variations in motion

performance.

In the technical part of the motion detection sce-

nario, we test by considering a query motion take, e.

g. from a Kinect recording, against a set of query

action classes A, that is, annotated classes which are

present in the data base. As a result of searching the

action graph for motions associated with these class

annotations we get a set of path candidates C =

{

}

consisting of similar motion segments s

. The size of

this set depends on the employed data base, but may

range from zero to several thousand retrieved seg-

ments. Consequently, we compute the percentage of

detected paths s

from the action graph which agree

with the query set A, thus automatically addressing

the fundamental question whether we came across

any action contained in A. We collect the annotations

which are represented most strongly, i. e. make for

more than 10% of the whole set C and computing the

respective start and end frames of these as follows.

We calculate the frame window which contains the

intersection of the annotation ground truth and a min-

imum of 75% of all according paths retrieved from the

action graph. The start and end of this window marks

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

328

the start and end frame of the annotation at hand. Note

that in case we aim at evaluation rather than detection,

we follow a slightly different protocol. This will be

addressed in Section 5.

4.4 SVM-based Action Recognition

We additionally evaluate frame classiﬁcation based

on a Support Vector Machine (SVM) with a stan-

dard Radial Basis Function kernel (RBF). To this end

we use the LibSVM implementation (Chang and Lin,

2011). Optimal SVM parameters C balancing hyper-

plane minimization and the inﬂuence of slack vari-

ables as well as the RBF kernel width γ are deter-

mined using grid search with cross validation. To re-

duce the time consumption for training, we use only

30% of the frames of each training sequence that are

chosen randomly. To take possible inﬂuence of this

random selection into account, we conduct four runs

each time using a different training frame selection

and average the resulting classiﬁcation accuracies.

5 RESULTS

Applications used for Ervaluation

For evaluation purposes, we considered ﬁve applica-

tions. First, in Section 5.2.1, we performed ac-

tion recognition tests on a cut dataset taken from the

HDM05 motion capture database. For this reason we

separated the cut sequence database into a training

part that contained exactly nine realizations of each

motion class, and a testing part that contained at least

3 realizations. The same motion capture database is

then used to test our algorithm on a sparse accelerom-

eter setup, detecting actions using a total of four sim-

ulated accelerometers on the wrists and ankles.

In Section 5.2.2, we tested the behavior of the

methods with Judo referee signal movements in an

online scenario, using query motions coming from an

optical mocap system. In this scenario, the database

contained typical referee signals performed by three

different actors with at least three repetitions. This

database was captured with a Vicon motion capture

system and the motion capture data was stored in the

skeleton based .v ﬁle format.

We also modiﬁed the previous scenario to a cross-

modal scenario, where the query motion was captured

with a Microsoft Kinect sensor, obtaining the skeletal

data using the Microsoft Kinect SDK. For this reason,

the Judo database had to be resampled to the native

frequency of the Kinect sensor (30Hz).

In the last application example (Section 5.2.3), in-

terest points extracted from video data serve as input

for the proposed algorithm, demonstrating applicabil-

ity in a vision-based context.

For each of the evaluated applications, we build

up two databases, one including motion clips of the

actor and the other with clips of the actor excluded.

Some applications which require poses in the

database to be comparable need to perform a nor-

malization step on each pose, making them scale-

and view-invariant. Along the lines of (Kr

uger et al.,

2010), the root node of the skeleton is transformed

such that the skeleton faces forward and is anchored

at the global coordinate frame origin. If the skeleton

is given in a hierarchical representation (e.g. HDM05

and Judo database skeletons), the root node’s position

is translated to (0, 0, 0)

and its orientation is set to the

multiplicative identity quaternion, followed by a for-

ward kinematics calculation to update the remaining

skeleton nodes. When normalizing skeletons where

joint positions are given in absolute world coordinates

with no rotational information (e.g. Kinect skeletons),

the orientation of the root node is estimated by ex-

ploiting rigid connectivity between the pelvis and its

neighboring joints, similar to the normalization step

used for raw optical marker data in (Baumann et al.,

2011).

To obtain scale-invariance, the bones of any query

skeleton are resized to match the skeleton that was

used to build up the database.

Description of the Evaluation

Allowing detection of more than one particular action

at a time does not make sense for evaluation of the de-

tection method, especially when this is done by means

of confusion matrices. The decision criteria presented

in Section 4.3 which allow for several strongly rep-

resented action classes to contribute to the detection

results, is clearly not suitable for evaluation purposes.

Instead, we decide for the single most strongly rep-

resented action class found in the action graph paths.

To evaluate the quality of the decision method we dis-

tinguish the following cases:

1. The retrieved motion paths lie completely within

the relevant ground truth interval, in which case

the method is regarded as properly working.

2. The motion paths lie outside the ground truth in-

terval as a whole, in which case the method is dis-

missed as incorrect (this was rarely observed to be

the case).

3. The retrieved path set intersects the ground truth

interval, in which case we further differentiate: if

this intersection includes more than 90% of the

ActionGraph-AVersatileDataStructureforActionRecognition

329

n-1

n-2

n-3

n-4

n-5

Query Motion

(p, t)

p:posestep

t:timestep

Frame

Pose

Neighborhoods

(1,1)

(2,2)

Figure 2: Detecting actions in current frame f

using the action graph. In this example, we illustrate detecting a ’Jumping

Jack’ motion which is performed in the last six frames ( f

n−5

to f

) using a window of w = 6 frames. The poses of the

’Jumping Jack’ are color coded, ranging from green to red, representing the start and end of the action, respectively. First, all

poses annotated with starting poses of actions (green) are connected to the single source vertex required for the single-source

shortest path algorithm, regarding all past neighborhoods up to window size w. Now, for every neighborhood, poses are

connected with edges according to the allowed time and pose steps S (S = {(1, 1), (2, 2)} in this example). After running the

single-source shortest path algorithm, we check for every candidate path terminating at an action ending pose (red pose in

neighborhood f

) whether the nodes on the path are consistently annotated with the same action, in which case this action is

reported as found. Note that every ’JumpingJack’ motion contains a ’ClapAboveHead’ in its middle, as can be seen in the

pose neighborhoods (dashed circles). Consequently, this clap is also detected, but at an earlier stage (frame f

n−2

Frames

Actions

200 400 600 800 1000 1200 1400 1600

clapAboveHead

jumpingJack

skier1RepsLstart

0.5

Figure 3: Example action recognition run on frames 0 − 1686 of motion HDM bd 03-05 02 120 from the HDM05 motion

capture library, comprising four jumping jack motions followed by three complete and one half skiing motion starting with

the left foot. Note that our method also detects sub-actions like the clap above the head, which is contained in the middle of

each jumping jack. The half-executed skiing motion at the end is not detected, because the action graph is unable to ﬁnd an

annotated end frame in this case.

total retrieved paths, the method is considered to

work well, otherwise this hypothesis is dismissed.

According to the above, matrices similar to con-

fusion matrices are used to visualize the performance

of the action recognition algorithm. The columns of

the matrix represent instances of the recognized ac-

tions while the rows represent the actual actions. Tak-

ing into account the cases in which the algorithm fails

to detect any action, a column labeled none is added.

A perfect action recognition would have a confusion

matrix with 1 on the diagonal and 0 for every other

element.

5.1 Details on knn Search

Choosing a feature set for the HDM05 cut database

was straightforward. The results from (Kr

uger et al.,

2010) indicated that feature set F

, which includes

the positions of the head, hands and feet, would work

very well on this database. The confusion matrices

presented below (Fig. 4) conﬁrm that this assumption

holds. Instead of encoding temporal information (e.g.

velocities) in the feature set directly, these are repre-

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

330

sented in the structure of the action graph: Edges are

inserted between successive database indices, accord-

ing to the allowed step size conditions.

According to (Kr

uger et al., 2010) we use the step

sizes (1,1), (2,1), (1,2) and (2,2). Thus, the action

graph easily runs in real-time when searching for 256

nearest neighbors, achieving an average frame rate

above 75Hz in a multi-threaded implementation on

the regarded motion capture databases. The described

results were obtained using a system with an Intel hex

core CPU with 3.33GHz and 24Gb of memory.

The knn search used in our approach can be re-

placed by a ﬁxed radius search. This variation does

not produce convincing results, due to the following

reasons: First, a ﬁxed radius can mean that we do not

ﬁnd any neighbors. Second, we found in our tests

of this variant, that the variability between motions

in some classes (cartwheel) is larger than variability

in other classes (walk two steps). Therefore it is not

possible to specify a uniform radius for all regarded

motion classes.

5.2 Discussion of Results

5.2.1 Action Recognition Tests on HDM05

Motion Classes

We conducted action recognition tests on the HDM05

cut library, which contains manually cut out motion

clips that were arranged into several different classes

and styles, having multiple realizations of the same

motion. These motions were divided into a train-

ing set, containing 142 motions, and a test set, con-

taining 273 motions. The confusion matrices for the

two action recognition methods on this dataset using

k = 1024 in the k-nearest neighborhood search can

be seen in Fig. 4. Examining the confusion matri-

ces shown in Fig. 4, the SVM-based approach shows

a good performance for a pose-based approach, hav-

ing a clearly visible diagonal with a few outliers, pri-

marily confusing walking motions. The action graph

shows a crisp diagonal, with only two major out-

liers, namely recognizing a sideways punch instead

of a cartwheel starting with the right hand and rec-

ognizing a clap above the head instead of a jump-

ing jack. However, in both cases the correct actions

are sub-actions of the recognized one, which is illus-

trated in Fig. 2 and Fig. 3 for the jumping jack mo-

tion. Also, when visually comparing the cartwheel in-

stances with the sideways punches, the starting phases

of the cartwheels show huge similarities with the side-

ways punches, leading to false recognitions. This in-

dicates the method is broadly suitable. Inspecting the

accuracy plot for the HDM05 library in Fig. 7, the

recognition method detects 90% of actions correctly

and its accuracy peaks at approximately 90% using

k = 1024.

Testing of the method was also done by evaluating

performance on a sparse accelerometer setup consist-

ing of only four accelerometers attached to the wrists

and ankles. Again, the HDM05 cut library was used

for this experiment. As can be seen in Fig. 5, the pose-

based SVM approach mislabels many action classes,

whereas the action graph method shows a dark di-

agonal with fewer mislabelings, indicating that the

method works well on this sensor setup. Taking a

closer look at the results also shows that the method

often confuses very similar actions, from which many

are not distinguishable from each other using ac-

celerometers alone. One such example are the clap

above head and clap hands actions, which produce the

same sensor reading in the given sensor setup.

5.2.2 Judo Referee Signals

In a cross-modal scenario, we used skeletons ex-

tracted from a Microsoft Kinect device to query for

similar motions in the optical motion capture database

containing the Judo referee signal motions. Skele-

tal data obtained from this sensor contains positional

data only and is of much lower quality than the opti-

cal mocap data, meaning the positional noise is much

more noticable and the accuracy of the system is not

on par with optical systems. Since the Microsoft

Kinect delivers skeletal motion data at 30Hz, whereas

the optical motion capture system has a frame rate

of 119.88Hz, the Vicon data is downsampled to the

lower rate to obtain temporal comparability.

In order to improve the probability of ﬁnding

paths through the pose neighborhoods using the ac-

tion graph, we ran additional tests with an increased

number of allowed steps (see Fig. 8). Interestingly,

feature set F

gains in accuracy when allowing 8

steps and 2

neighbors.

5.2.3 Action Recognition from Video Data

To show that the action recognition concept easily ap-

plies to other sorts of data, we refer to the example

of video data. In order to keep emphasis on our ac-

tion recognition method, we use a simple setup to

demonstrate the concept. To this end, we annotate

positions of hand, feet and the head in the ﬁrst frame

of video data and use standard feature detection meth-

ods (MSCR and SURF) to track the relevant features

used in our algorithm. Based on these we obtain a ten

dimensional feature set F

E-2d

consisting of ﬁve two di-

mensional positions. Camera parameters are derived

ActionGraph-AVersatileDataStructureforActionRecognition

331

Figure 4: Confusion matrices for SVM-based and action graph recognition methods, calculated on the HDM05 cut database

using feature set F

and the precision-recall diagram. We further distinguish the results between two different databases. In

the actor included scenario, motion clips of the actor of the currently tested action are allowed to be present in the database.

In the actor excluded scenario, the database is cleared of all clips of the actor in a preprocessing step.

Figure 5: Confusion matrices for SVM-based and action graph recognition methods, calculated on the HDM05 cut database

using data obtained from accelerometers attached to the wrists and ankles. On the right we show the precision-recall diagram.

Figure 6: Confusion matrices for SVM-based and action graph recognition methods, calculated on Judo Motions using feature

set F

for Kinect and V-Files and the corresponding precision-recall diagram.

Figure 7: Confusion matrices for different values of the parameter k (128,256,512) using feature set F

on the HDM05

database and the corresponding precision-recall diagram.

by incorporating knowledge about the scene and ac-

tors.

Since the motion database consists of three-

dimensional positional data in this example and the

feature extraction from video yields two-dimensional

interest points, we perform parallel projections of all

poses contained in the database. To handle different

viewing directions, projections were performed from

different viewing angles in 20 degree steps. All re-

sulting two-dimensional features were used to con-

struct a kd-tree for knn search. The back-projection

of kd-tree indices results in database indices in the

motion’s original space, enabling the use of the ac-

tion graph to detect the performed actions. Since the

tracked features are very noisy in our case, the action

graph does not return paths in all relevant cases. To

alleviate this we adapted step size conditions for this

scenario to allow for steps (1,3),(3,1),(4,1) and (1,4)

in addition to the previously mentioned ones.

5.2.4 Comprehensive Analysis of the Results

As demonstrated by the confusion matrices in Fig. 4

and Fig. 5, the results of the above-mentioned tests

show that the proposed detection method works very

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

332

Figure 8: Confusion matrix and precision-recall diagram for

Kinect queries in the Judo scenario where larger step sizes

were used.

well on optical motion capture data and still well for

accelerometer data. However, the Judo results lag far

behind this good score both for motion capture data

as well as data from Kinect recordings (see Fig. 6).

In both cases this is partly due to the fact that the

recorded referee motion repertoire turns out to be a

challenge for the method in itself: For one, most of

the gestures typical for Judo referees are fairly static

and do not display the continual movement a sensible

action recognition method is based on. Additionally,

referee gestures with different meanings often differ

only marginally, especially for the noisy Kinect data

and its poorly aligned skeletons. This causes condi-

tions to deteriorate.

Fig. 7 illustrates the transition of the resulting preci-

sion respectively recall for increasing choices of the

number k of nearest neighbors in the action recogni-

tion test. As can be seen, the results for k = 512 al-

ready display satisfactorily high precision. Achieving

this is obviously easy if we require as little recall. A

more reliable framework forces the recall to be higher

by employing a parameter k = 1024, although this ef-

fects in some loss of precision.

6 CONCLUSION AND FUTURE

WORK

This paper examined methods to automatically detect

human full body motions using motion capture data

obtained from various setups. This includes working

with high quality optical motion capture data, skele-

tons associated to the Microsoft Kinect within a cross-

modal setup as well as sparse and noisy data obtained

from accelerometers. Moreover, the method extends

to features extracted from video data. In particular,

the presented data-driven motion based detector was

found to be superior to support vector machines in

terms of their performance.

The approach at hand is parameterized by the em-

ployed feature sets, hence will work with other capa-

bilities. It will therefore be a matter of future work to

use and to evaluate the very recently proposed more

robust feature sets (Oﬂi et al., 2012) within our frame-

work. There are certain areas which turn out to serve

as fertile grounds for future work: From one point of

view, the application of the ﬁxed radius search method

has revealed there is a striking amount of variation in

the respective retrieved pose neighborhoods of certain

queries. In particular, the gaps occurring within these

neighborhoods seem noteworthy. Analyzing neigh-

borhood variation phenomena should provide inter-

esting new insight.

Although our method is already real-time capable

in many scenarios, it allows for modiﬁcations to in-

crease this capability to scenarios with many allowed

time steps and large pose neighborhoods: It is clearly

not necessary to create a complete graph structure for

every single frame throughout the process. Working

out a more efﬁcient solution which avoids discarding

previously acquired information in the spirit of (Taut-

ges et al., 2011) will contribute signiﬁcantly to greater

efﬁciency.

Moreover, since all signiﬁcant processes involved in

our method are easy to parallelize, they come with

even more advantages when executed on highly par-

allel units. In particular, implementing the proposed

techniques on a GPU seems a logical step which shall

be taken in the future.

Another line of future research is the exploration

of other consumer electronic devices—such as con-

tact sensors, simple 1-or-2 axes accelerometers, al-

timeters, etc.—and their combinations. We refer to

(Latr

e et al., 2011) for a recent survey of wireless

body area networks and to (Khan et al., 2010) for

a recent accelerometer based physical activity sys-

tem. Although not all of these are suitable for our

approach, many of them present promising perspec-

tives. Especially cell phones come with an increasing

variety of sensors and hence become popular objects

of study. Combining the information from differ-

ent sensors at different body locations combined with

Bayesian a priori knowledge on the temporal evolu-

tion of human motions taken from databases—as our

approach can be summarized—might be beneﬁcial in

this more general context as well.

REFERENCES

Arikan, O., Forsyth, D. A., and O’Brien, J. F. (2003). Mo-

tion synthesis from annotations. ACM Trans. Graph.,

22:402–408.

Bao, L. and Intille, S. (2004). Activity recognition from

user-annotated acceleration data. In Ferscha, A. and

Mattern, F., editors, Pervasive Computing, volume

3001 of Lecture Notes in Computer Science, pages 1–

17. Springer Berlin / Heidelberg.

ActionGraph-AVersatileDataStructureforActionRecognition

333

Barbi

c, J., Safonova, A., Pan, J.-Y., Faloutsos, C., Hodgins,

J. K., and Pollard, N. S. (2004). Segmenting mo-

tion capture data into distinct behaviors. In Heidrich,

W. and Balakrishnan, R., editors, Proceedings of the

Graphics Interface 2004 Conference, pages 185–194.

Canadian Human-Computer Communications Soci-

ety.

Baumann, J., Kr

uger, B., Zinke, A., and Weber, A. (2011).

Data-driven completion of motion capture data. In

Workshop on Virtual Reality Interaction and Physical

Simulation (VRIPHYS). Eurographics Association.

Bobick, A. F., Davis, J. W., Society, I. C., and Society, I. C.

(2001). The recognition of human movement using

temporal templates. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 23:257–267.

Campbell, L. W. and Bobick, A. F. (1995). Recognition of

human body motion using phase space constraints. In

Proceedings of the Fifth International Conference on

Computer Vision, ICCV ’95, pages 624–, Washington,

DC, USA. IEEE Computer Society.

Carnegie Mellon University Graphics Lab (2004). CMU

Motion Capture Database.

Chai, J. and Hodgins, J. K. (2005). Performance animation

from low-dimensional control signals. ACM Trans.

Graph., 24:686–696.

Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library

for support vector machines. ACM Transactions on

Intelligent Systems and Technology, 2:27:1–27:27.

Choudhury, T., Borriello, G., Consolvo, S., Haehnel, D.,

Harrison, B., Hemingway, B., Hightower, J., Klasnja,

P., Koscher, K., LaMarca, A., Landay, J. A., LeGrand,

L., Lester, J., Rahimi, A., Rea, A., and Wyatt, D.

(April 2008). The mobile sensing platform: An em-

bedded system for activity recognition. Appears in

IEEE Pervasive Magazine - Special Issue on Activity-

Based Computing, 7(2):32–41.

Khan, A. M., Lee, Y.-K. L. Y.-K., Lee, S. Y., and Kim,

T.-S. K. T.-S. (2010). A triaxial accelerometer-based

physical-activity recognition via augmented-signal

features and a hierarchical recognizer. IEEE trans-

actions on information technology in biomedicine a

publication of the IEEE Engineering in Medicine and

Biology Society, 14(5):1166–1172.

uger, B., Tautges, J., Weber, A., and Zinke, A. (2010).

Fast local and global similarity searches in large mo-

tion capture databases. In Proceedings of the 2010

ACM SIGGRAPH/Eurographics Symposium on Com-

puter Animation, SCA ’10, pages 1–10, Madrid,

Spain. Eurographics Association.

Latr

e, B., Braem, B., Moerman, I., Blondia, C., and De-

meester, P. (2011). A survey on wireless body area

networks. Wireless Networks, 17(1):1–18.

Maurer, U., Smailagic, A., Siewiorek, D., and Deisher, M.

(2006). Activity recognition and monitoring using

multiple sensors on different body positions. In Wear-

able and Implantable Body Sensor Networks, 2006.

BSN 2006. International Workshop on, pages 4 pp. –

116.

uller, M., R

oder, T., Clausen, M., Eberhardt, B., Kr

uger,

B., and Weber, A. (2007). Documentation: Mocap

Database HDM05. Computer Graphics Technical Re-

port CG-2007-2, Universit

at Bonn.

Numaguchi, N., Nakazawa, A., Shiratori, T., and Hodgins,

J. K. (2011). A puppet interface for retrieval of motion

capture data. In Proc. ACM SIGGRAPH/Eurographics

Symposium on Computer Animation.

Oﬂi, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy,

R. (2012). Sequence of the most informative joints

(smij): A new representation for human skeletal ac-

tion recognition. In CVPR Workshops, pages 8–13.

IEEE.

Raptis, M., Kirovski, D., and Hoppe, H. (2011). Real-time

classiﬁcation of dance gestures from skeleton anima-

tion. In Symposium on Computer Animation, pages

147–156.

Ravi, N., Dandekar, N., Mysore, P., and Littman, M. L.

(2005). Activity recognition from accelerometer data.

In Proceedings of the 17th conference on Innova-

tive applications of artiﬁcial intelligence - Volume 3,

IAAI’05, pages 1541–1546. AAAI Press.

Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing

human actions: A local svm approach. In Proceedings

of the Pattern Recognition, 17th International Confer-

ence on (ICPR’04) Volume 3 - Volume 03, ICPR ’04,

pages 32–36, Washington, DC, USA. IEEE Computer

Society.

Tautges, J., Zinke, A., Kr

uger, B., Baumann, J., Weber, A.,

Helten, T., M

uller, M., Seidel, H.-P., and Eberhardt,

B. (2011). Motion reconstruction using sparse ac-

celerometer data. ACM Trans. Graph., 30:18:1–18:12.

Wyatt, D., Philipose, M., and Choudhury, T. (2005). Un-

supervised activity recognition using automatically

mined common sense. In Proceedings of the 20th na-

tional conference on Artiﬁcial intelligence - Volume 1,

AAAI’05, pages 21–27. AAAI Press.

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

334