Using LSTM for Automatic Classiﬁcation of

Human Motion Capture Data

Rog

erio E. da Silva

1,2

, Jan Ond

rej

1,3

and Aljosa Smolic

V-SENSE, School of Computer Science and Statistics, Trinity College Dublin, Ireland

Department of Computer Science, Santa Catarina State University, Brazil

Volograms, Dublin, Ireland

Keywords:

Human Motion Classiﬁcation, Motion Capture, Content Analysis, Deep Learning, Artiﬁcial Intelligence.

Abstract:

Creative studios tend to produce an overwhelming amount of content everyday and being able to manage these

data and reuse it in new productions represent a way for reducing costs and increasing productivity and proﬁt.

This work is part of a project aiming to develop reusable assets in creative productions. This paper describes

our ﬁrst attempt using deep learning to classify human motion from motion capture ﬁles. It relies on a long

short-term memory network (LSTM) trained to recognize action on a simpliﬁed ontology of basic actions like

walking, running or jumping. Our solution was able of recognizing several actions with an accuracy over 95%

in the best cases.

1 INTRODUCTION

Creative industry is a broad term generally used to

refer to any company devoted to create content for

games, animations, AR/VR, VFX, etc. Over time,

the amount of creative content that can be produced

by these companies can easily become overwhelming;

how big? it will depend on the size of the company

itself. Studios often rely on their own production pi-

peline to reduce time and effort spent producing new

content, therefore reducing their production costs in

order to increase proﬁt. One effective way of redu-

cing costs is by reusing content from older producti-

ons, adapting them into new contexts, thus speeding

up the production. Despite that this notion sounds re-

asonable, achieving it in a production setting is not

easy, because, as mentioned before, being able to re-

trieve the one speciﬁc desired content among the mil-

lions of “assets” being produced every year (e.g. 3D

models, textures, sound effects, soundtracks, animati-

ons, scripts, etc.) has proven to be a high demanding

task.

One particular asset that has great interest for the

creative industry is motion captured data (mocap)

(Menache, 2011; Delbridge, 2015). Consider the fol-

lowing scenario: an animator needs to animate a se-

quence where a character walks limping from its left

leg due to an injury; instead of having the motion cap-

ture team recording a new sequence in the mocap lab,

the animator remembers that a few years back he al-

ready animated a similar sequence of a limping walk

which means he could adapt this sequence from that

old one (if he manages to remember which ﬁle con-

tains that desired animation). So, he makes a quick

search in the backup databases only to learn that the

studio has, in fact, thousands of motion captured ﬁ-

les in storage! Now, how to ﬁnd the one he’s look-

ing for? Re-recording the sequence in the mocap lab.

might be, in this case, a faster alternative to pursue

(although not cheaper).

This hypothetical situation is frequently observed

in a production pipeline, and ﬁnding ways to au-

tomatize the process for documenting creative con-

tents being produced (so to facilitate content retrie-

val) would have a big impact in reducing time and ef-

fort a production team would have to spend searching

through a database of older projects.

This work focuses on studying ways for automati-

zing the motion capture tagging process. To tag a con-

tent means to label it according to a given ontology, so

that later it could more easily be found by tag-based

searches, thus facilitating its retrieval and use. Our

approach relies on deep learning and long short-term

memory neural networks (LSTM) to analyze a se-

ries of mocap data ﬁles, classifying them accordingly.

This work is being developed under the context of the

SAUCE project (Smart Assets for re-Use in Creative

Environments) that is a three-year EU Research and

236

E. da Silva, R., Ond

rej, J. and Smolic, A.

Using LSTM for Automatic Classiﬁcation of Human Motion Capture Data.

DOI: 10.5220/0007349902360243

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 236-243

ISBN: 978-989-758-354-4

Innovation project between several companies and re-

search institutions to create a step-change in allowing

creative industry companies to re-use existing digital

assets for future productions.

The goal of SAUCE project is to produce, pilot

and demonstrate a set of professional tools and techni-

ques that reduce the costs for the production of en-

hanced digital content for the creative industries by

increasing the potential for re-purposing and re-use

of content as well as providing signiﬁcantly improved

technologies for digital content production and mana-

gement (SAUCE Project, 2018).

In this work, we studied ways for using long short-

term memory networks to automatically classify con-

tent from motion captured data according to a given

ontology. We have designed a LSTM architecture

that accepts motion captured data and determines the

actions being portrayed in each ﬁle. This work re-

presents a ﬁrst step for SAUCE project in developing

a general-purpose media classiﬁer system that could

help speeding up a creative pipeline process.

In the next section we discuss the problem of hu-

man motion classiﬁcation then, on Section 3 a brief

literature review on motion capture, long short-term

memory neural networks (LSTM) and a few rela-

ted works are presented and discussed. Following,

Section 4 explains our experiments and results we

obtained from applying our method. Finally, we pre-

sent our conclusions and ﬁnal remarks in Section 5.

2 HUMAN MOTION

CLASSIFICATION PROBLEM

Classifying motion means determining what kind of

action (e.g. walking, running, jumping, ﬁghting, dan-

cing, etc.) is being portrayed by any given human

motion in space. Tracking motion in space is usually

achieved by motion capture that involves decompo-

sing each motion as a series of three-dimensional po-

ses (called a skeleton) in time (see Section 3.1).

Therefore, being able to understand motion means

determining temporal relations among changes occur-

ring to speciﬁc body parts over time while performing

each given movement. In this work, we are interested

in studying ways for automatic labelling a series of

motion captured data to facilitate its reuse in future

productions by a creative studio.

According to (Martinez et al., 2017), “learning

statistical models of human motion is a difﬁcult task

due to the high-dimensionality, non-linear dynamics

and stochastic nature of human movement”. Clas-

sifying data involves analyzing each candidate ex-

tracting a series of properties from it trying to match

them to a speciﬁc class in a set of known classes (an

ontology). Several issues need to be tackled while

working on this kind of problem:

1. ﬁnding a suitable ontology description that is an

enumeration of possible classes and their attribu-

tes (the criteria used to identify an element of that

class);

2. an adequate knowledge representation on how

to represent ontological attributes in a way that ar-

tiﬁcial intelligence can work with;

3. a prediction criteria to describe how to deter-

mine if a given candidate belongs to a certain

class.

In the following sections we describe how to

track and represent spatial motion of humans, how

long short-term memory networks work and why we

choose to use them in this research and ﬁnally, a

few related works aiming human motion classiﬁca-

tion problem are presented.

3 STATE OF THE ART

3.1 Motion Capture

Motion capture (or mocap) is the process of recor-

ding the movements of objects or people via special

hardware setups . There are several possible technolo-

gies that can be applied to capture movement in space:

Optical Systems: utilize data captured from image

sensors to triangulate the 3D position of a subject

between two or more cameras calibrated to pro-

vide overlapping projections. This can be achie-

ved using special markers that can be passive (re-

ﬂective) or active (synchronized ﬂashing LEDs);

they can also use markerless tracking systems

that relies on computer vision to recognize human

parts from the set of cameras;

Non-optical Systems: any other technology that al-

lows motion tracking. The most common ones

are: inertial systems use inertial measurement

units (IMUs) containing a combination of gy-

roscopes, magnetometers, and accelerometers, to

measure rotational rates; mechanical motion are

often referred to as exoskeleton motion capture

systems, due to the way the sensors are (directly)

attached to the body to perform the tracking; and

magnetic systems calculate position and orienta-

tion by the relative magnetic ﬂux of three orthogo-

nal coils on both the transmitter and each receiver.

Using LSTM for Automatic Classiﬁcation of Human Motion Capture Data

237

One approach that is very popular these days and

is traditionally employed by animation studios invol-

ves an actor wearing a suit covered with optical mar-

kers that can then be tracked by an optical system of

infrared cameras.

From these recordings results a series of 3D coor-

dinates for each marker (tracked several times per se-

cond) that can later be mapped to a character as an

animation (this technique is called retargeting). Fi-

gure 1 shows a few frames of a skeletal animation

obtained via a motion capture session of a walk (CMU

Graphics Lab, 2018).

Figure 1: Motion captured sequence of a walk.

Regarding ﬁle formats for storing motion capture

data one popular choice is the BVH format. The Bi-

ovision Hierarchical data (BVH) ﬁle format was ori-

ginally developed by Biovision (a motion capture ser-

vice company) to distribute mocap data to their cu-

stomers. Later, it became a very popular format for

storing mocap.

The reason why we decided to adopt this format

is because while other motion capture formats, like

the C3D format (https://www.c3d.org/), store only the

coordinates for each tracked markers (Figure 2 on

the left), the BVH format also represents hierarchical

relations between joints, i.e., their physical relations

called a skeleton (Figure 3), making it simpler to cor-

relate movements between adjoined joints (Figure 2

on the right).

Figure 2: C3D markers vs BVH Skeleton.

Motion capture data is recorded as a series of mo-

tion channels, each representing one spatial location

and/or orientation of a joint. Since the amount of

channels is dependent on the number of joints and the

number of degrees of freedom (DOF) of each joint,

the size of the frame can vary from ﬁle to ﬁle.

3.2 Long Short-Term Memory Neural

Networks

The Long Short-Term Memory (LSTM) network is a

type of Recurrent Neural Network (RNN), which is a

Figure 3: Example of a skeleton and motion channels deﬁ-

nition in a BVH ﬁle.

special type of neural network designed for sequence

problems like, for instance, texts, speech, and anima-

tions. Traditional RNNs contain cycles that feed the

network activation from a previous time step as inputs

to inﬂuence predictions at the current one (Brownlee,

2018; Hochreiter and Schmidhuber, 1997).

Despite the fact that RNNs can learn temporal re-

lations, their main limitation is regarding training in

a problem known as the “vanishing gradient”. This

problem happens when, during training of a recurrent

process, the weights change become so small that they

have no effect in learning the data (or so large in the

other way around: “exploding gradient”).

LSTMs solve these problems by design. All infor-

mation being propagated through the network should

pass ﬁrst by three gates. These gates are activation

functions especially designed to work on the data so

to only allow relevant information to continue being

propagated during training. The three gates are:

Forget Gate: decides what information to discard

from each layer;

Input Gate: decides which values from the input to

update the memory;

Output Gate: decides what to output based on the

input and the memory.

In the literature, several variants of this architec-

ture can be found where the number of gates can vary

to suit speciﬁc contexts and needs .

3.3 Related Works

Two distinct classes of works involving automatic re-

cognition of the human skeleton, can be found in the

literature: human pose estimation and human action

classiﬁcation. We argue that, despite the clear simila-

rities between the two in terms of their goals (i.e. the

recognition of human motion), they signiﬁcantly dif-

fer in most of the technical aspects involved in how to

GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications

238

tackle with the problem, namely, data representation

and processing:

Human Pose Estimation: aims at detect and esti-

mate human poses in video format (resulting from

actual recordings of people or graphical renditions

of 3D motion capture data), then extracting skele-

tal information that can include depth estimation

(3D) or not (2D). The most common approach

used in this context involves analyzing the pixels

of each frame of a video using classical computer

vision techniques.

Just to mention a few examples of works that

adopt this approach: (Du et al., 2016; Tekin et al.,

2016; Toshev and Szegedy, 2014).

Human Action Classiﬁcation: aims at interpreting

3D motion from skeletal data analysis of actual

spatial recordings (as described in Section 3.1).

In this particular sense, not many works have been

found in the literature. Some examples that have,

somehow, inﬂuenced our work are:

• In (B

utepage et al., 2017) the authors trained

a LSTM to predict future 3D poses from the

most recent past. The system is described as an

encoder-decoder network for generative 3D mo-

dels of human motion based on skeletal anima-

tion;

• In (Gupta et al., 2014) the authors propose an

approach to directly interpret mocap data via v-

trajectories that are sequences of joints connected

over a time frame to allow ﬁnding similar mocap

sequences based on pose and viewpoint;

• Another example of human motion prediction

using deep learning LSTMs is presented in (Mar-

tinez et al., 2017);

• A slightly different application but still related

to human motion prediction is the multi-people

tracking system presented by (Fabbri et al., 2018),

where the authors developed a system capable of

extracting information of people body parts and

temporal relations regarding the subjects’ moti-

ons. In this work, the system automatically de-

tects each human ﬁgure and respective skeletal in-

formation from frames of video recordings, even

if partially occluded, by matching those with a da-

tabase of body poses.

The common point between all the related works

presented here is that, despite the fact they studied

automatic approaches to identify human motion from

motion capture data, none of them focused on label-

ling the data in order to facilitate future queries and

content retrieval, which seems to remain an unsolved

problem.

4 EXPERIMENTATION

In this section its described the experiments made

with LSTM networks implemented using Python

and Keras (https://keras.io/) with Tensorﬂow (https:

//www.tensorﬂow.org/) using the aforementioned data

set built upon the CMU Motion Capture Library.

4.1 Ontology of Human Actions

Since the scope of this work focuses only on tagging

motion capture actions, instead of designing a com-

plete ontology for every possible creative media an

animation studio would be interested in cataloging,

we opted to simplify the representation to consider

only a selected set of human actions.

We are using the freely available CMU Motion

Capture library (CMU Graphics Lab, 2018) for our

experiments. Thus, the list of actions our system is

able to classify reﬂects the actions available in this

database.

In our experiments we considered the follo-

wing actions: bending down, climbing, dancing,

fighting, jumping, running, sitting down,

standing up, and walking.

The deﬁnition of this ontology is important be-

cause training the neural network to recognize its clas-

ses means that it needs a carefully designed data set

of mocap ﬁles for each class. Our training data set is

composed of 1136 ﬁles divided into those 9 catego-

ries, representing more than 820,000 frame samples.

4.2 Data Representation

There are two main concerns in terms of representing

data for a neural network: how to represent the input

(or training) data and how to represent the output (in

our case the labels of each class in the ontology).

For the input, we followed the data representation

presented at (B

utepage et al., 2017), where each

frame of a mocap recording is represented in the

Cartesian space of each joint’s rotational data plus

the positional data for the ‘Hips’, thus resulting in

an 1D-array of size 3 × N

joints

+ 3 where N

joints

is the number of joints. Still agreeing with the

authors, we normalized all joint’s rotational angles,

centering the skeleton at the origin. So, each joint

data is represented according to the following for-

mat: <Joint> ZRotation, <Joint> YRotation,

<Joint> XRotation, e.g. LeftUpLeg Zrotation,

LeftUpLeg Yrotation, LeftUpLeg Xrotation.

Except for the Hips that also include the XYZ

positional values.

Using LSTM for Automatic Classiﬁcation of Human Motion Capture Data

239

Several tests have been performed with the size of

the sample, i.e. the number of frames (N

f rames

), and a

comparative of the results is presented in Section 4.5

below.

Regarding the output, each possible outcome is re-

presented as a classical ‘one-hot’ binary string, where

the number of bits relates to the size of the ontology

(number of classes) and each label having a different

bit highlighted. For instance, since we have 9 diffe-

rent labels in the ontology, the ﬁrst label “bending

down” is represented by the sequence 100000000.

These representations also inﬂuence the size of

the ﬁrst and last layers of the network as described

in Section 4.4.

4.3 Class Prediction

Since we opted to split the training samples into sub-

sets of N

f rames

frames, we had to do the same with our

working data set for consistency. Thus, each motion

capture ﬁle being analyzed is also split into samples

of the same size, and each sample is then submit to the

network for prediction. As a sample can be matched

with different classes, in the end the resulting class is

obtained by taking the mode of the class prediction

array, i.e., considering the most frequent label as the

answer .

This approach has the advantage of classifying

each ﬁle in terms of how likely that ﬁle belongs to

a given class of the ontology, thus allowing for mul-

tiple interpretations of its content, much like how it

would happen in a real scenario. Consider a recor-

ding where the actor starts running in preparation for

a jump. Imagine now an animator searching for either

‘run cycles’ or ‘types of jump’, the system should be

able to understand that the particular ﬁle would be a

suitable response in either case.

4.4 Implementing a LSTM

As mentioned before, we opted to implement the pro-

totype for our tool in Python using Keras with Ten-

sorﬂow. The main reason for this choice was due to

the simplicity that these tools offer, making it more

adequate for quick prototyping.

The solution that was used for the experiments

discussed in this paper rely on three layers : the ﬁrst

one (the input layer) is a LSTM layer of 15 neurons.

This value was chosen arbitrarily based on several

tests and can be modiﬁed to ﬁt different needs like for

instance, different skeletal structures or computatio-

nal performance, the second layer is a similar LSTM

layer (stacked LSTM layer) with the sole purpose of

increasing depth of the network (our experiments sho-

wed that deeper networks can perform better while

predicting lengthier animations), and ﬁnally, the third

layer (the output) is the one responsible for encoding

the predicted outcome to one of the classes in the on-

tology as described in previous sections.

It is important to notice that although the size of

the input layer does not necessarily need to match the

size of the input data, the output layer does need to

match the size of the output, i.e., the number of pos-

sible labels that can be outputted.

Listing 1: LSTM implementation in Keras.

1 nNe u r ons = 15

2 numLabe l s = 9

3 s a m p l e S i z e = 5

4 d s S i z e = l e n ( t r a i n i n g D a t a S e t )

5 n J o i n t s = 38

6 d s S h a p e = ( sa m p l e S i z e , n J o i n t s ∗ 3 )

8 model = S e q u e n t i a l ( )

9 model . a d d ( LSTM( nNeurons , r e t u r n s e q u e n c e s = True ,

10 i n p u t s h a p e = d sS h a p e ) )

11 model . a d d ( LSTM( n N e uro n s ) )

12 model . a d d ( Dense ( numLabels , a c t i v a t i o n = ’ s o f t m a x ’ ) )

13 model . c o m p i l e ( l o s s = ’ mse ’ , o p t i m i z e r = Adam( l r = 0 . 0 0 1 ) ,

14 m e t r i c s = [ ’ a c c u r a c y ’ ] )

15 model . summary ( )

4.5 Preliminary Results

Several experimental tests have been performed with

our tool covering different network architectures and

several subsets of the ontology. In our experiments,

as a way to better assess the accuracy of the model.

We prepared a series of motion capture ﬁles carefully

editing their content to portray only a single action per

ﬁle.

In this section, we describe three of such experi-

ments:

1. We performed a series of 5 predictions with the

model considering all 9 categories. Our prediction

data set in this case was composed by 54 ﬁles (6

for each category);

2. Later, a subset containing only the four larger

classes data sets have been considered for a se-

cond round of predictions, and the results are pre-

sented next in Section 4.5.2;

3. Finally, the previous experiments’ results showed

that the despite the differences in size of the trai-

ning data sets, four speciﬁc classes appeared to

have been better modeled by the network, so we

decided to performed a third round of predictions

using only these four ones. The results for these

are presented in Section 4.5.3.

4.5.1 Experiment # 1

Table 1 summarizes the results obtained from these

predictions for the ﬁrst trial. It’s important to notice

GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications

240

that the larger the size of the sample, the lower the

amount of samples, although temporal relations are

better represented.

Table 1: First experiment accuracy results considering dif-

ferent page size and all 9 labels of the ontology.

Figure 4 shows the confusion matrix we obtained

by taking, as an example, the best result in our ex-

periment (the one where sample size = 10). In this

experiment we can clearly see as the darker regions

of the matrix, that some of classes like ‘jumping’

or ‘walking’ has been better learned by the network

than other classes.

Figure 4: Confusion matrices for the predictions conside-

ring sample size = 10 (on the left) and the overall result

after combining all 5 predictions (on the right).

These results were indicating that the network was

underperforming while predicting some of the clas-

ses, most likely due to the problem of underﬁtting

(Brownlee, 2018) since the size of the training data

set for each category varies signiﬁcantly (the smal-

ler data set corresponds to 31% the size of the largest

one).

4.5.2 Experiment # 2

In order to try to minimize these effects, a se-

cond round of predictions have been performed

considering only the four larger data sets avai-

lable: bending down, dancing, standing up,

walking where each of them have more than 100,000

frames, changing the ratio between the smaller and

larger data sets to 76%.

The results obtained on this second round of pre-

dictions are presented in Table 2 below. They are

evidence that the low accuracy detected in the ﬁrst

experiment was due to underﬁtting the model, which

means that with a larger training data set the network

should perform better even when considering larger

ontologies.

Once more, taking the best result as an example,

calculating the confusion matrix for the experiment

Table 2: Accuracy results after second experiment that con-

sidered only the four larger training data sets.

resulted in Figure 5. Here the results were signi-

ﬁcantly better relatively to the previous experiment.

This can be observed considering that the values in

the main diagonal of the matrix are higher than the

rest of the matrix (ideally a confusion matrix would

appear as an identity matrix).

Figure 5: Confusion matrices for the predictions conside-

ring sample size = 20 and the four larger training data sets

(on the left) and the overall result after combining all 5 pre-

dictions (on the right).

Although the results after this experiment repre-

sented an improvement in regard to the last experi-

ment, they still were not as good as one could ex-

pect. After a careful analysis of the confusion ma-

trices obtained after the combined results for each

experiment, it was noticeable that a speciﬁc set of

classes were performing better despite the fact those

were not the larger data set at disposal. Figure 4

on the right show this combined matrix where it is

possible to infer this alternative four classes of prefe-

rence: “bending down”, “jumping”, “running” and

“walking” as the four ones showing the most promi-

sing results.

Next section present the results for the third expe-

riment considering exactly these four classes. Worth

noticing that these data sets have signiﬁcantly diffe-

rent training data set sizes and still the network were

able to train satisfactorily in them. We hypothesize

that this is due the nature of those speciﬁc actions that

signiﬁcantly vary from each, making it simpler for the

network to differentiate them from each other.

4.5.3 Experiment # 3

In this experiment all training data set was composed

of 903 ﬁles, separated into four categories: “bending

down”, “jumping”, “running” and “walking” and the

prediction data set containing 24 ﬁles.

This was the most successful experiment of them

all and the results clearly demonstrate the feasibility

Using LSTM for Automatic Classiﬁcation of Human Motion Capture Data

241

Table 3: Accuracy results after third experiment.

of the model and allow us to conclude that the other

experiments would also perform better if more trai-

ning would be available.

Figure 6 below depicts the results for the best ex-

periment performed under the described conditions

(size = 15) and also the resulting obtained by com-

bining all ﬁve experiments with this ontology.

Figure 6: Confusion matrices considering sample size = 15

and four selected labels (on the left) and the overall result

of the experiment on the right.

4.5.4 Summary

Figure 7 compares the accuracy (vertical axis) obtai-

ned with the experiments using different sample sizes

(horizontal axis). In orange it is shown the predicti-

ons considering all 9 categories, in blue the predicti-

ons using a subset containing only the four larger data

sets and in gray the experiment selecting the four most

successfully recognized in the ﬁrst experiment.

Figure 7: Comparison between the results of the three ex-

periments.

In summary, the results obtained with these expe-

riments can be considered promising, indicating the

viability of the model, showing that a stacked LSTM

neural network can successfully learn how to classify

human actions from motion capture data (proof-of-

concept).

4.6 Limitations

• Right now our system has the limitation of only

using CMU library skeleton structure (Figure 3),

which means that all training ﬁles and prediction

data have to have the same length (N

joints

). A

more generic approach that shall consider retar-

geting different skeleton structures into a baseline

model, thus allowing using multiple representati-

ons together is under development;

• Also, another limitation regarding data is the fact

that we are constrained to the size of the trai-

ning data set available for experimentation. So,

the results that have been obtained reﬂect that.

Although, the signiﬁcant improvement the second

experiment showed relatively to the ﬁrst, make us

conﬁdent that this, at least, indicates the feasibi-

lity of the model and that, with larger databases,

the system should perform much better in terms

of accuracy classifying the data thus solving the

underﬁtting problem;

• Allowing recognition of other features from the

motion captured data, such as affective body pos-

tures (Kleinsmith et al., 2011), and gestures, e.g.

a ‘happy walk’ or a ‘sad handshake’. This feature

would be of the most importance when trying to

retrieve creative content that involves digital ac-

tors (pantomime) and crowd simulations;

• Extending the ontology in a way that would al-

low developing automatic recognizers for any ot-

her type of media related to the creative process in

a studio.

5 CONCLUSIONS

Finding new (more efﬁcient) ways of authoring crea-

tive content is a feature that interest the most to com-

panies in this sector. This work aims at studying ways

of improving productivity by reducing time and ef-

fort authoring new animations by reusing older me-

dias into new projects. Since the volume of material

produced by such companies can be extremely large,

cataloging old ‘assets’ to facilitate tag-based searches

for future reference is a key aspect when dealing with

problems of this nature.

In this project, we are interested in developing

a tool for automatic classiﬁcation of motion capture

content in terms of the actions the actor is performing,

like walking, running, jumping, etc.

We designed a system that relies on deep lear-

ning, more speciﬁcally on long short-term memory

(LSTM) neural networks, to analyze the content of

GRAPP 2019 - 14th International Conference on Computer Graphics Theory and Applications

242

motion capture ﬁles in BVH format and classify it

according to labels deﬁned by a simpliﬁed ontology

composed of 9 action tags.

Our approach considered a separate data set of ca-

refully edited mocap ﬁles for training the network on

how to recognize each action. These data sets were

adapted from the freely available CMU Motion Cap-

ture Library. After training, the network was tested

using a different set of ﬁles that did not have been

used during training. For the sake of assessment, each

of these ﬁles were manually annotated with the ex-

pected label.

Comparing the results obtained by the classiﬁca-

tion software against the expected manually annota-

ted tags, the system showed, in several of the tests, an

accuracy in some cases better than 95%, what can in-

dicate that the original hypothesis have been satisﬁed.

For the future, it’s expected to improve training

by adding other actions to the ontology like, for in-

stance, considering affective body postures and/or ot-

her kinds of medias that might be of interest in a pro-

duction pipeline like textures, sounds, etc.

The ultimate goal would be to design an extenda-

ble modular content annotator capable of annotating

with different types of medias, based on a general-

purpose ontology.

Another possible application that might gain from

this automatic motion capture action recognition

technology is authoring character animations for the

purpose of retargeting crowd behaviors to different

scenarios. In theory, such an AI system could help

understanding each character’s movements in a given

situation and then help adapting the animations to new

target scenarios, and facilitate authoring crowd simu-

lation.

ACKNOWLEDGMENT

This publication has emanated from research sup-

ported in part by a research grant from Science

Foundation Ireland (SFI) under the Grant Number

15/RP/2776 and in part by the European Unions Hori-

zon 2020 Research and Innovation Programme under

Grant Agreement No 780470.

REFERENCES

Brownlee, J. (2018). Long Short-Term Memory Networks

with Python - Develop Sequence Prediction Models

With Deep Learning. Machine Learning Mastery.

[eBook].

utepage, J., Black, M. J., Kragic, D., and Kjellstr

om,

H. (2017). Deep representation learning for hu-

man motion prediction and classiﬁcation. CoRR,

abs/1702.07486. http://arxiv.org/abs/1702.07486.

CMU Graphics Lab (2018). CMU graphics lab motion cap-

ture database. http://mocap.cs.cmu.edu/.

Delbridge, M. (2015). Motion Capture in Performance -

An Introduction. Palgrave Macmillan UK, ﬁrst edition

edition.

Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z.,

Kankanhalli, M., and Geng, W. (2016). Marker-less

3D human motion capture with monocular image se-

quence and height-maps. In European Conference on

Computer Vision, pages 20–36. Springer.

Fabbri, M., Lanzi, F., Calderara, S., Palazzi, A., Vezzani,

R., and Cucchiara, R. (2018). Learning to detect and

track visible and occluded body joints in a virtual

world. CoRR, abs/1803.08319. http://arxiv.org/abs/

1803.08319.

Gupta, A., Martinez, J., Little, J. J., and Woodham, R. J.

(2014). 3d pose from motion for cross-view action re-

cognition via non-linear circulant temporal encoding.

In 2014 IEEE Conference on Computer Vision and

Pattern Recognition, pages 2601–2608.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-

term memory. Neural Computation, 9(8):1735–1780.

D.O.I.: 10.1162/neco.1997.9.8.1735, https://doi.org/

10.1162/neco.1997.9.8.1735.

Kleinsmith, A., Bianchi-Berthouze, N., and Steed, A.

(2011). Automatic recognition of non-acted affective

postures. Trans. Sys. Man Cyber. Part B, 41(4):1027–

1038. D.O.I.: 10.1109/TSMCB.2010.2103557.

Martinez, J., Black, M. J., and Romero, J. (2017). On

human motion prediction using recurrent neural net-

works. CoRR, abs/1705.02445. http://arxiv.org/abs/

1705.02445.

Menache, A. (2011). Understanding Motion Capture for

Computer Animation. Morgan Kaufmann, second edi-

tion edition.

SAUCE Project (2018). Smart asset re-use in creative envi-

ronments - SAUCE. http://www.sauceproject.eu.

Tekin, B., Rozantsev, A., Lepetit, V., and Fua, P. (2016). Di-

rect prediction of 3d body poses from motion compen-

sated sequences. In 2016 IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

991–1000.

Toshev, A. and Szegedy, C. (2014). Deeppose: Human pose

estimation via deep neural networks. In The IEEE

Conference on Computer Vision and Pattern Recog-

nition (CVPR).

Using LSTM for Automatic Classiﬁcation of Human Motion Capture Data

243