Recognising Actions for Instructional Training using Pose Information:

A Comparative Evaluation

an Bruton and Gerard Lacey

Graphics, Vision and Visualisation (GV2), Trinity College Dublin, University of Dublin, Ireland

Keywords:

Action Recognition, Deep Learning, Pose Estimation.

Abstract:

Humans perform many complex tasks involving the manipulation of multiple objects. Recognition of the

constituent actions of these tasks can be used to drive instructional training systems. The identities and poses

of the objects used during such tasks are salient for the purposes of recognition. In this work, 3D object

detection and registration techniques are used to identify and track objects involved in an everyday task of

preparing a cup of tea. The pose information serves as input to an action classiﬁcation system that uses Long-

Short Term Memory (LSTM) recurrent neural networks as part of a deep architecture. An advantage of this

approach is that it can represent the complex dynamics of object and human poses at hierarchical levels without

the need for design of speciﬁc spatio-temporal features. By using such compact features, we demonstrate the

feasibility of using the hyperparameter optimisation technique of Tree-Parzen Estimators to identify optimal

hyperparameters as well as network architectures. The results of 83% recognition show that this approach is

viable for similar scenarios of pervasive computing applications where prior scene knowledge exists.

1 INTRODUCTION

Humans perform many complex tasks with their

hands. These tasks often involve manipulating objects

in a speciﬁc manner to achieve a goal. A way in which

humans learn the motor skills necessary to perform

these tasks is by repetition with evaluative feedback

from an expert supervisor (Debarnot et al., 2014; Eri-

csson et al., 1993). A formalised method of this type

of training, known as Direct Observation of Procedu-

ral Skills (DOPS), has proven effective for improving

undergraduate medical skills (Profanter and Peratho-

ner, 2015). This observational training has drawbacks

however. These include the subjectivity and biases of

the supervisor, the logistical requirement of the phy-

sical presence of the supervisor, and the cost associ-

ated with providing supervisors for all students. To

overcome these issues, we wish to develop techniques

to automate the supervisory role in learning physical

tasks. A key challenge of implementing such a sy-

stem is the recognition of the individual actions that

are part of the task being performed. In this work, we

use an example task of preparing a cup of tea in place

of a medical skill to explore methods of constructing

such a system.

Using ﬁxed arrangements for the system allows

greater ﬂexibility in types of camera sensors. The ex-

tra depth information available from consumer RGB-

D cameras has been used to signiﬁcant advantage

in tackling the problem of tracking humans at vari-

ous granularities (Shotton et al., 2013; Tang et al.,

2016). Another area where the availability of depth

data has improved results is object pose estimation

(Hinterstoisser et al., 2012). These continuing ad-

vances could allow for accurate real-time tracking of

objects and people, providing valuable input informa-

tion for a system that determines if a complex object

manipulation task has been performed correctly.

The problem remains, however, of how to use this

information to understand the interactions between

people and objects. In this work, we use recurrent

neural networks to recognise human-object interacti-

ons based on tracked object poses. Training against

compact pose data permits training to be completed

in a reasonable timeframe. We use this property to

perform a hyperparameter search for architectural and

algorithmic parameters, automating the costly process

of identifying an optimal architecture.

As part of our work, we also make available

the dataset used to evaluate this system. This data-

set comprises performances of an activity recorded

with a multi-camera RGB-D set-up, as well as 3D

scans of all the objects used (https://www.scss.tcd.ie/

gerard.lacey).

482

Bruton, S. and Lacey, G.

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation.

DOI: 10.5220/0007395304820489

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 482-489

ISBN: 978-989-758-354-4

Figure 1: The high-level overview of the stages of the action recognition system.

2 RELATED WORK

Our task is to recognise a sequence of human-object

interactions as part of a ﬁxed activity. In the litera-

ture, systems of providing situational support for such

goal-directed activities have been developed. Much

of these systems make use of inertial sensors, such as

accelerometers, attached to the objects in use. Sensor

information has been combined with image features

to train discriminative classiﬁers, such as random fo-

rests, to recognise actions involved in preparation of

a salad (Stein and McKenna, 2013). A drawback of

sensor-based approaches is that the presence of sen-

sors is less natural for users and may hinder them in

performing the task using their typical technique.

Other methods for recognition of actions as part

of a complex task have focussed on the tracking of

objects and the design of features based on associated

motion patterns. On a dataset of cooking activities,

a number of object tracking methods were tested in

performing ﬁne-grained action recognition (Rohrbach

et al., 2012). It was found that a pose-based descriptor

approach, based on Fourier transform features, under-

performed relative to dense trajectories (Wang et al.,

2011). Other authors (Stein and McKenna, 2017) at-

tempt to recognise component ﬁne-grained actions of

the complex activity of preparing a salad by devising

a custom feature descriptor based on histograms of

tracklets described relative to the object in use.

An alternative to tracking individual objects, ot-

her approaches use global image features to perform

segmentation and recognition across an entire acti-

vity sequence. One recent work (Kuehne et al., 2016)

used a generative framework to segment ﬁne-grained

activities. Similarly based on language and grammar

models, other authors (Richard and Gall, 2016) use a

probabilistic model that models the segmentation and

classiﬁcation of actions jointly. However, due to these

segmental approaches requiring observation of the en-

tire sequence, it would not be possible to provide su-

pervisory feedback during an activity.

The resurgence of neural networks has seen CNNs

used to produce global image features. A recent work

(Lea et al., 2016) utilised a CNN to extract image

features, and used a 1D convolution over the output

image features to classify ﬁne-grained actions of goal-

directed activities. The authors also utilise a semi-

Markov model to segment a performance video into

actions, however this improves accuracy marginally

for the 50 Salads dataset (Stein and McKenna, 2013).

This variety of approaches shows that there is no

consensus on the best techniques of recognising acti-

ons for situational support systems.

3 SYSTEM DESIGN

Here, we detail the component stages of the entire re-

cognition pipeline. The pipeline itself is illustrated in

Figure 1, with references to the following sections.

3.1 Sensor Fusion

For the task performance, we assume that it can be

performed on the surface of a table. To eliminate

occlusions of salient information for action recogni-

tion we use an array of three RGB-D cameras, arran-

ged as per Figure 2. For each camera, we construct

point cloud representations from each RGB-D frame.

It is necessary to synchronize the point clouds re-

ceived from the different cameras to allow for fusion

into a single merged point cloud. To do so, we asso-

ciate point clouds by inspecting the time differences

between receipt of the data from the different cameras

and determine if the difference is within a speciﬁed

threshold.

Interference is an undesirable side effect of using

multiple structured infra-red sensors to observe the

same target. This results in holes and noise in the

depth maps. The “Shake ‘N’ Sense” technique (But-

ler et al., 2012) is used to address this problem.

This technique involves vibration of affected sensors

which causes the patterns of other cameras to appear

blurred relative to its own pattern. The advantages of

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation

483

Figure 2: The arrangement of the recording set-up. Top-

left image shows the vibration motor attached to an RGB-D

camera (Asus Xtion). Right image shows the placement of

the cameras in relation to the task table. Bottom-left image

shows a colour image taken from the centre camera during

a recording.

this technique is that it maximises the amount of data

available and has zero computational cost.

To fuse the point clouds, we estimate the rigid

transformation between camera pairs. The pipeline

for the registration, is to detect discriminative local

features in each of the point clouds, ﬁnd matches

for these features and estimate a registration based

on these matches. Intrinsic Shape Signature (ISS)

keypoints (Zhong, 2009) were determined for each

sensor’s point cloud and Signatures of Histograms

of Orientations (SHOT) descriptors were calculated

(Salti et al., 2014) at these keypoints. To provide

highly discriminative features, a number of target ob-

jects with characteristic geometry were included in

the scene. To calculate the rigid transformation, the

system of linear equations is solved using Levenberg

Marquardt. This alignment is reﬁned using the Gene-

ralised Iterative Closest Point algorithm (Segal et al.,

2009), which estimates a dense registration after the

coarse feature-based registration.

We use the presence of the task table to extract

from the merged cloud all points above the table, iso-

lating interesting points for our purposes. The RAN-

SAC algorithm is used to ﬁnd the table in the point

cloud (Schnabel et al., 2007). Once the largest plane

has been identiﬁed, Euclidean clustering (Rusu, 2010)

is used to isolate the table points from other points

that lie on this 3D plane. To identify all the points

above the table, we construct a convex hull around

the table cluster and check whether a point lies within

a volume extruded above this hull.

3.2 Object Pose Estimation

The technique for estimating object poses is compo-

sed of two stages: estimating the initial pose at the

Figure 3: A rendering of scanned objects as meshes.

beginning of a video; and tracking the object through

the remainder of the video. The ﬁrst stage identiﬁes

an object on the task table using the Linemod techni-

que (Hinterstoisser et al., 2012). This technique com-

bines colour gradients and surface normals to gene-

rate templates for known objects which can be efﬁ-

ciently searched. This technique has the beneﬁt of

working for objects that may have little surface tex-

ture, or are partially occluded. The templates are col-

lected by performing three dimensional scans of the

objects, and calculating the templates of the objects

at different possible poses.

The second stage tracks an object as the video pro-

gresses. As a result of the merging and segmenting

of the point clouds, we have isolated all points that

should only belong to either objects or subject’s arms.

Based on the understanding that objects move small

distances between frames, poses are registered frame

to frame using the Generalised Iterative Closest Point

(GICP) algorithm (Segal et al., 2009). This algorithm

uses a probabilistic model for a point to point cost

function, which has been shown to be more robust to

incorrect correspondences than other iterative closest

point algorithms. Point clouds of scanned versions of

the objects, for example those in Figure 3, are used for

registration. To ensure consistent densities of points

across the source and target point clouds, the points

are ﬁltered using a voxel grid of ﬁxed cell dimensi-

ons.

Given the initial pose of the object, estimated

using the Linemod algorithm, and the incremental po-

ses, found via registration of the scanned version of

an object transformed to the previous pose, a series of

poses for each object for a set of merged point clouds

will be produced. The poses are encoded as a transfor-

mation matrix, T : R

→ R

. In homogeneous form,

this transformation matrix is composed of a rotation,

R : R

→ R

and a translation t ∈ R

, T = [R, t;0

, 1].

A more dimensionally compact representation of a

transformation can be found by using unit quaterni-

ons to encode the rotation. Thus each pose is encoded

as [t, a, b, c, d], where the rotation is encoded by the

unit quaternion z = a + bi + cj + dk, a, b, c, d ∈ R.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

484

Figure 4: Left, a cluster centroid,

q, and the three eigenvec-

tors, e

, e

and e

, is shown for an arm cluster. Right, the

centroid is subtracted and the eigenvectors are axis-aligned.

The centres of the two axis-aligned bounding box faces per-

pendicular to the principle axis, p

and p

, are used as the

two end effectors for the arm pose.

3.3 Arm Pose Estimation

To estimate an arm pose, we use the segmented mer-

ged point cloud to identify arm points. Using the esti-

mated object poses, we remove points that lie within

a threshold distance of any points of the transformed

object point clouds. Due to noise and slight misalig-

nments of objects, it is necessary to perform further

segmentation. Under the assumption that arm points

all lie within a certain distance of each other, Eucli-

dean clustering (Rusu, 2010) is used to identify the

two largest clusters of the remaining points. A lower

bound on possible cluster size ensures that only clus-

ters large enough to be arms are selected.

We estimate the arm poses from the cluster in-

formation. We denote a candidate arm cluster as

Q = {q

∈ R

, i = 1, 2, . . .}. We represent a pose using

three components: a

∈ R

, the inner arm point in the

task area; a

∈ R

, a point representing the position

of the subject’s wrist; and r ∈ R, the width of the sub-

ject’s arm as it appears to the central camera sensor.

To calculate these features, the extents of an oriented

bounding box over the cluster set Q are examined.

To determine these bounding box extents, Princi-

ple Component Analysis (PCA) is used. The centroid,

q, is subtracted from each point, q

in a candidate

cluster. A data matrix, X is constructed, with each

row representing a mean subtracted point. The eigen-

values of the matrix X

X, are calculated along with

the corresponding eigenvectors, e

, e

and e

, repre-

senting the principle components of the cluster set.

These eigenvectors can be composed into an ortho-

gonal rotation matrix that transforms the three prin-

ciple axes of greatest variance to the Cartesian axes,

R =



, e



. The new data matrix, Y

= RX

facilitates the determination of the bounding box ex-

tents. The maximal and minimal extents along each

row of this matrix deﬁne the axis-aligned bounding

box for the set of transformed arm points. The centres

of the two faces perpendicular to the principle axis

(see Figure 4), p

and p

, can thus be found. Thus,

the inverse transformation, a

= R

q, can be ap-

plied to these points, to transform them back to the

original arm cluster, to get the desired end effectors.

The ﬁnal feature component is the width of the obser-

ved arm cluster, r, corresponding to the bounding box

width.

To distinguish between a left and right arm, the

two clusters centroids are inspected. If a single clus-

ter is detected, geometrical rules based on the position

of the centroid and the angle of the arm are used to de-

termine handedness. The threshold position and angle

for these rules are deﬁned based on the known scene

arrangement.

3.4 Action Recognition

Given the individual feature components descri-

bed in Sections 3.2 and 3.3, the feature vector is

deﬁned as the concatenation of the object featu-

res, [t

, a

, b

, c

, d

] for each i ∈ {cup, pot, bowl, jug},

and the arm features, [a

1,o

, a

2,o

, r

] for each o ∈

{le f t, right}. Here, a

1,o

, a

2,o

, and r

represent the in-

ner point, outer point and width of the estimated pose

of arm o, respectively. To capture temporal dynamics,

a sequence of pose features is used for classiﬁcation.

In the area of natural language processing, recur-

rent neural networks (RNNs) have proven useful in

performing tasks such as machine translation and in-

formation modelling. More recent work, has looked

at utilising recurrent neural networks to learn from se-

quences of image features extracted from videos (Do-

nahue et al., 2016).

Long Short Term Memory Networks (LSTM) are

an RNN structure designed to overcome the vanis-

hing gradient problem (Hochreiter and Schmidhuber,

1997) by allowing for old (potentially useless) infor-

mation to be forgotten and new (useful) information

to be recorded in the cell state.

The LSTM cells are used as part of a larger neu-

ral network architecture to perform classiﬁcation of

sequences of pose features. The recognition perfor-

mance is dependent upon the architecture. Decisions

include whether to train LSTM cells on forward se-

quences or both forward and reverse sequences, illus-

trated by the ‘reverse sequence’ operation in Figure 5.

Other decisions include the number of LSTM layers

to use and the number of cell units per layer. These

decisions are made using an automated procedure of

identifying an optimal architecture, which will be dis-

cussed in Section 4.3.

In our architectures, the output of the LSTM lay-

ers undergoes further transformations via a series of

fully connected layers. The output of the layers un-

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation

485

Figure 5: An example architecture used to classify sequen-

ces of pose information into action labels.

dergo an activation function, with intermediate dense

layers outputs undergoing a Rectiﬁed Linear Unit

(ReLU) activation. The outputs of the ﬁnal layer un-

dergo a softmax activation to produce an output dis-

tribution across the possible label classes.

Dropout is applied to the inputs of each of the

dense layers for regularisation purposes. For the

LSTM cells, a variant of dropout for recurrent neu-

ral networks is used, known as recurrent dropout (Gal

and Ghahramani, 2016), whereby the dropout mask is

identical across each of the timesteps in a sequence.

An L

loss is added based on the weights of the dense

layers to prevent over dependence on singular neu-

rons for classiﬁcations. Finally, batch normalisation

is used between the layers of the network.

4 EXPERIMENTAL SET-UP

4.1 Dataset Collection

To evaluate our system, a dataset was collected of

RGB-D videos of people performing the task of pre-

paring a cup of tea. The dataset is composed of 24

samples recorded using three Asus Xtion Pro Live

RGB-D cameras. A total of eight subjects were recor-

ded performing the task three times. There were no

restrictions imposed on how they prepared the cup of

tea. The videos were all manually labelled with one

of the ﬁve actions for each frame: ‘pour tea’; ‘pour

milk’; ‘add sugar’; ‘stir’; and ‘background’. A total of

25,913 frames were recorded, equating to an average

video length of 38 seconds.

4.2 Training

We train each candidate neural network architecture

to recognise the actions in this dataset. Given a se-

quence of pose features for a time t, (x

t−n

, . . . , x

where n ∈ N and x ∈ R

, and d is the length of

the pose feature vector, we wish to estimate the

output probability mass function, ˆp

(y), where y ∈

C , the set of action labels. Each neural network

we train is a nonlinear differentiable function of

the inputs, parametrised by the weights, ˆp

(y) =

F (x

t−n

, . . . , x

;{W

}), where {W

} is the set of weig-

hts contained in the network.

To train the network, back propagation is used.

We minimise a loss function based on the categori-

cal cross entropy, for each minibatch. The stochas-

tic optimisation technique of Adam was used for this

purpose.

Each network is trained for 250 epochs, with a mi-

nibatch size of 1024. The weights of the dense layers

and LSTMs in the network were initialised with Xa-

vier uniform initialisation, and the offset biases were

initialised with zeros. The Tensorﬂow deep learning

framework was used for implementation. We utilise

a leave-one-subject-out cross validation, testing on an

individual subject for each fold. This has the bene-

ﬁt of characterising the performance of the system for

an unseen subject, identifying cases which the system

may ﬁnd challenging to classify correctly.

4.3 Hyperparameter Optimisation

The selection of optimal algorithm parameters can be

time-consuming and involve expert knowledge that is

difﬁcult to convey. Due to elongated training times of

neural networks, techniques such as grid search and

random search can be infeasible. However, there ex-

ist methods of hyperparameter searching that can re-

duce the number of search iterations required. One

approach utilises Tree-Structured Parzen Estimators

(TPE) to model the target function, whereby each of

the sampled points is represented with a Gaussian dis-

tribution in the hyperparameter space (Bergstra et al.,

2011).

The next sample point is selected based on Ex-

pected Improvement. This can be deﬁned as the ex-

pectation, under some model M of a ﬁtness function

f : X → R, that f (x) will negatively exceed some

threshold y

∗

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

486

∗

(x) :=

∞

−∞

max(y

∗

− y, 0)p

(y|x)dy. (1)

In our case, as well as model parameters, we wish

to ﬁnd optimal architectural parameters, such as num-

ber of LSTM layers, and whether to additionally train

on reversed sequences (i.e. bidirectionally). Other hy-

perparameters searched over are shown in Table 1 as

well as the prior distributions for each. The HyperOpt

library (Bergstra et al., 2011) was utilised to perform

this hyperparameter search.

Table 1: The prior distributions selected for the hyperpa-

rameter searches. U(x, y) denotes the uniform distribution

between x and y. Uniform distributions are used in all of

the other discrete cases. These distributions are chosen ba-

sed on a number of initial tests runs.

Parameter Prior

Stride 1, 2 or 4

Sequence Length 8, 16, 32 or 64

Recurrent Units 64 or 128

Number LSTM layers 1, 2 or 3

Reverse sequences True or False

Dense Kernel L2 U(0.0001, 0.01)

LSTM input dropout rate U(0.0, 0.3)

LSTM recurrent dropout rate U(0.0, 0.3)

Softmax dropout rate U(0.0, 0.5)

Initial learning rate U(0.0001, 0.01)

4.4 Performance Benchmarking

To compare our recurrent neural network recogni-

tion approach, benchmark recognition algorithms are

selected to compare against. These algorithms are

Random Forest (Breiman, 2001) and Gradient Boos-

ted Decision Trees (Friedman, 2001). These are se-

lected due to their performance on high-dimensional

recognition tasks (Shotton et al., 2013; Tang et al.,

2016) and their use in related works (Stein and Mc-

Kenna, 2017). The hyperparameter optimisation

technique is also used to optimise the parameters of

these classiﬁers. The hyperparameters that were se-

arched over in the optimisation schedule were: the

sequence stride; the sequence length; the number of

tree estimators to use; and the maximum depth of the

individual trees. In each optimisation, we maximise

the F

score, calculated across all of the splits. This is

calculated as the weighted harmonic mean of the pre-

cision and recall. For each classiﬁer, ﬁfty hyperpa-

rameter search iterations were performed, with each

iteration tested using the cross-validation scheme.

5 RESULTS AND DISCUSSION

5.1 Object Pose Estimation

To evaluate the object pose estimation, a qualitative

analysis is performed. We analyse the estimated po-

ses and determine correctness based on their corre-

spondence to the perceived object pose, from images

as per Figure 6. Overall, the object pose estimation

performs reliably under this analysis. Of the 24 re-

cordings, with four objects tracked in each recording,

there are two instances of an object’s pose tracking

being irrecoverably lost. In each instance, the object

brieﬂy becomes fully occluded during the performan-

ces by a combination of the subject’s arms and other

objects. There are four further instances where the

alignment of an object’s pose is perceptively incorrect

for a portion of the recording. During the task perfor-

mance, the cup is ﬁlled with tea which changes the

perceived shape of the object, causing the registration

algorithm to incorrectly identify correspondences. A

potential method to overcome this issue would be to

introduce a classiﬁcation stage to determine whether

the cup has become ﬁlled, and once it has, switch to

registering an object scan of a ﬁlled cup.

5.2 Arm Pose Estimation

Similarly to the evaluation of object pose estimation,

ground truth data for the arm pose estimation techni-

que is unavailable and so we adopt a qualitative as-

sessment methodology. Overall, the arm poses are

estimated to an acceptable level for the majority of

the performance videos. As the technique relies on a

single merged cloud, errors do not accumulate. Ob-

served errors include confusion between arms and in-

clusion of object parts or other body parts in the pose

estimation, as shown in Figure 6. Much of these er-

rors are caused by upstream errors in the object pose

estimation stage. Other errors occur due to incorrect

segmentations due to subjects moving arms close to

their bodies.

5.3 Action Recognition

The optimal LSTM network uses temporal stride of a

a single frame and a sequence length of 64. It contains

one LSTM layer with 64 units for each sequence or-

der. It performs better than the benchmark classiﬁers

by a signiﬁcant margin, as can be seen in Table 2.

The accuracy is 82.9% calculated over all of the test

splits. The F

metric, which we optimised against,

is 81.72%, which is 8% above the other classiﬁer re-

sults. The classiﬁer also has a smaller standard devia-

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation

487

Table 2: The classiﬁcation metrics for the tested classiﬁers.

Classiﬁer Accuracy Precision Recall F

score

Random Forest 76.21 82.54 ± 17.13 71.11 ± 20.59 73.55 ± 14.91

Gradient Boosting 75.68 85.06 ± 18.12 70.45 ± 20.94 73.23 ± 13.89

LSTM 82.90 81.46 ± 8.65 82.20 ± 7.62 81.72 ± 7.49

Figure 7: Confusion matrices for the three tested classiﬁcation techniques. The LSTM classiﬁer has more even distribution of

correct predictions over all of the classes than the Random Forest or Gradient Boosting classiﬁers.

Figure 6: Pose estimation results for different sample recor-

dings. In the ﬁrst three rows, hand pose and object poses

are estimated reliably. In the fourth row, object pose esti-

mation errors are shown. These are due to misalignments

of the cup due to changing topology and tracking losses due

to complete occlusions. In the ﬁnal row, arm pose estima-

tion errors are shown. These errors are due to incorrect seg-

mentation of left and right arm clusters when the hands are

close together, and incorrect estimation based on inclusion

of points from the subject’s body.

tion across the action classes, indicating that it learns

to discriminate actions more evenly.

The results indicate that the LSTM network is able

to represent the dynamics of the human-object inte-

ractions to determine the current action. Analysing

the confusion matrices, as shown in Figure 7, we ob-

serve that the actions ‘Background’, ‘Place Sugar’

and ‘Stir Tea’ are mistakenly predicted for each ot-

her. This may be more difﬁcult to disambiguate as the

arm pose dynamics present in these action are more

subtle and do not necessarily involve the movement

of an object.

To gain a deeper understanding of where the

LSTM underperforms, we inspect its performance for

individual test splits and observe that there is a large

difference between the maximum F

score (93%) and

minimum F

score (60%) for the splits. The worst

performing splits contain the most signiﬁcant pose es-

timation errors, as detailed in Sections 5.1 and 5.2.

As such, improvements to these upstream methods

should result in better performance for these splits.

6 CONCLUSIONS AND FUTURE

WORK

In this work, we demonstrated a system that can clas-

sify human-object interactions for a goal-directed task

to a high degree. For the classiﬁcation method, we

proposed the use of an optimised neural network ar-

chitecture involving LSTMs. Analysing the results,

we identiﬁed areas for further improvement in the pi-

peline and proposed potential methods to overcome

these weaknesses. We release the multi-camera RGB-

D video dataset of all task performances, including

3D scan data for each of the objects used. The system

could potentially be applied to numerous real-world

problems that require the understanding of human-

object interactions, such as smart assembly line mo-

nitoring. We intend to further develop this system to

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

488

handle more complex interactions, such as those in-

volved in procedural medical skills, as part of an au-

tomated instructional training system.

ACKNOWLEDGEMENTS

This research has been conducted under an Irish Rese-

arch Council Enterprise Partnership Scholarship with

Intel Ireland.

REFERENCES

Bergstra, J. S., Bardenet, R., Bengio, Y., and Kgl, B.

(2011). Algorithms for hyper-parameter optimization.

In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pe-

reira, F., and Weinberger, K. Q., editors, Advances

in Neural Information Processing Systems 24, pages

2546–2554. Curran Associates, Inc.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

Butler, D. A., Izadi, S., Hilliges, O., Molyneaux, D., Hod-

ges, S., and Kim, D. (2012). Shake’n’sense: Reducing

interference for overlapping structured light depth ca-

meras. In Proceedings of the SIGCHI Conference on

Human Factors in Computing Systems, CHI ’12, pa-

ges 1933–1936, New York, NY, USA. ACM.

Debarnot, U., Sperduti, M., Di Rienzo, F., and Guillot, A.

(2014). Experts bodies, experts minds: How physi-

cal and mental training shape the brain. Frontiers in

Human Neuroscience, 8:280.

Donahue, J., Hendricks, L. A., Rohrbach, M., Venugo-

palan, S., Guadarrama, S., Saenko, K., and Darrell,

T. (2016). Long-term recurrent convolutional net-

works for visual recognition and description. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence.

Ericsson, K. A., Krampe, R. T., and Tesch-Romer, C.

(1993). The role of deliberate practice in the acqui-

sition of expert performance. Psychological Review,

100(3):363–406.

Friedman, J. H. (2001). Greedy function approximation: A

gradient boosting machine. The Annals of Statistics,

29(5):1189–1232.

Gal, Y. and Ghahramani, Z. (2016). A theoretically groun-

ded application of dropout in recurrent neural net-

works. In Lee, D. D., Sugiyama, M., Luxburg, U. V.,

Guyon, I., and Garnett, R., editors, Advances in Neu-

ral Information Processing Systems 29, pages 1019–

1027. Curran Associates, Inc.

Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab,

N., Fua, P., and Lepetit, V. (2012). Gradient response

maps for real-time detection of textureless objects.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 34(5):876–888.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural Computing, 9(8):1735–1780.

Kuehne, H., Gall, J., and Serre, T. (2016). An end-to-end

generative framework for video segmentation and re-

cognition. In 2016 IEEE Winter Conference on Appli-

cations of Computer Vision (WACV), pages 1–8.

Lea, C., Reiter, A., Vidal, R., and Hager, G. D. (2016).

Segmental spatiotemporal cnns for ﬁne-grained action

segmentation. In Computer Vision - ECCV 2016, Lec-

ture Notes in Computer Science, pages 36–52. Sprin-

ger, Cham.

Profanter, C. and Perathoner, A. (2015). Dops (direct ob-

servation of procedural skills) in undergraduate skills-

lab: Does it work? analysis of skills-performance and

curricular side effects. GMS Zeitschrift fr Medizinis-

che Ausbildung, 32(4).

Richard, A. and Gall, J. (2016). Temporal action detection

using a statistical language model. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 3131–3140.

Rohrbach, M., Amin, S., Andriluka, M., and Schiele, B.

(2012). A database for ﬁne grained activity de-

tection of cooking activities. In 2012 IEEE Confe-

rence on Computer Vision and Pattern Recognition,

pages 1194–1201.

Rusu, R. B. (2010). Semantic 3d object maps for every-

day manipulation in human living environments. KI -

Knstliche Intelligenz, 24(4):345–348.

Salti, S., Tombari, F., and Di Stefano, L. (2014). Shot: Uni-

que signatures of histograms for surface and texture

description. Computer Vision and Image Understan-

ding, 125:251–264.

Schnabel, R., Wahl, R., and Klein, R. (2007). Efﬁcient

ransac for point-cloud shape detection. In Computer

Graphics Forum, volume 26, pages 214–226. Wiley

Online Library.

Segal, A., Haehnel, D., and Thrun, S. (2009). Generalized-

icp. In Robotics: Science and Systems, volume 2.

Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finoc-

chio, M., Blake, A., Cook, M., and Moore, R. (2013).

Real-time human pose recognition in parts from single

depth images. Commun. ACM, 56(1):116–124.

Stein, S. and McKenna, S. J. (2013). Combining embed-

ded accelerometers with computer vision for recogni-

zing food preparation activities. In Proceedings of the

2013 ACM International Joint Conference on Perva-

sive and Ubiquitous Computing, UbiComp ’13, pages

729–738, New York, NY, USA. ACM.

Stein, S. and McKenna, S. J. (2017). Recognising complex

activities with histograms of relative tracklets. Com-

puter Vision and Image Understanding, 154:82–93.

Tang, D., Chang, H., Tejani, A., and Kim, T. K. (2016).

Latent regression forest: Structured estimation of 3d

hand poses. IEEE Transactions on Pattern Analysis

and Machine Intelligence, PP(99):1–1.

Wang, H., Klaser, A., Schmid, C., and Liu, C.-L. (2011).

Action recognition by dense trajectories. In 2011

IEEE Conference on Computer Vision and Pattern Re-

cognition (CVPR), pages 3169–3176.

Zhong, Y. (2009). Intrinsic shape signatures: A shape des-

criptor for 3d object recognition. In 2009 IEEE 12th

International Conference on Computer Vision Works-

hops (ICCV Workshops), pages 689–696.

Recognising Actions for Instructional Training using Pose Information: A Comparative Evaluation

489