Semi-supervised Discovery of Time-series Templates for Gesture Spotting

in Activity Recognition

ector F. Satiz

abal, Julien Rebetez and Andres Perez-Uribe

IICT, University of Applied Sciences (HES-SO), Yverdon-les-Bains, Switzerland

Keywords:

Time Series Clustering, Template Discovery, Dynamic Time Warping, Activity Recognition.

Abstract:

In human activity recognition, gesture spotting can be achieved by comparing the data from on-body sensors

with a set of known gesture templates. This work presents a semi-supervised approach to template discovery

in which the Dynamic Time Warping distance measure has been embedded in a classic clustering technique.

Clustering is used to ﬁnd a set of template candidates in an unsupervised manner, which are then evaluated by

means of a supervised assessment of their classiﬁcation performance. A cross-validation test over a benchmark

dataset showed that our approach yields good results with the advantage of using a single sensor.

1 INTRODUCTION

Activity recognition is a research and application

topic that has been gaining interest in the last years

among the machine learning and pattern recognition

community. Identifying the activities performed by

a person can be useful for a lot of practical applica-

tions. In general, the activities executed by a user give

a meaningful context to the devices surrounding him

and therefore, these devices could change their be-

haviour as a function of their context. A large amount

of previous works focus on activity recognition using

a large number of sensors either in a single body lo-

cation (Lara et al., 2012), as well as in multiple body

locations (Sagha et al., 2011; Stiefmeier et al., 2008).

Our work is motivated by the use of as few sensors as

possible, hence we have tested the use of a single sen-

sor located in the right forearm. Moreover, we con-

sider the recent “history” of the sensor data instead of

basing the activity recognition on the features of the

raw sensor data during a single window of time. We

propose to use a semi-supervised approach for ﬁnd-

ing gesture “ﬁngerprints” that we further use as tem-

plates to perform gesture spotting. Several works in

the literature also coped with this issue. Ko et al. (Ko

et al., 2005) compared a group of pre-deﬁned tem-

plates with the incoming signal using the Dynamic

Time Warping distance measure for performing con-

text recognition. Stiefmeier et al. (Stiefmeier et al.,

2008) used time-series quantization and an approach

based on a measure of distance between sequences

similar to the Longest Common Sub-Sequence simil-

arity measure. In this work, we decided to use a two-

steps approach in which (i) the incoming sequences

are grouped in an unsupervised manner in order to

ﬁnd good candidates to gesture templates, and (ii) the

candidates are evaluated in a supervised manner by

comparing them with the ground truth (i.e., labels in

the database). The paper is organized as follows. Sec-

tion 2 makes an overview of the dataset, Section 3

gives the details about the features that were com-

puted from the acceleration data, Section 4 explains

the semi-supervised approach for gesture template

discovery, Section 5 exposes the results we found with

our approach, and Section 6 shows the conclusions we

drew from our experiments.

2 THE DATASET

We used a dataset from the UCI repository (Frank and

Asuncion, 2010) devised to benchmark human activ-

ity recognition algorithms called the OPPORTUNITY

Activity Recognition Dataset (Roggen et al., 2010).

For each one of the 4 subjects in the dataset there are

ﬁve daily activity sessions and one drill session which

has about 20 repetitions of some pre-deﬁned actions.

We used the drill session of the ﬁrst subject (S1) for

our tests. Moreover, we used only one sensor from the

72 available: the RLA inertial unit located in the right

forearm. The classiﬁcation goals were the following

mid-level annotations:

573

F. Satizábal H., Rebetez J. and Perez-Uribe A. (2013).

Semi-supervised Discovery of Time-series Templates for Gesture Spotting in Activity Recognition.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 573-576

DOI: 10.5220/0004268805730576

 SciTePress

Open/close fridge Open/close dishwasher

Open/close 3 drawers Open/close two doors

Open/close two doors Turn light on/off

Clean the table Drink standing/seated

3 DATA TRANSFORMATION

AND FEATURE EXTRACTION

As a ﬁrst step we used a band-pass ﬁlter between

0.1 Hz and 1 Hz to extract the motility component of

the acceleration (Mathie et al., 2003). We then used

a sliding window of 16 samples

, overlapped 50 % to

compute some characteristic features. We computed

the average of the signal and the angle of the best lin-

ear segment that approximates the acceleration signal

within the window of 16 samples.

4 SEMI-SUPERVISED

DISCOVERY OF TEMPLATES

A template T for a given gesture G

is a sequence of

values S

which is closer to the sequences S generated

when gesture G

is executed, than to the sequences S

generated when other gestures G

(k 6= i) are executed.

This section describes the main steps of our approach

for discovering gesture templates:

1. Use raw data from a single accelerometer as input.

2. Band-pass ﬁlter to extract motility components.

3. Extract average and angle features.

4. K-means to create an alphabet.

5. K-medoids to ﬁnd a vocabulary of common

words.

6. Select templates using f (0.5) score.

7. Perform gesture spotting using f (1) score.

4.1 Measures of Distance between

Sequences

There are several manners for assessing the similarity

or the distance between two sequences x = (x

, ..., x

)

and y = (y

, ..., y

). The simplest way to measure the

distance is to think of x and y as ordinary vectors in

n+1

and to use the Euclidean distance. Although ap-

propriate for short sequences, it does not cope well

with small delays or phase changes that are common

in longer sequences. On the other hand, the Dynamic

Time Warping (DTW) distance is robust to small lo-

cal variations in the speed of the sequences (Berndt

∼0.5 s given that signals are sampled at 30 Hz.

and Clifford, 1994), allowing the comparison of time

series that are similar but locally out of phase.

4.2 Quantization of the Time-series

We employed vector quantization to transform the

multi-dimensional signal of feature values (i.e., 3

axes, 2 features per axis) in a sequence of “symbols”.

This process makes the system less sensitive to noise

and outliers. We tested different codebook sizes, from

50 to 200. The size of the codebook affects the preci-

sion of the comparisons made by the measures of dis-

tance. We used the k-means algorithm (MacQueen,

1967) to build the codebook.

4.3 Unsupervised Sequence Clustering

We used unsupervised clustering to discover common

sequences which are good candidates to be gesture

templates. Unsupervised clustering groups similar se-

quences and links each group of sequences to a pro-

totype sequence. This natural grouping of similar se-

quences may (or not) correspond to the gestures done

by the person wearing the accelerometers.

We used the k-medoids algorithm (Kaufman and

Rousseeuw, 1987) for creating the groups of se-

quences

. We found groups of sequences of differ-

ent lengths, from 12 to 24 symbols in order to have

template candidates of different lengths.

4.4 Supervised Template Discovery

The prototype sequences found in an unsupervised

manner are candidates to be gesture templates. Thus,

each candidate was evaluated as a template for each

gesture. We used the f (β) score (Rijsbergen, 1979)

which estimates how accurately a candidate classiﬁes

a particular gesture as such.

As an example, Figure 1 shows the f (0.5) score

obtained by using each one of a set of 20 prototypes

in detecting the gestures in the S1 Drill dataset. As

it can be seen in Figure 1, prototype 12 and 13 have

high scores (near to 0.7) for the gesture “drink from

cup”, and the prototype number 6 is a good tem-

plate for the “toggle switch” gesture. Moreover, it

This algorithm is very similar to the k-means algo-

rithm, with the difference that prototypes are not computed

by averaging the members of the group, but by choosing the

observation (sequence) that is closer to all the observations

in the group.

We give more emphasis to the precision than to the re-

call. Since several templates per gesture are allowed, we are

interested in ﬁnding templates matching correctly with the

gestures with as less false positives as possible, paying less

attention to the false negatives.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

574

2 4 6 8 10 12 14 16 18 20

Null

Close Dishwasher

Close Drawer 3

Close Drawer 2

Close Door 1

Close Door 2

Close Drawer 1

Close Fridge

Toggle Switch

Open Dishwasher

Open Drawer 3

Open Drawer 2

Open Door 1

Open Door 2

Open Drawer 1

Open Fridge

Drink from Cup

Clean Table

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 1: Supervised discovery of templates. The colours in

the matrix represent the f (0.5) score of each one of the pro-

totypes when used for matching the annotated gestures. The

vertical axis corresponds to the gestures, and the horizontal

axis correspond to the prototype sequences.

can be seen that prototype 2 is a good template for

two activities (“close door 2” and “close door 1”) and

thus, these activities may be too similar and should be

merged. The automatic selection of gesture templates

is achieved by setting a threshold value th

templates

that

is compared with the f (0.5) score of the potential se-

quence prototypes. Prototypes having a f (0.5) score

higher than the th

templates

are selected as templates for

a given gesture

. More than one template per gesture

is allowed.

4.5 Gesture Spotting

Gestures in an incoming acceleration signal can be

spotted by comparing the incoming sequences with

the pre-deﬁned templates. If an incoming sequence is

found to be sufﬁciently similar to one of the templates

of a given gesture, then this sequence can be labelled

as an occurrence of that gesture. We computed the

distance threshold as: threshold = mean(distance) −

∗ std(distance) where th

is a parameter that

modulates at how many standard deviation the thresh-

old is located from the average of the distance over the

whole dataset.

5 RESULTS

In this section we present some of the results of the

semi-supervised discovery of templates.

5.1 Parameter Exploration

The whole setup involves several parameters, and the

In this example there are some gestures without a valid

template. That is because for the sake of exmepliﬁcation we

intentionally kept a small number groups.

performance of the spotting depends of them. Table 1

shows a list of the parameters that were modiﬁed dur-

ing the experiments and the values we tested.

Table 1: A list of the parameters that were explored during

the tests, their description and their values.

Parameter Description Value

k k-means alphabet size [100, 250]

word

k-medoids vocabulary size [50, 100]

templates

Template selection threshold [0.3, 0.6]

Gesture spotting threshold [1.5, 2.5]

Figure 2 shows the average f (1) score for all the

gestures in the S1 Drill dataset. We tested different

values of k, k

word

, th

templates

and th

. In the case of

the Euclidean distance, varying parameter th

(verti-

cal axis of each coloured grid) makes the f (1) change

more than varying parameter th

templates

(horizontal

axis of each coloured grid). Moreover, having more

candidate templates produces higher average values

of f (1). On the other hand, when using the DTW

distance, both parameters th

and th

templates

make

the f (1) score to change and, as in the case of the

Euclidean distance, having more candidate templates

produces higher average values of f (1). An interest-

ing results is that, as expected, the use of the DTW

distance produces higher values of f (1) score than us-

ing the Euclidean distance.

0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

2.45

2.5

th template

th classification

f1 ab 150 d 100

0.1

0.2

0.3

0.4

0.5

0.6

(a) Euclidean.

0.3 0.325 0.35 0.375 0.4 0.425 0.45 0.475 0.5 0.525 0.55 0.575 0.6

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2.05

2.1

2.15

2.2

2.25

2.3

2.35

2.4

2.45

2.5

th template

th classification

f1 ab 150 d 100

0.1

0.2

0.3

0.4

0.5

0.6

(b) DTW.

Figure 2: Some results of parameter exploration with k =

150 and k

word

= 100. Each matrix of colours is built by

using the corresponding distance for comparing sequences,

and by changing two parameters: th

templates

on the horizon-

tal axis, and th

on the vertical axis. The scale of colours

goes from 0 to 0.7.

5.2 Intra Drill Cross-validation

We employed the results showed in Section 5.1 for

selecting appropriate parameters for the selection of

gesture templates, and performed cross-validation

tests to assess the performance of the approach for

semi-supervised gesture spotting. We split the S1

Drill dataset in 20 parts, one per run, and performed

20 tests (i.e., leave-one-run-out cross-validation).

Each time we discovered the templates with 19 parts

and left the remaining part of the dataset for validating

Semi-supervisedDiscoveryofTime-seriesTemplatesforGestureSpottinginActivityRecognition

575

Table 2: Average f (1) score for the cross validation test. T stands for training and V stands for validation tests. The parameters

used are shown in the table.

Distance k k

words

templates

T (µ, σ) V (µ, σ)

Euclidean 200 75 0.55 2.2 (0.61, 0.09) (0.43, 0.34)

DTW 200 75 0.55 1.9 (0.68, 0.08) (0.59, 0.32)

the approach by computing the f (1) score. Table 2

summarizes the results of the cross-validation test

As it can be seen from Table 2, using the Eu-

clidean distance for comparing sequences gives poor

results in the validation datasets. As expected, using

the Dynamic Time Warping distance yields better re-

sults since DTW is more appropriate for comparing

sequences given that it allows a non-uniform align-

ment between the two sequences being compared.

6 CONCLUSIONS

This paper presents a method for detecting gesture

templates in a semi-supervised manner. The experi-

ments demonstrated that using Dynamic Time Warp-

ing as the distance measure for the k-medoids algo-

rithm gave the best results when spotting gestures.

Our results for the cross-validation test are compa-

rable to the ones obtained by Sagha et al. (Sagha

et al., 2011). Both contributions use a window of

16 samples overlapped 50% and the average as fea-

ture. The main difference between both contributions

is that Sagha et al. use the information in a single

window for inferring the gestures, whereas we use the

information in a sequence of feature values from mul-

tiple adjacent windows. Amongst the tested classi-

ﬁers, they found that 1-NN has the best performance

with an average f (1) score of 0.53 on subject 1. It

is important to note that they use all the upper body

sensors while we only use one sensor located on the

right forearm (RLA). We obtained a validation per-

formance of 0.59 over subject 1, which is very close

to the best results found by Sagha et al. Moreover,

our approach has the advantage of using less sensors

and of not requiring the whole dataset in memory to

classify new gestures.

Finally, by requiring ground truth only for the

template selection stage (but not to ﬁnd the candi-

dates), our approach for template discovery could be

used in a system that can create its group of templates

incrementally. For example, a system which asks for

user annotation when a previously unseen template

is discovered. The method is therefore more ﬂexible

than a fully supervised method where all the possible

gestures would have to be deﬁned beforehand.

ACKNOWLEDGEMENTS

The authors would like to thank Daniel Roggen and

the team of the wearable computing laboratory at

ETHZ for their valuable input. This work is part of

the project SmartDays founded by the Hasler Foun-

dation.

REFERENCES

Berndt, D. J. and Clifford, J. (1994). Using Dynamic Time

Warping to Find Patterns in Time Series. In Proc. of

KDD-94: AAAI Workshop on Knowledge Discovery

in Databases, pages 359–370.

Frank, A. and Asuncion, A. (2010). UCI machine learning

repository.

Kaufman, L. and Rousseeuw, P. J. (1987). Clustering by

means of medoids. Statistical Data Analysis Based on

the L

-Norm and Related Methods, pages 405–416.

Ko, M. H. et al. (2005). Online context recognition in multi-

sensor systems using dynamic time warping. In Proc.

of the International Conference on Intelligent Sensors,

Sensors Networks and Information Processing., pages

283 – 288.

Lara, O. D. et al. (2012). Centinela: A human activ-

ity recognition system based on acceleration and vi-

tal sign data. Pervasive and Mobile Computing,

8(5):717–729.

MacQueen, J. B. (1967). Some methods for classiﬁcation

and analysis of multivariate observations. In Proc. of

the ﬁfth Berkeley Symposium on Mathematical Statis-

tics and Probability, volume 1, pages 281–297.

Mathie, M. et al. (2003). Detection of daily physical ac-

tivities using a triaxial accelerometer. Medical and

Biological Engineering and Computing, 41:296–301.

Rijsbergen, C. J. V. (1979). Information Retrieval.

Butterworth-Heinemann, 2nd edition.

Roggen, D. et al. (2010). Collecting complex activity

datasets in highly rich networked sensor environ-

ments. In 2010 Seventh International Conference on

Networked Sensing Systems, pages 233–240.

Sagha, H. et al. (2011). Benchmarking classiﬁcation tech-

niques using the Opportunity human activity dataset.

In IEEE International Conference on Systems, Man,

and Cybernetics.

Stiefmeier, T. et al. (2008). Wearable activity tracking in car

manufacturing. Pervasive Computing, IEEE, 7(2):42–

50.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

576