Segregational Soft Dynamic Time Warping and Its Application to Action

Prediction

Victoria Manousaki

1,2 a

and Antonis Argyros

1,2 b

Computer Science Department, University of Crete, Heraklion, Greece

Institute of Computer Science, Foundation for Research and Technology - Hellas (FORTH), Heraklion, Greece

Keywords:

Soft Dynamic Time Warping, Segregational, Action Prediction, Alignment, Duration Prediction.

Abstract:

Aligning the execution of complete actions captured in segmented videos has been a problem explored by

Dynamic Time Warping (DTW) and Soft Dynamic Time Warping (S-DTW) algorithms. The limitation of

these algorithms is that they cannot align unsegmented actions, i.e., actions that appear between other actions.

This limitation is mitigated by the use of two existing DTW variants, namely the Open-End DTW (OE-DTW)

and the Open-Begin-End DTW (OBE-DTW). OE-DTW is designed for aligning actions of known begin point

but unknown end point, while OBE-DTW handles continuous, completely unsegmented actions with unknown

begin and end points. In this paper, we combine the merits of S-DTW with those of OE-DTW and OBE-DTW.

In that direction, we propose two new DTW variants, the Open-End Soft DTW (OE-S-DTW) and the Open-

Begin-End Soft DTW (OBE-S-DTW). The superiority of the proposed algorithms lies in the combination of

the soft-minimum operator and the relaxation of the boundary constraints of S-DTW, with the segregational

capabilities of OE-DTW and OBE-DTW, resulting in better and differentiable action alignment in the case

of continuous, unsegmented videos. We evaluate the proposed algorithms on the task of action prediction on

standard datasets such as MHAD, MHAD101-v/-s, MSR Daily Activities and CAD-120. Our experimental

results show the superiority of the proposed algorithms to existing video alignment methods.

1 INTRODUCTION

Cameras capture visual information of action and

activity executions from different angles, with dif-

ferent speeds, performed by different subjects, etc.

Temporal video alignment algorithms are powerful

tools for matching action executions in time, de-

spite such differences. Thus, they have been used

to match complete videos in works for action qual-

ity assessment (Roditakis et al., 2021), action co-

segmentation (Papoutsakis et al., 2017a), ﬁne-grained

frame retrieval (Haresh et al., 2021), etc.

Typically, this type of temporal match-

ing/alignment involves a representation of a video

in the form of time series, i.e., a temporal ordering

of frames, each represented as a multidimensional

vector in some feature space

. Under such a formula-

tion, a time series representing a test action execution

https://orcid.org/0000-0003-2181-791X

https://orcid.org/0000-0001-8230-3192

For this reason, in this paper we use the term “video”

and “action sequence” or “action execution” interchange-

ably.

needs to be aligned/matched with another time

series, representing a reference action execution. The

Dynamic Time Warping (DTW) algorithm is applied

to such time series representations by establishing an

alignment path of minimum-cost. The Soft Dynamic

Time Warping (S-DTW) algorithm is also applied

to such input and ﬁnds the soft-minimum cost of all

possible path-based alignments.

Both DTW and S-DTW align segmented inputs,

i.e., they ﬁnd the best alignment of different execu-

tions of actions that start and end at known points

in time. However, in many interesting realistic prob-

lems, this is not always the case. For reference actions

we do have complete time series representations. But

the test action to be aligned to the reference ones may

be unsegmented, i.e., we might know its start point

but not its end point, or we may know neither its start

not its end. This happens when we want to match

continuous, unsegmented input to reference actions.

Consider, as an example, a continuous video show-

ing an action that occurs between other, unknown ac-

tions. Matching this action to reference ones requires

not only alignment but also segregation/segmentation

226

Manousaki, V. and Argyros, A.

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction.

DOI: 10.5220/0010882300003124

In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages

226-235

ISBN: 978-989-758-555-5; ISSN: 2184-4321

of the input test action.

The problem of aligning partially or fully unseg-

mented test actions is addressed by the Open-End

(OE-DTW) and Open-Begin-End (OBE-DTW) DTW

variants. Speciﬁcally, OE-DTW assumes that the test

video has a known starting point but unknown end

point, while in OBE-DTW both the start and the end

of the test video may be unknown.

In this paper, we propose two new DTW variants,

namely the Open-End Soft DTW (OE-S-DTW) and

the Open-Begin-End Soft DTW (OBE-S-DTW). As

their names suggest, these variants combine the dif-

ferentiability of S-DTW with the capability of OE-

DTW and OBE-DTW to handle unsegmented test ac-

tions. Thus, they can be used as temporal alignment

loss to train neural networks in the case of partially

or fully unsegmented input. To the best of our knowl-

edge, OE-S-DTW and OBE-S-DTW are the ﬁrst soft

DTW variants that segregate the input to be aligned

with the reference action executions.

We use the proposed segregational soft DTW vari-

ants to solve the problem of action prediction. More

speciﬁcally, continuous videos are transformed into

time series of certain features. To do so, we either use

features obtained from skeletal data or deep features

extracted by a VGG-16 network on RGB video data.

This partially or fully unsegmented input is matched

with a set of similarly represented reference action

executions and the best match is established. This

means that the label of a partially observed, on-going

action can be predicted before reaching action com-

pletion. As a side effect, the end time of the ongoing

action can also be predicted.

We test the performance of the proposed OE-

S-DTW and OBE-S-DTW algorithms for the prob-

lem of action prediction on four standard benchmark

datasets, namely MSR Daily Activities (Wang et al.,

2012), CAD-120 (Koppula et al., 2013), MHAD (Oﬂi

et al., 2013) and MHAD101-s/MHAD101-v (Papout-

sakis et al., ). Our proposed OE-S-DTW performs

comparably to OE-DTW in segmented action se-

quences, thus is able to replace OE-DTW in settings

where differentiability is desired. OBE-S-DTW and

OBE-DTW outperform all the other state of the art

algorithms in segmented action sequences. Morover,

OBE-S-DTW outperforms by a great margin OBE-

DTW and other state of the art algorithms in the more

difﬁcult and realistic scenario of action prediction in

unsegmented action sequences.

In summary, the contributions of this work in-

clude:

• The proposal of OE-S-DTW and OBE-S-DTW,

which are the ﬁrst soft DTW variants that can

align partially/fully unsegmented test time series

to reference ones.

• The extensive evaluation of the two proposed vari-

ants on the problem of short-term human ac-

tion prediction, and action end-time forecasting

in standard datasets and in comparison to existing

state of the art methods.

2 RELATED WORK

Dynamic Time Warping (Sakoe and Chiba, 1978)

aligns segmented sequences by ﬁnding a warping path

between them. The warping path is subject to bound-

ary constraints, i.e., it has to start and end at the

known start and end frames of the sequences to be

aligned. The alignment path is established with the

aid of a distance matrix containing all pair-wise dis-

tances of the frames of the two sequences. The align-

ment score is given by the summation of all path-

related values in the distance matrix. DTW has been

used in a plethora of problems and settings. Indica-

tively, in (Papoutsakis et al., 2017b) DTW tempo-

ral alignment has been used to solve the problem of

action co-segmentation by aligning trimmed action

sequences. Recently, (Hadji et al., 2021) proposed

a method for representation learning through DTW

temporal alignment of trimmed videos and cycle con-

sistency. (Dvornik et al., 2021) have proposed a

DTW approximation where outliers in the matching

of sequences are eliminated resulting in a more mean-

ingful alignment.

The Open-End DTW (OE-DTW) (Tormene et al.,

2009) is capable of ﬁnding the minimum cost align-

ment path of sequences that contain actions of known

start but unknown end, either due to partially observed

actions or actions followed by an unrelated sufﬁx.

The alignment score is given by the summation of

all minimum-cost path values in the distance matrix.

Such paths start at the top-left point of the distance

matrix (as in DTW). However, unlike DTW, the path

should not necessarily end at the bottom-right cell

of the distance matrix. OE-DTW has been used to

compare motion curves for the rehabilitation of post-

stroke patients (Schez-Sobrino et al., 2019). In Kinect

v2, OE-DTW is used for the evaluation of the user’s

motion in order to provide real-time feedback to the

user (Yang and Tondowidjojo, 2019).

The Open-Begin-End (OBE-DTW) (Tormene

et al., 2009) acquires the minimum-cost alignment by

relaxing both endpoints. The advantage of this vari-

ant is that it makes possible to align unsegmented in-

puts. The warping path of minimum cost does not

have to start and end at the top-left and bottom-right

cells (respectively) of the distance matrix. OBE-DTW

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

227

has been used in many contexts for unsegmented se-

quence alignment e.g., for the problem of classifying

motion from depth cameras (Kim et al., 2015).

The Soft Dynamic Time Warping (S-DTW) (Cu-

turi and Blondel, 2017) variant of DTW, instead of

taking the minimum-cost alignment, considers the

soft-minimum of the distribution of all costs spanned

by all possible alignments between two time series in

segmented sequences. The S-DTW alignment score

contains the summation of all path-based values. Sim-

ilarly to DTW, S-DTW assumes segmented input.

(Haresh et al., 2021) use the S-DTW algorithm as the

temporal alignment loss in order to train their network

to learn better video representations. (Chang et al.,

2019) use the differentiable alignment of S-DTW for

the alignment and segmentation of actions by using

the videos and the transcripts of the actions.

Segmental DTW (Park and Glass, 2007) seeks for

the minimum-cost sub-sequence alignment of pairs

of unsegmented inputs. Segmental DTW decom-

poses the distance matrix in sets of overlapping ar-

eas and ﬁnds the local end-to-end alignments in these

areas resulting in sub-sequence matching. Segmen-

tal DTW has been used in the context of action co-

segmentation (Panagiotakis et al., 2018) in motion-

capture data or video between pairs of actions for the

detecting of commonalities of varying length, differ-

ent actors, etc.

Finally, the Ordered Temporal Alignment Module

(OTAM) (Cao et al., 2020) incorporates a S-DTW-

based alignment method where the soft-minimum op-

erator is used to calculate all possible path-based

alignments in segmented sequences of ﬁxed length.

The alignment score is given by aligning the se-

quences end-to-end using S-DTW, while the align-

ment path is retrieved by an OBE-DTW approxima-

tion. OTAM has been used in (Cao et al., 2020) for

few-shot video classiﬁcation of ﬁxed-length trimmed

videos.

Temporal sequence alignment has been used ex-

tensively in the context of action recognition and pre-

diction. The work of (Afrasiabi et al., 2019) has been

used for action prediction in trimmed videos where

CNN networks have been used for feature extraction

by using optical ﬂow. The alignment is performed

using DTW and the classiﬁcation is performed using

the KNN and SVM algorithms. (Manousaki et al.,

2021) have used OE-DTW, and S-DTW to confront

the problem of action prediction in trimmed videos

that show on-going actions observed at different ob-

servation ratios. Also, (Ghoddoosian et al., 2021)

perform action recognition and duration prediction of

incomplete actions by aligning video segments using

object and verb information.

3 ACTION SEQUENCE

ALIGNMENT

Let the test (query) action sequence X be represented

as X = (x

,...x

) ∈ R

n×l

and the reference video Y

be represented as Y = (y

,...,y

) ∈ R

n×m

. The Eu-

clidean distance of frames x and y is deﬁned as d(x,y)

and is used to create the distance matrix D(X,Y ) =

[d(x

)]

i j

∈ R

l×m

containing all pair-wise frame dis-

tances. The cumulative matrix that is based on D and

represents all path-based alignments P of X and Y , is

denoted as C(X,Y ) = {hp,D(X,Y )i, p ∈ P

l,m

} where

P represents all the alignments connecting the upper-

left to the lower-right of the distance matrix. Given

this notation, in the following sections we elaborate

on the existing and proposed action sequence align-

ment algorithms.

3.1 Existing Alignment Methods

The minimum cost of aligning two time series in their

entirety is given by DTW (Sakoe and Chiba, 1978)

at the last index of the cumulative matrix C(X ,Y ).

The alignment score is normalized with the size of

the query. The DTW alignment cost is deﬁned as:

DTW (X,Y ) = min

p∈P

C(X ,Y ). (1)

Open-End Dynamic Time Warping (OE-

DTW) (Tormene et al., 2009): is a variant of

the original DTW (Sakoe and Chiba, 1978). The

query and reference sequences share the same

start but end at different points in time. The

cumulative matrix is calculated using the asym-

metric pattern as follows: C(x

) = D(x

) +

min(C(x

i−1

),C(x

i−1

j−1

),C(x

i−1

j−2

)). The

alignment can end at any point at the last row of the

cumulative matrix. The values are normalized by the

size of the query. The alignment score is given by the

minimum value in the last row. The alignment cost of

OE-DTW is deﬁned as:

OE − DTW(X,Y ) = min

j=1,...,m

DTW (X,Y

). (2)

Open-Begin-End Dynamic Time Warping (OBE-

DTW) (Tormene et al., 2009): OBE-DTW refers to

the DTW variant where the beginning and ending of

the query sequence are unknown. This allows the

matching of one sequence with a part anywhere in-

side the second sequence. To achieve this, a row with

zero values is appended at the beginning of the dis-

tance matrix and the computations are performed as

in OE-DTW. The new cumulative matrix is denoted

as C

(X,Y ). The values are normalized by the length

of the query. The back-tracing of the minimum-cost

path starts from the minimum value of the last row

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

228

and ends at the ﬁrst zero-valued row. The alignment

cost of OBE-DTW is deﬁned as:

OBE − DTW(X,Y ) = min

j=1,...,m

(X,Y

). (3)

Soft Dynamic Time Warping (S-DTW) (Cuturi

and Blondel, 2017): S-DTW is an extension

of the original DTW algorithm. In contrast to

DTW which takes the minimum cost alignment

path, S-DTW takes into account all possible align-

ments. The cumulative matrix C(X,Y ) is cre-

ated by allowing horizontal, diagonal and vertical

moves. More speciﬁcally, C(x

) = D(x

) +

min

(C(x

i−1

),C(x

i−1

j−1

),C(x

j−1

)). The cu-

mulative matrix is padded at the top with a row and at

the left with a column so that C

i,0

= C

0, j

= ∞ for all

i, j 6= 0 and C

0,0

= 0.

The alignment cost of S-DTW is deﬁned as:

SDTW

(X,Y ) = min

p∈P

C(X ,Y ), (4)

with

minγ(p

,..., p

) =

(

min

i≤k

, γ = 0,

−γlog

∑

i=1

/γ

γ > 0,

(5)

where γ ≥ 0 is a smoothing hyper-parameter.

3.2 Proposed Alignment Methods

Open-End Soft DTW (OE-S-DTW): Based on the

OE-DTW and the S-DTW, we propose OE-S-DTW,

where instead of aligning two sequences to their en-

tirety, we align them partially by having them an-

chored at the beginning, while their endpoints are

free. We start by calculating the distance matrix

which contains the pairwise distances of the se-

quences X and Y . The cumulative matrix is calculated

using the min

operator as follows:

C(x

) = D(x

min

(C(x

i−1

),C(x

i−1

j−1

),C(x

j−1

)).

(6)

The alignment path can terminate at any point of the

last row of the C matrix. The scores at the last row are

normalized by the size of the query and the alignment

value is the minimum of the last row. Then, the gra-

dient is calculated from that point backwards to the

common start point to ﬁnd the alignment between the

two time series. The ﬁnal OE-S-DTW score is also

normalised by the size of the matched reference. The

alignment cost of OE-S-DTW is deﬁned as:

OE − S − DTW(X,Y ) = min

j=1,...,m

SDTW

(X,Y

(7)

Open-Begin-End Soft Dynamic Time Warping

(OBE-S-DTW): shares the same alternations as the

OBE-DTW. Upon calculating the distance matrix

D(X,Y ), a row of zero values is appended at the be-

ginning of the distance matrix creating D

(X,Y ). The

cumulative matrix C

is calculated by using the min

operator as follows:

) = D

min

i−1

),C

i−1

j−1

),C

j−1

)).

(8)

The last row of C

is normalized by the size of the

query. The alignment cost is the minimum value of

the last row. Then, the gradient is computed from that

point towards the zero-valued row and ends when it

reaches it. The size of the matched reference corre-

sponds to that range. The gradient gives as the align-

ment matrix, all possible alignments. Once the align-

ment path is obtained, we normalize the alignment

cost with the size of the matching part of the refer-

ence sequence. The alignment cost of OBE-S-DTW

is deﬁned as:

OBE − S − DTW(X,Y ) = min

j=1,...,m

(X,Y

). (9)

3.3 Alignment-based Action Prediction

Action prediction is deﬁned as the problem of infer-

ring the label of a partially executed action. We deﬁne

the action observation ratio to be the percentage of the

action that is already observed. In our experiments,

the observation ratio varies in the range [10%,100%].

When the observation ratio equals to 100% then the

whole action has been observed. In this case, the

problem of action prediction becomes identical to the

problem of action classiﬁcation.

To perform video-based action prediction, we rep-

resent the videos of executed actions as multidimen-

sional time series. Given time series representations

of several prototype executions of certain actions, and

an incomplete video execution of one of these actions,

we cast the problem of action prediction as a problem

of aligning/matching the incomplete action execution

to the prototype ones. The label of the closest match-

ing prototype action is reported as the predicted label

of the incomplete action.

In more detail, the unknown label L(A) of a time

series representation of an incomplete action A is in-

ferred through the alignment/matching of incomplete

and prototype actions. A set of K time series S

1 ≤ i ≤ K, corresponds to prototype videos with la-

bels L(S

). The alignment cost of two time series X,

Y is denoted as Cost(X ,Y ). Thus:

L(A) = L



arg min

1≤i≤K



Cost(A,S

)



. (10)

The proposed methodology can infer the label of an

incomplete/query video A by determining which pro-

totype/reference video S

has the minimum alignment

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

229

cost with A. This is is done through the proposed Dy-

namic Time Warping variants. The label of A is set to

L(S

4 DATASETS

For the evaluation of the proposed methods we em-

ploy four standard benchmark datasets which contain

trimmed and untrimmed action executions of humans

interacting with objects. Action representations en-

code the human body/object pose and the class of the

manipulated object. In general, we test our algorithms

with skeletal (i.e., motion capture) 3-D data and fea-

tures, but also with RGB-based features extracted by

a VGG-16 neural network. We follow the approach

of Manousaki et al. (Manousaki et al., 2021) regard-

ing the fusion of human and object representations for

the computation of the distance matrix of two action

sequences. Speciﬁcally, the representation fusion is

done by employing a weighted sum of the individual

distance matrices of the human and object representa-

tions. The weights depend on the class of the manip-

ulated objects. If no objects are present in the scene,

we use only the human pose representations. In case

there are several objects in the observed scene, fol-

lowing (Manousaki et al., 2021), we consider the ob-

ject that is manipulated by and/or closest to the actor.

MHAD Dataset (Oﬂi et al., 2013): Contains

trimmed executions of 11 human actions. Only one

of them (“throwing a ball”) involves human-object

interaction. The actions are performed by 5 female

and 7 male subjects in different execution styles and

speeds. The actions are: jumping in place, jumping

jacks, bending, punching, waving one hand, waving

two hands, clapping, throwing a ball, sit down and

stand up, sit down, stand up. 3D skeletal data of 30

joints have been acquired from a motion capture sys-

tem that provide 3D positions of 43 LED markers as

well as RGB and depth frames. The ﬁrst 7 subjects

are used as the reference sequences while the last 5

subjects are used as the test sequences. The same

evaluation split is used as in (Oﬂi et al., 2013) and

Manousaki et al. (Manousaki et al., 2021).

Skeletal Features: Based on the 3D skeletal data of

the 30 joints provided by the MHAD dataset, we em-

ploy the same human body representation as in (Rius

et al., 2009; Papoutsakis et al., 2017a; Manousaki

et al., 2018). Body-centered and camera-centered fea-

tures are employed resulting in a 60-dimensional vec-

tor. This vector is extended by 4 angles represent-

ing angles encoding the fore- and the back- arms and

upper- and lower legs.

VGG features: For this type of data we opted to utilize

the data provided in (Bacharidis and Argyros, 2020).

From a VGG-16 (Simonyan and Zisserman, 2014)

network, 1-D feature vectors are extracted from the

last fully-connected layer, resulting in a feature vector

of 2048 dimensions for each frame of the sequence.

For the RGB frames the network is not ﬁne-tuned and

the learned weights are maintained from the training

on ImageNet (Deng et al., 2009). In the case of op-

tical ﬂow the VGG-16 layers are ﬁne-tuned starting

from the last 2-D layer and above on optical ﬂow data

from the KTH dataset (Schuldt et al., 2004), freezing

the rest with the weight values from ImageNet (Deng

et al., 2009).

MHAD101 Dataset (Papoutsakis et al., ): Contains

concatenated actions from the MHAD dataset in or-

der to form longer sequences of multiple actions. To

alleviate possible ambiguities, the action labeled as sit

down/stand up is excluded as it is a composition of the

actions sit down and stand up. In the MHAD dataset

each action is repeated ﬁve times by each subject but

in this dataset only the ﬁrst execution of an action

by each subject is used. The skeletal data provided

by the MHAD dataset are used and down-sampled to

30fps. The aforementioned actions (excluding the ac-

tion sit-down/stand-up) are used for creating larger

sequences of actions. The synthesised MHAD101

dataset contains 101 pairs of action sequences. In the

ﬁrst 50 paired sequences, each sequence consists of 3

concatenated action clips (triplets) and the paired se-

quences have exactly 1 in common. In the rest pairs of

actions sequences, 4 to 7 actions are concatenated in

a long sequence. In all the synthesised sequences the

style and duration variability are promoted by using

different subjects in forming different triplets. The

lengths of the sequences range between 300 to 2150

frames.

Skeletal Features: We use the MHAD101-s version of

MHAD101 which contains skeletal features. We used

only the ﬁrst 50 pairs of action sequences. By splitting

these 50 pairs, we resulted in 100 action sequences

where each of them contains 3 concatenated actions.

These actions are synthesised from the same features

that are described in Section 4.

VGG Features: We use the MHAD101-v version of

MHAD101 which contains the RGB videos of the

same triplets as in MHAD101-s. We then extract fea-

tures from the VGG-16 network as in (Bacharidis and

Argyros, 2020). We took into account all the available

frames without down-sampling to 30fps.

MSR Daily Activity 3D Dataset (Wang et al.,

2012): Contains 16 trimmed executions of human-

object interactions in two different settings, standing

up and sitting on a sofa. The actions are: eating,

speaking on cellphone, writing on paper, using a lap-

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

230

top, using a vacuum cleaner, cheering up, sitting still,

tossing paper, playing a game, walking, lie down on

the sofa, playing the guitar, reading a book, standing

up, drinking and sitting down. The standard evalua-

tion split is used as in (Xia and Aggarwal, 2013; Reily

et al., 2018; Manousaki et al., 2021).

Skeletal Features: Following the work of Manousaki

et al. (Manousaki et al., 2021), we represent the

dataset with 3D joint angles and 3D skeletal joint po-

sitions. The 3D joint angles are based on the work of

(Rius et al., 2009). Due to the noisiness of the lower

body data, only the upper body joints are used thus

resulting in a 30-dimensional feature vector. The 3D

joint angles are augmented with the 3D skeletal joint

position of the upper body that are invariant to the

body center resulting in a 27 + 18 = 45-dimensional

vector. The object class and the 2D object position

are acquired through the YoloV4 (Bochkovskiy et al.,

2020) algorithm as in Manousaki et al. (Manousaki

et al., 2021). The ﬁnal feature vector per frame is 47-

dimensional.

CAD-120 Dataset (Koppula et al., 2013): Contains

trimmed actions with human-object interactions. The

actions can be performed using different objects by 4

subjects from different viewpoints. The action labels

are: reach, move, pour, eat, drink, open, place, close,

clean, null. The manipulated objects are:cup, box,

bowl, plate, microwave, cloth, book, milk, remote,

medicine box. The standard 4-fold cross validation

split is used as in (Koppula et al., 2013; Manousaki

et al., 2021).

Skeletal Features: The feature vector representing the

CAD-120 dataset contains the 3D location of 8 up-

per body joints, the distance moved by each joint and

their displacement. The objects are represented us-

ing the 3D centroid location, the distance between the

object centroid and each of the 8 human joints. Also,

the distance moved by the object and the displacement

of the object’s centroid. These features are also em-

ployed in (Koppula et al., 2013) and Manousaki et

al. (Manousaki et al., 2021).

4.1 Performance Metrics

The observation ratio of each video is ranging from

10% to 100% with step equal to 10%. The accuracy

of the predicted action label is measured by compar-

ing the partially observed video with the prototype

videos. Action prediction is quantiﬁed using stan-

dard metrics such as F1-score, precision, recall and

Intersection-Over-Union (IoU).

5 IMPLEMENTATION ISSUES

For OE-DTW and OBE-DTW we employed a

publicly available implementation

. The imple-

mentations of OE-S-DTW and OBE-S-DTW were

based on the S-DTW implementation in the Tslearn

toolkit (R.Tavenard et al., 2020). The parameter γ was

experimentally set equal to 1 for all datasets through

evaluation in the range [0.001 , 1].

The Segmental DTW implementation is provided

by (Panagiotakis et al., 2018; Papoutsakis et al.,

2017b). The parameters of the Segmental DTW algo-

rithm are set according to (Panagiotakis et al., 2018)

where it is recommended the minimum length of a

warping path to be half the length of the smallest ac-

tion. Our experiments showed that if the minimum

length is set too small, the algorithm ends up with

smaller paths that do not represent an alignment. If

the minimum length is very high, the algorithm is un-

able to align videos in the case of small observation

ratios.

For our comparison with the work of Cao et

al. (Cao et al., 2020) we need to stress the fact that

we are not comparing directly with the full OTAM

framework. The comparison is based on the align-

ment algorithm that creates the distance matrices and

ﬁnds an alignment path between two sequences and

classiﬁes to the reference video that minimizes the

alignment score. Since there is no code available for

this method, we implemented the alignment compo-

nent based on the details provided in the paper. Also,

the OTAM framework is only tested in trimmed ac-

tion videos of ﬁxed length. In (Cao et al., 2020) a co-

sine distance measure is proposed but in the data used

in the current work the Euclidean distance yields the

best results for this method. So, the reported results

are based on the Euclidean distance and γ value equal

to 1. The key differences are the padding with zeros

at the start and end of the distance matrix and the dif-

ferent computation of the cumulative matrix which is

C(x

) = D(x

)+min

(C(x

i−1

j−1

),C(x

j−1

)).

The alignment score is given at the last index of the

cumulative matrix which denotes the alignment of the

sequences in their entirety.

6 EXPERIMENTS

To evaluate the proposed OE-S-DTW and OBE-S-

DTW algorithms on the task of human action predic-

tion, we conduct three types of experiments. First,

https://github.com/statefb/dtwalign

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

231

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

Observation Ratio

100

Accuracy

MHAD (Skeletal)

OE-DTW

OE-S-DTW

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

Observation Ratio

100

Accuracy

MSR Daily Activities (Skeletal)

OE-DTW

OE-S-DTW

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

Observation Ratio

100

Accuracy

CAD120 (Skeletal)

OE-DTW

OE-S-DTW

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

Figure 1: Action prediction accuracy in trimmed videos as a function of observation ratio involving skeletal features in the

MHAD (left), MSR (middle) and CAD-120 (right) datasets.

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

Observation Ratio

100

Accuracy

MHAD (VGG Features)

OE-DTW

OE-S-DTW

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

Figure 2: Action prediction accuracy in trimmed videos as

a function of observation ratio involving VGG-16 features

in the MHAD dataset.

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

Observation Ratio

100

Accuracy

MSR Daily Activities (Skeletal)

Reily et al. - BIPOD

Alfaifi et al. - 3D-CNN

Manousaki et al. - OE-DTW

OE-S-DTW

OBE-S-DTW

Figure 3: Action prediction accuracy of our methods in

comparison to state of the art methods on the MSR Daily

Activities dataset.

we use these algorithms to perform action prediction

in trimmed action sequences containing one action.

In this setting, the proposed algorithms are used to

align and match an action of known start and variable

observation ratio to a set of prototype actions (sec-

tion 6.1). We also compare the performance of the

proposed algorithms to a number of competing meth-

ods. In a second experiment, the input is a triplet

of actions and the goal is to predict the label of the

middle action under different observation ratios (sec-

tion 6.2). In this untrimmed video/action setting, both

the start and the end of the action are unknown to the

algorithms. Finally, given the capability of the pro-

posed algorithms to predict the label of the on-going

action, we test how accurately they can predict the

end-time of that action (section 6.3).

6.1 Evaluation in Trimmed Actions

In this set of experiments we used the trimmed video

recordings of the MHAD, MSR Daily and CAD-120

datasets. In Fig. 1, we report results on experiments

using skeletal features. We evaluate the proposed OE-

S-DTW and OBE-S-DTW algorithms in comparison

to OE-DTW, OBE-DTW, Segmental DTW (Panagio-

takis et al., 2018) and OTAM (Cao et al., 2020). As

it can be veriﬁed, the performance of OE-DTW and

OE-S-DTW is comparable and both achieve very high

accuracy in all datasets. Moreover, OBE-S-DTW out-

performs OBE-DTW by a large margin in the MHAD

and MSR Daily datasets. In the CAD-120 dataset the

performance of OBE-DTW and OBE-S-DTW is prac-

tically the same. Note that OE-DTW and OE-S-DTW

are aware of the common start of the actions while

OBE-DTW and OBE-S-DTW, don’t. Thus, OBE-

DTW and OBE-S-DTW have to deal with a consid-

erably less constrained/more difﬁcult problem.

In the same ﬁgure, we can observe the perfor-

mances of the two competitive methods, Segmental

DTW and OTAM. The proposed algorithms outper-

form both Segmental DTW and OTAM by a great

margin in all datasets.

For the MHAD dataset we also experimented with

VGG features. Figure 2 shows the performance of

all evaluated algorithms is such a setting. It can

be observed that even with this type of features, the

proposed OBE-S-DTW outperforms Segmental DTW

and OTAM as well as OBE-DTW for observation ra-

tio greater than 40%. OE-DTW performs comparably

to our proposed OE-S-DTW variant.

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

232

1/3

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

100

Accuracy

MHAD101-s Triplets

OE-DTW

OE-S-DTW

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

1/3

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

100

Accuracy

MHAD101-v Triplets

OBE-DTW

OBE-S-DTW

SegmentalDTW

OTAM

Figure 4: Action prediction accuracy in untrimmed videos (video triplets) of the MHAD101 dataset as a function of observa-

tion ratio involving skeletal features (MHAD101-s, left) and VGG-16 features (MHAND101-v, right).

For the MSR Daily Activities dataset, in Figure 3,

we also compare our approach to the competitive

methods of Reily et al. (Reily et al., 2018), Alfaiﬁ

et al. (Alfaiﬁ and Artoli, 2020) and Manousaki et

al. (Manousaki et al., 2021). As it can be observed,

we outperform (Reily et al., 2018) and Manousaki et

al. (Manousaki et al., 2021) at all observation ratios

and (Alfaiﬁ and Artoli, 2020) for all observation ra-

tions greater than 40%.

6.2 Evaluation in Untrimmed Actions

The proposed methods are also evaluated in the

untrimmed action sequences (action triplets) of

MHAD101-s/-v datasets. In this experiment we ex-

plore whether OBE-S-DTW can recognize an unseg-

mented action that appears between some other preﬁx

and sufﬁx actions. To do so, the algorithms progres-

sively observe the whole triplet (3 actions in a row)

and aim at segmenting and recognizing the middle ac-

tion. To achieve this, in each triplet that is observed,

we exclude the preﬁx/sufﬁx actions from the set of

the reference actions. The preﬁx and sufﬁx actions

are observed in thirds. The middle action is observed

in tenths, so as to have a ﬁner performance sampling

relative to the observation ratio of the test action.

Figure 4 (left) shows the obtained results on the

MHAD101-s dataset (skeletal features). In that plot,

the two vertical black lines denote the ground-truth

start and end of the middle action. High accuracy dur-

ing the preﬁx denotes the ability of the algorithm to

recognize that the algorithm correctly identiﬁes that

the sought action has not yet started. High accuracy

during the sufﬁx denotes the successful recognition

of the middle action inside the triplet. As it can be

observed, this experiment is not suited for the OE-

DTW and OE-S-DTW variants which fail completely

to segment and identify the middle action. OBE-S-

DTW clearly outperforms OBE-DTW and all other

evaluated algorithms by a large margin.

The same experiment is held using the

MHAD101-v dataset using the VGG-16 features

(Figure 4, right). We observe that overall, the

algorithms perform better with skeletal features than

with VGG ones. However, their ranking and relative

performance does not change. Thus, the superiority

of the proposed OBE-S-DTW is independent of the

type of the employed features.

In Figure 5 we also illustrate the precision, recall,

F1-score and IoU for the two best performing algo-

rithms, OBE-DTW and OBE-S-DTW. In all cases,

OBE-S-DTW produces better action alignments and,

thus, action predictions.

In an effort to gather further experimental evi-

dence, we ran all the experiments reported in this sec-

tion by reversing the videos/action sequences of the

evaluated datasets (i.e., observing them progressively

from the end towards the start). Another reason for

running this additional tests was to check whether the

speciﬁc actions that appear as action preﬁxes or post-

ﬁxes affect the performance of the evaluated algo-

rithms. We report that this change in the observation

order did not affect the performance of the evaluated

algorithms and the conclusions of this study.

6.3 Duration Prediction

Being able to forecast the completion time of an on-

going action is an important piece of information in

many vision applications. Our algorithms can derive

this information by matching a partially observed ac-

tion to the reference ones and by assuming that this

will have the duration of the closest match. In Fig-

ure 6 we can see the performance of OBE-DTW and

OBE-S-DTW in this task. For a given observation ra-

tio, we report the end-frame prediction error which

is deﬁned as the discrepancy of the estimated end of

a certain action from its ground truth end, as a per-

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

233

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

F1-Score

MHAD101-s

OBE-DTW

OBE-S-DTW

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

MHAD101-s

OBE-DTW

OBE-S-DTW

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

MHAD101-s

OBE-DTW

OBE-S-DTW

2/3

3/3

1/10

2/10

3/10

4/10

5/10

6/10

7/10

8/10

9/10

10/10

1/3

2/3

3/3

Observation Ratio

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

IoU

MHAD101-s

OBE-DTW

OBE-S-DTW

Figure 5: Performance metrics for OBE-DTW and OBE-S-DTW on the MHAD101-s dataset.

1/10 2/10 3/10 4/10 5/10 6/10 7/10 8/10 9/10 10/10

Observation Ratio

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

End-frame prediction error

MHAD101-s

OBE-DTW

OBE-S-DTW

Figure 6: Percentage of the frames that are lost compared

to the ground truth duration of the action.

centage of the test action length. When an action is

wrongly classiﬁed by the algorithm, then a prediction

error of 1.0 (100%) is added. We observe that as the

algorithms see more of a certain action (larger obser-

vation ratio) they predict the action end more accu-

rately. Moreover, the proposed OBE-S-DTW appears

to outperform clearly OBE-DTW.

7 SUMMARY

In this work, we proposed two novel temporal align-

ment algorithms for the matching of incomplete

videos represented as multidimensional time series.

These algorithms proved to be superior to existing

DTW variants on the task of human action prediction

in trimmed and untrimmed videos. The experimental

results on skeletal and deep features show signiﬁcant

accuracy gain on the human action prediction scenar-

ios by aligning fused human-object action representa-

tions on the MHAD, MSR, CAD-120, MHAD101-s

and MHAD101-v datasets. Moreover, by being soft

and differentiable, the proposed variants can be used

as an integral part of the loss function of Neural Net-

work solutions to a number of related problems. On-

going research work is targeted in this direction.

ACKNOWLEDGEMENTS

The research work was supported by the Hellenic

Foundation for Research and Innovation (HFRI) un-

der the HFRI PhD Fellowship grant (Fellowship

Number: 1592) and by HFRI under the “1st Call for

H.F.R.I Research Projects to support Faculty mem-

bers and Researchers and the procurement of high-

cost research equipment”, project I.C.Humans, num-

ber 91.

REFERENCES

Afrasiabi, M., Mansoorizadeh, M., et al. (2019). Dtw-

cnn: time series-based human interaction prediction in

VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications

234

videos using cnn-extracted features. The Visual Com-

puter.

Alfaiﬁ, R. and Artoli, A. (2020). Human action prediction

with 3d-cnn. SN Computer Science.

Bacharidis, K. and Argyros, A. (2020). Improving deep

learning approaches for human activity recognition

based on natural language processing of action labels.

In IJCNN. IEEE.

Bochkovskiy, A., Wang, C., and Liao, H. (2020). Yolov4:

Optimal speed and accuracy of object detection.

arXiv:2004.10934.

Cao, K., Ji, J., Cao, Z., Chang, C.-Y., and Niebles, J. C.

(2020). Few-shot video classiﬁcation via temporal

alignment. In CVPR.

Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., and

Niebles, J. C. (2019). D3tw: Discriminative differ-

entiable dynamic time warping for weakly supervised

action alignment and segmentation. In CVPR.

Cuturi, M. and Blondel, M. (2017). Soft-dtw: a differen-

tiable loss function for time-series. arXiv:1703.01541.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In IEEE CVPR.

Dvornik, N., Hadji, I., Derpanis, K. G., Garg, A., and Jep-

son, A. D. (2021). Drop-dtw: Aligning common sig-

nal between sequences while dropping outliers. arXiv

preprint arXiv:2108.11996.

Ghoddoosian, R., Sayed, S., and Athitsos, V. (2021). Ac-

tion duration prediction for segment-level alignment

of weakly-labeled videos. In IEEE WACV.

Hadji, I., Derpanis, K. G., and Jepson, A. D. (2021). Repre-

sentation learning via global temporal alignment and

cycle-consistency. arXiv preprint arXiv:2105.05217.

Haresh, S., Kumar, S., Coskun, H., Syed, S. N., Konin, A.,

Zia, M. Z., and Tran, Q.-H. (2021). Learning by align-

ing videos in time. arXiv preprint arXiv:2103.17260.

Kim, D., Jang, M., Yoon, Y., and Kim, J. (2015). Clas-

siﬁcation of dance motions with depth cameras using

subsequence dynamic time warping. In SPPR. IEEE.

Koppula, H., Gupta, R., and Saxena, A. (2013). Learn-

ing human activities and object affordances from rgb-

d videos. The International Journal of Robotics Re-

search.

Manousaki, V., Papoutsakis, K., and Argyros, A. (2018).

Evaluating method design options for action classiﬁ-

cation based on bags of visual words. In VISAPP.

Manousaki, V., Papoutsakis, K., and Argyros, A. (2021).

Action prediction during human-object interaction

based on dtw and early fusion of human and object

representations. In ICVS. Springer.

Oﬂi, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R.

(2013). Berkeley mhad: A comprehensive multimodal

human action database. In IEEE WACV.

Panagiotakis, C., Papoutsakis, K., and Argyros, A. (2018).

A graph-based approach for detecting common ac-

tions in motion capture data and videos. In Pattern

Recognition.

Papoutsakis, K., Panagiotakis, C., and Argyros, A. (2017a).

Temporal action co-segmentation in 3d motion cap-

ture data and videos. In CVPR.

Papoutsakis, K., Panagiotakis, C., and Argyros, A. A. Tem-

poral action co-segmentation in 3d motion capture

data and videos. In CVPR 2017. IEEE.

Papoutsakis, K., Panagiotakis, C., and Argyros, A. A.

(2017b). Temporal action co-segmentation in 3d mo-

tion capture data and videos. In CVPR.

Park, A. S. and Glass, J. R. (2007). Unsupervised pattern

discovery in speech. IEEE Transactions on Audio,

Speech, and Language Processing.

Reily, B., Han, F., Parker, L., and Zhang, H. (2018).

Skeleton-based bio-inspired human activity prediction

for real-time human–robot interaction. Autonomous

Robots.

Rius, I., Gonz

alez, J., Varona, J., and Roca, F. (2009).

Action-speciﬁc motion prior for efﬁcient bayesian 3d

human body tracking. In Pattern Recognition.

Roditakis, K., Makris, A., and Argyros, A. (2021). Towards

improved and interpretable action quality assessment

with self-supervised alignment.

R.Tavenard, Faouzi, J., Vandewiele, G., Divo, F., Androz,

G., Holtz, C., Payne, M., Yurchak, R., Rußwurm, M.,

Kolar, K., and Woods, E. (2020). Tslearn, a machine

learning toolkit for time series data. Journal of Ma-

chine Learning Research.

Sakoe, H. and Chiba, S. (1978). Dynamic programming

algorithm optimization for spoken word recognition.

IEEE transactions on acoustics, speech, and signal

processing.

Schez-Sobrino, S., Monekosso, D. N., Remagnino, P.,

Vallejo, D., and Glez-Morcillo, C. (2019). Automatic

recognition of physical exercises performed by stroke

survivors to improve remote rehabilitation. In MAPR.

Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing

human actions: a local svm approach. In ICPR. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

Tormene, P., Giorgino, T., Quaglini, S., and Stefanelli, M.

(2009). Matching incomplete time series with dy-

namic time warping: an algorithm and an application

to post-stroke rehabilitation. Artiﬁcial intelligence in

medicine.

Wang, J., Liu, Z., Wu, Y., and Yuan, J. (2012). Mining

actionlet ensemble for action recognition with depth

cameras. In IEEE CVPR.

Xia, L. and Aggarwal, J. (2013). Spatio-temporal depth

cuboid similarity feature for activity recognition using

depth camera. In IEEE CVPR.

Yang, C.-K. and Tondowidjojo, R. (2019). Kinect v2 based

real-time motion comparison with re-targeting and

color code feedback. In IEEE GCCE.

Segregational Soft Dynamic Time Warping and Its Application to Action Prediction

235