Symbolic Translation of Time Series

using Piecewise N-gram Similarity Voting

Siegfried Delannoy

1,2

Emilie Poisson Caillault

2 a

, Andr

e Bigand

2 b

and Kevin Rousseeuw

ORIENTOI DEV, 165 Av. de Bretagne, 59000 Lille, France

LISIC, ULCO, EA 4491, Calais, France

Keywords:

Similarity Measure, N-gram, Temporal Extractions, Web-Application, User Proﬁles, Weighted Vote

Classiﬁcation.

Abstract:

This paper studies a way to discriminate user behaviour from their viewed pages in a web-application. This

technique is on similarity measure selection and time sequence splitting techniques. Using temporal splitting

techniques, the proposed similarity measures greatly improve the result accuracy. We applied these ones on

several datasets from the well known UCR Archive and our research is focused on a private dataset (ORI-

ENTOI) and a public one called UCR-CBF. Some of the proposed temporal tricks appear to make similarity

measures efﬁcient with noises. They make them possible to deal with repeating terms, which is a drawback

for most of the similarity measures. Thus the similarity measures are shown to reach the state of the art on

UCR datasets. We also evaluated the proposed technique on our private (ORIENTOI) dataset with success.

We ﬁnally discuss about the weakness of our method and the ways to improve it.

1 INTRODUCTION

Knowledge discovery from users’Webpage naviga-

tion is an old concern, but still open and active (Sha-

habi et al., 1997; Nowak et al., 2018). Internet

access became really common, thus many compa-

nies have an interest in web user’s proﬁling to adapt

their ﬁnancial or communication strategies, such as e-

commerce, media services, bank or social network. In

the last decade, serious games are used both to iden-

tify and to understand user behaviour in a medical or

educational way (Wattanasoontorn et al., 2013; Bi-

enkowski et al., 2014). One user can be characterized

by a sequence of game-play actions. Game excite-

ment, tiredness, stress, personality, but also device,

environment or knowledge may inﬂuence on action

responses. All these factors should be identiﬁed and

taken into account in the interpretation.

Our work focuses on the way to improve a web

application dedicated to the orientation and discov-

ery of jobs from serious games. Based on the anal-

ysis of game actions and user behaviour, it is possi-

ble to establish relationships between users and busi-

ness cards. Proﬁle analysis is a key-point. This pa-

https://orcid.org/0000-0001-6564-8762

https://orcid.org/0000-0002-3165-5363

per therefore aims to classify users according to their

navigation behaviour in this serious-game-based web

application.

The paper is inspired by the Loh et al.’s work (Loh

et al., 2016), which presents a similar context: classi-

fying users in a serious game using a virtual maze.

In (Loh et al., 2016) they used several Similarity

Measures (SMs) to discriminate proﬁles from Game-

play Action-Decision (GAD): Explorer, Fulﬁller, and

Leaver. Player actions are saved as a sequence of trav-

elled cases in the maze, identiﬁed by a letter. This

sequence will be compared with reference sequences

deﬁned by an expert to assign its proﬁle. They studied

5 SMs comparing the player sequences and the refer-

ence sequences: Dice, Jaccard, Overlap, Cosine and

Longest Common Substring (LCS). And they used N-

gram preprocessing to deal with the importance of

temporality order in the sequences. These similar-

ity measures can be used in a wide range of ﬁelds:

Dice and Jaccard for document clustering (Afzali and

Kumar, 2018), fraud detection, Jaccard for ﬁngerprint

similarity (Bajusz et al., 2015), LCS for DNA analy-

sis. Phan et al. (Phan et al., 2017) proposed to use

SMs for biological time series imputation. (Bajusz

et al., 2015) concluded that a similarity measure can

be poor in a study ﬁeld, but strong in another. Even

if we also have a maze in our web application (ORI-

Delannoy, S., Caillault, É., Bigand, A. and Rousseeuw, K.

Symbolic Translation of Time Series using Piecewise N-gram Similarity Voting.

DOI: 10.5220/0010317603270333

In Proceedings of the 10th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2021), pages 327-333

ISBN: 978-989-758-486-2

327

ENTOI), this new application is really different: our

maze have near-random generation. This fact makes

over 1.8e+28 possibilities of maze, making impossi-

ble for us to use this method on this part of our ap-

plication. So, rather than using this approach to the

game part of our data, we used this on the viewed

pages in the application. Viewed pages are converted

into characters to use SMs.

The context of our work is different from the

Loh’s context, since we are not using gameplay se-

quences but navigation sequences. This particular

context implies more complexity to deal with, espe-

cially the high redundancy corresponding to the core

loop of our application. Moreover, the similarity mea-

sure proposed by Loh does not deal without splitting

in some situations on the second used dataset (CBF).

So, this paper extends Loh et al.’s approach to classify

users from their navigation page-key information and

proposes a new weighted similarity voting classiﬁer.

We also used some temporal extractions, such as n-

gram to keep the information relative to the sequence

order and then to add a sequence splitting technique to

decrease noise interference and strengthen SM about

redundancy.

Section 2 details this new approach. Results on

collected navigation path data (ORIENTOI) and sev-

eral public data from UCR archive are discussed be-

fore concluding with other applications.

2 PROPOSED METHOD

This part describes the proposed approach to clas-

sify a sequence from a train database and argues the

choice made.

2.1 General Outline

User actions are described by a character sequence

x of length denoted |x|. x

= {x

} is the sequence

of the player j, ordered by i time-index from 1 to

|. |x| could vary, so their similarities from a train-

ing database are used to assign a behaviour label. We

denote x

the t

sequence of a training dataset with T

proﬁles.

So to classify a sequence x

of a player j from T

train proﬁles, we adopted this general scheme:

1. Symbolic Preprocessing: replace x

and x

their character sequences if not already done.

2. S-split of x

(sequence from T ) and x

sequences.

S-split means that a sequence x is cut into S seg-

ments of length L except for the last segment. The

segment is deﬁned as:





⊆ x

, s ∈ [1 : S]

L =



|x|



i ∈ [1 + (s − 1) × L : min (s × L, |x|)]

(1)

Thus, for a value S close to |x|, x could not be

exactly cut in S segments. S = 1 corresponds to

the full sequence, i.e. without any segmentation.

3. Computing N-gram on each segment of x

and x

with n ∈ {1 : N}. Each resulted N-gram is noted

4. Similarity Measures: computing vectors w

length |T | containing the similarities sim of x

with all training sequences of T , as follows:

sim(x

) =

∑

s=1

sim(x

) (2)

Similarity ranges from 0 (dissimilar) to 1 (sim-

ilar), but sim(A, A) isn’t necessarily equal to 1

(Like Dice and Overlap).

5. Vectors Manipulation or Voting Techniques to

obtain a unique vector w

from one or several W

6. Classifying, by assigning to the user j the dom-

inant class of the k-nearest neighbors from the T

train proﬁles.

In the following, metrics will be named as follow

N-SM/S, where N is the value N of N-gram, SM

the similarity measure, and S the number of splits.

2 − Dice/3 means that the original sequence is cut

in 3 subsequences, transformed with contiguous se-

quence of 2-letter items and Dice measure are used

for comparison.

2.2 Voting Scheme

There are 2 main ways to use a voting scheme (Phan

et al., 2018), the hard voting, and the weighted vot-

ing. In the hard case, all votes are the same in terms

of importance, but in the weighted case, each vote

can have a different importance into the ﬁnal result.

The weighted case can be useful if we want to make

some model more important than other. In our case,

we tried several approaches, and a weighted voting

scheme based on the accuracy of metrics (founded

from the training set) and occurrence of SM gave us

better results (see section 4.5).

2.3 Pros and Cons

N-gram Size. Dice, Jaccard and Overlap similarities

are based on counts from intersection of items in

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

328

2 sequences. Thus, an item from the start of a

sequence A can be related to an item at the end of

sequence B. It could be partially revised by using

a high n-value. But high n-value are not robust to

noise. Let see an example with Jaccard scores be-

tween 2 sequences ”123456789” and ”12X456X89”.

They decrease according to n-value: 1-Jaccard =

0.7, 2-Jaccard = 0.33, 3-Jaccard = 0.07, 4-Jaccard = 0.

Similarity Measures. Some SMs used by (Loh

et al., 2016) do not take into account frequency. 3 of

the 5 SM (Dice, Jaccard and Overlap) use intersect

of the characters of the 2 sequences to calculate

similarity. Such approach makes them sensitive to

noise and redundancy. LCS is also highly sensitive

to noise, for the reason that a mutation in the middle

of a common substring, can greatly decrease the sim-

ilarity calculated. In order to override this issue, we

implement 2 other metrics based on item frequency

: Bag-of-words (BoW) and Term Frequency-Inverse

Document Frequency (TF-IDF) well known in text

similarity. BoW is the Euclidean distance between

term occurrence vectors in x

and x

, TF-IDF between

their term occurrences weighted by their document

frequency. Thus the previous distances d are trans-

formed to obtain similarity value sim called sBoW

and sTF-IDF deﬁned by sim = 1 − d/rowmax(d).

S-split. Previous metrics could be not relevant due

to possible matching between the beginning of x

and

the end of x

. Our split technique ensures boundary

and monotony conditions.

3 PROPOSED FRAMEWORK

AND DATASET

Several datasets (data) are used to validate our ap-

proach: an extraction from ORIENTOI database and

some data from the UCR archive (Dau et al., 2018)

with a focus on the artiﬁcial CBF dataset. CBF was

chosen because of its citation and its similarity with

the ORIENTOI’s dataset: same sequence beginnings,

some level and shape differences in the rest of the se-

quences, and to ensure replicability. More informa-

tion is available in the table 1.

Table 1: Dataset Information: sequence length, total num-

ber of train and test proﬁles, distribution of test samples per

class (C1, C2, C3).

Data length train test C1 C2 C3

CBF 128 30 900 300 298 302

ORIENTOI 1-919 30 3,887 356 212 3,319

The 27 other UCR datasets are not detailed here,

they come from various types: Image outline, sensor

readings, motion capture, spectrographs, electric de-

vices, ECG and simulated. They are selected for a

comparison with the accuracy results done in (Bag-

nall et al., 2017).

1-NN (one nearest-neighbor) classiﬁer is used to

assign the proﬁle of a sequence from the train se-

quences.

All UCR data are composed of one training set

and one testing set. So the train part is used both for

labelling of the test part and for validation (using a

leave-one-out cross-validation). Thus, the parameters

(N,S and SM) and the process to ﬁnd the best combi-

nation of similarities with their weightings are com-

pletely and independently computed apart of the test

set. N-gram values are set from 1 to 5 as well as the

splitting values (S ∈ 1 : 5).

For ORIENTOI’s dataset, 10 elements per class

are chosen randomly to compose the train set. So clas-

siﬁers are trained on these 30 elements like CBF and

their capacity of generalization are computed from the

rest: 3,887 sequences.

3.1 The CBF Dataset

The CBF dataset from the UCR Archive (Dau et al.,

2018), is a simulated data set deﬁned by N. Saiko in

his thesis ”Local Feature Extraction and Its Applica-

tions Using a Library of Bases ”. Data from each

class is starndard normal noise plus an offset term

which differs for each class. CBF is composed of 930

numeric time series with equal length and 3 classes

(C1=Cylinder, C2=Bell and C3=Funnel) to identify,

with 30 train data for 900 test data. We use the CBF

train and test sets. Each time series is transformed

by symbolic quantization with a 0.5 step in the signal

range [-3.5;3.8], illustrated in Fig 1. This 0.5 step is

ﬁrstly chosen arbitrary and then adapted in 4.8.

Figure 1: Symbolic quantization per class on CBF train

dataset.

3.2 ORIENTOI’s Dataset (This Dataset

Is Private and Will Is Not Explained

Here)

In the ORIENTOI’s application, one user has to play

a required number of games before reaching the se-

rious part with job orientation. Detecting and under-

Symbolic Translation of Time Series using Piecewise N-gram Similarity Voting

329

standing player is an important part to generate some

personality elements and to adapt the part of job pro-

posal. The intended purpose here is to classify the

players into 3 classes, quite similar to Loh et al.’s pro-

ﬁles, deﬁned by:

• C1 - Early Quitters. Players that did not reach a

required number of games, to obtain some gener-

ated personality elements ;

• C2 - Quitters. Players that reached the personal-

ity part, but stopped before the serious part, and

so, never answered about job preference (the ﬁnal

step) ;

• C3 - Fulﬁllers. Players that reached the end of

the full process at least once.

Thus the ORIENTOI’ dataset is composed of page-

key sequences ordered by their time-stamp for each

player. A character is assigned to each page. The

length of player navigation varies from 1 to 919 time-

stamps, with a large majority of Fulﬁllers.

4 RESULTS

This section presents accuracy results obtained on

UCR and ORIENTOI datasets, with a deepening on

CBF to show some weakness and strength of the pro-

posed technique.

Similarity Measure (SM) name in the following

tables will be shortened as follow: Di. (Dice), Jac.

(Jaccard), Ovl. (Overlap), Cos. (Cosine), LCS

(Longest common substring), BoW (Normalized bag

of word) and IDF (Normalized text frequency-inverse

document frequency).

4.1 CBF: Without N-gram and S-split

On CBF, using only SM leads to poor results shown

in the table 2. The best accuracy (AC) is obtained

with frequency-based SMs: Cosine, BoW and IDF. It

is explained by the fact that 88.5% of the occurrences

are done by 6 different letters and that 3 Letters from

test set never appear in the train set, and 4 appear in

only 2.8% of the sequences. So, SM using intersect

like Dice, Jaccard and Overlap produce poor result.

Table 2: Accuracy percentage (AC) per similarity (SM)

without N-gram and S-split in the UCR-CBF test dataset.

SM Di. Jac. Ovl. Cos. LCS BoW IDF

AC 37.1 56.7 37.1 66.4 40.0 65.5 67.4

4.2 CBF: With N-gram

By adding N-gram and whatever the method (except

LCS), more uniform results (> 60%) are obtained,

presented in table 3. Dice and Overlap SMs are sig-

niﬁcantly improved using n-gram. They have same

accuracy due the equal length of all sequences. But

these result remain low, because of noise that gener-

ates mutating characters and non-discriminative char-

acter (that situation at least appears once in most se-

quences).

Table 3: Best accuracy percentage per SM for N-gram value

for the UCR-CBF test dataset.

SM Di. Jac. Ovl. Cos. LCS BoW IDF

N 4 2 4 2 3 3 1

AC 64.1 69.8 64.1 67.0 41.4 67.8 67.4

4.3 CBF: With S-split

Then, by simply adding s-split, we present in table 4

the signiﬁcant improvement of accuracy results. Fun-

nel and Bell have a highly similar term frequency after

quantization (for these 2 classes, only 6 letters have

mean different frequency, with less than 0.3% of dif-

ference) and that implies confusion between these 2

classes, see an accuracy close to 2/3 in table 3. S-

split is a main step that allows a more than 20% gain.

Once again, frequency-based SM performs particu-

larly well on CBF (> 99% of good classiﬁcation).

Table 4: Best accuracy percentage (AC) per similarity (SM)

and the associated S-split value for the UCR-CBF test

dataset.

SM Di. Jac. Ovl. Cos. LCS BoW IDF

S 4 4 4 3-5 5 3-5 3

AC 89.2 96.5 89.2 99.5 77.1 99.6 99.1

4.4 CBF: With N-gram and S-split

Both N-gram and S-split preprocessing lead to better

results for Dice, Jaccard and Overlap SMs, as shown

in the table 5. At this stage, the highest result is 99.6%

with the BoW using 3-split or 5-split. Since using

high N-gram is sensitive to noise, most of best scores

correspond to unigram and bigram.

Table 5: Best accuracy percentage per SM with the as-

sociated S-split and N-gram values for the UCR-CBF test

dataset.

SM Di. Jac. Ovl. Cos. LCS BoW IDF

S 5 3 5 3-5 5 3-5 3

N 2 2 2 1 1 1 1

AC 96.7 98.7 96.7 99.5 77.1 99.6 99.1

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

330

4.5 CBF: Similarity Measure

Aggregation

Some SMs are reliable to predict class, depending on

the selected n-gram and s-split. 175 metrics N-SM/S

are computed according to the possible combinations

(SM,N,S). The proposed voting scheme aims to catch

the strength and the complementarity of each SM and

to propose a nice combination. This can be done ei-

ther by vectors manipulating (means or product) or

voting scheme.

A step-forward process on the training set leads

to a single measure, 2-Jaccard/3, with an accuracy of

100% on the training set, but 98.7% (F1-measure =

0.987) on the testing set.

To enforce complementarity, we decided to keep

all the metrics with an accuracy in the train part that

have less than 10% of relative difference with the best

one, and using them in a weighted voting scheme, de-

ﬁned as follow: Accuracy(m)

Occurence(m).

Accuracy(m): is the accuracy of the metric m on

the training set (validation step). This allows us to

increase the gap between high and poor metrics.

Occurence(m): this is the occurrence of the SM in

the kept metrics from the training set. Thus, it lowers

the inﬂuence of too frequent SMs and allows more

complementarity: each SM brings a different kind of

information.

An hard vote obtained from the 175 metrics

(weigths=1) permits 93.3% of good recognition on

the testing set based on the training proﬁles. With this

weighted vote combining the 175 metrics, test accu-

racy increases up to 98.8%. With the selected metrics

based on the 10%-relative difference,99.5% of test ac-

curacy is reached (F1-measure = 0.995), close to the

state-of-art result: 99.8% (Bagnall et al., 2017). Also,

weighted similarity vote we proposed obtains better

accuracy than the 7 of 9 algorithms cited in (Bagnall

et al., 2017).

As said previously, n-gram are highly sensitive to

noise, by dismissing n-gram from this voting scheme,

the prediction reaches 99.8% of accuracy and a F1-

measure of 99.8%, as well as Bag of SFA Symbols or

COTE for CBF in (Bagnall et al., 2017).

4.6 ORIENTOI’s Results

For the ORIENTOI’s dataset (ORIENTOI in the se-

quel), and only using the best metric (1-Jaccard/1),

an accuracy of 85.3% (F1-measure = 0.844) is ob-

tained and once again, the ”step forward” stops into

the ﬁrst step. Our vote method allows to reach an ac-

curacy of 97.2% (F1-measure = 0.971). N-gram pro-

cess seems to be useful to classify for this dataset:

only 95.9% accuracy without n-gram (F1-measure =

0.961). This could be explained by the fact that in

an application, multiple choices exists and the n-gram

help the method to take into consideration the impor-

tance of transition between pages. The s-split is less

important in ORIENTOI. Due to the high redundancy

of the core loop (cycle of main interest actions), the

s-split is less effective.

4.7 UCR Results

In (Bagnall et al., 2017), 9 algorithms are tested on

85 UCR datasets. We used the same benchmark to

validate our approach on 28 of them.

Table 6 details accuracy results for the 6 types of

data in UCR and recall the state-of-art best scores (p0

from Table 6). The relevance of the N-gram process

(p1, p1a) depends on the used dataset. Mean accuracy

with the previous SM vote protocol is 66.3% with-

out n-gram and s-split process. Adding n-gram up-

grades this accuracy to 70.9%, and using just s-split

upgrade it to 78.1% (p2, p2a). Both processes give a

close score: 78.2%. This shows the importance of s-

split, but does not mean that metrics with n-gram isn’t

reliable. Furthermore, metrics without n-gram reach

state-of-art for 3 of the 28 used datasets.

Our weighted vote scheme has better results than

an hard vote scheme (mean accucary of 60.8%) on

these 28 datasets without n-gram and s-split. And

for full metrics (N-SM/S), weighted vote scheme was

78.2%, and 77.7% for the hard vote scheme, pointing

the usefulness of lowering too redundant SM on kept

metrics.

Our method seems to be efﬁcient in some kind of

data, such as motion capture, ECG or simulated and

less in others, such as spectrographs, as you can see

in the table 6.

4.8 UCR Results: Adapted

Quantization Step

A ﬁxed quantization (0.5-step) could not be relevant

for each dataset. So adapted step was explored for

each dataset of UCR among these values: 5, 10, 15,

20 and 30 (p1a, p2a). Results with this adapted quan-

tization are noted p1a and p2a in table 6. Few datasets

have lower result with adapted step, and some dataset

have notable better results, such as SonyAIBORobot-

Surface1, Wine, BirdChicken and SyntheticControl.

Note that we only test 5 adapted steps in this case,

and an optimal quantization could be learnt from the

training set. Symbolic Aggregate approximation on

CBF has also been compared without better success,

but could be investigated for the other datasets.

Symbolic Translation of Time Series using Piecewise N-gram Similarity Voting

331

Table 6: Accuracy percentage for some UCR Test datasets:

p0 ((Bagnall et al., 2017)) is the best cited result ; then the

proposed approach (p): p1 = vote with both n-gram and s-

split; p2 = vote without n-gram; p1a = p1 and p2a = p2 with

adapted step (5, 10, 15, 20, 30). The 2 Last rows correspond

to mean and standard deviation accuracy on all datasets and

* the occurence number of best results between p1 and p2

or p1a and p2a.

Dataset p0 p1 p2 p1a p2a

Coffee 100.0 96.4 96.4 92.8 96.4

Wine 92.6 46.2 50 70.3 59.2

Beef 76.4 40.0 46.6 55.3 50.0

Plane 100 99 100 100 100

Trace 100 94 95 99 99

ItalyPowerDemand 97.0 91.5 91.7 90.3 90.7

MoteStrain 91.7 87.6 87.3 86.8 88.6

Lightning7 79.9 69.8 57.5 65.7 69.8

SonyAIBORobotS.2 96.0 81.2 80.5 84.1 78.6

OliveOil 90.1 73.3 66.6 83.3 73.3

SonyAIBORobotS.1 89.9 59.7 59.0 78.2 73.3

BeetleFly 94.8 85.0 95.0 85 95

DiatomSizeReduction 94.6 92.1 91.5 93.4 95

ProximalPhalanxTW 81.5 77.5 77.0 76.0 76.0

MiddlePhalanxTW 58.7 53.2 53.8 56.4 57.7

FaceFour 99.6 80.6 84.0 87.5 78.4

Fish 97.4 75.4 76.0 72.5 73.1

BirdChicken 94.6 70.0 70.0 90.0 75.0

ArrowHead 87.7 62.2 68.5 74.8 74.8

Adiac 81 51.6 51.6 58.5 58.0

WordSynonyms 77.8 62.0 57.5 60.3 57.3

CBF 99.8 99.5 99.8 99.7 99.8

SyntheticControl 99.9 92.6 87.3 97.3 87.6

ECGFiveDays 98.6 79.3 76.5 82.2 80.1

TwoLeadECG 98.5 80.7 90.7 87.5 84.2

ECG200 89.0 88.0 85.0 86.0 87.0

GunPoint 99.9 94.6 91.3 96.6 94.0

ToeSegmentation1 95.4 82.4 75.0 84.0 78.9

Occ. best* - 13 12 14 10

MEAN 91,5 77,3 77,1 81,9 79,6

STD 9,7 16,5 16,5 13,4 14,3

5 CONCLUSION

The Loh et al.’s work is revisited and extended to

focus on the way to classify user proﬁles from their

viewed pages in a web-application. A piecewise n-

gram similarity voting is proposed and validated for

the investigated dataset and also on several datasets

with different contexts.

Without the proposed splitting techniques, results

obtained on some data are relatively good, but the

generalization on other data is less convincing. The

splitting techniques allow us to constrained match-

ing between sequences with boundary and monotonic

conditions, and greatly improves the results on UCR

datasets. Due to noise and to the nature of the cho-

sen similarity measures, not- splitting sequences give

poor results for CBF. Splitting really provides an im-

portant gain for these applications. With an arbitrary

step of quantization, the simple fact of using splitting

increases the result on 21 of 28 of UCR datasets with

absolute gain up to 54.1% and mean gain of 11.7%.

As for the gain on our dataset (ORIENTOI), the

redundancy of actions in our data reduces the interest

of the subdivision and has to be highlighted. This sub-

division has the advantage of constraining the com-

parison space, so that a character at the beginning of

a sequence is not compared to a character at the end

of another sequence. An overly redundant cycle will

still appear in all the divisions and will minimize or

even cancel the interest of the sub-division method.

The proposed weighted voter reaches better re-

sults than a simple step forward. And our general

method allows us to reach the state-of-the-art score

on 3 of the datasets and to have less than 6% of dif-

ference to accuracy for 8 other ones.

The proposed average Piecewise N-Gram similar-

ity (and combination of them) give promising results

to classify user proﬁles by their navigation path in an

in-situ serious game web application.

This method can also be extended and enhanced for

time series (see section 5). The efﬁciency of similarity

measures depends on the dataset and the way to com-

pare sequences. The proposed piecewise comparison

of 2 sequences (s-split) is elementary and may beneﬁt

to similarity measures but should be adapted for cycle

within the sequence.

Perspectives.

The proposed method can deal with time series with

close state-of-the art accuracy but some improve-

ments are required to enhance these results:

• Better Adapting the Step of the Symbolic Pre-

processing: the value range varies with the con-

sidered datasets and could be wide. The range

is from 3.38 to 19.19 according to the chosen

dataset. With a 0.5 quantization level,the last one

have 39 symbols whereas the ﬁrst one have only 7.

However, the 5 ﬁxed step values are not necessary

optimal and should be further investigated.

• More S-split or Fluctuating S-split: the length

of the used time series are also wide, from 24 to

570. Sub-sequencing a time series of 570 points

into 5 splits may be not enough. Also, the equal

splitting length and/or weight is perhaps not the

most suitable solution for some datasets.

• Finding When to Use N-gram: as showed in this

paper, some time series got better results without

n-gram. It could be interesting to ﬁnd out when to

use (or not) a n-gram technique.

ICPRAM 2021 - 10th International Conference on Pattern Recognition Applications and Methods

332

• The KNN: we only used the 1-NN because it

seems to be a relatively good choice, but it’s per-

haps not the case for all datasets. And an other

method is possibly more suitable.

• Better Detecting the Best Couples of Metrics

to Use: the method we propose to detect cou-

ple of metrics to be used together is not optimal

and it’s certainly the most important way to im-

prove all the results. In fact, some of possible met-

ric combinations are able to reach 100% accuracy

on CBF, even without weighted vote scheme, and

with less metrics used.

ACKNOWLEDGEMENTS

This research was supported/partially supported by

ORIENTOI DEV that provided the ORIENTOI

dataset and the ”Association Nationale Recherche

Technologie”.

REFERENCES

Afzali, M. and Kumar, S. (2018). An extensive study of sim-

ilarity and dissimilarity measures used for text doc-

ument clustering using k-means algorithm. Interna-

tional Journal of Information Technology and Com-

puter Science, 10:64–73.

Bagnall, A., Lines, J., Bostrom, A., Large, J., and Keogh,

E. (2017). The great time series classiﬁcation bake

off: a review and experimental evaluation of recent

algorithmic advances. Data Mining and Knowledge

Discovery, 31(3):606–660.

Bajusz, D., R

acz, A., and H

eberger, K. (2015). Why is

tanimoto index an appropriate choice for ﬁngerprint-

based similarity calculations? Journal of Cheminfor-

matics, 7(1):20.

Bienkowski, M., Feng, M., and Means, B. (2014).

Enhancing teaching and learning through educa-

tional data mining and learning analytics: An is-

sue brief. Report number: https://tech.ed.gov/ﬁles/

2015/04/Developer-Toolkit.pdf, Afﬁliation: US De-

partment of Education.

Dau, H. A., Keogh, E., Kamgar, K., Yeh, C.-C. M.,

Zhu, Y., Gharghabi, S., Ratanamahatana, C. A., Yan-

ping, Hu, B., Begum, N., Bagnall, A., Mueen, A.,

Batista, G., and Hexagon-ML (2018). The ucr time

series classiﬁcation archive. https://www.cs.ucr.edu/

∼eamonn/time

series data 2018/.

Loh, C. S., Li, I.-H., and Sheng, Y. (2016). Comparison

of similarity measures to differentiate players’ actions

and decision-making proﬁles in serious games analyt-

ics. Computers in Human Behavior, 64:562 – 574.

Nowak, J., Korytkowski, M., Nowicki, R., Scherer, R., and

Siwocha, A. (2018). Random forests for proﬁling

computer network users. In Rutkowski, L., Scherer,

R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R.,

and Zurada, J. M., editors, Artiﬁcial Intelligence and

Soft Computing, pages 734–739, Cham. Springer In-

ternational Publishing.

Phan, T., Poisson-Caillault, E., Lefebvre, A., and Bigand,

A. (2017). Dynamic time warping-based imputation

for univariate time series data. Pattern Recognition

Letters.

Phan, T.-T.-H., Bigand, A., and Caillault,

E. P. (2018).

A new fuzzy logic-based similarity measure applied

to large gap imputation for uncorrelated multivariate

time series. Applied Computational Intelligence and

Soft Computing, 2018:1–15.

Shahabi, C., Zarkesh, A. M., Adibi, J., and Shah, V. (1997).

Knowledge discovery from users web-page naviga-

tion. In Proceedings Seventh International Workshop

on Research Issues in Data Engineering. High Perfor-

mance Database Management for Large-Scale Appli-

cations, pages 20–29.

Wattanasoontorn, V., Boada, I., Garc

ıa Hernandez, R., and

Sbert, M. (2013). Serious games for health. Enter-

tainment Computing, 4:231–247.

Symbolic Translation of Time Series using Piecewise N-gram Similarity Voting

333