CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate

Time Series Classiﬁcation

Sarra Hassine

, Sourour Ammar

1,2 a

and Ilef Ben Slima

2,3 b

Digital Research Center of Sfax, B.P. 275, Sakiet Ezzit, 3021 Sfax, Tunisia

SM@RTS: Laboratory of Signals, systeMs, aRtiﬁcial Intelligence and neTworkS, Sfax, Tunisia

ISMAIK, University of Kairouan, Kairouan, 3100 Tunisia

Keywords:

Multivariate Time Series Classiﬁcation, Transformer, CNN, Deep Learning.

Abstract:

The extensive presence of sensors in multiple domains has led to the generation of enormous amounts of mul-

tivariate time series data, presenting signiﬁcant challenges for efﬁcient classiﬁcation. Although contemporary

artiﬁcial intelligence methods show promising performance in addressing such data, they often struggle to

capture both long-range dependencies and intricate local patterns within the sequences. This paper introduces

CNN-Trans, an innovative deep learning model designed speciﬁcally for multivariate time series classiﬁcation

to address the mentioned challenge. CNN-Trans combines the strengths of transformers and convolutional

neural networks (CNN). The proposed model uses a parallel strategy with both a transformer encoder and a

CNN encoder working simultaneously on the time series data. The transformer captures global relationships

through self-attention, while the CNN extracts localized spatial features tailored to each variable. We evaluate

CNN-Trans on various benchmark datasets encompassing diverse sensor applications. The results show that

our model is robust and highly effective for complex data. CNN-Trans outperforms others with 93.33% on

NATOPS and 98.37% on PenDigits, excelling in high-dimensional datasets like Kitchen (95.74%) and HAR

(87.41%). Additionally, CNN-Trans exhibits robustness and generalizability across different input features,

showcasing its practical utility in real-world scenarios.

1 INTRODUCTION

Time series data, capturing the dynamic evolution

of phenomena across time, pervades countless do-

mains. From human activity recognition to med-

ical diagnoses and ﬁnancial forecasting, analyzing

these sequences empowers critical decisions in real-

world applications. Notably, multivariate time series

(MTS), where data are collected across multiple vari-

ables simultaneously, pose distinct challenges due to

their inherent complexity. Deep learning approaches

like convolutional neural networks (CNN) have es-

tablished themselves as valuable tools for tackling

multivariate time series classiﬁcation (MTSC) (Is-

mail Fawaz et al., 2019). Their capability to extract

localized temporal and spatial correlations has led to

signiﬁcant advancements in various tasks. However,

traditional CNNs encounter limitations in capturing

long-range dependencies, which often hold crucial in-

https://orcid.org/0000-0002-5851-9206

https://orcid.org/0000-0003-0442-425X

formation for accurate classiﬁcation. Additionally,

their focus on local data interactions neglects the

global context within the entire sequence. Seeking to

address these limitations, recent research has explored

the potential of recurrent neural networks (RNNs)

like LSTMs (Karim et al., 2019). While capable

of modeling long-range dependencies, their compu-

tational complexity and limited capacity hinder their

widespread adoption (Vaswani, 2017). In contrast, at-

tention models offer intriguing capabilities to capture

long-range interactions efﬁciently. Their broader re-

ceptive ﬁelds allow for rich contextual information,

enhancing the overall learning capacity of models.

Not surprisingly, the success of attention models in

natural language processing (Vaswani, 2017; Devlin

et al., 2019) has spurred their adaptation to other do-

mains like computer vision and, increasingly, time

series analysis (Zerveas et al., 2021). At the heart

of this revolution lies the transformer, a deep learn-

ing architecture that leverages powerful self-attention

mechanisms (Vaswani, 2017). This mechanism ex-

cels at modeling relationships within the input time

418

Hassine, S., Ammar, S. and Ben Slima, I.

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation.

DOI: 10.5220/0013169500003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 418-428

ISBN: 978-989-758-737-5; ISSN: 2184-433X

series, uncovering intricate dependencies that inﬂu-

ence classiﬁcation outcomes. However, self-attention

inherently neglects the crucial ordering information

embedded within sequential data like time series. Ad-

dressing this poses a critical challenge, as the lack of

explicit positional encoding can hinder the model’s

ability to fully understand the data’s temporal dynam-

ics. This issue is especially ampliﬁed in time se-

ries data, where context is often weaker compared

to domains like text or image data. To overcome

the limitations of existing methods and capitalize on

the strengths of both transformers and CNNs, this pa-

per introduces CNN-Trans, a novel deep learning ar-

chitecture speciﬁcally designed for multivariate time

series classiﬁcation. CNN-Trans operates in paral-

lel, employing a transformer encoder to capture long-

range dependencies and a CNN encoder to extract lo-

calized features from each variable. This synergis-

tic approach leads to a comprehensive representation

that combines global and local contexts, enabling su-

perior classiﬁcation performance. Extensive evalua-

tions on public and private datasets demonstrate that

CNN-Trans consistently outperforms state-of-the-art

approaches, highlighting its robustness and general-

izability, particularly when handling complex, high-

dimensional data.

Our study offers the following key contributions:

• Novel architecture: We propose CNN-Trans, the

ﬁrst architecture to combine transformers and

CNNs for parallel processing of multivariate time

series data.

• Enhanced long-range dependency modeling:

Through the transformer encoder, we effectively

capture long-range relationships within the time

series, overcoming a limitation of traditional

CNNs.

• Robust feature extraction: The CNN encoder ex-

tracts ﬁne-grained features from each individual

variable, providing valuable localized information

for accurate classiﬁcation.

• Improved performance: We demonstrate that

CNN-Trans achieves competitive performance on

various benchmark datasets across diverse do-

mains, and its performance excels as dataset size

increases, especially with high-dimensional data.

The remainder of this paper is structured as fol-

lows: Section 2 provides a comprehensive review of

related research, highlighting previous approaches to

MTSC and situating our work within the existing lit-

erature. Section 3 begins by introducing the basic

concepts and properties of univariate and multivariate

time series data and provides a mathematical formu-

lation of the MTSC problem. Section 4 details our

proposed method, including the architectural design

of the model, the speciﬁc innovations introduced to

handle MTS data effectively, and the integration of

these components into a uniﬁed framework. Section 5

presents the experiments we conducted and discusses

the ﬁndings. Finally, Section 6 concludes the paper

while suggesting directions for future research.

2 RELATED WORK

2.1 CNN Based Models

The success of CNNs in various domains, includ-

ing computer vision, speech recognition, and natu-

ral language processing, has led to their adoption

in time series classiﬁcation (TSC) tasks. Since the

breakthrough of AlexNet in 2012 (Krizhevsky et al.,

2012), CNN architectures have undergone signiﬁ-

cant improvements, such as the use of deeper net-

works, smaller and more efﬁcient convolutional ﬁl-

ters, and batch normalization to enhance training sta-

bility (Gu et al., 2018). These advancements have en-

abled CNNs to achieve state-of-the-art performance

in numerous applications (Gu et al., 2018; Foumani

and Nickabadi, 2019), and have paved the way for

their application to TSC problems. For instance, the

authors in (Wang et al., 2017) proposed a robust base-

line model for TSC based on a Fully Convolutional

Network (FCN). This approach operates end-to-end,

involving training CNNs from scratch while requir-

ing minimal preprocessing of raw data. Additionally,

this approach utilizes Class Activation Maps to high-

light signiﬁcant regions in the data related to speciﬁc

labels. Another signiﬁcant contribution in the ﬁeld

of TSC is the InceptionTime (Ismail Fawaz et al.,

2020). Inspired by Inception networks used in com-

puter vision, InceptionTime is an ensemble of deep

CNN models designed speciﬁcally for TSC. The au-

thors showed that InceptionTime not only matches but

often surpasses the accuracy of the HIVE-COTE algo-

rithm, which was previously considered the state-of-

the-art in TSC, while addressing its high training time

complexity.

Although CNNs are designed to capture local pat-

terns in data, they may struggle with long-range de-

pendencies, especially when the dependencies span

across many time steps. This can affect their perfor-

mance in TSC where the temporal relationships are

crucial.

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation

419

2.2 Attention Based Models

Attention-based models have shown promise in ad-

dressing some of the limitations of CNNs in TSC.

Attention mechanisms allow models to dynamically

weigh the importance of different time steps in the

input sequence, which enables the models to focus

on critical moments that contribute signiﬁcantly to

the classiﬁcation task. Models like the Attention

LSTM Fully Convolutional Network (MALSTM-

FCN) (Karim et al., 2019) and TapNet (Zhang et al.,

2020) have been speciﬁcally designed to leverage

attention mechanisms in the context of multivariate

time series classiﬁcation. MALSTM-FCN is an ex-

tension of the ALSTM-FCN (Karim et al., 2017)

model, proposed to deal with multivariate time series

data. It integrates attention within the LSTM frame-

work, allowing it to capture long-range dependencies

while simultaneously focusing on the most pertinent

time steps (Karim et al., 2019). This results in im-

proved accuracy and robustness in classifying com-

plex time series data. Similarly, TapNet (Zhang et al.,

2020) employs an attention mechanism to enhance

feature extraction from multivariate inputs, ensuring

that the model can adaptively learn which variables

and time steps are most inﬂuential for the classiﬁca-

tion outcome. By focusing on the most relevant fea-

tures, the attention mechanism of TapNet helps im-

prove the model’s ability to distinguish between dif-

ferent classes, especially in scenarios where labeled

data is limited.

Attention-based models represent a signiﬁcant ad-

vancement in the ﬁeld of time series classiﬁcation,

as they not only improve classiﬁcation performance

but also enhance interpretability. As mentioned in

(Hsu et al., 2019), the attention weights offer valu-

able insights into the model’s decision-making pro-

cess, highlighting which time steps and features are

considered signiﬁcant for speciﬁc classiﬁcations.

2.3 Transformers for MTS

Classiﬁcation

Transformers are a more recent development in

attention-based models (Zerveas et al., 2021; Devlin

et al., 2019; Liu et al., 2021; Zhang et al., 2023).

Unlike previous attention mechanisms that were of-

ten paired with RNNs (such as ALSTM-FCN), Trans-

formers rely entirely on self-attention mechanisms.

The self-attention mechanism is a variant of the at-

tention mechanism which allows the model to weigh

the signiﬁcance of different parts of the input rel-

ative to a speciﬁc position (Vaswani, 2017). This

mechanism enables the model to process all ele-

ments in a sequence simultaneously, which is cru-

cial for understanding context and semantics (Zhang

et al., 2023). This architecture allows Transform-

ers to capture long-range dependencies and global

context more effectively than CNNs or traditional

attention-augmented models. In the context of MTS

classiﬁcation, the authors in (Liu et al., 2021) pro-

posed a transformer-based approach named ”Gated

Transformer Networks (GTN)”. This approach com-

bines the strengths of Transformer Networks with gat-

ing mechanism which merges two towers of Trans-

former networks and captures both channel-wise and

step-wise correlations in multivariate time series data

(Liu et al., 2021). Other transformer-based models

have been proposed in the context of MTS classiﬁ-

cation such as (Zerveas et al., 2021) which proposed

a transformer-based framework for unsupervised rep-

resentation learning of multivariate time series and

(Yang et al., 2024) which proposes a transformer-

based dynamic architecture with a hierarchical pool-

ing layer to decompose time series into subsequences

representing different frequency components to fa-

cilitate time series classiﬁcation. Although these

transformer-based architectures are effective, they are

complex and require large amounts of data for train-

ing. In addition, they involve an unsupervised learn-

ing phase to achieve optimal performance.

3 PROBLEM FORMULATION

In time series analysis, understanding the structure

and characteristics of the data is crucial before

tackling the problem of classiﬁcation. Below, we

ﬁrst deﬁne univariate and multivariate time series,

followed by a detailed explanation of the multivariate

time series classiﬁcation problem.

A Univariate Time Series: is a sequence of ob-

servations collected over time from a single variable

or feature. Mathematically, it can be represented as

a one-dimensional vector x = (x

, x

, . . . , x

) ∈ R

where T denotes the sequence length (i.e., the num-

ber of time steps). Each value x

corresponds to the

observation at time step t ∈ 1, 2, . . . , T .

A multivariate time series (MTS), on the other

hand, consists of multiple variables or features

recorded simultaneously over time. Each sample in an

MTS dataset is structured as a two-dimensional vector

with a shape of (n f , T ), where n f denotes the number

of features (variables), and T indicates the sequence

length (time step). A data sample can be represented

as X = (x

, x

, . . . , x

n f

) ∈ R

n f ×T

, where each feature

vector x

= (x

i,1

, x

i,2

, . . . , x

i,T

) represents the sequence

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

420

Multivariate Time series

sample

. . .

𝒙

𝒊,𝟏

𝒙

𝒊,𝟐

… 𝒙

𝒊,𝑻

𝒙

𝟏

𝒙

𝟐

𝒙

𝒊

Multivariate Time series Dataset

𝑿

𝟏

𝑿

𝟐

𝑿

𝟑

𝑿

𝑵

𝒚

𝟏

𝒚

𝟐

𝒚

𝑵

Labels

Samples

Figure 1: Representation of multivariate time series sample and multivariate time series Dataset.

of observations for the i

feature over T time steps

(see Figure 1 (left)).

In the context of multivariate time series data, the

model’s input consists of two-dimensional samples.

Each sample, represented as X = (x

, x

, . . . , x

n f

) ∈

n f ×T

, is an ordered sequence with T time steps and

n f features. Each sample X is associated with a class

label y ∈ Ω, where Ω is a predeﬁned set of possible

labels.

Given a collection of N multivariate time series

samples (see Figure 1 (b)), represented as X =

, X

, . . . , X

) ∈ R

N×n f ×T

, along with their corre-

sponding true labels Y = (y

, y

, . . . , y

) ∈ Ω

, our

goal is to classify each input sample X

into one of

the classes in Ω. The task of multivariate time se-

ries classiﬁcation (MTSC) is thus to predict the label

y for a given MTS data sample X. Figure 1 (right)

illustrates a MTS dataset, consisting of N samples

along with their labels. In this study, we employ our

proposed CNN-Trans model to learn the mapping be-

tween the input data X and the target labels Y .

4 CNN-TRANS FOR

MULTIVARIATE TIME SERIES

CLASSIFICATION

This paper presents a novel architectural design that

utilizes the strengths of both convolutional neural net-

works and transformers. Figure 2 illustrates the pro-

posed model architecture for MTS classiﬁcation. The

design consists of a two-branch model in a parallel

conﬁguration, with 1) a CNN-based branch dedicated

to extracting local features and 2) a Transformer-

based branch focused on capturing temporal depen-

dencies. The outputs of the two branches are then

concatenated and passed into a classiﬁcation head.

The transformer encoder plays a vital role in the

proposed model for classifying multivariate time se-

ries data, as it captures correlations between variables

using an attention mechanism. This mechanism en-

ables the model to selectively focus on relevant parts

of the input sequence and assigns different weights to

different variables based on their importance. By in-

corporating the transformer, the model can effectively

interpret relationships between variables and capture

intricate dependencies within the data.

In our proposed model, the CNN is utilized to au-

tomatically learn and capture important local features

from the input data. This helps in identifying key

patterns that are essential for distinguishing between

different classes. Time series data often exhibit local

patterns, such as peaks, troughs, or recurring shapes,

which are important for classiﬁcation. CNNs use con-

volutional ﬁlters to capture these local dependencies

and patterns across time steps, allowing the model to

focus on the most relevant parts of the data.

The combination of a CNN branch and a Trans-

former encoder branch in the proposed architecture

enables simultaneous analysis of temporal features

and correlation between variables, thereby effectively

overcoming a key limitation in existing classiﬁcation

models.

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation

421

Figure 2: Overview of the CNN-Trans Architecture. CNN-Trans consists of a two-branch model in a parallel conﬁguration: a

CNN-based branch and a Transformer encoder-based branch.

4.1 Network Input

Our proposed model utilizes two distinct processing

branches to handle multivariate time series data: a

CNN-based branch, and a Transformer-based branch.

The CNN branch directly processes the data in its

original format, where it expects a speciﬁc number

of time steps T with a certain number of distinct

variables n f per step (we consider these variables as

channels). On the other hand, the Transformer-based

branch requires the data to be reshaped into a 3D

tensor. This tensor organizes the data into dimen-

sions [B, T, n f ], where B represents the batch size,

T represents the number of time steps, and n f rep-

resents the number of features per step. This restruc-

turing is aligned with the strengths of the Transformer

model, enabling it to effectively capture intricate rela-

tionships between features across different time steps

within each sequence in the batch.

4.2 CNN-Based Branch

The CNN component of the proposed architecture is

inspired by the work (Karim et al., 2019). It consists

of a sequence of three stacked temporal convolutional

blocks. These blocks progressively extract character-

istics from the input data using ﬁlters with varying

sizes (128, 256, 128). Each block employs a tempo-

ral convolutional layer to capture temporal dependen-

cies, followed by batch normalization for enhanced

training stability and a ReLU activation function to

introduce non-linearity. To improve feature represen-

tation, the ﬁrst two convolutional blocks incorporate

Squeeze-and-Excitation (SE) blocks (Hu et al., 2019).

These SE blocks capture dependencies between chan-

nels within the feature maps and dynamically recal-

ibrate the importance of each channel, which may

highlight more relevant features. Finally, a global av-

erage pooling layer is applied after the ﬁnal convo-

lutional block to decrease the number of parameters

before passing the data to the subsequent layers for

classiﬁcation.

4.3 Transformer-Based Branch

Since Multivariate time series classiﬁcation focuses

on identifying patterns within sequences of numer-

ical data, we utilize a Transformer encoder, which

excels at capturing long-range dependencies within

sequences, making it well-suited for this task. Our

Transformer-based branch begins with a linear in-

put layer that transforms the raw feature space into

a higher-dimensional representation. To account for

sequence order, positional encoding is added to the in-

put. The core of the encoder consists of a stack of two

layers, each containing a multi-head attention mech-

anism with four heads and a subsequent feedforward

neural network. The multi-head attention layer cap-

tures dependencies across different positions in the

sequence, while the feedforward network enhances

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

422

the model’s ability to discern intricate patterns within

the data.

4.4 Classiﬁcation Head

The classiﬁcation head takes as input the concate-

nated output from the two branches. It consists of

a Dense layer, of size equal to the number of classes,

with softmax as the activation function to produce a

probability distribution across the classes. We use the

categorical cross-entropy as the loss function, and we

train the network by minimizing this loss between the

predicted class probabilities ˆy and the true labels y, as

expressed in the following:

L = −

∑

i=1

log( ˆy

)

with n is the number of classes.

5 EXPERIMENTS

In this section, we describe our experiments designed

to evaluate the performance of CNN-Trans in terms

of its accuracy in multivariate time series classiﬁca-

tion. We ﬁrst describe the datasets used in our ex-

periments. We then summarize the implementation

details. Finally, we present the obtained results and

provide a comparison of our proposed method with

state-of-the-art ones.

5.1 Datasets

We evaluate the proposed method using 7 datasets

from the latest multivariate time series classiﬁcation

archive (Bagnall et al., 2018) (the ﬁrst 7 rows in Ta-

ble 1). This archive features real-world multivariate

time series data from diverse applications, including

Human Activity Recognition, Motion Classiﬁcation,

and ECG/EEG Signal Classiﬁcation. The datasets

vary in dimensionality from 2 dimensions in trajec-

tory classiﬁcation to 28 dimensions in EEG classiﬁ-

cation. The time series lengths range from 8 to 144,

and the dataset sizes span from 80 to 10,992 samples.

Additionally, we tested our method on the Human

Activity Recognition dataset (HAR dataset) from the

UCI repository (Lichman, 2013). This dataset con-

sists of recordings from 30 subjects performing var-

ious activities of daily living while carrying a waist-

mounted smartphone equipped with inertial sensors.

The subjects, aged between 19 and 48, performed

six activities (Walking,Walking-Upstairs, Walking-

Downstairs, Sitting, Standing, and Laying) while

wearing a Samsung Galaxy S II on their waist.

Finally, since Transformer-based architectures re-

quire large amounts of data for effective training, we

consider two private datasets (the last 2 rows in Ta-

ble 1) to ensure sufﬁcient data availability, thereby

demonstrating the potential of the proposed CNN-

Trans model for the MTSC task compared to other

methods when there is sufﬁcient training data. These

two datasets are collected in two different contexts

(one in a Kitchen and the other in a Meeting room),

and for two different tasks: Action recognition and

Activity recognition, respectively. They are collected

from 8 ambient sensors, leading to 93 and 85 fea-

tures, respectively for the Kitchen and the Meet-

ingRoom datasets. For the Kitchen dataset, there

are 3 classes, while there are 4 classes for the Meet-

ingRoom dataset.

All datasets are split into training and testing sets

as described in Table 1. For the purpose of ensuring

comparability with previous studies, we retained the

original training and testing splits provided in each

MTS dataset. Each dataset is normalized to have zero

mean and unit standard deviation, and the time series

are padded with zeros to ensure that each time series

has the same length as the longest series in the train-

ing set.

5.2 Implementation Details

All our experiments were implemented using Pytorch

and run on a single Tesla T4 GPU (16GB). The CNN-

Trans model is trained using the Adam optimizer with

a learning rate of 0.00001 and a dropout rate of 0.1.

We use the categorical cross-entropy loss function and

implement a learning rate schedule that adjusts for

plateaus, as recommended by (Ismail Fawaz et al.,

2019). The training processes a batch size of 64. We

assess the model’s performance on both the training

and validation sets at regular intervals, recording the

best test results along with their corresponding hyper-

parameters. To ensure a fair comparison, we report

the test accuracy of the model that achieves the low-

est training loss, following the approach outlined by

(Ismail Fawaz et al., 2019).

We evaluate the proposed method using accuracy

(ACC) on the test set as the metric. ACC is calculated

by the following formula:

ACC =

Correct predictions

All predictions

We also provide the confusion matrices of the pro-

posed CNN-Trans method for each dataset to offer

a detailed view of its performance across different

classes and highlight areas where misclassiﬁcations

occur.

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation

423

Table 1: Properties of all datasets considered in this work. HAR: Humain Activity Recognition. EEG: ECG/EEG Signal

Classiﬁcation. AR: Activity Recognition.

Datasets Classes Variables length Train Test Type

BasicMotions 4 6 100 40 40 HAR

RacketSports 4 6 30 151 152 HAR

NATOPS 6 24 51 180 180 HAR

Libras 15 2 45 180 180 HAR

FingerMovements 2 28 50 316 100 EEG

ArticularyWordRecognition 25 9 144 275 300 Motion

PenDigits 10 2 8 7494 3498 Motion

HAR dataset 6 9 128 7352 2947 AR

Kitchen 3 93 50 2988 470 Action

MeetingRoom 4 85 50 8217 921 AR

5.3 Results

We evaluate the performance of the proposed model

CNN-Trans by comparing its classiﬁcation accuracy

with state-of-the-art methods: FCN (Wang et al.,

2017), MLSTM-FCN (Karim et al., 2019), TapNet

(Zhang et al., 2020), and ResNet (Wang et al., 2017).

We provide in Table 2 a performance comparison

in terms of classiﬁcation accuracy of the CNN-Trans

model compared with state-of-the-art methods on 10

datasets. We highlight the highest accuracy score

in red and the second-highest score in blue for each

dataset.

The obtained results show that the proposed CNN-

Trans method, based on a two-branch architecture

combining a CNN and a Transformer, demonstrates

competitive performance across various MTS datasets

compared to other models. CNN-Trans consistently

shows strong performance across most datasets. For

instance, it achieves high accuracy on datasets such

as RacketSports, Libra, FingerMovements, Kitchen,

and MeetingRoom, outperforming the other methods.

On NATOPS, PenDigits and ArticularyWordRecogni-

tion datasets, CNN-Trans is very close to the highest

scores among the compared methods.

On the HAR dataset, although the MLSTM-FCN

model achieved the highest accuracy of 96.71%

compared to all the models, the CNN-Trans model

remains competitive (achieving an accuracy of

87.41%), demonstrating its effectiveness across var-

ious time series classiﬁcation tasks.

5.3.1 Performance Analysis Based on Dataset

Size

Here we analyze the performance of the CNN-Trans

model in relation to the size of the datasets, includ-

ing the number of variables, the sequence length, and

the amount of training data. This analysis offers ad-

ditional insights into its strengths and potential limi-

tations.

CNN-Trans exhibits strong scalability and effec-

tiveness, particularly as dataset size increases from

medium to large, making it well-suited for complex,

high-dimensional time series data. For instance, on

medium-sized datasets like NATOPS (180 training

samples) and Kitchen (2988 training samples), CNN-

Trans achieves high accuracy, outperforming other

models such as ResNet, and demonstrating its ability

to learn from a sufﬁcient but not excessive amount of

data. The confusion matrices illustrated in Figure 3(a)

show that failure cases are related to the most confus-

ing activities, such as classes 1 (All clear) and 2 (Not

clear) in the NATOPS dataset.

On large datasets like PenDigits (7494 training

samples) and Meeting Room (8217 training samples),

CNN-Trans performs exceptionally well, with accura-

cies of 98.37% and 84.25%, respectively, showing its

capacity to handle extensive data and extract mean-

ingful patterns. Figure 4(b) shows that for the HAR

dataset, despite strong performance, there is notice-

able misclassiﬁcation between adjacent classes (class

3:sitting and class 4:standing). This reﬂects difﬁcul-

ties in distinguishing highly complex or overlapping

activity patterns.

However, on smaller datasets like BasicMotions

(40 training samples) and FingerMovements (316

training samples), while CNN-Trans remains compet-

itive, it occasionally falls short compared to simpler

models like FCN, which might be more efﬁcient with

limited data. This suggests that while CNN-Trans is

highly effective in more complex scenarios, its per-

formance can vary depending on the dataset size, par-

ticularly when data is scarce (see Figure 5).

In summary, CNN-Trans is a ﬂexible model that

performs well as the size of the dataset grows,

especially for handling intricate, high-dimensional

data. Its competitiveness on smaller datasets remains

strong, but it may not consistently outperform simpler

models that are more appropriate for limited data.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

424

Table 2: Performance comparison of the proposed model with state-of-the-art methods. For each dataset, the red result denotes

the model with the best performance and the blue result indicates the model with the second best accuracy. The symbol ’-’

means that the accuracy of the corresponding model is not available for the dataset.

Datasets CNN-Trans FCN MLSTM-FCN TapNet ResNet

BasicMotions 90.00 100 95.00 100 100

RacketSports 84.86 82.23 - - 82.23

NATOPS 93.33 87.78 88.9 93.9 89.44

Libras 86.66 85.00 - - 83.89

FingerMovements 59.00 53.00 - - 54.00

ArticularyWordRecognition 98.33 98.00 97.30 98.70 98.00

PenDigits 98.37 98.57 97.80 98.00 97.71

HAR 87.41 84.45 96.71 - 87.11

Kitchen 95.74 66.17 94.25 81.7 53.19

MeetingRoom 84.25 63.95 70.14 82.23 68.08

(a) (b)

Figure 3: Confusion matrices of the CNN-Trans method on NATOPS, Libra, FingerMovements, and ArticularyWordRecog-

nition datasets.

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation

425

(a) (b)

Figure 4: Confusion matrices of the CNN-Trans method on PenDigits, HAR, MeetingRoom and Kitchen datasets.

5.3.2 Performance Analysis Based on Data Type

Here we analyze the behavior of the CNN-Trans

model in relation to the type of dataset. By group-

ing datasets based on their type, namely HAR,

EEG/ECG, motion-based tasks, and complex AR, we

highlight the strengths and challenges of CNN-Trans

model when dealing with diverse data characteristics.

Table 2 shows that CNN-Trans achieves consis-

tently high accuracy in HAR tasks, particularly on

NATOPS (93.33%) and BasicMotions (90%), indicat-

ing its effectiveness in capturing temporal and spa-

tial features for activity recognition. For more chal-

lenging datasets like RacketSports, the performance

is slightly lower (84.86%) but still competitive with

the highest accuracy across all tested methods.

On the FingerMovements dataset, that represents

EEG signals, the accuracy is relatively low across

all models, with CNN-Trans achieving the highest

(59%). This underscores the complexity of EEG

data, which requires sophisticated feature extrac-

tion and may beneﬁt from preprocessing or domain-

speciﬁc adaptations. The confusion matrix, illustrated

in Figure 3, highlights the confusion between the

two classes, emphasizing the challenge of extracting

meaningful features from the EEG signal.

On the other hand, on motion data type, all models

perform well, with CNN-Trans achieving 98.33% and

98.37% on ArticularyWordRecognition and PenDig-

its datasets, respectively. This ﬁnding shows the

strength of CNN-Trans in tasks with structured, con-

tinuous motion data.

On datasets that are collected from ambient sen-

sors, Kitchen and MeetingRoom, CNN-Trans ex-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

426

(a) (b)

Figure 5: Confusion matrices of the CNN-Trans method on BasicMotions and RacketSports datasets.

cels, achieving 95.74% (Kitchen) and 84.25% (Meet-

ingRoom), signiﬁcantly outperforming others. This

highlights its robustness and adaptability, particularly

in datasets with diverse and noisy features.

6 CONCLUSION AND FUTURE

WORK

In this paper, we introduced CNN-Trans, a novel

deep learning model designed speciﬁcally for multi-

variate time series classiﬁcation. By integrating the

strengths of convolutional neural networks and trans-

formers, CNN-Trans effectively addresses the chal-

lenges posed by the extensive and complex nature

of multivariate time series data. The model’s par-

allel architecture allows for simultaneous process-

ing of local features through CNNs and global rela-

tionships through transformers, leading to enhanced

classiﬁcation accuracy. Our extensive evaluations on

both benchmark and private datasets demonstrate that

CNN-Trans consistently outperforms state-of-the-art

approaches, showcasing its robustness and generaliz-

ability across diverse sensor applications. CNN-Trans

excels particularly when dealing with complex, noisy,

and high-dimensional data, conﬁrming its adaptabil-

ity and potential for practical applications.

While CNN-Trans offers improvements in inter-

pretability through attention weights, further research

could delve into enhancing the explainability of the

model’s decisions. In future work, we aim to develop

methods to visualize and interpret the contributions of

different features and time steps, thereby enhancing

trust in the model’s predictions.

To further enhance CNN-Trans in future work, we

aim to integrate a self-supervised learning phase to

leverage vast unannotated datasets. This could in-

volve pre-training with unlabeled data, extracting fea-

tures through self-supervised methods. Such integra-

tion aims to improve model performance and adapt-

ability by utilizing both labeled and unlabeled data

more effectively.

REFERENCES

Bagnall, A. J., Dau, H. A., Lines, J., Flynn, M., Large,

J., Bostrom, A., Southam, P., and Keogh, E. J.

(2018). The UEA multivariate time series classiﬁca-

tion archive, 2018. CoRR, abs/1811.00075.

Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2019).

BERT: pre-training of deep bidirectional transformers

for language understanding. In Burstein, J., Doran,

C., and Solorio, T., editors, Proceedings of the 2019

Conference of the North American Chapter of the As-

sociation for Computational Linguistics: Human Lan-

guage Technologies, NAACL-HLT 2019, Minneapolis,

MN, USA, volume 1, pages 4171–4186.

Foumani, S. N. M. and Nickabadi, A. (2019). A probabilis-

tic topic model using deep visual word representation

for simultaneous image classiﬁcation and annotation.

Journal of Visual Communication and Image Repre-

sentation, 59:195–203.

Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B.,

Liu, T., Wang, X., Wang, G., Cai, J., et al. (2018). Re-

cent advances in convolutional neural networks. Pat-

tern recognition, 77:354–377.

Hsu, E.-Y., Liu, C.-L., and Tseng, V. S. (2019). Multivari-

ate time series early classiﬁcation with interpretabil-

ity using deep learning and attention mechanism. In

CNN-Trans: A Two-Branch CNN Transformer Model for Multivariate Time Series Classiﬁcation

427

Paciﬁc-Asia Conference on Knowledge Discovery and

Data Mining, pages 541–553. Springer.

Hu, J., Shen, L., Albanie, S., Sun, G., and Wu, E. (2019).

Squeeze-and-excitation networks.

Ismail Fawaz, H., Forestier, G., Weber, J., Idoumghar, L.,

and Muller, P.-A. (2019). Deep learning for time series

classiﬁcation: a review. Data mining and knowledge

discovery, 33(4):917–963.

Ismail Fawaz, H., Lucas, B., Forestier, G., Pelletier, C.,

Schmidt, D. F., Weber, J., Webb, G. I., Idoumghar, L.,

Muller, P.-A., and Petitjean, F. (2020). Inceptiontime:

Finding alexnet for time series classiﬁcation. Data

Mining and Knowledge Discovery, 34(6):1936–1962.

Karim, F., Majumdar, S., Darabi, H., and Chen, S. (2017).

Lstm fully convolutional networks for time series clas-

siﬁcation. IEEE access, 6:1662–1669.

Karim, F., Majumdar, S., Darabi, H., and Harford, S.

(2019). Multivariate lstm-fcns for time series classiﬁ-

cation. Neural networks, 116:237–245.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).

Imagenet classiﬁcation with deep convolutional neu-

ral networks. In Pereira, F., Burges, C., Bottou, L.,

and Weinberger, K., editors, Advances in Neural In-

formation Processing Systems, volume 25. Curran As-

sociates, Inc.

Lichman, M. (2013). Uci machine learning repository.

http://archive.ics.uci.edu/ml [Accessed: ( August 21,

2024)].

Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z.,

and Song, W. (2021). Gated transformer networks for

multivariate time series classiﬁcation. arXiv preprint

arXiv:2103.14438.

Vaswani, A. (2017). Attention is all you need. Advances in

Neural Information Processing Systems.

Wang, Z., Yan, W., and Oates, T. (2017). Time series clas-

siﬁcation from scratch with deep neural networks: A

strong baseline. In 2017 International joint confer-

ence on neural networks (IJCNN), pages 1578–1585.

IEEE.

Yang, C., Wang, X., Yao, L., Long, G., and Xu, G. (2024).

Dyformer: A dynamic transformer-based architecture

for multivariate time series classiﬁcation. Information

Sciences, 656:119881.

Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., and

Eickhoff, C. (2021). A transformer-based framework

for multivariate time series representation learning. In

Proceedings of the 27th ACM SIGKDD conference

on knowledge discovery & data mining, pages 2114–

2124.

Zhang, E. Y., Cheok, A. D., Pan, Z., Cai, J., and Yan, Y.

(2023). From turing to transformers: A comprehen-

sive review and tutorial on the evolution and applica-

tions of generative transformer models. Sci, 5(4):46.

Zhang, X., Gao, Y., Lin, J., and Lu, C.-T. (2020). Tapnet:

Multivariate time series classiﬁcation with attentional

prototypical network. In Proceedings of the AAAI

conference on artiﬁcial intelligence, volume 34, pages

6845–6852.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

428