Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised

Model

Eldiane Borges dos Santos Dur

aes and Jo

ao Batista Florindo

Institute of Mathematics, Statistics and Scientiﬁc Computing, University of Campinas, Street Sergio Buarque de Holanda,

651, Campinas, Brazil

ﬂ

Keywords:

Sleep-Stage Classiﬁcation, mulEEG, Self-Supervised Learning, Linear SVM, EEG Signal.

Abstract:

Background and Objective: Accurate classiﬁcation of sleep stages is crucial for diagnosing sleep disorders

and automating this process can signiﬁcantly enhance clinical assessments. This study aims to explore the

use of a self-supervised model (more speciﬁcally, an adapted version of mulEEG) combined with a Linear

SVM classiﬁer to improve sleep stage classiﬁcation. Methods: The mulEEG model, which learns electroen-

cephalogram signal representations in a self-supervised manner, was simpliﬁed here by replacing ResNet-50

with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were

conducted: the ﬁrst one evaluated different conﬁgurations of the model and data volume for training, while

the second tested the effectiveness of time series features, spectrogram features, and their concatenation as

inputs to a Linear SVM classiﬁer. Results: The results showed that reducing the volume of data offered a

better cost-beneﬁt ratio compared to simplifying the model. Using the concatenated features with ResNet-18

also outperformed the linear evaluations of the original mulEEG model, achieving higher classiﬁcation perfor-

mance. Conclusions: Simplifying the mulEEG model to extract features and pairing it with a robust classiﬁer

leads to more efﬁcient and accurate sleep stage classiﬁcation. This approach holds promise for improving

clinical sleep assessments and can be extended to other biological signal classiﬁcation tasks.

1 INTRODUCTION

In the human body, sleep is divided into cycles, each

consisting of two distinct phases: Rapid Eye Move-

ment (REM) and Non-Rapid Eye Movement (NREM)

sleep, which is further divided into stages N1, N2, and

N3 (Patel et al., 2024). Each phase is characterized

by variations in muscle tone, brain wave patterns, and

eye movements. Furthermore, the human body un-

dergoes each sleep cycle 4 to 6 times per night, with

each cycle lasting approximately 90 minutes. How-

ever, sleep quality and time spent at each stage can

be affected by sleep-related or mental disorders, such

as obstructive sleep apnea, depression, schizophre-

nia, and dementia, as well as traumatic brain injuries,

medications, and circadian rhythm disorders. As a re-

sult, the identiﬁcation of sleep stages plays a crucial

role in the diagnosis of these conditions, and automat-

ing this classiﬁcation can signiﬁcantly improve clini-

cal sleep assessments.

The mulEEG model described in (Kumar et al.,

2022) is an example of a modern successful ap-

https://orcid.org/0000-0002-0071-0227

proach in the deep learning category. It aims to auto-

mate the classiﬁcation of sleep stages using electroen-

cephalogram (EEG) signals collected during sleep.

To achieve this, it performs a pretext task, which in-

volves learning effective representations of these EEG

signals from multiple data views in a self-supervised

manner. Despite the great results achieved by the

mulEEG model in ideal scenarios, with large amounts

of data and computational resources for pretraining,

we notice that the literature still lacks self-supervised

architectures adapted for contexts where the availabil-

ity of data and computational power is limited.

In this context, and inspired by self-supervised

models for time series analysis like mulEEG, here

we propose a self-supervised deep learning frame-

work for the identiﬁcation of sleep stages. We take

mullEEG as our starting point, but leverage it with

strategies aiming at offering alternatives to simplify

both the model and its training process, while also

maximizing the utility of the learned representations

by using them as input to a Linear SVM algorithm for

the classiﬁcation task.

The ﬁrst strategy involves replacing ResNet-50

972

Durães, E. B. S. and Florindo, J. B.

Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised Model.

DOI: 10.5220/0013367900003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

972-979

ISBN: 978-989-758-728-3; ISSN: 2184-4321

with 1D-convolutions in the time series encoder with

a ResNet-18, and alternating the usage of the pretext

dataset between 20% and 100%. The second one in-

volves several training of the Linear SVM using the

representations learned by the self-supervised module

with either ResNet-50 or ResNet-18, which are the

time series features, spectrogram features, and their

concatenation.

To assess the performance of the proposed model,

the main evaluation metrics were accuracy (Acc), Co-

hen’s kappa (κ), and macro-averaged F1 score (MF1).

The reference for comparison was the linear evalu-

ation of mulEEG presented in (Kumar et al., 2022),

where a linear classiﬁer was trained using only the

time series features as input. However, the training of

the Linear SVM with the concatenation of both fea-

tures learned by mulEEG using ResNet-18 was sufﬁ-

cient to surpass the results reported in (Kumar et al.,

2022).

2 RELATED WORKS

As stated in (Sekkal et al., 2022), the methods

used to automate the classiﬁcation of sleep stages

are based on two main strategies: (i) conven-

tional machine learning methods and (ii) deep learn-

ing approaches based on artiﬁcial neural networks.

The ﬁrst category includes algorithms such as K-

Nearest Neighbors (KNN), Support Vector Machines

(SVM), Random Forests (RF), Decision Trees, and

Bayesian rule-based classiﬁers. The second cate-

gory comprises Multi-Layer Perceptrons (MLP) and

their modern reﬁnements, including Recurrent Neural

Networks (RNNs), Long Short-Term Memory Net-

works (LSTMs), Gated Recurrent Units (GRUs), Bi-

directional LSTMs, and Convolutional Neural Net-

works (CNNs). Our study lies in the second category.

A recent review on deep learning techniques used for

sleep stage classiﬁcation can be found in (Liu et al.,

2024).

Deep learning can be further divided into differ-

ent learning paradigms, such as supervised, unsuper-

vised, and, more recently, self-supervised learning.

This last group is particularly interesting in scenar-

ios where the access to annotated data for training

is limited. Several works have investigated the use

of self-supervised learning on EEG signals, for ex-

ample, the study in (Xiao et al., 2024), where an

algorithm based on contrastive learning is used for

seizure detection. Speciﬁcally on sleep stage clas-

siﬁcation, we have (Eldele et al., 2023), which pro-

vides a systematic evaluation of SSL in few-label set-

tings. The authors in (Yuan et al., 2024) opt for an-

alyzing non-polysomnography data, particularly ac-

quired by wrist-worn accelerometers. In our study,

we focus speciﬁcally on mulEEG approach (Kumar

et al., 2022), considering the richness of the deep EEG

representation that the model provides by combining

SSL with a multi-view description. We differentiate

from the original model with respect to the focus on

computational efﬁciency and in solving a real-world

task, which is classiﬁcation, instead of feature repre-

sentation learning, which is the focus in (Kumar et al.,

2022).

3 BACKGROUND

In this section, the fundamental elements to be used

in our methodology are described. They are the

mulEEG, a multi-view representation learning model

on EEG signals (Kumar et al., 2022), and the Lin-

ear Support Vector Machine (Bishop and Nasrabadi,

2006), which is chosen here as the classiﬁcation head.

3.1 mulEEG

The mulEEG model, presented in (Kumar et al.,

2022), is a self-supervised multi-view method to learn

the representation of EEG signals. The objective

is to effectively utilize the complementary informa-

tion of the EEG multi-view signals. Also, this self-

supervised method follows the contrastive learning

approach.

In order to obtain the multi-view of EEG signals,

ﬁrst data augmentation is applied. The family of aug-

mentations T

uses jittering, in which uniform noise

is added to the EEG signal, and masking, in which

signals are masked randomly, ending up with the time

series t

. On the other hand, in the family of aug-

mentations T

, the EEG signals are randomly ﬂipped

in horizontal direction and then scaled with Gaussian

noise, resulting in the time series t

. Additionally,

and t

are converted into their respective spectro-

grams, s

and s

, by a Short-Time Fourier Transform

S. In this way, we have all the EEG signal views to be

used.

Thereby, the mulEEG model is composed by the

time series encoder E

, which is a ResNet-50 with 1D

convolutions, and the spectrogram encoder E

as fea-

ture extractors. They are responsible for obtaining the

effective EEG signal representations.

Once the time series and spectrogram features

have been obtained, they are passed into projection

heads, which map those representations to the space

where the contrastive loss will be applied. This struc-

ture is a fully connected neural network whose layer

Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised Model

973

sequence corresponds to: Linear, Batch Norm, ReLU,

and Linear. For each family of augmentations T

i = 1,2, there are three projection heads, one for each

kind of feature: time series ( f

), spectrogram (h

and the concatenation of both (g

). The correspond-

ing contrastive losses are named L

T T

, L

, and L

which gives more ﬂexibility to optimize each feature.

The diagram of mulEEG is presented in Figure 1.

The model employs a variant of contrastive loss

called NT-Xent, which maximizes the similarity be-

tween two augmented views while minimizing its

similarity with other samples (Kumar et al., 2022),

given by Equation (2). Notice that N is the batch size,

τ is the temperature parameter, and cosine similarity

is used.

l(i, j) = −log

exp(cos(z

)/τ)

∑

k=1

[k̸=i]

exp(cos(z

)/τ)

(1)

L(z

) =

∑

k=1

l(2k − 1,2k) + l(2k,2k − 1). (2)

We also have the diverse loss L

, which forces the

complementary information between time series and

spectrogram views. This loss is applied over the time

series and spectrogram features from both families of

augmentations of a single sample instead of the entire

batch, ignoring the concatenated features, which tend

to maximize the mutual information between those

features. The diverse loss is represented in Equation

(4), where z

= [z

] is taken with respect to

a single sample and τ

is the temperature parameter.

The total loss, a linear combination of all the losses

previously described with parameters λ

and λ

, is

represented by Equation 5.

,a,b) = − log



exp(cos(z

[a],z

[b])/τ

)

∑

i=1

[i̸=a]

exp(cos(z

[a],z

[i])/τ

)



(3)

∑

k=1

,1,2) +l

,2,1) +l

,3,4) +l

,4,3),

(4)

tot

= λ

T T

+ L

) + λ

. (5)

For the ﬁnal outcome of the model, the pre-trained

encoders are submitted to a linear layer, which is at-

tached right after the frozen time series encoder and

only this speciﬁc layer is trained, evaluating the cho-

sen metrics as shown in Figure 2. Therefore, mulEEG

does not explore the sleep-stage classiﬁcation task but

learns effective representations of it.

3.2 Linear SVM

Support Vector Machine (SVM) is a supervised ma-

chine learning algorithm that can be used in binary

classiﬁcation problems. The SVM goal is to obtain

the maximum margin that separates hyperplanes cor-

responding to decision boundaries over the classes.

Next, the Linear SVM description is presented ac-

cording to (Pedregosa et al., 2011).

Given a sample (x, y), with input vector x ∈ R

n×1

and label y ∈ {−1, 1}, consider the prediction given

by (w

x+b), where w ∈ R

n×1

is the weight and b ∈ R

is the bias term. The hinge loss used in Linear SVM

can be deﬁned as

max(0,1 − y(w

x + b)). (6)

Note that if the prediction is correct, the hinge loss is

equal to zero.

Thereby, given m samples {(x

(i)

)}

i=1

, the

Linear SVM solves the following problem:

argmin

w,b

∑

i=1

max(0,1 −y

(i)

+ b))+

(7)

In Equation (7), a regularization term is included via

C > 0, which acts as the inverse of the regularization

penalty.

One can also increase the complexity of SVM by

the addition of a kernel function, which transforms the

input vectors in such a way that the class separation is

performed by complicated non-linear decision bound-

aries. However, in the Linear SVM, the kernel func-

tion is the identity. Additionally, in a multi-class clas-

siﬁcation problem, strategies like “one-vs-the-rest” or

“one-vs-one” can be applied.

4 PROPOSED METHOD

Taking into consideration that, despite its effective-

ness in EEG analysis, mulEEG is a complex model

and its training algorithm has a high computational

cost, the proposed method adapts the original archi-

tecture, with the objective of obtaining similar ac-

curacy with a singniﬁcantly reduced computational

overhead. In a ﬁrst stage, we substitute the ResNet-

50 in the time series encoder by a ResNet-18, also us-

ing 1D-convolutions. This adaptation aims to verify

whether a simpler model can also be effective in the

classiﬁcation task. The amount of data used to train

the model was also varied to check the need for a large

dataset for this task. The architectures for ResNet-50

and ResNet-18 with 1D convolutions in the encoder

are presented in Tables 1 and 2, respectively. The

architecture of the ResNet-18 was adapted with 1D-

convolutions based on (He et al., 2015).

Additionally, once the EEG signal representations

were learned, a Linear SVM is trained for sleep-stage

classiﬁcation. This step has the goal of checking how

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

974

Figure 1: MulEEG structure. First, the EEG signal goes through data augmentation, in which we get the time series views

of the families of augmentations T

and T

and their conversions to spectrograms by the Short Time Fourier Transform

S. Then, these views are passed to encoders E

for time series and E

for spectrograms. Finally, the outputs go through

distinct projection heads depending on the type of view (time series, spectrogram or their concatenation) and the family of

augmentations, where the contrastive loss is applied between the projections of the same type.

Table 1: Architecture of ResNet-50 with 1D-convolutions used in mulEEG. Consider the input dimension as [256× 1 ×3000],

where 256 is the batch size, 1 is the number of channels, and 3000 is the length of each sample. The residual blocks are

represented in brackets and their numbers of repetition are indicated at right. Additionally, k is the kernel size, f is the

number of ﬁlters, s is the stride, and p is the padding of the 1D-convolution. Notice that s = 2 means that the stride is

actually equal to 2 only in the ﬁrst repetition of the corresponding bottleneck block, so there is a down-sampling. In the next

repetitions, the stride is equal to 1.

ResNet-50 with 1D-convolutions used in mulEEG.

Layer Output dimension Architecture

Conv0

[256 × 16 × 1500] k = 71, f = 16, s = 2, p = 35

[256 × 16 × 750] k = 71, s = 2, p = 35 Max-Pooling

Conv1 x [256 × 32 × 750]





k = 1 f = 8 s = 1 p = 0

k = 25 f = 8 s = 1 p = 12

k = 1 f = 32 s = 1 p = 0





× 3

Conv2 x [256 × 64 × 375]





k = 1 f = 16 s = 1 p = 0

k = 25 f = 16 s = 2 p = 12

k = 1 f = 64 s = 1 p = 0





× 4

Conv3 x [256 × 128 × 188]





k = 1 f = 32 s = 1 p = 0

k = 25 f = 32 s = 2 p = 12

k = 1 f = 128 s = 1 p = 0





× 6

Conv4 x [256 × 256 × 94]





k = 1 f = 64 s = 1 p = 0

k = 25 f = 64 s = 2 p = 12

k = 1 f = 256 s = 1 p = 0





× 3

effective those features are in a more robust classiﬁ-

cation algorithm, since the baseline architecture had

a simple linear classiﬁer to provide its output. The

choice of using a linear kernel on SVM is justiﬁed by

the huge data volume used in the experiments. Ac-

cording to (Pedregosa et al., 2011), an SVM with

non-linear kernel scales at least quadratically with the

number of samples, while the SVM with linear ker-

nel can scale almost linearly to millions of samples.

Thereby, the Linear SVM is a sufﬁcient algorithm for

Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised Model

975

Table 2: Architecture of the ResNet-18 adapted in our method with 1D-convolutions. The notation is identical to that in Table

ResNet-18 adapted with 1D-convolutions

Layer Output dimension Architecture

Conv0

[256 × 16 × 1500] k = 71, f = 16, s = 2, p = 35

[256 × 16 × 750] k = 71, s = 2, p = 35 Max-Pooling

Conv1 x [256 × 8 × 750]



k = 25 f = 8 s = 1 p = 12



× 2

Conv2 x [256 × 16 × 375]



k = 25 f = 16 s = 1 p = 12

k = 25 f = 16 s = 2 p = 12



× 2

Conv3 x [256 × 32 × 188]



k = 25 f = 32 s = 1 p = 12

k = 25 f = 32 s = 2 p = 12



× 2

Conv4 x [256 × 64 × 94]



k = 25 f = 64 s = 1 p = 12

k = 25 f = 64 s = 2 p = 12



× 2

Figure 2: Linear evaluation of mulEEG, in which the EEG

signal goes through the pre-trained encoder E

and the rep-

resentation obtained is used as input to train the linear clas-

siﬁer, getting the sleep-stage classiﬁcation at the end.

the goal of testing the learned EEG features in the

classiﬁcation task. Finally, in the proposed frame-

work, the input of the Linear SVM can be the time

series features, spectogram features, or the concate-

nation of both, as shown in Figure 3.

Figure 3: Proposed method. First, the EEG signal goes

through the chosen pre-trained encoders, which could vary

depending on the model for E

. Notice that before E

, there

is the Short Time Fourier Transform S. Then, the Linear

SVM is trained with one of the possible inputs: (1) time

series representation, (2) spectrogram representation or (3)

their concatenation. Finally, the sleep-stage classiﬁcation is

obtained.

5 EXPERIMENTAL SETUP

5.1 Dataset

The proposed method was evaluated on the Sleep-

EDF database presented in (Kemp et al., 2000) and

publicly available in (Goldberger et al., 2000). The

data consists of 153 whole-night polysomnography

sleep recordings sampled at 100 Hz and by the EEG

method (from Fpz-Cz and Pz-Oz electrode locations),

presented in the *PSG.edf ﬁles. Each one contains

the respective *Hypnogram.edf ﬁle with annotations

of the sleep patterns (hypnograms), which are the

sleep stages ‘W’ (Wake), ‘R’ (REM), ‘1’ (N1), ‘2’

(N2), ‘3’ (N3), ‘4’ (N3), ‘M’ (moviment time) and

‘?’ (not scored) scored by well-trained technicians.

These data come from the Sleep Cassette Study con-

ducted between 1987 and 1991, which is about the

age effects on sleep in healthy Caucasian adults aged

25-101, without any sleep-related medication (Gold-

berger et al., 2000).

In terms of data processing, as in (Kumar et al.,

2022), 58 patients were chosen to compose the un-

labeled pretext group for training the mulEEG and

20 were left for cross-validation (5-fold) in the linear

evaluation. Each recording, sampled at 100 Hz, was

split into 30-second segments named epoch, which

corresponds to a 3000 components array. To train the

model, we randomly select 20% and 100% of those

samples from the pretext group. Also, the same data

used in the linear evaluation was used to train the Lin-

ear SVM, except that in this last case they are normal-

ized.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

976

5.2 Implementation Details

In general, the training protocol was kept similar to

that in (Kumar et al., 2022). Then, we have N = 256

(batch size), the temperature parameters τ = 1 for

T T

, L

, and L

and τ

= 10 for L

, λ

= 1, and

= 2. An important difference in the protocol was

with respect to the number of epochs, which here was

deﬁned as 200, while the authors in (Kumar et al.,

2022) used 140. From epoch 80 on, the linear eval-

uation starts to be done at each 4 epochs, in which

the linear classiﬁer is trained for 100 epochs with the

corresponding pre-trained encoder for time series E

frozen. The computational setup for training com-

prised an Intel Core-i7 8700, 16 GB of RAM, Nvidia

Titan V graphics card, Python 3.9/PyTorch 2.5, Linux

Ubuntu 24.10.

In the Linear SVM, the loss function is the squared

hinge loss, with C = 1, tolerance for stopping equals

−4

, and the maximum number of iterations equals

1000. The pre-trained encoders used for this task are

the ones with the highest MF1 during trainig with

100% of data and varyng the ResNet architectures of

5.3 Evaluation Metrics

The evaluation metrics used for both the linear evalu-

ation of mulEEG and the proposed method are accu-

racy (Acc), Cohen’s kappa (κ), and macro-averaged

F1 score (MF1). The computational time for training

was also analyzed.

6 RESULTS AND DISCUSSION

Table 3 presents the linear evaluation metrics for

various mulEEG training conﬁgurations compared to

(Kumar et al., 2022). It should be noted that the use

of ResNet-50 results in signiﬁcantly longer training

time compared to ResNet-18, despite delivering su-

perior results. However, the metrics obtained with

ResNet-50 and 100% of the pretext group were the

best in all experiments, although slightly lower than

those reported in (Kumar et al., 2022). However, this

conﬁguration proved to be the most computationally

expensive. Conversely, training with ResNet-18 and

100% of the data produced metrics very similar to

those achieved with ResNet-50 and 20% of the data,

with the latter one requiring approximately one-third

of the training time. In this way, we conﬁrm that train-

ing with ResNet-50 and 20% of the data provides the

best cost-beneﬁt ratio. In other words, reducing the

data volume of the pretext group is signiﬁcantly more

advantageous than simplifying the model to acceler-

ate training.

For the linear SVM classiﬁcation task, the met-

rics obtained by varying the time series encoder and

classiﬁer input are presented in Table 4. First, it is ev-

ident that the metrics obtained by SVM when using

the EEG signal were signiﬁcantly inferior compared

to the other conﬁgurations. This suggests that the rep-

resentations learned by mulEEG are indeed effective

for the classiﬁcation task, serving as a proﬁcient fea-

ture extractor for EEG signals.

It is also noteworthy that when using spectrogram

features, the metrics remain similar despite variations

in the encoder. This aligns with expectations, as the

spectrogram encoder E

remains unchanged in both

cases, indicating that its training is unaffected by the

time series encoder, as intended by the use of L

loss

function. However, the use of this feature proved to

be the least beneﬁcial to the classiﬁcation task, out-

performing only the direct use of the EEG signal.

Furthermore, when using ResNet-18, the time se-

ries feature yields metrics that are highly comparable

with those reported in (Kumar et al., 2022), while the

concatenated features surpass the metrics of the afore-

mentioned reference. These results demonstrate that

simplifying the mulEEG model by using ResNet-18

in E

is highly advantageous for classifying EEG sig-

nals with a more robust classiﬁer, particularly when

utilizing concatenated features, demonstrating the ef-

fective use of the complementary information from

both views.

Moreover, when employing ResNet-50, i.e.,

mulEEG in its original conﬁguration, the use of both

time series and concatenated features yields nearly

identical results, surpassing all other metrics, includ-

ing those of (Kumar et al., 2022). Unlike the ﬁnd-

ings with ResNet-18, there is no signiﬁcant improve-

ment when using the concatenated features, with only

the F1 score showing an increase of approximately

2%. From this we can infer that as the complexity

of the temporal encoder increases, more information

is extracted from the time series, making the spectro-

gram feature less contributory to the classiﬁer’s learn-

ing process.

In conclusion, the feature extractors trained within

mulEEG play a pivotal role in the classiﬁcation of

EEG signals into sleep stages. Furthermore, it can

be posited that the superior performance of the SVM

in classifying EEG signals arises from using the con-

catenated features, which outperform the linear eval-

uation presented in (Kumar et al., 2022). Despite the

superior metrics of using ResNet-50 as the encoder in

this context, ResNet-18 still offers a compelling cost-

beneﬁt ratio when paired with a more robust classiﬁer.

Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised Model

977

Table 3: Evaluation metrics of various mulEEG training conﬁgurations.

Method Acc κ MF1 Training time

20% data + ResNet-18 0.6979 0.5705 0.5252 3h 2m 55s

20% data + ResNet-50 0.7483 0.6469 0.6056 4h 27m 26s

100% data + ResNet-18 0.7549 0.6528 0.6189 13h 21m 22s

100% data + ResNet-50 0.7653 0.6704 0.6546 21h 2m 56s

Linear evaluation of (Kumar et al., 2022) 0.7806 0.6850 0.6782 -

Table 4: Metrics observed in the classiﬁcation task using Linear SVM with EEG signal and time series, spectrogram, and

concatenated features as the input. The encoder E

was varied and the linear evaluation of (Kumar et al., 2022) is also

presented for comparison purposes.

Linear SVM training

model Input of SVM Acc κ F1

Linear evaluation of (Kumar et al., 2022) 0.7806 0.6850 0.6782 (MF1)

- EEG signal 0.2984 0.0260 0.2143

ResNet-18

time series feature 0.7732 0.6812 0.6657

Spectrogram feature 0.7452 0.6415 0.6426

Concatenated feature 0.7909 0.7074 0.6972

ResNet-50

time series feature 0.8090 0.7328 0.7239

Spectrogram feature 0.7413 0.6357 0.6366

Concatenated feature 0.8079 0.7323 0.7373

After all, a more streamlined model achieved compet-

itive performance by effectively exploiting the com-

plementary information between the time series and

spectrogram. This also conﬁrms that both data repre-

sentations offer useful viewpoints and that the combi-

nation of a self-supervised feature description allows

the effective use of lighter supervised models in the

target task. This ﬁnding can also help in other do-

mains to be explored in future works, for time series

classiﬁcation in general.

7 CONCLUSIONS

This work presents an investigation on the use of

a self-supervised model alongside with Linear SVM

classiﬁer, with the aim of exploring the classiﬁca-

tion of EEG signals into sleep-stages. Initially, we

observed that the complexity of deep learning mod-

els known to be well-succeeded in this task, com-

bined with the large volume of data, resulted in a

high computational cost during training. To address

this, we propose a model that uses as baseline the

well-established mulEEG architecture, but simpliﬁes

it by replacing the ResNet-50 with 1D-convolutions

by ResNet-18, using it as the time series encoder. Fur-

thermore, training this model involved experimenting

with both 20% and 100% of the data from the pre-

text group. The linear evaluation of these training

conﬁgurations revealed that although the ResNet-50

model with 100% of the data achieved the best met-

rics, it also incurred in the highest computational cost.

We thus discovered that reducing the volume of data

yielded a better cost-beneﬁt ratio than simplifying the

temporal encoder model.

It is important to notice that the primary goal of

models like mulEEG is to learn effective representa-

tions of EEG signals, rather than to directly address

the classiﬁcation task. This is evident in the linear

evaluation, where a simple linear classiﬁer receives

only the time series features as input. In this con-

text, the Linear SVM was chosen here as a more

robust classiﬁer, trained with varying inputs derived

from the EEG signal representations obtained by the

self-supervised model, which receives the time series,

spectrogram, and their concatenation as input. Ad-

ditionally, the temporal encoder was again varied be-

tween ResNet-50 and ResNet-18 in order to compare

their performance with a more robust classiﬁer. The

metrics obtained with ResNet-50, using both time se-

ries and concatenated features, were very similar and

resulted in the best performance, surpassing the linear

evaluation in (Kumar et al., 2022). It is worth not-

ing that the complexity of the temporal encoder limits

the amount of complementary information provided

by the spectrogram features. In contrast, when using

ResNet-18, the concatenated features emerge as par-

ticularly effective, and the results once again outper-

form those in (Kumar et al., 2022). This highlights

the signiﬁcant cost-beneﬁt advantage of simplifying

the mulEEG model and pairing it with a more robust

classiﬁer, thereby effectively capitalizing on the com-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

978

plementary perspectives of both the raw time series

and the spectrogram.

In conclusion, this study highlights the potential

of associating a robust classiﬁer after extracting fea-

tures from EEG signals for classiﬁcation into sleep

stages, as well as leveraging the different perspec-

tives these data provide. Indeed, this approach holds

promise for further exploration in a variety of prob-

lems involving biological signals in general, such as

the detection of anomalies in electrocardiograms, for

example.

ACKNOWLEDGEMENTS

J. B. Florindo gratefully acknowledges the ﬁnan-

cial support of the S

ao Paulo Research Foundation

(FAPESP) (Grants #2024/01245-1 and #2020/09838-

0) and from National Council for Scientiﬁc and

Technological Development, Brazil (CNPq) (Grant

#306981/2022-0).

REFERENCES

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-

nition and machine learning, volume 4. Springer.

Eldele, E., Ragab, M., Chen, Z., Wu, M., Kwoh, C.-K.,

and Li, X. (2023). Self-supervised learning for label-

efﬁcient sleep stage classiﬁcation: A comprehensive

evaluation. IEEE Transactions on Neural Systems and

Rehabilitation Engineering, 31:1333–1342.

Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov,

P. C., Mark, R., Mietus, J. E., Moody, G. B., Peng,

C. K., and Stanley, H. E. (2000). Physiobank, phys-

iotoolkit, and physionet: Components of a new re-

search resource for complex physiologic signals. Cir-

culation [Online], 101 (23):pp. e215–e220.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Kemp, B., Zwinderman, A. H., Tuk, B., Kamphuisen, H.

A. C., and Oberye, J. J. L. (2000). Analysis of a

sleep-dependent neuronal feedback loop: the slow-

wave microcontinuity of the eeg. IEEE Transactions

on Biomedical Engineering, 47(9):1185–1194.

Kumar, V., Reddy, L., Sharma, S. K., Dadi, K., Yarra,

C., Bapi, R. S., and Rajendran, S. (2022). muleeg:

A multi-view representation learning on eeg signals.

Lecture Notes in Computer Science, 13433.

Liu, P., Qian, W., Zhang, H., Zhu, Y., Hong, Q., Li, Q.,

and Yao, Y. (2024). Automatic sleep stage classiﬁca-

tion using deep learning: signals, data representation,

and neural networks. Artiﬁcial Intelligence Review,

57(11):301.

Patel, A. K., Reddy, V., Shumway, K. R., and Araujo, J. F.

(2024). Physiology, sleep stages.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Sekkal, R. N., Bereksi-Reguig, F., Ruiz-Fernandez, D., Dib,

N., and Sekkal, S. (2022). Automatic sleep stage clas-

siﬁcation: From classical machine learning methods

to deep learning. Biomedical Signal Processing and

Control, 77:103751.

Xiao, T., Wang, Z., Zhang, Y., Wang, S., Feng, H., Zhao, Y.,

et al. (2024). Self-supervised learning with attention

mechanism for eeg-based seizure detection. Biomedi-

cal Signal Processing and Control, 87:105464.

Yuan, H., Plekhanova, T., Walmsley, R., Reynolds, A. C.,

Maddison, K. J., Bucan, M., Gehrman, P., Rowlands,

A., Ray, D. W., Bennett, D., et al. (2024). Self-

supervised learning of accelerometer data provides

new insights for sleep and its association with mor-

tality. NPJ digital medicine, 7(1):86.

Sleep-Stage Efﬁcient Classiﬁcation Using a Lightweight Self-Supervised Model

979