Non Contact Stress Assessment Based on Deep Tabular Method

Urmila

and Avantika Singh

Dr. Shyama Prasad Mukherjee International Institute of Technology, Naya Raipur, Raipur, Chhattisgarh, India

Keywords:

Stress Detection, Remote Photoplethysmography (rPPG), Physiological Features.

Abstract:

In today’s competitive world, stress is the major factor that inﬂuences human health negatively. In the long

term, stress can lead to serious health problems such as diabetes, depression, anxiety, and various heart dis-

eases. Thus, timely stress recognition is important for efﬁcient stress management. Currently, for stress as-

sessment various wearable devices are used to capture physiological signals. However, these devices although

accurate are cost-sensitive and requires direct physical contact which may lead to discomfort in long run. In

this work we have introduced a tabular based deep learning architecture for detecting stress by analyzing phys-

iological features. The architecture extracts physiological features from remote photoplethysmography (rPPG)

signals computed from facial videos. The proposed architecture is validated on publicly available UBFC-Phys

dataset for two sets of experiments (i) Stress task classiﬁcation and (ii) Multi-level stress classiﬁcation. For

both set of experiments the proposed methodology outperforms the current state-of-art method. The code is

available at https://github.com/Heeya2205/Deep_tabular_methods.

1 INTRODUCTION

Stress is a physical, emotional, or mental response

to external pressures or demands, which can arise

from various factors. Life situations like work chal-

lenges, family issues, and environmental changes of-

ten arouse stress conditions in an individual. It is a

natural response of the body to perceived threats and

can manifest as tension, anxiety or other psycholog-

ical or physical reactions (Giannakakis et al., 2019).

Stress causes the body to release hormones like corti-

sol and adrenaline, which raise your heart rate, blood

pressure, and blood sugar to help you deal with chal-

lenges. However, if stress lasts a long time, it can lead

to serious health problems like heart disease, diabetes,

anxiety, and depression (Fink, 2010).

In human body autonomic nervous system (ANS)

is responsible for controlling the number of heart-

beats per minute (Barazi et al., 2021). ANS can be

further divided into two sub-parts: (a) Sympathetic

nervous system, (b) Parasympathetic nervous system.

Sympathetic nervous system is active when an indi-

vidual encounters situation like stress, anxiety, fear or

undergoes laborious exercises, as a result it increases

the heart rate. On the other hand parasympathetic ner-

vous system is active when an individual is in calm

https://orcid.org/0009-0003-8642-8525

https://orcid.org/0000-0002-2606-5959

state of mind particularly when an individual feels

compassion or love and thus, it decreases the heart-

rate. As mentioned above heart rate is directly inﬂu-

enced by ANS which further depends upon our state-

of-mind. Thus in this work we aim to estimate stress

based on heart-rate (HR) and its related factors like

Heart rate variability (HRV), Peak detection etc.

Traditionally, heart rate can be measured by var-

ious methods like electrocardiography (ECG), elec-

tromyography (EMG), and photoplethysmography

(PPG). However, these methods although accurate

are cost-sensitive and requires direct physical contact.

Henceforth, in this work we explore, non-invasive

method of heart-rate estimation by utilizing remote

photoplethysmography (rPPG) signals.

Here, in this work for stress assessment, ﬁrstly

rPPG signals are extracted from recorded video

frames. Later, physiological features like HR, HRV

etc are calculated from observed rPPG signals. Fur-

ther, the extracted physiological features are encoded

for learning discriminative stress levels by employ-

ing deep tabular data learning architecture (Arik and

Pﬁster, 2021) based on attention mechanism. Main

contributions of this work are as follows:

• Efﬁcient Tabular Learning with TabNet: The

proposed method harnesses TabNet network

(Arik and Pﬁster, 2021) as the foundational back-

bone to directly process physiological features ex-

Urmila, and Singh, A.

Non Contact Stress Assessment Based on Deep Tabular Method.

DOI: 10.5220/0013150300003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 597-604

ISBN: 978-989-758-730-6; ISSN: 2184-4313

597

tracted from rPPG signals, incorporating built-

in feature prioritization and sparsity mechanisms.

This eliminates the need for extensive preprocess-

ing while enhancing model interpretability and

performance.

• Improved Generalization Across Classiﬁcation

Tasks: The method demonstrates lower variance

in accuracy across binary and multi-class stress

classiﬁcation tasks, highlighting superior gener-

alization compared to existing approaches, which

often exhibit signiﬁcant performance variability.

• Robust Feature Extraction: To effectively man-

age the dynamic motion of faces across video

frames and improve the precision of facial re-

gion extraction, the proposed method integrates

the Haar Cascade (Choi et al., ), and Mediapipe

library (Lugaresi et al., 2019).This combination

facilitates robust face detection and accurate lo-

calization of facial landmarks.

2 RELATED WORK

Several works have been reported in the literature for

rPPG signal analysis (Das et al., 2023) (Speth et al.,

2024) but the work done in the ﬁeld of stress analy-

sis using rPPG signal is still limited. In one of the

notable work (Ziaratnia et al., 2024) author’s pro-

posed deep learning-based method that utilize Com-

pact Convolutional Transformers (CCT) for feature

extraction and Long Short-Term Memory (LSTM) for

temporal pattern recognition. This study presents a

non-contact approach for recognizing stress by uti-

lizing rPPG signals derived from RGB facial videos.

In another work (Xu et al., 2024) a multi-task atten-

tional convolutional neural network (MTASR) is de-

ployed, which integrates peak detection and heart rate

estimation to enhance stress recognition. By employ-

ing peak detection as a physiological parameter, the

researchers trained their network to identify more re-

liable physiological indicators. In another work au-

thors (Casado et al., 2023) recognized depression

based on a pipeline that extract rPPG signals in a

full unsupervised manner, and calculate 60 statisti-

cal, geometrical, and physiological features. These

extracted features are further used to train several

machine learning regressors to recognize different

level of depressions. In (Ntalampiras, 2023) authors

proposed a non-intrusive, low-cost, and automatic

stress monitoring framework. This framework ex-

tracts multi-domain speech features to reveal comple-

mentary stress-related characteristics. In (Pan et al.,

2024) a deep network for stress assessment is pro-

posed that focuses on facial features such as expres-

sions, movements, and speciﬁc areas like the eyes,

nasolabial folds, and jaw, which are related to depres-

sion. The proposed model emphasizes dynamic facial

features through its attention mechanism.

3 METHODOLOGY

This section conceptualizes our proposed approach.

Figure 1 gives a generic overview of the proposed sys-

tem which consists of mainly ﬁve parts: (i) ROI selec-

tion and facial landmark detection (ii) Color space ex-

traction from cropped facial regions (iii) rPPG signal

extraction (iv) Physiological features extraction from

rPPG signals (v) Tabular learning on physiological

features.

3.1 ROI Selection and Facial Landmark

Detection

In the pre-processing phase, video frames are ex-

tracted from facial video sequences. Speciﬁcally, we

utilize a frame rate of 30 fps for temporal segmen-

tation of the video. Each extracted frame is main-

tained at a resolution of 513 × 513 pixels. Following

the frame extraction, the next step involves generat-

ing a bounding box around the facial region of inter-

est. To achieve this, we implement the Haar Cascade

algorithm (Choi et al., ), which operates by detecting

facial features through the application of rectangular

ﬁlters that compute the intensity differences between

adjacent regions in the image. Once the bounding

box is delineated, the subsequent step is the detection

of facial landmarks. For this, we employ the Medi-

apipe Face Mesh (Lugaresi et al., 2019) framework,

which facilitates the identiﬁcation of 478 facial land-

marks (Ziaratnia et al., 2024).

By integrating both techniques in our proposed

method, we address the challenges inherent in extract-

ing the facial region of interest. Figures 2a and 2b

demonstrate these challenges and highlight the im-

provements achieved through our approach.

Furthermore, convex hull based masking tech-

nique has been employed to isolate and emphasize the

facial region. Mathematically, the convex hull is the

smallest convex shape that can enclose all the points

representing the facial landmarks. For this we have

ﬁrst created a binary mask with the same resolution

as our extracted video frame resolution (513 × 513).

Formally, binary mask is described as:

mask(x, y) =

(

255 if (x, y) ∈ convex hull

0 otherwise

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

598

Figure 1: Proposed architecture comprising of ﬁve main parts: (i) ROI selection and facial landmark detection (ii) Color space

extraction (iii) rPPG signal extraction (iv) Physiological features extraction (v) Tabular learning on extracted physiological

features.

where (x, y) denotes the pixel coordinates. The last

step is to generate cropped facial region, which is gen-

erated as:

Cropped

= F

Mediapipe

∧ mask

where, F

Mediapipe

is the output representing 478 facial

landmarks. Here, the bitwise AND operation is used

to retain only the pixel values of the facial region that

fall within the convex hull, effectively setting all non-

facial pixels to black. This selective masking tech-

nique highlights the facial features by suppressing ir-

relevant background elements.

3.2 Color Space Extraction from

Cropped Facial Regions

Inspired by the prior study (Kim et al., 2021) we apply

YCgCr color space (F

YCgCr

) from F

Cropped

(cropped

facial region). By successfully decoupling the lu-

minance component (Y) from the chrominance com-

ponents (Cg) and (Cr), this color space facilitates

the identiﬁcation of subtle color changes that signify

physiological conditions. By concentrating on the lu-

minance channel and working with various skin tones,

this color space lessens lighting variations (Panigrahi

and Sharma, 2022).

3.3 rPPG Signal Extraction

For extracting rPPG signals from F

YCgCr

we have

utilized state-of-the-art POS (Plane-Orthogonal-to-

Skin) (Wang et al., 2016) algorithm. Algorithm 1 il-

lustrates POS detailed steps.

Inspired by the prior study (Xu et al., 2024) to

normalize extracted rPPG signals (S

pos

) we have ﬁl-

tered it with Butterworth ﬁlter (Selesnick and Burrus,

(a)

(b)

Figure 2: Preprocessing steps. (a) Without proper prepro-

cessing . (b) After preprocessing. The image is taken from

UBFC-Phys dataset.

1998). While ﬁltering low cut-off frequency was set

as 0.7 Hz and high cut-off frequency was set as 2.5

Hz. Once the signals are normalized, we divide them

into 2-second windows (30 frames per second) with a

10% overlap between segments to account for tempo-

ral differences across frames.

Non Contact Stress Assessment Based on Deep Tabular Method

599

Algorithm 1: rPPG Signal Extraction.

1: Input: F

YCgCr

2: Output: S

pos

(extracted rPPG Signal)

3: Step 1: Compute Color Signals

4: C

← Y −Cg ▷ Color Signal 1

5: C

← Cr −

Y +Cg

▷ Color Signal 2

6: Step 2: Compute Means of Color Signals

7: µ

←

∑

i=1

▷ Mean of Color Signal 1

8: µ

←

∑

i=1

▷ Mean of Color Signal 2

9: Step 3: Adjust Color Signals by Subtracting

the Mean

10: S

← C

− µ

▷ Adjusted Signal 1

11: S

← C

− µ

▷ Adjusted Signal 2

12: Step 4: Compute Final rPPG Signal

13: S

pos

← S

+ S

▷ rPPG Signal

14: Return S

pos

Figure 3: Single window depicting signal peak and interval

detection.

3.4 Physiological Features Extraction

From the normalized rPPG signals (S

npos

) we have

extracted signal peaks as shown in Figure 3. From

the extracted peaks, signal peak intervals are com-

puted as depicted in Figure 3. Later on the basis of

aforementioned signal intervals 7 physiological fea-

tures are computed as follows:

(i) Heart Rate (HR):

HR =

mean interval between peaks (seconds)

where intervals are the time differences between con-

secutive S

npos

peaks.

(ii) Heart Rate Variability (HRV):

HRV = Standard deviation of intervals (seconds)

which represents the variation in time between heart-

beats.

(iii) Standard Deviation of Heart Rate

(SD HR):

SD HR = Standard deviation of computed heart rates

(bpm) calculated for each interval.

(iv) Root Mean Square of Successive Differ-

ences (RMSSD):

RMSSD =

N − 1

N−1

∑

i=1

(interval

i+1

− interval

)

where N is the number of intervals, indicating short-

term HRV.

(v) Ups Count: In a sympathetic activity in Au-

tonomic nervous system(ANS) this ups count refer to

the number of times S

npos

signal transit from a lower

value to higher value. It indicates the onset of a heart-

beat.

(vi) Downs Count: It is the moment when the

heart ﬁnishes pumping blood and start to relax, which

is captured as a downward transition in the S

npos

sig-

nal.

(vii) Cycles: Mathematically, it is deﬁned as the

minimum of Ups Count and Downs Count.

3.5 Tabular Learning on Physiological

Features

The extracted physiological features are numeric val-

ues; thus, we store them in a tabular structure. To

extract meaningful information from it, we have in-

corporated the state-of-the-art tabular learning archi-

tecture, TabNet (Arik and Pﬁster, 2021). In our pro-

posed methodology, we have extracted discriminative

information from physiological features by utilizing

the TabNet encoder architecture, as depicted in Fig-

ure 4. This tabular-based deep learning architecture

operates through a series of sequential decision steps,

where each step reﬁnes its focus based on the infor-

mation processed from the previous step. Each deci-

sion step employs non-linear transformations to reﬁne

the feature representations. Furthermore, it includes a

sparsity regularization mechanism to control the num-

ber of features selected at each step.

As depicted in Figure 4, this architecture mainly

comprises three modules: (i) Feature Transformer,

(ii) Attentive Transformer, and (iii) Feature Selection

Mask. All these are described below:

Feature Transformer: This module is responsi-

ble for transforming the input physiological features

into meaningful representations. These representa-

tions are later divided into two parts using a split

block. The ﬁrst part, denoted as d[i] (as depicted in

Figure 4), contributes to the current decision output,

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

600

Figure 4: Encoder architecture for extracting discriminative information from physiological features. This architecture op-

erates through a series of sequential decision steps mainly comprising of three parts (i) Feature Transformer (ii) Attentive

Transformer (iii) Feature Selection Mask. Here, a[i] serves as the input for the attention transformer in the next decision step

while d[i], denotes the current decision output.

while the other part, denoted as a[i] (as depicted in

Figure 4), serves as the input for the attentive trans-

former in the next decision step.

Attentive Transformer: This module generates a

trainable mask that identiﬁes the most important fea-

tures for each decision step using a sparsemax func-

tion, which creates a sparse, interpretable feature se-

lection mechanism. A prior scale term is also em-

ployed to regulate how often each feature is selected

across decision steps.

Feature Selection Mask: This module is used for

selecting globally important feature attributions.

To develop a deep tabular learning model capa-

ble of robustly identifying and prioritizing speciﬁc

features derived from rPPG signals that play a criti-

cal role in our analysis, we leverage the TabNet ar-

chitecture. This approach eliminates the need for

preprocessing techniques by allowing the model to

adaptively focus on essential features. If a feature is

deemed less supportive at any stage, the model ad-

vances to another level of evaluation to reassess its

importance.

4 EXPERIMENT AND RESULTS

4.1 Dataset Description

To validate the performance of proposed approach

in this work we have worked on UBFC-Phy

dataset (Sabour et al., 2021). This dataset is pub-

licly available and consists of 56 subjects in total.

Out of these 56 subjects 47 subjects are female and

rest are male. This dataset contains facial videos of

enrolled subjects. Each subject underwent three dif-

ferent tasks: resting task (T1), speech task (T2) and

arithmetic task (T3) resulting in generation of 168

videos in total. Furthermore, this dataset is divided

into two groups based on the challenging nature of

the task namely: control group and test group. In

control group subjects faced less challenging task as

compared to subjects in test group.

4.2 Model Validation and Experimental

Setup

For validating our proposed approach we have con-

ducted two set of experimentation as conducted in

previous state of the art work (Ziaratnia et al., 2024)

namely: (i) Stress task classiﬁcation, (ii) Multi-level

stress classiﬁcation.

In stress task classiﬁcation subjects are classiﬁed

on the basis of tasks performed. Under this we have

performed both binary (T 1 vs. T 2, T 2 vs. T 3, and T 1

vs. T 3) as well as ternary classiﬁcation (T1 vs. T 2

vs. T 3). While in case of multi-level stress classiﬁ-

cation subjects are classiﬁed on the basis of the chal-

lenging nature of the task. Out of the three tasks T 3

(arithmetic task) is considered as the most demand-

ing task (Ziaratnia et al., 2024). Thus, this set of

experimentation is focused towards classifying stress

into three levels: (i) no stress ( T 1 task for both

control and test group), (ii) low stress ( T 3 control

group), and (iii) high stress ( T 3 test group). For stress

task classiﬁcation seven-fold cross-validation tech-

nique and for multi-level stress classiﬁcation stratiﬁed

ﬁve-fold cross-validation technique has been adopted

as in (Ziaratnia et al., 2024).

All the experiments have been conducted on a sys-

tem with speciﬁcations as: Intel(R) Xeon(R) Silver

4114 CPU, 64 GB of RAM, 4 TB SSD, NVIDIA

GeForce GTX 1080 Ti GPU. For the software envi-

ronment, we have utilized libraries such as OpenCV

(Mordvintsev and Abid, 2017), PyTorch (Imambi

Non Contact Stress Assessment Based on Deep Tabular Method

601

et al., 2021), and Mediapipie (Lugaresi et al., 2019).

Table 1: Stress task 7-fold cross-validation classiﬁcation

results. Binary classiﬁcation T1 (resting task) vs. T2

(speech task).

K-folds Accuracy Precision Recall F1 score

Fold-1 0.750 0.833 0.625 0.714

Fold-2 0.681 0.615 1.000 0.761

Fold-3 0.812 0.727 1.000 0.842

Fold-4 1.000 1.000 1.000 1.000

Fold-5 0.937 0.888 1.000 0.941

Fold-6 0.937 0.888 1.000 0.941

Fold-7 0.875 1.000 0.750 0.857

7-Fold Mean 0.857 0.850 0.9107 0.865

Table 2: Stress task 7-fold cross-validation classiﬁcation re-

sults. Binary classiﬁcation T1 (resting task) vs. T3 (arith-

metic task).

K-folds Accuracy Precision Recall F1 score

Fold-1 0.812 0.857 0.750 0.800

Fold-2 1.000 1.000 1.000 1.000

Fold-3 0.937 1.000 0.875 0.933

Fold-4 0.930 1.000 0.875 0.933

Fold-5 1.000 1.000 1.000 1.000

Fold-6 1.000 1.000 1.000 1.000

Fold-7 0.937 0.888 1.000 0.941

7-Fold Mean 0.946 0.963 0.928 0.944

Table 3: Stress task 7-fold cross-validation classiﬁcation re-

sults. Binary classiﬁcation T2 (speech task) vs. T3 (arith-

metic task).

K-folds Accuracy Precision Recall F1 score

Fold-1 0.750 0.833 0.625 0.714

Fold-2 0.875 0.875 0.875 0.875

Fold-3 0.812 0.857 0.750 0.800

Fold-4 0.875 1.000 0.750 0.857

Fold-5 0.812 0.857 0.750 0.800

Fold-6 0.875 1.000 0.750 0.857

Fold-7 0.937 0.888 1.000 0.941

7-Fold Mean 0.848 0.901 0.785 0.835

Table 4: Stress task 7-fold cross-validation classiﬁcation

results. Ternary classiﬁcation T1 (resting task) vs. T2

(speech task) vs. T3 (arithmetic task).

K-folds Accuracy Precision Recall F1 score

Fold-1 0.750 0.757 0.750 0.752

Fold-2 0.750 0.795 0.750 0.752

Fold-3 0.875 0.891 0.875 0.873

Fold-4 0.916 0.933 0.916 0.918

Fold-5 0.833 0.838 0.833 0.829

Fold-6 0.916 0.933 0.916 0.918

Fold-7 0.916 0.933 0.916 0.918

7-Fold Mean 0.851 0.869 0.851 0.851

4.3 Performance Metrics

The proposed framework’s performance is evaluated

using common classiﬁcation evaluation criteria, in-

cluding accuracy, precision, recall, and F1 score. Ac-

curacy is calculated as the ratio of correctly predicted

instances to total instances, while precision is the ratio

Table 5: Multi-Level stress 5-fold cross-validation classiﬁ-

cation results. Ternary classiﬁcation: no stress (T1 task

for both control and test group), low stress (T3 control

group) and high stress (T3 test group).

K-folds Accuracy Precision Recall F1 score

Fold-1 0.826 0.896 0.826 0.816

Fold-2 0.869 0.864 0.869 0.864

Fold-3 0.863 0.877 0.863 0.857

Fold-4 0.954 0.961 0.954 0.953

Fold-5 0.863 0.865 0.863 0.861

5-Fold Mean 0.875 0.882 0.875 0.870

Table 6: Multi-Level stress 5-fold cross-validation classi-

ﬁcation results. Mean value (± standard deviations) for

three class classiﬁcation individually: no stress (T1 task

for both control and test group), low stress (T3 control

group) and high stress (T3 test group) .

Class Accuracy Precision Recall F1 score

No Stress 0.981(±0.040) 0.910(±0.092) 0.981(±0.040) 0.943(±0.061)

Low Stress 0.866(±0.139) 0.812(±0.059) 0.866(±0.139) 0.835(±0.088)

High Stress 0.653(±0.086) 0.910(±0.124) 0.653(±0.086) 0.756(±0.081)

of accurately predicted positive cases to all instances

predicted as positive. Recall is the proportion of accu-

rately predicted positive instances relative to all actual

positive instances, including those incorrectly classi-

ﬁed as negative. F1 Score is calculated as the har-

monic mean of precision and recall.

4.4 Stress Task Classiﬁcation

Performance

As mentioned in section 4.2 for stress task classiﬁ-

cation experimentation we have computed results for

both binary as well as ternary classiﬁcation. Table 1,

Table 2, and Table 3 illustrates binary stress classi-

ﬁcation results for T1 (resting task) vs. T2 (speech

task), T1 (resting task) vs. T3 (arithmetic task), and

T2 (speech task) vs. T3 (arithmetic task) tasks respec-

tively. Table 4 illustrates ternary stress task classiﬁca-

tion results for T1 (resting task) vs. T2 (speech task)

vs. T3 (arithmetic task). Major observations from Ta-

ble 1, Table 2, Table 3, and Table 4 are as follows:

• Highest accuracy is observed while differentiat-

ing subjects undergoing resting task versus arith-

metic task as depicted in Table 2. As stated in

(Ziaratnia et al., 2024) arithmetic task is the most

challenging task and our proposed approach ef-

fectively discriminates it which depicts our model

superiority.

• Lowest recall is observed while differentiating

subjects undergoing speech task versus arithmetic

task as depicted in Table 3. This phenomenon is

quite obvious because speech task that involves

expressing views in front of others and solving

arithmetic problems both involved inducing stress

in individuals.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

602

Table 7: Experimental results comparing our approach to other cutting-edge techniques for stress task categorisation on the

UBFC-Phys dataset. The best ﬁndings are bolded in the table, which displays the mean values (± standard deviations) of the

7 cross-validation. This table’s values are derived from (Ziaratnia et al., 2024).

Methods T1 vs. T2 T1 vs. T3 T1 vs. T2 vs. T3

Accuracy F1 Score Accuracy F1 Score Accuracy F1 Score

MLP (Dolmans et al., 2021) 0.709(±0.061) 0.706(±0.064) 0.599(±0.040) 0.587(±0.037) 0.440(±0.028) 0.434(±0.031)

LIT (Dolmans et al., 2021) 0.701(±0.063) 0.699(±0.066) 0.625(±0.042) 0.622(±0.052) 0.447(±0.027) 0.443(±0.031)

DFAF (Gao et al., 2019) 0.758(±0.035) 0.756(±0.033) 0.689(±0.040) 0.686(±0.037) 0.478(±0.034) 0.477(±0.036)

CAM (Praveen et al., 2021) 0.726(±0.060) 0.722(±0.063) 0.650(±0.052) 0.645(±0.054) 0.494(±0.028) 0.487(±0.033)

MFN (Yu et al., 2021) 0.769(±0.035) 0.763(±0.043) 0.684(±0.050) 0.670(±0.049) 0.501(±0.038) 0.500(±0.036)

BCSA (Zhang et al., 2023) 0.818(±0.063) 0.817(±0.063) 0.723(±0.039) 0.722(±0.039) 0.558(±0.052) 0.552(±0.048)

Multimodal CCT-LSTM (Ziaratnia et al., 2024) 0.981(±0.016) 0.981(±0.016) 0.924(±0.037) 0.924(±0.037) 0.832(±0.058) 0.834(±0.056)

Proposed Approach 0.857(±0.104) 0.865(±0.095) 0.946(±0.062) 0.943(±0.065) 0.851(±0.069) 0.851(±0.070)

• In case of ternary stress task classiﬁcation task

as depicted in Table 4 the mean accuracy, pre-

cision and recall across 7-folds are almost same.

This phenomenon indicates fairness of our model

which focuses not only on minimizing false neg-

atives but also focuses on minimizing false posi-

tives.

4.5 Multi-Level Stress Classiﬁcation

Performance

Table 5 and Table 6 depicts results for multi-level

stress classiﬁcation. According to state-of-the-art re-

search, we employed a 5-fold stratiﬁed cross valida-

tion strategy in this experiment because the number

of subjects in the control and test groups was unbal-

anced (Ziaratnia et al., 2024). As depicted in Table 5

the highest accuracy reported across all folds was in

fold-4 while the mean accuracy achieved was 87.5%.

Furthermore, as depicted in Table 6 the highest and

lowest multi-level stress accuracy was achieved in

No-Stress and High-Stress class respectively.

4.6 Comparative Analysis

To validate the efﬁcacy of proposed approach we have

compared our computed results with other state-of-

the-art methods working on same dataset (UBFC-

Phy) as ours. It is evident from Table 7 that, with

one exception (T1 vs. T2), our suggested methodol-

ogy performs better than any previous state-of-the-art

effort. When compared to the most recent state-of-

the-art work (Ziaratnia et al., 2024), the ﬁndings for

(i) T1 vs. T3 and (ii) T1 vs. T2 vs. T3 demon-

strate an accuracy improvement of 2.2% and 1.9%,

respectively. For our proposed approach the results

obtained in case of T1 vs. T2 case is less as compared

to the available state-of-the-art approach. It should be

noted that in case of current available state-of-the-art

work (Ziaratnia et al., 2024) there is variation of about

15% in accuracy (as reported in Table 7) while con-

sidering various stress task classiﬁcation cases. In this

work our objective is not only to develop an approach

that achieves high accuracy but that also generalizes

well to different cases. Henceforth, we have focused

towards reducing the accuracy variation across differ-

ent stress task classiﬁcation cases. For our proposed

approach the variation in accuracy across different

stress task classiﬁcation cases was reported around

9% which is approximately 6% less than the current

state-of-the-art work (Ziaratnia et al., 2024).

For multi-level stress classiﬁcation task we have

compared our proposed approach with two state-of-

the-art approaches (Ziaratnia et al., 2024), (Xu et al.,

2024). In (Ziaratnia et al., 2024) authors have com-

puted results by using 5 fold stratiﬁed cross validation

approach for computing results. Using the same strat-

egy in our case we have achieved accuracy of 0.875

with F1-score as 0.870 while in (Ziaratnia et al.,

2024) computed accuracy was 0.805 with F1-score

as 0.803. Clearly, we have achieved an improvement

of 7% in accuracy as compared to (Ziaratnia et al.,

2024). In (Xu et al., 2024), authors used 10-fold

cross-validation for the categorization of low stress

(T2) vs. high stress (T3) and achieved an accuracy

of 83.83% . Following the same strategy, for our pro-

posed architecture we have achieved comparable ac-

curacy of 83.50%.

5 CONCLUSION AND FUTURE

SCOPE

Stress assessment is crucial for applications such as

driver condition monitoring, workplace productivity

analysis, and tailored healthcare. The development

of precise and economical stress task and level clas-

siﬁcation techniques is urgently needed. Therefore,

in this work we have presented a deep tabular non-

contact stress measurement technique. The suggested

method concentrates on extracting the physiological

parameters from the rPPG signal and improving ROI

selection while dynamically moving the face inside

the video frame. The suggested architecture per-

Non Contact Stress Assessment Based on Deep Tabular Method

603

formed better than earlier approaches on the UBFC-

Phy dataset in both sets of experiments. Future re-

search may examine the possibilities of combining

speech and eye gaze data with rPPG signals to assess

stress.

REFERENCES

Arik, S.

O. and Pﬁster, T. (2021). Tabnet: Attentive inter-

pretable tabular learning. In Proceedings of the AAAI

conference on artiﬁcial intelligence, volume 35, pages

6679–6687.

Barazi, N., Polidovitch, N., Debi, R., Yakobov, S., Lakin,

R., and Backx, P. H. (2021). Dissecting the roles of

the autonomic nervous system and physical activity

on circadian heart rate ﬂuctuations in mice. Frontiers

in physiology, 12:692247.

Casado, C.

A., Ca

nellas, M. L., and L

opez, M. B. (2023).

Depression recognition using remote photoplethys-

mography from facial videos. IEEE Transactions on

Affective Computing, 14(4):3305–3316.

Choi, C.-H., Kim, J., Hyun, J., Kim, Y., and Moon, B. Face

detection using haar cascade classiﬁers based on ver-

tical component calibration.

Das, M., Bhuyan, M. K., and Sharma, L. N. (2023).

Time–frequency learning framework for rppg signal

estimation using scalogram-based feature map of fa-

cial video data. IEEE Transactions on Instrumenta-

tion and Measurement, 72:1–10.

Dolmans, T. C., Poel, M., van’t Klooster, J.-W. J., and

Veldkamp, B. P. (2021). Perceived mental work-

load classiﬁcation using intermediate fusion multi-

modal deep learning. Frontiers in human neuro-

science, 14:609096.

Fink, G. (2010). Stress: Deﬁnition and history. Stress sci-

ence: neuroendocrinology, 3(9):3–14.

Gao, P., Jiang, Z., You, H., Lu, P., Hoi, S. C., Wang, X., and

Li, H. (2019). Dynamic fusion with intra-and inter-

modality attention ﬂow for visual question answer-

ing. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 6639–

6648.

Giannakakis, G., Grigoriadis, D., Giannakaki, K., Simanti-

raki, O., Roniotis, A., and Tsiknakis, M. (2019). Re-

view on psychological stress detection using biosig-

nals. IEEE transactions on affective computing,

13(1):440–460.

Imambi, S., Prakash, K. B., and Kanagachidambaresan, G.

(2021). Pytorch. Programming with TensorFlow: so-

lution for edge computing applications, pages 87–104.

Kim, N. H., Yu, S.-G., Kim, S.-E., and Lee, E. C. (2021).

Non-contact oxygen saturation measurement using

ycgcr color space with an rgb camera. Sensors,

21(18):6120.

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja,

E., Hays, M., Zhang, F., Chang, C.-L., Yong, M. G.,

Lee, J., et al. (2019). Mediapipe: A framework

for building perception pipelines. arXiv preprint

arXiv:1906.08172.

Mordvintsev, A. and Abid, K. (2017). Opencv-python tuto-

rials documentation.

Ntalampiras, S. (2023). Model ensemble for predicting

heart and respiration rate from speech. IEEE Internet

Computing, 27(3):15–20.

Pan, Y., Shang, Y., Liu, T., Shao, Z., Guo, G., Ding, H., and

Hu, Q. (2024). Spatial–temporal attention network for

depression recognition from facial videos. Expert Sys-

tems with Applications, 237:121410.

Panigrahi, A. and Sharma, H. (2022). Non-contact hr ex-

traction from different color spaces using rgb camera.

In National Conference on Communications (NCC),

pages 332–337.

Praveen, R. G., Granger, E., and Cardinal, P. (2021). Cross

attentional audio-visual fusion for dimensional emo-

tion recognition. In 16th IEEE International Con-

ference on Automatic Face and Gesture Recognition,

pages 1–8.

Sabour, R. M., Benezeth, Y., De Oliveira, P., Chappe, J., and

Yang, F. (2021). Ubfc-phys: A multimodal database

for psychophysiological studies of social stress. IEEE

Transactions on Affective Computing, 14(1):622–636.

Selesnick, I. W. and Burrus, C. S. (1998). Generalized dig-

ital butterworth ﬁlter design. IEEE Transactions on

signal processing, 46(6):1688–1694.

Speth, J., Vance, N., Sporrer, B., Niu, L., Flynn, P., and

Czajka, A. (2024). Mspm: A multi-site physio-

logical monitoring dataset for remote pulse, respira-

tion, and blood pressure estimation. arXiv preprint

arXiv:2402.02224.

Wang, W., Den Brinker, A. C., Stuijk, S., and De Haan,

G. (2016). Algorithmic principles of remote

ppg. IEEE Transactions on Biomedical Engineering,

64(7):1479–1491.

Xu, J., Song, C., Yue, Z., and Ding, S. (2024). Facial video-

based non-contact stress recognition utilizing multi-

task learning with peak attention. IEEE Journal of

Biomedical and Health Informatics.

Yu, H., Vaessen, T., Myin-Germeys, I., and Sano, A. (2021).

Modality fusion network and personalized attention in

momentary stress detection in the wild. In 9th Inter-

national Conference on Affective Computing and In-

telligent Interaction (ACII), pages 1–8.

Zhang, X., Wei, X., Zhou, Z., Zhao, Q., Zhang, S., Yang,

Y., Li, R., and Hu, B. (2023). Dynamic alignment and

fusion of multimodal physiological patterns for stress

recognition. IEEE Transactions on Affective Comput-

ing.

Ziaratnia, S., Laohakangvalvit, T., Sugaya, M., and Srip-

ian, P. (2024). Multimodal deep learning for remote

stress estimation using cct-lstm. In Proceedings of

the IEEE/CVF Winter Conference on Applications of

Computer Vision, pages 8336–8344.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

604