Facial Empathy Analysis Through Deep Learning and Computer Vision

Techniques in Mixed Reality Environments

Insaf Setitra

, Domitile Lourdeaux

and Louenas Bounia

1, 2

UMR CNRS 7253 Heudiasyc, Sorbonne Universit

e, Universit

e de Technologie de Compi

egne,

57 avenue de Landshut, Compi

egne, France

Universit

e Sorbonne Paris Nord, Laboratoire d’Informatique de Paris-Nord (LIPN) - UMR-CNRS 7030, France

Keywords:

Empathy, Facial Expression, Emotion Detection, Valence and Arousal.

Abstract:

This paper introduces a novel approach for facial empathy analysis using deep learning and computer vision

techniques within mixed reality environments. The primary objective is to detect and quantify empathic re-

sponses based on facial expressions, establishing the link between empathy and facial expressions. We propose

the Deep Convolutional Neural Network with the Exponential Linear Unit activation function (ELU-DCNN).

We moreover design an augmented reality platform with two main features (i). virtual overlay of a VR head-

set on the user’s face and (ii). facial emotion recognition for users wearing the VR headset. Our target is

to analyse facial expressions in immersed environments in order to assess the empathy of users while being

immersed in speciﬁc environments. Our results analyse the feasibility and effectiveness of these models in

detecting and quantifying empathy through facial expressions. This work contributes to the growing ﬁeld of

affective computing and highlights the potential of integrating advanced computer vision techniques in mixed

reality applications to better understand human emotional responses.

1 INTRODUCTION

Empathy is a fundamental element of human interac-

tions and plays a crucial role in communication and

interpersonal relationships. Understanding and an-

alyzing facial expressions associated with empathy

can offer profound perspectives not only in the ﬁeld

of psychology, but also in various technological ap-

plications such as virtual reality, games and human-

machine interactions.

The state of the art hence analyzes both empathy

and facial expressions, with limited studies exploring

their interrelation in the context of historical empa-

thy. Some studies explore how empathy can improve

human-centered design by understanding user needs

and developing solutions accordingly (Zhu and Luo,

2023), (Somarathna et al., 2023), (Gareth W. Young

and Smolic, 2022), (Ventura and Martingano, 2023),

(Mathur et al., 2021), (Bang and Yildirim, 2018),

(Shin, 2018). In contrast, our work diverges by con-

centrating on the use of computer vision and deep

learning techniques to analyze empathic responses

in historical scenarios, aiming to adapt and enrich

these scenarios through advanced technological ap-

proaches. The main goal is to create a system that can

detect and analyze facial expressions to measure em-

pathy. The project contributes to the improvement of

virtual reality scenarios in a historical setting (mainly

in the memorial of Compi

egne) based on detected em-

pathy.

1.1 Empathy Detection

Empathy is divided into three distinct components:

affective, cognitive and associative empathy (Shen,

2010), (Ventura and Martingano, 2023). Affective

empathy is the ability to feel the emotions of an-

other person, cognitive empathy is the ability to un-

derstand the thoughts and feelings of others, and as-

sociative empathy combines the emotional and cog-

nitive aspects, allowing individuals to put themselves

completely in the other’s place. In our study, we are

mainly interested in cognitive empathy. Indeed, the

purpose of visiting a historical museum is to under-

stand historical events without directly feeling what

the individuals (deportees for instance) could have

felt. The study in (Hasan et al., 2024) revealed that

while empathy detection systems use various types of

signals, there is a predominance of empathy analysis

from texts as opposite to facial expressions. Question-

Setitra, I., Lourdeaux, D. and Bounia, L.

Facial Empathy Analysis Through Deep Learning and Computer Vision Techniques in Mixed Reality Environments.

DOI: 10.5220/0013057800003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 31-39

ISBN: 978-989-758-737-5; ISSN: 2184-433X

naires allow to label the collected data. The Toronto

empathy questionnaire (Spreng et al., 2009) is fre-

quently cited in the literature. Shin (Shin, 2018) pro-

posed a model examining the relationships between

immersion, presence, ﬂow, embodiment and empathy

in a virtual environment. Bang and Yildirim (Bang

and Yildirim, 2018) conducted a study to assess the

effectiveness of virtual reality storytelling in build-

ing user empathy. To do this, they compared two

groups of participants: the ﬁrst watched the documen-

tary After Solitary

using a VR headset, while the sec-

ond one watched the same video in 360

format on a

desktop computer. In the context of measuring em-

pathy, researchers used the State Empathy Question-

naire (SEQ)(Shen, 2010). Leena et al.(Mathur et al.,

2021) conducted an experiment to collect a new set

of empathy data in an innovative interaction context,

where participants listened to stories told by the sto-

ryteller robot LuxAI. After listening to the three sto-

ries, participants completed the SEQ Questionnaire to

assess their empathic reactions. Video information

of participants was ﬁrst extracted by OpenFace 2.0

(Baltrusaitis et al., 2018) to identify certain facial fea-

tures, including expression movement, eye angle and

head position. (Gareth W. Young and Smolic, 2022)

states that a good immersion creates an ”illusion of

body exchange” that facilitates the adoption of the

perspective of the incarnate character. Among sce-

narios that evoke empathy stands The Last Goodbye

a virtual reality experience that allows users to visit

the remains of a Nazi concentration camp in the com-

pany of a Holocaust survivor, Pinchas Gutter. This

scenario is intended to evoke intense emotions such

as sadness, anger and deep reﬂection. ”VR World

War II”

is another virtual reality scenario that im-

merses users in the events of World War II. In (Xue

et al., 2023)

, a database is presented which includes

73 extracts that induce valence variations and emo-

tional activation. A similar dataset is presented in (Li

et al., 2017)

with a set of 360

videos. AVDOS-

https://www.youtube.com/watch?v=G7 YvGDh9Uc&

t=14s

USC Shoah Foundation (2020, Septembre 18). The

last GOODBYE [VR documentary]. Gabo Arora and Ari

Palitz. https://sfi.usc.edu/lastgoodbye

World War II Foundation (2021-2024) VR video se-

ries. https://www.youtube.com/playlist?list=PL2A7-aRM

5qjU7KKRIdL-LsPObYfZHT7Od

https://www.dis.cwi.nl/ceap-360vr- dataset/, https:

//github.com/cwi-dis/CEAP-360VR-Dataset/tree/master?t

ab=readme-ov-file

https://vhil.stanford.edu/public-database-360-video

s-corresponding-ratings-arousal-and-valence

VR (Gnacek et al., 2024)

contains 30-second videos

with activation and valence information evaluated at

each second. Arousal and valence (positive/negative)

are often used to evaluate induced emotions. Self-

assessment by participants via the SAM scale is a

widespread method. Correlations are observed be-

tween head upward movements and high activation

levels (Somarathna et al., 2023). A subset of AVDOS-

VR is also presented in (Xue et al., 2023).

In our study we choose four scenarios from

AVDOS-VR (Gnacek et al., 2024) that are used us-

ing a desktop application we developed and two VR

scenarios (360

) from (Li et al., 2017). We use the

Toronto questionnaire (Spreng et al., 2009) before the

experiment and the SEQ questionnaire (Shen, 2010)

after the experience. We provide more details in the

following sections.

1.2 Facial Expression Recoginition

According to Rakibul et al. (Hasan et al., 2024), for

empathy detection in the deep learning category, mod-

els based on Convolutional Neural networks (CNN)

and Recurrent Neural Networks (RNN) are most fre-

quently used, while in the classical machine learn-

ing category, SVM is the most frequently used. Tra-

ditionally, seven basic emotions are classiﬁed: fear,

anger, disgust, happiness, neutrality, sadness and sur-

prise while the main datasets used in the literature are

FER2013, AffectNet, CK+, eNTERFACE’05 (Mar-

tin et al., 2006). Li and Deng (Li and Deng, 2022)

present a comprehensive review on Facial Expres-

sion Recognition (FER) and describes the standard

process of a deep FER system, including prepro-

cessing, learning of deep characteristics and classi-

ﬁcation. (Mohamed et al., 2022) presents a method

based on the association of a pre-trained CNN mod-

els (VGG16, ResNet50) used as a feature extractor

with a Multi-Layer Perceptron (MLP) classiﬁer. (De-

mochkina and Savchenko, 2021) presents a method

to recognize facial expressions in videos, using the

proposed MobileEmotiFace netword. Kas et al. (Kas

et al., 2021) presents a framework that combines tex-

ture and shape characteristics from 49 landmarks de-

tected on a facial image. The shape is extracted us-

ing Histogram of Oriented Gradient (HOG) and the

texture using an Orthogonal and Parallel-based Di-

rections Generic Quad Map Binary Patterns (OPD-

GQMBP). Ezati et al. in (Ezati et al., 2024) highlight

the challenges posed by high computational com-

plexity and variations of multi-view poses in real-

https://github.com/michalgnacek/AVDOS-VR/tree/m

ain , https://www.gnacek.com/affective-video-database-onl

ine-study

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

world contexts and propose a Lightweight Attentional

Network incorporating Multi-Scale Feature Fusion

(LANMSFF). Ma et al (Ma et al., 2024) propose

the FER-YOLO-Mamba model, which integrates the

principles of Mamba and YOLO technologies to facil-

itate efﬁcient recognition and localization of facial ex-

pressions. Sun et al. (Sun et al., 2023) propose a new

self-supervised approach called SVFAP that uses self-

supervised learning to overcome challenges related to

overﬁtting and the high cost of creating datasets. An-

other signiﬁcant area of research examines how dif-

ferent facial regions contribute to the recognition of

facial expression. The observations of Wegrzyn et al.

(Wegrzyn et al., 2017) show that the lower part of the

face is about joy and disgust while the upper part in-

forms about anger, Fear, surprise and sadness. Win-

genbach (Wingenbach, 2023) shows the relationship

between facial muscles and facial expression. Using

facial electromyography

, it is possible to detect the

slightest muscle contractions and identify the charac-

teristic Action Units (AUs). In addition, it is possi-

ble to combine these AUs to determine an emotion.

The AU method has been used in several other stud-

ies such as (Yao et al., 2021). Huc et. al. (Huc

et al., 2023) studies the effects of emotional attribu-

tion errors on people with or without masks. This

work shows that the bottom of the face contributes

to the identiﬁcation of emotions of joy, sadness and

surprise while the top of the face allows to recognize

fear which contradicts somewhat the (Wingenbach,

2023). Other confusions such as the confusion of fear

and surprise, anger and disgust and sadness and fear

and all with the neutral emotion are also described.

(Poux et al., 2020) is based on the propagation of fa-

cial movement to overcome difﬁculties related to oc-

clusions. (Minaee et al., 2021) presents an innovative

approach for facial expression recognition based on a

convolutional attentional network with a spatial atten-

tion mechanism. To validate the attentional approach,

a saliency map of important regions is generated. The

results conﬁrmed that different expressions are indeed

sensitive to different parts of the face, for example the

mouth for joy and the eyes for anger.

2 OUR APPROACH

Our approach to facial empathy analysis is structured

into three key parts. The ﬁrst part (Subsection 2.1)

focuses on facial expression recognition. We employ

a proposed neural network that fully predicts the ex-

pression from an image. This network is combined

A medical technique that studies the function of nerves

and muscles

with various feature extractors, including pixels, His-

togram of Oriented Gradients (HOG), MobileNet, and

VGG, along with classiﬁcation algorithms such as K-

Nearest Neighbors (KNN) and Support Vector Ma-

chines (SVM). In the second part (Subsection 2.2),

we introduce a novel aspect of our dataset by adding a

mask to the expressions. This modiﬁcation simulates

an environment where individuals are immersed in

Virtual Reality (VR), allowing us to analyze their fa-

cial expressions under these conditions. The last part

(Subsection 3.2) outlines our experimental setup. We

describe a series of scenarios designed to evoke em-

pathy, during which we capture video recordings of

participants’ reactions, both with and without wearing

a VR headset. Participants complete during the ex-

perimentation empathy questionnaires to provide ad-

ditional insights into their empathic responses. This

comprehensive approach enables us to analyze and

understand the relationship between facial expres-

sions and empathy in both traditional and VR envi-

ronments.

2.1 Facial Expression Recognition

The network we propose to classify facial expressions

is inspired of the work of Debnath et al.(Debnath

et al., 2022). Our proposed model consists of six con-

volution layers organized into three blocks:

• The ﬁrst convolution block contains two layers us-

ing 5 ﬁlters with 64 ﬁlters each.

• The second block consists of two layers using 3 ×

3 ﬁlters with 128 ﬁlters.

• The third block has two convolution layers using

3 × 3 ﬁlters with 256 ﬁlters.

• Each convolution layer is followed by batch nor-

malization which helps to stabilize and accelerate

the drive. Moreover, an ELU (Exponential Linear

Unit) activation function is also applied after each

layer to improve network convergence.

• The model integrates MaxPooling layers after

each convolution layer block to reduce dimen-

sionality.

• Dropout layers prevent overﬁtting by ensuring

regularization as the number of parameters in-

creases.

• Flatten layer is ﬁnally used to convert the feature

maps into a one-dimensional vector followed by

a dense layer of 128 neurons with ELU activation

and batch normalization.

• The output layer uses a softmax activation for

multi-class classiﬁcation.

Facial Empathy Analysis Through Deep Learning and Computer Vision Techniques in Mixed Reality Environments

In addition, to improve the diversity and balance

of our training dataset, we have used a data augmen-

tation strategy shown in Figure 1. The set of augmen-

tations used are as follows given that x, y and x

′

, y

′

are

the image coordinates before and after the augmenta-

tion respectively:

• Gaussian blur: The weighted average of the

neighboring pixels for each pixel in the image is

applied. We use the Gaussian kernel to do so

(G(x, y) =

2πσ

exp

−x

2σ

• Afﬁne transformation: The combination of

translations, rotations, scaling and shears (bias

distortions) is applied



′





a b

c d









• Euclidean Transformation: includes

only rotations and translations



′





cosθ −sinθ

sinθ cosθ









with Where

θ is the angle of rotation and (t

, t

) is the

translation vector.

• Total Transformation: transformation of the

quadrilaterals into other quadrilaterals





′









a b c

d e f

g h i













• Contrast Modiﬁcation: Higher contrast makes

light areas brighter and dark areas darker

I′(x, y) = αI(x, y) + β with I the image intensity

and α, β the contrast and brithness factors respec-

tively.

• Image Flipping (ﬂip): either horizontal (x =

−x, y = y) or vertical (x = x, y = −y).

For the feature based facial expression recogni-

tion, we extract 4 types of features and use two

main classiﬁers. The feature extractors are described

brieﬂy in the following:

• Pixels: each image is ﬂattened into a vector of

pixels. Then, this vector is normalized to have an

average of 0 and a standard deviation of 1, which

improves the performance of machine learning al-

gorithms.

• Histogram of Oriented Gradient HOG: he his-

tograms of all cells are grouped together and used

as a feature vector.

• VGG Feature Extractor: in VGG16 (Simonyan

and Zisserman, 2014) we adopt the last layer

Figure 1: Examples of the used augmentations.

(4096 dimensional vector) as the feature vector

without any training.

• Mobile Net Feature Extractor: Similarily to

VGG, we use the last layer of MobileNet (Howard

et al., 2017) as a feature vector.

Finally we use SVM and KNN for classiﬁcation

and use in our approach the cross-validation to deter-

mine the optimal number of neighbors.

2.2 Augmented Reality for Facial

Expression Recognition in

Immersion

Virtual reality headset overlay on faces via Aug-

mented Reality (AR) offers an innovative approach to

the study of emotion recognition in the context of ex-

periences in emotion recognition. This work also al-

lows to assess the potential impact of facial occlusion

by physical devices on the recognition of emotions.

The AR algorithm relies on two pre-prepared VR

headset images: a front view and a proﬁle (sagittal)

view. The ﬁrst crucial step in the process is to ana-

lyze the input image to detect and understand the face

geometry. This phase is fundamental because it pro-

vides the necessary information to position and ori-

ent the virtual VR headset correctly. The algorithm

operates in the following stages. (i) Face Detec-

tion:This process a machine learning-based method

known as BlazeFace (Bazarevsky et al., 2019), a

lightweight and efﬁcient face detector designed for

real-time applications. BlazeFace leverages a single-

shot detection (SSD) architecture that quickly identi-

ﬁes the bounding box of the face within the image,

enabling real-time face detection with high precision.

(ii). Face Mesh Generation: Within the Region Of

Interest (ROI), a mesh grid is generated, mapping out

the facial structure. This grid consists of numerous

points, known as landmarks, strategically positioned

to capture critical facial features such as the eyes,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

nose, and mouth. (iii). Landmark Localization: The

algorithm uses the convolutional neural network Mo-

bileNetV2 (Sandler et al., 2019) to predict the precise

locations of these landmarks. (iv). Temporal Fil-

tering: For real-time applications, temporal ﬁltering

is applied to stabilize the landmark positions across

consecutive frames. This reduces jitter and ensures

smooth tracking of facial movements.

From the set of detected landmarks, we selects

four speciﬁc points that will serve as a reference for

positioning the VR headset. These points are strategi-

cally chosen to frame the area where the helmet will

be placed, usually around the eyes and temple and de-

termine the geometric transformation applied to the

helmet. To determine the pitch, roll, and yaw angles

of the head, the pose can be calculated from 3D-2D

point correspondences using relevant algorithms and

facial landmarks. This problem involves solving for

the rotation (r) and translation (t) that minimize the

projection error from 3D-2D point correspondences.

The rotation vector r represents the axis of rotation in

3D space, and its magnitude represents the angle of

rotation. To convert this rotation vector into a rota-

tion matrix R, we use the Rodrigues’ rotation formula

(Hartley and Zisserman, 2003). The Rodrigues’ rota-

tion formula converts a rotation vector into a rotation

matrix through the following steps:

• Compute the angle of rotation θ as the magnitude

of the rotation vector: θ = ∥r∥.

• Compute the unit vector k =

• Construct the skew-symmetric cross-product ma-

trix K of k : K =





0 −k

−k





• Compute the rotation matrix R as R = I +

sin(θ)K + (1 − cos(θ))K

with I the identity ma-

trix.

• Finally, compute the Euler angles from the rota-

tion matrix R. The Euler angles (roll, pitch, and

yaw) describe the orientation of an object in three-

dimensional space. These angles are extracted

from the rotation matrix using as follows:

– Roll α : α = atan(R

, R

)

– Pitch β : β = atan(−R

+ R

)

– Yaw γ : γ = atan(R

, R

)

– R =









being the rotation ma-

trix.

Once calculated, the transformation matrix is applied

to the helmet image. This distorts the original image

according to the set parameters, producing a version

of the headset aligned with the face in the image.

2.3 Immersion Protocol and Empathy

Analysis

For the study of reactions, we have chosen four short

2D videos from AVDOS-VR (Gnacek et al., 2024)

and two virtual reality videos from (Li et al., 2017).

The scenarios are as follows:

• Police Helicopter Captures Armed Confronta-

tion (vid

eo 13 of AVDOS): 2D video of 30

seconds showing a police drill where an ofﬁcer

comes to the aid of a colleague who was shot

2.659 and 6.652 for valence and arousal respec-

tively.

• Sick Boy Crying During an Interview (vid

eo 19

of AVDOS): 2D video of 30 seconds showing an

excerpt from an interview with a child suffering

from a serious illness 2.434 and 5.623 for valence

and arousal respectively.

• Soldiers Marching and Singing a Pop Song

(vid

eo 56 of AVDOS): 2D video of 30 seconds

showing a group of soldiers parading by singing

the song ”Barbie Girl” 6.283 and 5.975 for va-

lence and arousal respectively.

• Toddler Laughing at Torn Paper Pages (vid

51 of AVDOS): 2D video of 30 seconds show-

ing a young child laughs when an adult tears up

a piece of paper with 7.07 and 6.429 for valence

and arousal respectively.

• Survive a Bear Attack in VR: a 360

video of 90

seconds showing a bear approaching dangerously

to a group of 3 campers, who decide to run away

in their car after having distracted the bear with

a cookie with 5.22 and 5 for valence and arousal

respectively.

• Solitary Conﬁnement: a 360

video of 221 sec-

onds that puts the viewer in the shoes of an inmate

in isolation, while listening to a testimony from

a former prisoner 2.38 and 4.25 for valence and

arousal respectively.

Moreover, in the experimental protocol, we have

developed a Graphical User Interface (GUI) to allow

visualization the videos while simultaneously captur-

ing the webcam video stream. We use the Toronto

questionnaire (Spreng et al., 2009) before the exper-

iment to assess the empathy of the participant, fol-

lowed by the SEQ questionnaire (Shen, 2010) after

the experience. In order to collect responses of the

questionnaires, we use Google Forms in which we

used the same questions as the questionnaires. We

Facial Empathy Analysis Through Deep Learning and Computer Vision Techniques in Mixed Reality Environments

also add for the questionnaire a precision about which

individual in the video to rely for the empathic re-

sponse. For example, in the ”Police helicopter cap-

tures armed confrontation” scenario, we ask the par-

ticipant to choose which actor in the video to relate

to (the ﬁrst policeman, the second, or the confronting

actor). More speciﬁcally, we provide the participants

with the following information:

• For the Preliminary Phase, Follow the Follow-

ing Steps:

– Inform the participant that they will be view-

ing a series of videos, each followed by a ques-

tionnaire. Also mention that some videos may

contain graphic scenes (notably video 13).

– Provide participants with contentment forms to

be ﬁlled, signed and returned.

– Place the participant in front of a computer in a

quiet room with a neutral background.

– Ask the participant to complete the Toronto

Questionnaire.

– Explain the operation of the graphical interface

used to view videos.

• For each 2D Video, Repeat the Following

Steps:

– The experimental staff leaves the room.

– The participant starts the video using the GUI.

– At the end of the video, the experimental staff

returns to the room and asks the participant to

complete the State Empathy Questionnaire cor-

responding to the video.

• For each 3D (360

) Video, Repeat the Follow-

ing Steps:

– The experimentation staff prepares the headset

with the video ready to be launched.

– The experimental staff leaves the room.

– At the end of the video, the experimental staff

returns to the room and asks the participant to

complete the State Empathy Questionnaire cor-

responding to the video.

After the experiment, we annotate the captured

videos with the appropriate facial expression. We

mainly select a representative image of the expres-

sions and reactions of the person being ﬁlmed, and

annotate the emotion observed among the following

seven labels: anger, disgust, fear, happiness, sadness,

surprise and neutrality. Finally, we crop the image to

retain only the participant’s face.

3 EXPERIMENTS

3.1 Facial Expression Recognition

Experiments

Experiments for facial expression recognition are

conducted on the Facial Expression Recognition 2013

(FER2013) database (Goodfellow et al., 2013). The

database includes 35.887 grayscale images of faces

with dimensions of 48 × 48 pixels. The images are

labeled in seven categories of emotions: anger, dis-

gust, fear, happiness, sadness, surprise and neutral.

Although FER2013 is particularly useful for training

and validating facial recognition models due to its

large size and the diversity of emotions represented,

classes in the database are not balanced. Particularly

the class ‘disgust’ has more than 16 times fewer sam-

ples than the class ‘happiness’. Figure 2 highlights the

imbalance in the distribution of emotion classes. We

therefore applied data augmentation described previ-

ously in order to balance the classes. 10.000 im-

ages were obtained for each class. We summarize

in Table 1 the different settings along with the ob-

tained accuracy. For our ELU-DCNN, the Dropout

layer has a rate of 0.6 to avoid overﬁtting and the

Adam optimizer is used. We used early stop callbacks

(EarlyStopping) to monitor the accuracy validation

metric, with a criterion of patience of 11 epochs and

a restoration of the best weights. Finally, we used a

scheduler of reduction of the learning rate (ReduceL-

ROnPlateau) in order to reduce the learning rate by

half after 7 epochs without improvement. The model

was trained for 100 epochs with lots (batch) of 32.

Table 1: Results of facial expression recognition using our

approach on the FER2013 dataset.

Approach Details Accuracy

Basic approach batch size = 64 0.63

Early stopping with p=5

Imbalanced classes

Data augmentation 10.000 images par classes 0.69

Reduced batch size batch size = 32 0.78

Early stopping with p=11

Nadam optimizer

Parameter reﬁnement Adam Optimizer 0.81

As can be seen from the table, we obtain an ac-

curacy of 81% on our test data sample. Figure 3 also

shows that the accuracy is increasing and that it even-

tually stabilizes around 80% after the 40

epoch. Fi-

nally Figure 4 shows the confusion matrix for the val-

idation set. As can be seen, the model predicts in a

similar manner the different classes of expressions.

Hence, our model predicts 74% anger, 97% disgust,

68% fear, 85% happiness, 72% sadness, 90% surprise

and 77% neutral.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

Figure 2: Distribution of Emotion Classes in the Dataset:

(left) Table of Emotion Names and Image Counts with

(right) Corresponding Histogram Illustrating Class Imbal-

ance.

Figure 3: Accuracy (left) and loss (right) with respect to

number of epochs (x-axis).

Figure 4: Confusion matrix for the validation set of

FER2013 dataset using our approach.

3.2 Empathy Analysis Experiments

In order to retrain our model on occluded faces and

hence simulate an environment where individuals are

immersed, we apply the augmented reality transfor-

mations on the FER dataset. This has the advantage

of training a facial expression classiﬁer without the

need to re-annotate the data (as FER2013 is already

annotated with facial expression classes). However,

the accuracy drops to 67%. Some emotions without

the information from the top of the head struggle to be

recognized because of the lack of information. Some

emotions, such as anger, are strongly expressed in the

eye and eyebrow area, areas typically masked by a

VR headset. This occlusion can signiﬁcantly reduce

the accuracy of emotional recognition.

We simultaneously performed the test protocol for

empathy analysis as described in Subsection . To

summarize, the protocol consists on:

• Ask participants to sign the consent form to use

their videos for this research.

• Ask participants to ﬁll the Toronto form.

• Show scenarios (either 2D or 360

videos) to the

participants and capture their videos while they

watch the videos.

• After watching each scenario, ask participants to

ﬁll the empathy SEQ quetionnaire that we repro-

duced in the google form.

•

By the end of the tests, We obtained a set of faces

of 8 individuals who participated in the experiments.

Each face was associated with emotions that we la-

belled manually and on which we were able to test

our different models of facial expression recognition.

Our classiﬁer has achieved an accuracy of 20%. The

model correctly identiﬁed the expression of happi-

ness but failed to recognize the expression of surprise,

wrongly classifying it as anger. The main limitations

are glasses on the face, poor image quality and differ-

ent angles. Since the model did not perform to a high

extent on the captured videos, we analyzed empathy

based on the facial expressions that we manually an-

notated. We used the results of the questionnaires to

assess the level of empathy associated with different

emotions. As illustrated in Figure 5, different emo-

tions correspond to varying levels of empathic reso-

nance.

Figure 5: Confusion matrix for the validation set of

FER2013 dataset using our approch.

It is clear that fear and surprise have the high-

est empathy scores, while anger has the lowest score.

However, the differences in scores are not particularly

marked. This situation can be explained by several

factors. We depict some of them in the following:

Facial Empathy Analysis Through Deep Learning and Computer Vision Techniques in Mixed Reality Environments

• Participants’ Familiarity with the Videos:

Many participants had already participated in the

selection of the videos, their familiarity with

video content may lessen the intensity of their em-

pathic response, as they know what to expect and

are less surprised by the emotions depicted.

• Data Set Limitation: Our data set is still limited

in terms of diversity and quantity.

• Individual Variability: Empathic responses can

vary greatly from person to person based on indi-

vidual factors such as personal experiences, emo-

tional sensitivity, and innate empathic skills.

• Nature of Emotions: Some emotions can natu-

rally trigger stronger empathic reactions. For ex-

ample, fear and surprise are intense and often im-

mediate emotions, which may explain why they

score high. Other emotions, such as anger, can be

more complex and cause various reactions, some

people may feel empathy while others feel resis-

tance or rejection.

In order to overcome these limitations and achieve

more representative and generalizable results, we

can propose the following improvements. (i). In-

crease the scale of the experience: By expanding

the number of participants and diversifying demo-

graphic groups, we can get a better representation of

the empathic reactions. (ii). Diversiﬁcation of emo-

tional stimuli: By using a greater variety of videos

and images to represent emotions, we can reduce the

familiarity effect and capture a wider range of reac-

tions. (iii). Improvement of Evaluation Condi-

tions: By standardizing the viewing conditions and

minimizing distractions, we can ensure that the reac-

tions of participants are as natural and authentic as

possible. (iv). Use of Advanced Empathy Measure-

ment Techniques: By integrating psychophysiologi-

cal measures such as eye movement tracking, By ana-

lyzing facial expressions in real time, and monitoring

physiological responses, we can obtain more accurate

and objective data on levels of empathy.

By implementing these strategies, we hope to gain

more accurate and reliable insights into the levels of

empathy associated with different emotions, and thus

improve our understanding of the mechanisms of em-

pathetic resonance.

4 CONCLUSION

In this work, we developed a comprehensive system

for facial empathy analysis using computer vision

and machine learning techniques. Our experimen-

tation setup included scenarios that evoke empathy,

capturing participants’ facial reactions through video

recordings, and subsequently measuring their empa-

thy levels using questionnaires. The ﬁndings from our

experiments indicate at some extent the correlation

between speciﬁc facial expressions and empathic re-

sponses. Overall, this work contributes to the ﬁeld by

bridging the gap between facial expression analysis

and empathy detection, offering a novel approach that

can be applied in various domains, including psychol-

ogy, human-computer interaction, and virtual reality.

Future research could focus on reﬁning the models

and algorithms for even greater accuracy and explor-

ing additional applications of this technology in real-

world scenarios.

REFERENCES

Baltrusaitis, T., Zadeh, A., Lim, Y. C., and Morency, L.-

P. (2018). Openface 2.0: Facial behavior analysis

toolkit. In 2018 13th IEEE International Confer-

ence on Automatic Face and Gesture Recognition (FG

2018), pages 59–66.

Bang, E. and Yildirim, C. (2018). Virtually empathetic?:

Examining the effects of virtual reality storytelling

on empathy. In Chen, J. Y. and Fragomeni, G., ed-

itors, Virtual, Augmented and Mixed Reality: Inter-

action, Navigation, Visualization, Embodiment, and

Simulation, pages 290–298, Cham. Springer Interna-

tional Publishing.

Bazarevsky, V., Kartynnik, Y., Vakunov, A., Raveendran,

K., and Grundmann, M. (2019). Blazeface: Sub-

millisecond neural face detection on mobile gpus.

Debnath, T., Reza, M., Rahman, A., Beheshti, A., Band, S.,

and Alinejad-Rokny, H. (2022). Four-layer convnet to

facial emotion recognition with minimal epochs and

the signiﬁcance of data diversity. Scientiﬁc Reports,

12:1–18. Copyright the Crown 2022. Version archived

for private and non-commercial use with the permis-

sion of the author/s and according to publisher condi-

tions. For further rights please contact the publisher.

Demochkina, P. and Savchenko, A. V. (2021). Mobileemo-

tiface: Efﬁcient facial image representations in video-

based emotion recognition on mobile devices. In Pat-

tern Recognition. ICPR International Workshops and

Challenges: Virtual Event, January 10–15, 2021, Pro-

ceedings, Part V, page 266–274, Berlin, Heidelberg.

Springer-Verlag.

Ezati, A., Dezyani, M., Rana, R., Rajabi, R., and Ayatol-

lahi, A. (2024). A lightweight attention-based deep

network via multi-scale feature fusion for multi-view

facial expression recognition. ArXiv, abs/2403.14318.

Gareth W. Young, N. O. and Smolic, A. (2022). Exploring

virtual reality for quality immersive empathy building

experiences. Behaviour and Information Technology,

41(16):3415–3431.

Gnacek, M., Quintero, L., Mavridou, I., Balaguer-Ballester,

E., Kostoulas, T., Nduka, C., and Seiss, E. (2024).

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

AVDOS-VR: Affective Video Database with Physio-

logical Signals and Continuous Ratings Collected Re-

motely in VR. Scientiﬁc Data, 11(1).

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,

Mirza, M., Hamner, B., Cukierski, W., Tang, Y.,

Thaler, D., Lee, D.-H., Zhou, Y., Ramaiah, C., Feng,

F., Li, R., Wang, X., Athanasakis, D., Shawe-Taylor,

J., Milakov, M., Park, J., Ionescu, R., Popescu, M.,

Grozea, C., Bergstra, J., Xie, J., Romaszko, L., Xu,

B., Chuang, Z., and Bengio, Y. (2013). Challenges

in representation learning: A report on three machine

learning contests. In Lee, M., Hirose, A., Hou, Z.-

G., and Kil, R. M., editors, Neural Information Pro-

cessing, pages 117–124, Berlin, Heidelberg. Springer

Berlin Heidelberg.

Hartley, R. and Zisserman, A. (2003). Multiple View Geom-

etry in Computer Vision. Cambridge University Press,

New York, NY, USA, 2 edition.

Hasan, M. R., Hossain, M. Z., Ghosh, S., Krishna, A., and

Gedeon, T. (2024). Empathy detection from text, au-

diovisual, audio or physiological signals: Task formu-

lations and machine learning methods.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications.

Huc, M., Bush, K., Atias, G., Berrigan, L., Cox, S., and

Jaworska, N. (2023). Recognition of masked and un-

masked facial expressions in males and females and

relations with mental wellness. Frontiers in Psychol-

ogy, 14.

Kas, M., merabet, Y. E., Ruichek, Y., and Messoussi, R.

(2021). New framework for person-independent facial

expression recognition combining textural and shape

analysis through new feature extraction approach. In-

formation Sciences, 549:200–220.

Li, B. J., Bailenson, J. N., Pines, A., Greenleaf, W. J., and

Williams, L. M. (2017). A public database of immer-

sive vr videos with corresponding ratings of arousal,

valence, and correlations between head movements

and self report measures. Frontiers in Psychology, 8.

Li, S. and Deng, W. (2022). Deep facial expression recogni-

tion: A survey. IEEE Transactions on Affective Com-

puting, 13(3):1195–1215.

Ma, H., Lei, S., Celik, T., and Li, H.-C. (2024). Fer-yolo-

mamba: Facial expression detection and classiﬁcation

based on selective state space.

Martin, O., Kotsia, I., Macq, B., and Pitas, I. (2006). The

enterface’ 05 audio-visual emotion database. In 22nd

International Conference on Data Engineering Work-

shops (ICDEW’06), pages 8–8.

Mathur, L., Spitale, M., Xi, H., Li, J., and Matari

c, M. J.

(2021). Modeling user empathy elicited by a robot sto-

ryteller. In 2021 9th International Conference on Af-

fective Computing and Intelligent Interaction (ACII),

pages 1–8.

Minaee, S., Minaei, M., and Abdolrashidi, A. (2021). Deep-

emotion: Facial expression recognition using atten-

tional convolutional network. Sensors, 21(9).

Mohamed, B., Daoud, M., Mohamed, B., and taleb ahmed,

A. (2022). Improvement of emotion recognition from

facial images using deep learning and early stopping

cross validation. Multimedia Tools and Applications,

81.

Poux, D., Allaert, B., Mennesson, J., Ihaddadene, N., Bi-

lasco, I. M., and Djeraba, C. (2020). Facial expres-

sions analysis under occlusions based on speciﬁcities

of facial motion propagation. Multimedia Tools and

Applications, 80(15):22405–22427.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and

Chen, L.-C. (2019). Mobilenetv2: Inverted residuals

and linear bottlenecks.

Shen, L. (2010). On a scale of state empathy during mes-

sage processing. Western Journal of Communication,

74:504–524.

Shin, D. (2018). Empathy and embodied experience in vir-

tual environment: To what extent can virtual reality

stimulate empathy and embodied experience? Com-

puters in Human Behavior, 78:64–73.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Somarathna, R., Bednarz, T., and Mohammadi, G. (2023).

Virtual reality for emotion elicitation – a review. IEEE

Transactions on Affective Computing, 14(4):2626–

2645.

Spreng, R. N., Mckinnon, M., Mar, R., and Levine, B.

(2009). The toronto empathy questionnaire: Scale de-

velopment and initial validation of a factor-analytic

solution to multiple empathy measures. Journal of

personality assessment, 91:62–71.

Sun, L., Lian, Z., Wang, K., He, Y., Xu, M., Sun, H., Liu,

B., and Tao, J. (2023). Svfap: Self-supervised video

facial affect perceiver.

Ventura, S. and Martingano, A. J. (2023). Roundtable:

Raising empathy through virtual reality. In Ventura,

S., editor, Empathy, chapter 3. IntechOpen, Rijeka.

Wegrzyn, M., Vogt, M., Kireclioglu, B., Schneider, J., and

Kissler, J. (2017). Mapping the emotional face. how

individual face parts contribute to successful emotion

recognition. PLOS ONE, 12.

Wingenbach, T. S. H. (2023). Facial EMG – Investigating

the Interplay of Facial Muscles and Emotions, pages

283–300. Springer International Publishing, Cham.

Xue, T., Ali, A. E., Zhang, T., Ding, G., and Cesar, P.

(2023). Ceap-360vr: A continuous physiological and

behavioral emotion annotation dataset for 360

◦

videos. IEEE Transactions on Multimedia, 25:243–

255.

Yao, L., Wan, Y., Ni, H., and Xu, B. (2021). Action unit

classiﬁcation for facial expression recognition using

active learning and svm. Multimedia Tools and Appli-

cations, 80.

Zhu, Q. and Luo, J. (2023). Toward artiﬁcial empathy for

human-centered design: A framework.

Facial Empathy Analysis Through Deep Learning and Computer Vision Techniques in Mixed Reality Environments