Multi-Face Emotion Detection for Effective Human-Robot Interaction

Mohamed Ala Yahyaoui

, Mouaad Oujabour

, Leila Ben Letaifa

1 a

and Amine Bohi

2 b

CESI LINEACT Laboratory, UR 7527, Vandoeuvre-l

es-Nancy, 54500, France

CESI LINEACT Laboratory, UR 7527, Dijon, 21800, France

Keywords:

Emotion Detection, Facial Expression Recognition, Human-Robot Interaction, Deep Learning, Graphical

User Interface.

Abstract:

The integration of dialogue interfaces in mobile devices has become ubiquitous, providing a wide array of

services. As technology progresses, humanoid robots designed with human-like features to interact effectively

with people are gaining prominence, and the use of advanced human-robot dialogue interfaces is continually

expanding. In this context, emotion recognition plays a crucial role in enhancing human-robot interaction by

enabling robots to understand human intentions. This research proposes a facial emotion detection interface

integrated into a mobile humanoid robot, capable of displaying real-time emotions from multiple individuals

on a user interface. To this end, various deep neural network models for facial expression recognition were

developed and evaluated under consistent computer-based conditions, yielding promising results. Afterwards,

a trade-off between accuracy and memory footprint was carefully considered to effectively implement this

application on a mobile humanoid robot.

1 INTRODUCTION

The rapid advancement of technology in recent years

has accelerated research in robotics, with a particu-

lar emphasis on humanoid robots. Designed to re-

semble humans in body, hands, and head, humanoid

robots are increasingly capable of sophisticated inter-

actions with people, including recognizing individu-

als and responding to commands. This human-like

form and behavior make them particularly well-suited

for applications in human-computer interaction, serv-

ing as effective platforms for studying and improv-

ing user engagement and interaction dynamics. Cur-

rent examples of humanoid robots include Honda’s

ASIMO (Hirose and Ogawa, 2007), known for its ad-

vanced mobility and dexterity; Blue Frog Robotics’

Buddy (Peltier and Fiorini, 2017), designed for so-

cial interaction and domestic assistance; and Alde-

baran Robotics’ NAO (Gouaillier et al., 2009), rec-

ognized for its versatility in research and educational

settings. These robots showcase the diversity of roles

humanoid robots can play, from companionship and

entertainment to education and beyond.

Developing emotional intelligence in robots is rel-

evant as they increasingly participate in social set-

https://orcid.org/0000-0002-0474-3229

https://orcid.org/0000-0002-2435-3017

tings. Indeed, beyond performing physical tasks, en-

hancing robots’ ability to perceive, interpret and re-

spond to human needs is essential for effective social

Human-Robot Interaction (HRI) and Human-Robot

Collaboration (HRC).

In the realm of social robotics, integrating sen-

sors such as microphone for ”mouth” or camera for

’eyes’ into the humanoid robot, enables the robot to

capture human emotions in real-time, and to adapt its

response and behavior accordingly (Justo et al., 2020;

Olaso et al., 2021; Palmero et al., 2023). This capa-

bility enhances their utility in various applications and

facilitates engagement and intuitive interaction expe-

riences between robots and humans. Detecting emo-

tions from camera starts with face detection, which

involves identifying and locating human faces within

images or video frames. This process includes pre-

processing images, extracting distinct facial features,

classifying regions as faces or non-faces, reﬁning de-

tection accuracy, and handling variations in lighting,

occlusions, poses and scales. Face emotion recog-

nition (FER) employs computer vision and machine

learning techniques to analyze human emotions from

face.

Often, emotion recognition systems deals with

only one user while he is communicating with a ma-

chine. However, multiple users can communicate si-

Yahyaoui, M. A., Oujabour, M., Ben Letaifa, L. and Bohi, A.

Multi-Face Emotion Detection for Effective Human-Robot Interaction.

DOI: 10.5220/0013170300003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 1, pages 91-99

ISBN: 978-989-758-737-5; ISSN: 2184-433X

multaneously with it. Multi-face emotion recognition

is particularly valuable across various scenarios. For

instance, at a comedy club, it provides real-time feed-

back to comedians, manages lighting and sound, in-

teracts with the audience, and detects disruption.

In this work, we present a complete facial emo-

tion recognition interface and its deployment in a mo-

bile humanoid robot. The proposed interface can dis-

play emotions from multiple individuals in real-time

within an advanced user interface. To achieve this,

several deep neural network models have been devel-

oped and evaluated under the same conditions. Then

a tradeoff between system accuracy and model size

have been considered in order to implement the op-

timal solution into a humanoid robot. The model’s

performance and its conﬁdence interval also guided

this choice of solution.

The remainder of this paper is structured as fol-

lows. Section 2 reviews the state of the art related

to emotion detection for Human-Robot Interaction

(HRI) and Facial Emotion Recognition (FER) sys-

tems. Section 3 presents the design and implemen-

tation of the proposed emotional interface, detailing

the multi-face detection, emotion recognition system,

and the graphical user interface. Section 4 describes

the integration of the facial emotion recognition sys-

tem into the Tiago++ humanoid robot, highlighting

the processes of face tracking and real-time emotion

detection. Section 5 outlines the experimental setup

and presents the results, including performance met-

rics, model comparisons, and user interaction anal-

ysis. Finally, Section 6 concludes the paper with a

discussion of the ﬁndings, limitations of the current

approach, and potential directions for future work.

2 RELATED WORK

Although emotions have been investigated in the con-

text of HRI, it remains a signiﬁcant challenge. In this

section, we report recent research in HRI as well as

FER systems.

2.1 Emotion Detection for HRI

In social robotics, emotion detection is mimicked by

robots to interact naturally and harmoniously with hu-

mans. Several studies have focused on implementing

facial emotion recognition in robots. For instance,

the study (Zhao et al., 2020) applied facial emotion

recognition on three datasets: FER2013, FERPLUS

and FERFIN. The system was implemented on a NAO

robot, which responds with actions based on the de-

tected emotions. However, this study has some limita-

tions, as it does not provide details on the robot’s im-

plementation. Additionally, the research (Dwijayanti

et al., 2022) integrated a facial detection system with a

facial emotion recognition system and implemented it

in a robot. They also explored automatic detection of

the distance between the camera of the robot and the

person. One drawback is that the robot is stationary,

so mobility is not considered. The study (Spezialetti

et al., 2020) serves as a survey of emotion recogni-

tion research for human-robot interaction. It reviews

emotion recognition models, datasets, and modalities,

with a particular emphasis on facial emotion recogni-

tion. However, it does not include any research utiliz-

ing deep learning models for facial emotion recogni-

tion.

2.2 Facial Emotion Recognition

Deep learning has revolutionized computer vision

tasks, including Facial Emotion Recognition (FER),

with numerous studies proposing various methodolo-

gies to achieve high classiﬁcation accuracy using well

known benchmark datasets (Farhat et al., ; Goodfel-

low et al., 2013; Letaifa et al., 2019; Mollahosseini

et al., 2017; Justo et al., 2021; Lucey et al., 2010).

Several recent studies have proposed innovative

approaches for FER. Farzaneh et al. (Farzaneh

and Qi, 2021) introduced the Deep Attentive Cen-

ter Loss (DACL) method, which integrates an atten-

tion mechanism to enhance feature discrimination,

showing superior performance on RAF-DB and Af-

fectNet datasets. Similarly, Pecoraro et al. (Pec-

oraro et al., 2022) proposed the LHC-Net architec-

ture, which employs a multi-head self-attention mod-

ule tailored for FER tasks, achieving state-of-the-art

results on FER2013 with lower computational com-

plexity. In another work, Han et al. (Han et al., 2022)

presented a triple-structure network model based on

MobileNet V1, which captures inter-class and intra-

class diversity features, demonstrating strong results

on KDEF, MMI, and CK+ datasets. Fard et al. (Fard

and Mahoor, 2022) introduced the Adaptive Correla-

tion (Ad-Corre) Loss, which improved performance

on AffectNet, RAF-DB, and FER2013 datasets when

applied to Xception and ResNet50 models. Other no-

table contributions include the Segmentation VGG-19

model (Vignesh et al., 2023), which enhanced FER

on FER2013 using segmentation-inspired blocks, and

the DDAMFN network by Zhang et al. (Zhang et al.,

2023), which incorporated dual-direction attention to

achieve excellent results on AffectNet and FERPlus.

Lastly, in our recent work, we introduced EmoNeXt

(El Boudouri and Bohi, 2023), a deep learning frame-

work that has set new state-of-the-art benchmarks on

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

the FER2013 dataset. EmoNeXt integrates a Spa-

tial Transformer Network (STN) for handling fa-

cial alignment variations, along with Squeeze-and-

Excitation (SE) blocks for channel-wise feature recal-

ibration. Additionally, a self-attention regularization

term was introduced to enhance compact feature gen-

eration, further improving accuracy.

This brief review shows that many FER models

have focused exclusively on improving accuracy. As

a result, today’s leading models can reach memory

sizes in the order of gigabytes, which poses chal-

lenges for deployment in memory-constrained envi-

ronments, such as the robots.

3 THE EMOTIONAL INTERFACE

One of the challenges in the domain of emotion

detection for HRI, is the simultaneous detection of

emotions from multiple faces, which is useful where

robots interact with groups of people.

3.1 Multi-Face Detection

We choose the Haarcascade classiﬁer, proposed by

Paul Viola and Michael Jones in their seminal paper

(Viola and Jones, 2001), as a highly effective method

for face detection. Other notable methods include the

Histogram of Oriented Gradients (HOG) combined

with Support Vector Machines (SVM) and deep learn-

ing approaches such as the Multi-task Cascaded Con-

volutional Networks (MTCNN). While these meth-

ods have shown promising results in various applica-

tions, the Haarcascade classiﬁer is particularly advan-

tageous for real-time scenarios.

The general principle of the Haarcascade ap-

proach is illustrated in Figure 1. This machine

learning-based method involves training a cascade

function using a large dataset of positive (face) and

negative (non-face) images. The classiﬁer relies on

Haar features, which are similar to convolutional ker-

nels, to extract distinguishing characteristics from im-

ages. Each Haar feature is a single value calculated by

subtracting the sum of pixels under a white rectangle

from the sum of pixels under a black rectangle. To

efﬁciently compute these features, the concept of in-

tegral images is utilized, reducing the calculation to

an operation involving just four pixels, regardless of

the feature’s size.

During training, all possible sizes and positions of

these features are applied to the training images, re-

sulting in over 160,000 potential features. To select

the most relevant features, the AdaBoost algorithm is

utilized, which iteratively adjusts the weights of mis-

Figure 1: Cascade structure for Haar classiﬁers (Kim et al.,

2015).

classiﬁed images and selects features with the lowest

error rates, thereby creating a strong classiﬁer from a

combination of weak classiﬁers. Despite the high ini-

tial number of features, this process narrows it down

signiﬁcantly (e.g., from 160,000 to around 6,000).

For detection, the image is scanned with a 24x24

pixel window, applying these selected features. To en-

hance efﬁciency, the authors introduced a cascade of

classiﬁers. This means that features are grouped into

stages, and if a window fails at any stage, it is imme-

diately discarded as a non-face region. This hierarchi-

cal approach ensures that only potential face regions

undergo the full, more complex evaluation process,

allowing for real-time face detection with high accu-

racy.

3.2 Emotion Recognition System

Pretrained deep learning models have demonstrated

exceptional effectiveness for feature extraction across

various domains (Palmero et al., 2023). In our

emotion recognition system (Figure. 2), we lever-

age a pretrained convolutional neural network (CNN)

model to apply transfer learning using the FER2013

dataset (Goodfellow et al., 2013). Speciﬁcally, we uti-

lize pretrained CNN models, initially trained on the

ImageNet dataset which encompass millions of im-

ages from various categories (Deng et al., 2009). This

extensive training enables these models to extract

highly relevant and general visual features through

their convolutional layers. These layers detect funda-

mental elements such as edges, textures, and shapes,

which are essential for understanding facial struc-

tures. We utilize these convolutional layers to pro-

cess our input images, leaving out the top portion of

the model, speciﬁcally the fully connected layers ini-

tially designed for the ImageNet classiﬁcation tasks.

Instead, by passing our facial images through the pre-

trained model’s convolutional layers, we generate a

feature stack that encapsulates essential visual infor-

mation. This feature stack, representing a rich set of

features extracted from the images, is then ﬂattened

into a format suitable for further processing. Subse-

quently, we introduce additional fully connected lay-

Multi-Face Emotion Detection for Effective Human-Robot Interaction

ers tailored to the FER2013 dataset to recognize and

classify seven distinct emotions: anger, disgust, fear,

happiness, sadness, surprise, and neutrality.

Figure 2: The architecture of the emotion recognition sys-

tem using transfer learning on the FER2013 dataset.

These newly added layers are trained to ﬁne-tune

the model speciﬁcally for emotion recognition, lever-

aging the robust feature extraction capabilities of the

pretrained model’s convolutional layers.

3.3 Graphical Interface

The graphical interface of our emotion recognition

system integrates multiple advanced technologies to

provide a seamless and responsive user experience.

Upon launching the application, the interface is built

using the Tkinter library

, creating a user-friendly

graphical environment. The system activates the we-

bcam through the OpenCV library

, capturing a live

video feed for real-time analysis. Captured video

frames undergo face detection using the HaarCascade

classiﬁer, a robust method for identifying faces under

various lighting conditions and angles (see descrip-

tion in subsection 3.1).

https://docs.python.org/3/library/tkinter.html

https://docs.opencv.org/4.x/

Once a face is detected, the region of interest is ex-

tracted and subjected to preprocessing to ensure com-

patibility with the model’s input size. The processed

image is then fed into a pretrained CNN model that

have been ﬁne-tuned on the FER2013 dataset. This

model analyzes the facial image to predict the user’s

emotional state, categorizing it into distinct emotions

such as anger, fear, disgust, happiness, sadness, sur-

prise, and neutrality. The predicted emotion is then

displayed on the graphical interface, providing imme-

diate feedback to the user. All these steps are illus-

trated by Fig. 3.

Figure 3: Global architecture of our real-time multi-face

emotion recognition user interface.

4 THE HUMANOID ROBOT

The Tiago robot, developed by PAL Robotics (Pages

et al., ), is a humanoid mobile robot

. Its modular de-

sign allows for customization to meet speciﬁc needs.

In this section, we outline our approach to equipping

the Tiago++ model of the robot with face emotion

recognition capabilities. By using the Robot Oper-

ating System (ROS)

for communication and process-

ing, and integrating a Tkinter-based GUI for real-time

visualization, we enhance the ability of the robot to

https://pal-robotics.com/robots/tiago/

https://wiki.ros.org

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

interact with humans. This implementation is divided

into two primary tasks: face tracking and emotion de-

tection, each described in the following subsections.

4.1 Face Tracking Integration on

Tiago++ Robot

We implemented a face tracking module on the

Tiago robot by integrating ROS with a Tkinter-

based GUI application. The process begins with

initializing a ROS node named Tiago FER and set-

ting up essential publishers and subscribers to fa-

cilitate communication between the robot and the

software. We use the CvBridge

library to con-

vert images from ROS format to OpenCV format.

Meanwhile, the MediaPipeRos instance processes

these images to detect regions of interest (ROI)

for face tracking. The application’s main loop re-

ceives images from the robot’s camera through the

/xtion/rgb/image ROS topic, processes these im-

ages to detect faces, and generates commands to ad-

just the robot’s yaw and pitch. These commands,

which control head movements, are published to the

head controller/increment/goal topic using the

IncrementActionGoal message type, enabling the

robot to track the detected faces. These steps are out-

lined in the diagram generated by ROS, as shown in

Figure 4.

4.2 Emotion Detection and GUI Display

on Tiago++ Screen

Following face tracking, the processed images are

analyzed to predict emotions. The detected emo-

tions are displayed on a Tkinter GUI, which fea-

tures a canvas for image display and progress bars

to visualize emotion scores. The processed im-

ages and emotion data are published back to the

/imagesBack ROS topic. Additionally, incremen-

tal commands for torso movements are sent to the

/torso controller/safe command topic using the

JointTrajectory message type, allowing the robot

to dynamically respond to detected emotions (see Fig-

ure 4.

5 EXPERIMENTS AND RESULTS

Developing a human-robot interface for FER involves

detecting faces and emotions, implementing the user

interface, and integrating it into the robot platform.

https://wiki.ros.org/cv bridge/Tutorials

The robot’s camera captures images of individuals

interacting with it, processes these images to detect

emotions, and then displays the detected emotions on

the user interface. This interface is visible on the

tablet mounted on the robot’s chest. Several chal-

lenges are to be addressed, particularly focusing on

the accuracy of the models and the feasibility of im-

plementing them on the robot.

5.1 Face Emotion Detection

In this work, we ﬁne-tuned several pretrained mod-

els from the Keras library

, initially trained on the

ImageNet 1000K dataset. These models were se-

lected based on their strong performance in the Im-

ageNet classiﬁcation task and their ability to general-

ize well for FER tasks. We applied transfer learning,

as explained in subsection 3.2, to the following mod-

els: MobileNet (Howard et al., 2017), DenseNet201

(Huang et al., 2017), ResNet152V2 (He et al., 2016b),

ResNet101 (He et al., 2016a), Xception (Chollet,

2017), EfﬁcientNetV2-B0 (Tan and Le, 2021), In-

ceptionResNetV2 and InceptionV3 (Szegedy et al.,

2017), VGG16 and VGG19 (Karen, 2014), and Con-

vNeXt (from Tiny to XLarge version) (Liu et al., ).

For training, we consistently used data augmenta-

tion techniques such as rotation, shift, zoom, horizon-

tal ﬂip and adjustments in brightness and contrast to

improve the model’s robustness. Additionally, Ran-

dom Erasing was used to simulate occlusions, while

resizing and recropping variations improved robust-

ness to differences in face positioning. The models

were optimized using Adam with a learning rate of

0.0001, combined with strategies like EarlyStopping

and ReduceLROnPlateau to prevent overﬁtting and

dynamically adjust the learning rate.

The accuracy and memory footprint of each ﬁne-

tuned model on the FER2013 dataset are reported in

Table 1. While ConvNeXt XLarge achieved the high-

est accuracy at 72.27%, it comes with a signiﬁcantly

larger memory footprint than the other models.

5.2 Conﬁdence Interval

Accuracy is an estimate of the performance of a sys-

tem, and its reliability depends on the number of tests

conducted that is in our case the number of emotions

to be recognized. The measurement of the conﬁdence

interval is introduced to assess the trustability of our

recognition rate. In (Zouari, 2007), the successes are

modeled by a binomial distribution. If N is the num-

ber of tests and P is the recognition rate, then the con-

ﬁdence interval [P-, P+] at x% is:

https://keras.io/api/applications/

Multi-Face Emotion Detection for Effective Human-Robot Interaction

(a) Face tracking integration on Tiago++ robot. (b) Emotion detection and GUI display on Tiago++ robot.

Figure 4: ROS-based Tiago++ face emotion recognition integration process: the diagram in the left (a) depicts the steps

involved in face tracking integration, while the diagram in the right (b) shows the emotion detection and GUI display process.

Table 1: Pretrained models ﬁne-tuned on the FER2013

dataset: accuracy (%) and memory footprint (Megabytes).

Model name Accuracy Model size

MobileNet 66.11 14.5

ResNet152V2 67.28 611.3

DenseNet201 67.84 221.0

InceptionV3 68.43 268.6

Xception 68.93 346.9

ConvNeXt Tiny 69.43 362

EfﬁcientNetV2-B0 70.00 139.0

ConvNeXt Small 70.15 566

InceptionResNetV2 70.29 648.2

ConvNeXt Base 70.32 1120

VGG16 71.18 171.0

ResNet101 71.30 549.8

VGG19 71.46 262.5

ConvNeXt Large 71.57 2733

ConvNeXt XLarge 72.27 3900

with z95% = 1.96 and z98% = 2.33. This means that

there is a x% chance that the rate falls within the in-

terval [P-, P+].

The FER2013 dataset consists in 35,887 grayscale

images, divided into training (80%), test (10%) and

validation (10%). Hence, using each model 3589

samples have been evaluated on the test set. We com-

pute the conﬁdence interval with z98 for all mod-

els and report the results in Figure 5. We notice

that several models, including VGG16, Inception-

ResNetV2, ConvNeXt Base, EfﬁcientNetV2-B0, and

VGG19, show overlapping results. In terms of pre-

cision, these models demonstrate similar efﬁciency.

However, there is a notable difference in their sizes,

with EfﬁcientNetV2-B0 being the most compact. Due

to its smaller size, EfﬁcientNetV2-B0 has been cho-

sen for implementation on the robot.

Figure 5: Accuracy and conﬁdence intervals of the models.

5.3 User Interface Development

Our emotion recognition application features an in-

tuitive and user-friendly graphical interface designed

for both single-face and multi-face emotion detec-

tion. The interface allows users to utilize their de-

vice’s camera to capture live video streams, which

are then processed in real-time to detect and classify

facial expressions. For single-face emotion recogni-

tion, the application highlights the detected face and

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

displays the identiﬁed emotion with corresponding

conﬁdence levels. In multi-face scenarios, the inter-

face efﬁciently detects multiple faces within the same

frame, assigning emotions to each detected face in-

dividually. The results are visually presented using

bounding boxes and emotion labels directly on the

video feed, providing clear and immediate feedback.

Figure 6: The user interface displays face and emotion de-

tection for a single person. Progress bars indicate the conﬁ-

dence score for each recognized emotion.

Additionally, the interface includes progress bars

for the detected emotion, visually representing the

conﬁdence level of each prediction. An avatar further

enhances user interaction by imitating the predicted

emotion in real-time, offering an engaging and dy-

namic way to understand the results. This comprehen-

sive and interactive interface ensures that users can

easily interpret the emotion detection outcomes, mak-

ing the application practical for various real-world

settings, including human-robot interaction and affec-

tive computing. Figure. 6 and Figure. 7 show some

examples of the user interface applied to single and

multi-face emotion detection.

5.4 FER Deployment on Tiago++ Robot

The Tiago++ is a humanoid mobile robot with con-

strained resources (CPU, memory, and storage). Be-

sides interacting with humans, the robot must concur-

Figure 7: Face detection is followed by emotion detection

for multiple individuals present in the same image.

rently perform critical tasks such as navigation and

detection, which are also resource-intensive. Conse-

quently, for deploying our application on the Tiago++

robot, it is essential to select a model not only based

on its test accuracy but also on the memory footprint

of the model. The Tiago++ robot has a maximum ca-

pacity of about 150 MB for model ﬁles to ensure real-

time inference without disrupting other processes run-

ning on the robot. According to Table 1 and the previ-

ous subsection, EfﬁcientNetV2-B0 stands out with a

good balance between accuracy (70.00%) and model

size (139 MB), meeting the robot’s constraints.

To illustrate the system’s effectiveness, we con-

ducted two sets of experiments. In the ﬁrst set, a sin-

gle participant interacted with the robot, displaying

a range of emotions. The system’s ability to accu-

rately detect the face and classify the emotional state

of the participant in real-time was meticulously ob-

served and documented. In the second set, two partic-

ipants were present simultaneously, engaging in var-

ious interactions with the robot. This scenario tested

the system’s robustness in detecting multiple faces

and correctly identifying each individual’s emotional

state in real-time. The results of these experiments

are depicted through a series of images captured dur-

ing the interactions on Figure. 8.

6 CONCLUSIONS

In this paper, we presented a facial emotion detection

interface implemented on a mobile humanoid robot.

This interface is capable of displaying emotions from

multiple individuals in real-time video. To achieve

Multi-Face Emotion Detection for Effective Human-Robot Interaction

Figure 8: Multi face emotion detection deployed on robot

this, we developed and evaluated several deep neu-

ral network models under consistent conditions, care-

fully considering factors such as model size and accu-

racy to ensure compatibility with both personal com-

puters and mobile robots like the Tiago++.

While our system demonstrates strong perfor-

mance, it is important to note the limitations of rely-

ing solely on facial expressions for emotion detection,

particularly in contexts where communication may be

impaired. Emotions are complex and multifaceted,

often requiring the integration of multiple modali-

ties for more accurate recognition. Therefore, future

work will focus on incorporating additional modali-

ties, such as voice, text, gestures, and biosignals, to

enhance the performance and reliability of emotion

recognition systems. Additionally, we will focus on

optimizing large models used in FER tasks to ensure

their efﬁciency for deployment on the Tiago++ robot,

considering the balance between model size and ac-

curacy.

REFERENCES

Chollet, F. (2017). Xception: Deep learning with depthwise

separable convolutions. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 1251–1258.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In 2009 IEEE conference on computer vi-

sion and pattern recognition, pages 248–255.

Dwijayanti, S., Iqbal, M., and Suprapto, B. Y. (2022). Real-

time implementation of face recognition and emotion

recognition in a humanoid robot using a convolutional

neural network. IEEE Access, 10:89876–89886.

El Boudouri, Y. and Bohi, A. (2023). Emonext: an adapted

convnext for facial emotion recognition. In 2023 IEEE

25th International Workshop on Multimedia Signal

Processing (MMSP), pages 1–6. IEEE.

Fard, A. P. and Mahoor, M. H. (2022). Ad-corre: Adap-

tive correlation-based loss for facial expression recog-

nition in the wild. IEEE Access, 10:26756–26768.

Farhat, N., Bohi, A., Letaifa, L. B., and Slama, R. Cg-

mer: a card game-based multimodal dataset for emo-

tion recognition. In Sixteenth International Confer-

ence on Machine Vision (ICMV 2023).

Farzaneh, A. H. and Qi, X. (2021). Facial expression recog-

nition in the wild via deep attentive center loss. In

Proceedings of the IEEE/CVF winter conference on

applications of computer vision, pages 2402–2411.

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,

Mirza, M., Hamner, B., Cukierski, W., Tang, Y.,

Thaler, D., Lee, D.-H., et al. (2013). Challenges in

representation learning: A report on three machine

learning contests. In Neural information processing.

20th international conference, ICONIP. Springer.

Gouaillier, D., Hugel, V., Blazevic, P., and Kilner, C.

(2009). Mechatronic design of nao humanoid. In

IEEE International Conference on Robotics and Au-

tomation ICRA.

Han, B., Hu, M., Wang, X., and Ren, F. (2022). A triple-

structure network model based upon mobilenet v1 and

multi-loss function for facial expression recognition.

Symmetry, 14(10):2055.

He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Iden-

tity mappings in deep residual networks. In Computer

Vision–ECCV 2016: 14th European Conference, Am-

sterdam, The Netherlands, October 11–14, 2016, Pro-

ceedings, Part IV 14, pages 630–645. Springer.

Hirose, M. and Ogawa, K. (2007). Honda humanoid robots

development. Philosophical Transactions of the Royal

Society A: Mathematical, Physical and Engineering

Sciences, 365(1850):11–19.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,

Wang, W., Weyand, T., Andreetto, M., and Adam,

H. (2017). Mobilenets: Efﬁcient convolutional neu-

ral networks for mobile vision applications. arXiv

preprint arXiv:1704.04861.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,

K. Q. (2017). Densely connected convolutional net-

works. In Proceedings of the IEEE conference on

computer vision and pattern recognition.

Justo, R., Letaifa, L. B., Olaso, J. M., L

opez-Zorrilla, A.,

Develasco, M., V

azquez, A., and Torres, M. I. (2021).

A spanish corpus for talking to the elderly. Conversa-

tional Dialogue Systems for the Next Decade.

Justo, R., Letaifa, L. B., Palmero, C., Fraile, E. G., Jo-

hansen, A., Vazquez, A., Cordasco, G., Schlogl, S.,

Ruanova, B. F., Silva, M., Escalera, S., Velasco,

M. D., Laranga, J. T., Esposito, A., Kornes, M., and

Torres, M. I. (2020). Analysis of the interaction be-

tween elderly people and a simulated virtual coach.

Journal of Ambient Intelligence and Humanized Com-

puting, 11:6125–6140.

Karen, S. (2014). Very deep convolutional networks for

large-scale image recognition. arXiv preprint arXiv:

1409.1556.

Kim, M., Lee, D., and Kim, K.-Y. (2015). System archi-

tecture for real-time face detection on analog video

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

camera. International Journal of Distributed Sensor

Networks, 11(5):251386.

Letaifa, L. B., Develasco, M., Justo, R., and Torres, M. I.

(2019). First steps to develop a corpus of interac-

tions between elderly and virtual agents in spanish

with emotion. In International Conference on Statis-

tical Language and Speech Processing.

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T.,

and Xie, S. A convnet for the 2020s. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition.

Lucey, P., Cohn, J. F., Kanade, T., Saragih, J., Ambadar,

Z., and Matthews, I. (2010). The extended cohn-

kanade dataset (ck+): A complete dataset for action

unit and emotion-speciﬁed expression. In 2010 ieee

computer society conference on computer vision and

pattern recognition-workshops, pages 94–101. IEEE.

Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2017).

Affectnet: A database for facial expression, valence,

and arousal computing in the wild. IEEE Transactions

on Affective Computing, 10(1):18–31.

Olaso, J., V

azquez, A., Letaifa, L. B., de Velasco, M.,

Mtibaa, A., Hmani, M. A., Petrovska-Delacr

etaz, D.,

Chollet, G., Montenegro, C., L

opez-Zorrilla, A., et al.

(2021). The empathic virtual coach: a demo. In The

2021 International Conference on Multimodal Inter-

action (ICMI’21), pages 848–851. ACM.

Pages, J., Marchionni, L., and Ferro, F. Tiago: the mod-

ular robot that adapts to different research needs. In

International workshop on robot modularity, IROS.

Palmero, C., DeVelasco, M., Hmani, A., Mtibaa, M. A.,

Letaifa, L. B., et al. (2023). Exploring emotion ex-

pression recognition in older adults interacting with a

virtual coach. arXiv preprint arXiv:2311.05567.

Pecoraro, R., Basile, V., and Bono, V. (2022). Local

multi-head channel self-attention for facial expression

recognition. Information, 13(9):419.

Peltier, A. and Fiorini, L. (2017). Buddy: A companion

robot for living assistance. Journal of Robotics and

Automation, 3(2):75–81.

Spezialetti, M., Placidi, G., and Rossi, S. (2020). Emotion

recognition for human-robot interaction: Recent ad-

vances and future perspectives. Frontiers in Robotics

and AI, 7:145.

Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. (2017).

Inception-v4, inception-resnet and the impact of resid-

ual connections on learning. In Proceedings of the

AAAI conference on artiﬁcial intelligence, volume 31.

Tan, M. and Le, Q. (2021). Efﬁcientnetv2: Smaller mod-

els and faster training. In International conference on

machine learning, pages 10096–10106. PMLR.

Vignesh, S., Savithadevi, M., Sridevi, M., and Sridhar, R.

(2023). A novel facial emotion recognition model us-

ing segmentation vgg-19 architecture. International

Journal of Information Technology, pages 1–11.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. In Proceedings

of the IEEE computer society conference on computer

vision and pattern recognition. CVPR.

Zhang, S., Zhang, Y., Zhang, Y., Wang, Y., and Song, Z.

(2023). A dual-direction attention mixed feature net-

work for facial expression recognition. Electronics,

12(17):3595.

Zhao, G., Yang, H., Tao, Y., Zhang, L., and and, C. Z.

(2020). Lightweight cnn-based expression recognition

on humanoid robot. KSII Transactions on Internet and

Information Systems, 14(3):1188–1203.

Zouari, L. (2007). Vers le temps r

eel en transcription au-

tomatique de la parole grand vocabulaire. PhD thesis,

ecom ParisTech.

Multi-Face Emotion Detection for Effective Human-Robot Interaction