FlowAct: A Proactive Multimodal Human-Robot Interaction System

with Continuous Flow of Perception and Modular Action Sub-Systems

Timoth

ee Dhaussy, Bassam Jabaian and Fabrice Lef

evre

Laboratoire Informatique d’Avignon, Avignon University, France

Keywords:

HRI, Robotics, Multimodal Perceptions, Proactive.

Abstract:

The evolution of autonomous systems in the context of human-robot interaction systems requires a synergy

between the continuous perception of the environment and the potential actions to navigate or interact with it.

In this paper we present FlowAct, a proactive multimodal human-robot interaction architecture, working as an

asynchronous endless loop of robot sensors into actuators, and organized by two controllers, the Environment

State Tracking (EST) and the Action Planner. Through a series of real-world experiments, we exhibit the

efﬁcacy of the system in maintaining a continuous perception-action loop, substantially enhancing the respon-

siveness and adaptability of autonomous pro-active agents. The modular architecture of the action subsystems

facilitates easy extensibility and adaptability to a broad spectrum of tasks and scenarios. The experiments

demonstrate the ability of a Pepper robot governed by FlowAct to intervene proactively in laboratory tests

and in the ﬁeld in a hospital waiting room to offer participants various services (appointment management,

information, entertainment, etc.).

1 INTRODUCTION

Human-robot Interaction (HRI) has undergone a

transformative journey, evolving from basic, task-

oriented engagements to sophisticated, context-aware

interactions that mirror human-like dynamism (Grau

et al., 2021). As robots become an integral part of

our daily environments, there is a growing demand for

systems that can continuously perceive, comprehend,

and act within their surroundings in a way that is both

intuitive and adaptive (Chen et al., 2018). The con-

cept of continuous perception, where an autonomous

entity is perpetually sensing and interpreting its envi-

ronment, has become a fundamental pillar for mod-

ern HRI systems (Salomon, 1997). This is a depar-

ture from traditional systems that predominantly op-

erated in a reactive mode, responding to stimuli based

on preset algorithms or rules.

The compartmentalization of speciﬁc functionali-

ties into distinct modules, such as those for movement

or speech, has been recognized as a crucial advance

in the ﬁeld (Tek

ulve et al., 2019). This modular ap-

proach not only ensures that the system remains rele-

vant in various scenarios, but also facilitates the seam-

less integration of new functionalities and the opti-

mization of existing ones (Garrell et al., 2017).

In this paper, we present FlowAct, a proactive

multimodal system that exempliﬁes the fusion of con-

tinuous perception with action planning and moni-

toring. Anchored in the Environment State Tracker,

FlowAct offers a representation of its surroundings,

setting the stage for more informed and dynamic in-

teractions, which can be proactively triggered.

2 RELATED WORK

In the recent decades, interactive robots designed for

interacting with humans have found widespread ap-

plications across diverse sectors. They are increas-

ingly being used in service-oriented roles, such as

serving as waiters in restaurants (Gasteiger et al.,

2023), working as customer guides in shopping

malls (Kanda et al., 2009b), or assisting passengers

in train stations (Shiomi et al., 2011) to name a

few. Moreover, these systems have made signiﬁcant

inroads into the healthcare sector (Barakova, 2011;

Diehl et al., 2014; Molina et al., 2018).

To build an HRI system, the architecture should

integrate various software components to facilitate

efﬁcient and concurrent execution of multiple tasks.

These systems must possess key capabilities, in-

cluding recording historical events (Prescott et al.,

2019b), and constructing representations of others’

Dhaussy, T., Jabaian, B. and Lefèvre, F.

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems.

DOI: 10.5220/0013265700003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 771-779

ISBN: 978-989-758-730-6; ISSN: 2184-4313

771

actions, beliefs, desires, and intentions (McCann and

Bratman, 1991). In their paper, Moulin-Firer et

al. (Moulin-Frier et al., 2018) propose a cognitive

architecture organization based on Distributed Adap-

tive Control (DAC) (Verschure et al., 2003; Verschure

et al., 2014) that deals with the processing of states

of the world, or exteroception, the self, or intercep-

tion, and action. Furthermore, the RoboCog model

of the ADAPTA project (Romero-Garc

es et al., 2015)

enabled a sales robot to persuade potential customers

to approach a sales booth. This robot was capable

of identifying customers, gauging their willingness to

follow, and responding to speciﬁc queries.

The BRILLO (Rossi et al., 2022) architecture for

a bartender social robot follows a three-layer organi-

zation for its architecture: the execution layer, con-

text awareness and decision-making layer, and per-

cepts layer, all implemented within the Robot Op-

erating System (ROS), an efﬁcient software with li-

braries and tools to build robot applications. Although

the architectures described above are speciﬁcally de-

signed and adapted to a particular application task,

they share a common processing structure with three

layers: the perception layer, the representation layer,

and the action layer. Our model was developed with

this foundational concept in mind. Nevertheless, it

was designed to be entirely task-independent, modu-

lar, and oriented towards process ﬂow. Consequently,

the stream of multimodal perception-actions is con-

tinuously processed, facilitating proactive action de-

cisions through persistent monitoring of the perceived

environment.

3 FlowAct MODEL

3.1 Overview

In this section, the speciﬁc features of FlowAct, a sys-

tem designed to serve as a continuous conduit for sen-

sory inputs and their resulting actions, are introduced.

Four next sections provide a more precise insight into

the implementation of the FlowAct layers.

In FlowAct, the agent interacts with the world

by executing an inﬁnite loop, with sensors providing

inputs and actuators delivering outputs. The cogni-

tive architecture comprises three stages: perceiving

the world through visual and audio sensors, repre-

senting the internal scene for the agent, and mak-

ing decisions of actions to act on the world (move,

speak, touch, etc.), as illustrated in Figure 1. As

such, this overall structure follows the standard loop

of cognitive architectures (perception, representation,

and action) (Moulin-Frier et al., 2018; Rossi et al.,

2022). Moreover, a modular approach is imple-

mented, wherein distinct modules are connected with

controllers that fulﬁll essential functions within the

architecture. This conﬁguration facilitates the adap-

tation of modules to meet any particular cognitive re-

quirements. The information requisite and dissemi-

nated at each stage is transmitted via dedicated mem-

ory zones or blackboards (referred to as ”topics”),

rendering it accessible to all controllers and permit-

ting stringent regulation of production and consump-

tion conditions, such as creation and modiﬁcation

timestamps, as well as access priority lists.

FlowAct separates the perception layer into two

sub-levels, ”raw perceptions” and ”reﬁned percep-

tions”. The concept of raw perception refers to the

agent’s raw sensory data reﬂecting the quintet of hu-

man senses, especially auditory and visual, as well as

perceptions directly derived from these senses, such

as depth vision. These elementary perceptions are in-

tercepted by the ”perception reﬁnement modules,” a

set in which raw data are distilled, producing an en-

hanced interpretative layer of the environment, called

”reﬁned perceptions.”

The EST controller functions in a continuous way,

assimilating raw and reﬁned perceptions to build a dy-

namic representation of the environment. It is also

tasked with memory management and can interface

with a database to store or retrieve speciﬁc knowl-

edge, helping to interpret the current state of the

world, such as the tracking of individuals or objects.

This controller employs various environmental mod-

ules, including the re-identiﬁcation of individuals and

the allocation of perceptions, to update the scene anal-

ysis while taking into account its historical context.

The environment state is available to the Action

Planner controller, the strategic core of FlowAct. This

module is responsible for the analysis of environmen-

tal data and the planning of action strategies, inform-

ing the action modules asynchronously. It has the

unique ability to reﬂect on future actions while ex-

ecuting current actions. The action modules, acting

as the actuators of the system, transmit precise be-

havioral directives to the robotic agent, thus realizing

the transition from the environmental state to action

within this autonomous loop.

3.2 Robot Perceptions

Although the FlowAct architecture is not contingent

on a particular robotic platform, the experiments con-

ducted in this study employ Pepper, a robot devel-

oped by Softbank Robotics. Consequently, the imple-

mentation details provided will be based on this plat-

form, while maintaining general applicability. Pep-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

772

Figure 1: Overview of the FlowAct architecture, a continu-

ous perception/action loop.

per is equipped with an array of four microphones,

two loudspeakers, and three tactile sensors. Nonver-

bal communication is activated by LED clusters lo-

cated in the eyes, on the shoulders, and around the ear

speakers. Robot perception is aided by two 640x480

resolution cameras, strategically placed on the fore-

head and mouth, and an ASUS Xtion 3D sensor in one

eye, essential for localization and navigation. Given

its size, the forehead camera is ideal for HRI, aligned

with the average human body height. Pepper runs un-

der the NAOqi OS operating system, a GNU/Linux

distribution based on Gentoo. For our experiment,

NAOqi 2.5.5 is used, coupled with ROS

, so that the

perceptions received and sent by NAOqi pass through

topics (the message-passing framework implemented

within ROS). These initial sensory data collected by

the robot’s primary sensors are deﬁned as ’raw per-

ceptions.’

3.3 Perception Reﬁner Modules

The FlowAct methodology encompasses the aug-

mentation of the principal raw perceptions acquired

by the agent, thereby facilitating the generation of a

comprehensive scene analysis. This is accomplished

through the implementation of perception reﬁnement

modules. These modules systematically process

and enhance the initial dataset, resulting in a more

sophisticated and insightful collection of perceptions,

referred to as ’reﬁned perceptions.’ Subsequently, the

reﬁned perceptions are amalgamated with the raw

perceptions and transmitted to the environment state

tracker, which constructs a scene representation based

https://wiki.ros.org/

on this synthesized data. In our implementation, the

following modules are used:

• Voice Activity Detection: detects speech seg-

ments in raw data, based on an adaptive threshold

on energy of the sound signal.

• Person Tracker: combines the location of the

person and the extraction of body keypoint fea-

tures from an implementation of Yolov7 (Wang

et al., 2022) with a Deepsort algorithm (Wojke

et al., 2017) to track the identity of the people

present in the images captured by the camera.

• Gaze Tracker: uses the RT Gene’s ROS pack-

age (Fischer et al., 2018) which transforms the

image stream into various facial descriptors, gaze

orientation, head pose, and the position of key fa-

cial points.

• Speaker Diarization: is based on a temporal

audio-visual fusion model for multi-user speaker

diarization (Dhaussy et al., 2023b). The method

identiﬁes dominant speakers and tracks them over

time by measuring the spatial coincidence be-

tween sound locations and visual presence.

• Interaction Acceptance Belief (IAB): infers the

level of IAB (Dhaussy et al., 2023a) which com-

monly answers the question ’What are the chances

of my interaction to be accepted by the targeted

user’ and is mainly based on the gaze of the user.

• Automatic Speech Recognition (ASR): based

on API calls sending the speech segments (de-

tected by the VAD module) to recognizers (such

as GoogleCloud speech recognition

or a local

OpenAI Whisper

whenever user’s privacy is at

stake).

3.4 Representation of the Environment

The EST controller is tasked with the construction

of an accurate temporal representation of the scene.

Consequently, it collects reﬁned perceptual data,

maintains a historical record, orchestrates the ﬂow of

perceptions, and delivers a comprehensive represen-

tation of the environment. Within this asynchronous

gathering of perceptions, it systematically aligns and

organizes these perceptions to synchronously deliver

its representation of the environment.

EST is therefore responsible for associative mem-

ory (Prescott et al., 2019a), a concept that deﬁnes

the ability to link two initially unrelated elements,

https://cloud.google.com/speech-to-text

https://github.com/openai/whisper

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems

773

such as a name and an unknown face. In this con-

text, we employ speciﬁc ”Environment analysis mod-

ules,” like person re-identiﬁcation, which tracks users

in the scene, or perception assigner, tasked with link-

ing each detected user to a known or new identity.

In particular, in the task targeted in our experi-

ments (Section 4), the environment representation fo-

cuses on a single subject type: users detected within

images. Each user is associated with attributes such

as their IAB value, the number of times they have

been engaged, their corresponding utterances, and the

agent’s state. This latter includes information about

its spatial position, current and past actions, most re-

cent utterance, the user it is currently interacting with,

and its current state (observation, engaged, return to

its watchtower position, engaging user).

3.5 Decision Layer

The agent’s behavioral dynamics are orchestrated

through the synergistic operation of the Action Plan-

ner controller and a suite of Action modules. The Ac-

tion Planner serves as the cerebral core, permanently

rendering decisions to either stay put, initiate interac-

tion, return to a predeﬁned (observation) position, or

continue the current interaction. Decisions are seam-

lessly transmitted to the Action modules, each de-

signed to spring into operation responsive to the delin-

eated behavior. Building upon the framework estab-

lished by Kanda et al. (Kanda et al., 2009a), we con-

ceptualize the outputs of the action planner as ’global

behavior’. To adhere to this global behavior, the ’local

behaviors’ are relayed via action modules. These lo-

cal behaviors are characterized by their more granular

and atomic nature. For example, the action ’continue

interaction’ implies the speaking module to answer

when it detects a user utterance. The engagement be-

havior in the tested system is governed by a series of

rules based on the level of the IAB model in the re-

ﬁned perceptions, coupled with implicit engagement

requests from the user (a raised hand for instance).

The action planner frequency is 0.5Hz.

For our experimental setup, we have implemented

two pivotal Action modules:

Moving Module: capable of executing two distinct

actions. Firstly, it engages with a speciﬁed individ-

ual by aligning the robot’s orientation towards the tar-

get and proceeding to within 0.7 meters. Secondly,

it navigates the robot back to its initial observational

position. The system is equipped with internal states

that ensure the robot remains stationed at its standby

location and is engaged in a navigational sequence;

Speaking Module: This module regulates interaction

with users, initiating a conversational cycle once en-

gagement is achieved and the communication param-

eters are deﬁned by the action modules. Responses

are generated through an API call to a large lan-

guage model (LLM), such as Vicuna

conﬁgured as

a conversational assistant, such as with the role-play

zero-shot setting (Njifenjou et al., 2024). Termination

of the conversation is dependent upon the identiﬁca-

tion of speciﬁc keywords or user disengagement, in-

cluding a 10-second interval of non-response. Addi-

tionally, this module updates internal state indicators,

namely ’is speaking’ and ’in a conversation’, to reﬂect

active speech and participation by the robot, thereby

ensuring continuity in action planning decisions.

These modules collectively embody an agent’s

behavior as a sophisticated interaction of decision-

making processes, supported by a feedback mecha-

nism that enhances the agent’s environmental aware-

ness and adaptability progressively.

4 EXPERIMENTS

The experiments conducted to evaluate the implemen-

tation of FlowAct are executed in two distinct phases.

The initial phase involves a controlled experiment

within a laboratory setting, where individuals simu-

late the role of patients. This phase facilitates the test-

ing and reﬁnement of the system’s functionalities, en-

abling the observation of interactions and responses

of the proactive social robot within a controlled en-

vironment. The subsequent phase is carried out in a

real-world setting, speciﬁcally within a hospital, in-

volving actual patients. This phase seeks to assess the

system under real-world conditions, considering un-

predictable variables and the diversity of human in-

teractions that may arise. This bifurcated approach

will not only technically validate the system but also

collect feedback on the user experience, ensuring the

robot’s capability to effectively perform its social role

in a hospital context. The modules are slightly mod-

iﬁed between the laboratory experiment and the hos-

pital experiment to ensure the anonymity of hospital

patients; Google Speech-to-Text is replaced with the

local version of Whisper.

4.1 Laboratory Experiments

4.1.1 Scenario

To evaluate the effectiveness of the proposed model,

we implemented the proactive interaction loop within

a real-world setting, speciﬁcally designed to replicate

https://lmsys.org/blog/2023-03-30-vicuna/

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

774

Figure 2: Screenshot of a scene, with anonymised faces,

showing the IAB value for each person detected. Red box

indicates a person available for engagement.

a scenario where patients await their appointments in

a hospital waiting room. To emulate the role of pa-

tients for this investigation, a cohort of diverse partic-

ipants was enrolled, including 20 individuals, 14 men

and 6 women, all in the age range [22, 52]. Only 3 of

them were familiar with robotics.

The participants, which consisted mainly of aca-

demics and students from our institution, gave their

written consent to participate in this study. Further-

more, written informed consent was obtained from

individuals for the publication of any identiﬁable im-

ages or data that could be included in publications re-

lated to the experiment.

The participants were positioned in the vicinity of

the robotic entity (3-4 meters radius), either seated or

standing. Thereafter, they were instructed to adopt

one of the following behaviors, which are intended

to represent different levels of signals for proactive

engagement:

• Engage in dialogues with individuals seated adja-

cent to them;

• Engage in active utilization of mobile devices, for

instance, playing games or browsing the web;

• Exhibit a passive behavior, maintaining a stance

of idleness and portraying a waiting state devoid

of any particular engagement or activity;

• Display cues of interest and attentiveness towards

the robotic agent;

• Initiate interaction with the robotic agent by seek-

ing its attention or assistance, and requesting in-

formation, guidance, or support.

Upon concluding an interaction, the robot returns

to its observation position to initiate another engage-

ment. The participants are then instructed to resume

their designated scenarios, which may have been in-

terrupted during the interaction, once the robot reverts

to its original position. Participants were instructed to

speak only during their interaction with the agent or

when communicating with an individual seated next

to them as part of the scenario. Their speaking turn

during an interaction is indicated by an image dis-

played on the agent’s tablet.

Each scenario encompasses a blend of passive be-

havior towards the agent, along with active behaviors

such as showing interest or requesting interaction. A

scenario is considered complete either when a prede-

termined time limit is reached or after each participant

has been engaged by the robot.

4.1.2 Evaluation

To validate the functionality and usability of using

FlowAct for proactive robotic interaction, we per-

formed a thorough evaluation focusing on user experi-

ence to check the efﬁciency of the setup

. The experi-

ence is seen as a task in which the robot has to display

a proactive engagement behavior toward the humans

gathered in the room. The participants are informed

about the task of the robot and rate the questionnaire

accordingly to their posterior feelings.

The User Experience Questionnaire, as outlined

by Finstad’s (Finstad, 2010) study, was employed to

gauge users’ interactions with the system. Each ques-

tion represents a usability component evaluation of

the system. Following the order of the questionnaire,

we can measure effectiveness, satisfaction, overall

quality, and efﬁciency. The Usability Metric for User

Experience (UMUX), a concise and effective four-

item Likert scale, was utilized for the subjective eval-

uation of the application’s usability. This scale is

strategically crafted to yield insights comparable to

those derived from the more extensive 10-item sys-

tem usability scale (SUS), ensuring a thorough and

reliable assessment of user experience.

4.1.3 Laboratory Results

The average UMUX score of approximately 71 sug-

gests that, on a scale of 0 to 100, the overall usability

and user experience of the system being evaluated are

good. Regarding task performance, a signiﬁcant ma-

jority of participants perceived the agent as effectively

fulﬁlling its intended role, as in the example shown in

Figure 2. The robot exhibited its ability to participate

in interactions within a hospital setting, demonstrat-

ing both reactive engagement in response to explicit

user cues and proactive engagement prompted by the

actor’s exhibited interest in the agent.

All data collected during these evaluations can be

made available for research purpose upon simple request

to the main author.

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems

775

Yet the general satisfaction with the system was

mixed. Although most of the participants did not

experience excessive frustration, notable instances of

frustration were primarily attributed to the response

latency of the agent, which typically ranged from 5

to 10 seconds. This delay was a consequence of the

computational demand for the LLM-based conversa-

tional agent. In particular, participants with previous

robotics experience expressed higher levels of frustra-

tion related to this latency.

In terms of usability, the consensus was that the

system was user-friendly and did not require speciﬁc

prerequisites for operation. During the experiments,

it was observed that the score of efﬁciency decreased

in situations where participants had to maintain eye

contact with the agent for longer than anticipated (of-

ten exceeding 10 seconds) or when they were required

to repeat themselves due to the robot’s inability to

comprehend their initial speech. In the course of

the conducted trials, each experiment was success-

fully executed, demonstrating engagement and inter-

action with each participant. In particular, in two in-

stances, the robot initiated interaction with individu-

als before they exhibited the reactive sign, typically a

raised hand signaling the agent. The users perceived

this preemptive interaction by the robot as proactive,

as it occurred in response to their demonstrated inter-

est prior to the conventional signal for engagement.

4.2 Hospital Experiments

4.2.1 Scenario and Evaluation

Subsequent to the laboratory experiments conﬁrm-

ing operational performance, the experimental proce-

dures were implemented in a hospital setting over a

consecutive three-day period. A total of 11 patients

were involved, 4 men and 7 women, resulting in a

total of 13 recorded interactions. The mean age of

the participants was 75 years, with ages ranging from

52 to 89 years. Of these 11 individuals, seven had

prior experience interacting with robotic systems. All

procedures conducted within the hospital setting were

previously approved by the hospital’s ethics commit-

tee.

The scenario at the hospital slightly differs

from the laboratory setup. To evaluate system’s

usability and proper functioning of the observation-

engagement-interaction loop, participants were

asked, in turns, to ﬁrst demonstrate interest in engag-

ing the robot. If the robot did not engage proactively,

participants were instructed to call it reactively (using

a hand signal, for instance). Each scene involved one

or two individuals positioned in front of the robot.

The evaluation setup for the hospital experiments

differs slightly. The experiments are carried out us-

ing the SUS questionnaire (Brooke, 1995), preferred

by our psychologist partners at the hospital. The SUS

questionnaire is used to evaluate the perceptions of

robot performance by participants in the context of in-

teraction. The SUS questionnaire, similar to the pre-

viously used UMUX questionnaire, is a standardized

tool that is used to assess the usability of a system. It

consists of ten statements rated on a ﬁve-point Likert

scale, ranging from ”Strongly Disagree” to ”Strongly

Agree.” This scale collects quantitative data on var-

ious aspects of the interaction, such as ease of use,

perceived complexity, user conﬁdence, and the learn-

ing curve. The scores are then converted into an over-

all score out of 100, making it easier to compare and

interpret the results.

4.2.2 Hospital Results

Despite our efforts and due to factors beyond our con-

trol related to the hospital context, this sample is still

too small to draw ﬁrm conclusions from the collected

observations. However, within the framework of this

paper, it allows us to establish the operational context

of the study and its practical implementation. Even

though experimental sessions are needed to comple-

ment this very preliminary set, in the meantime, ana-

lyzing the current results may still offer some insights

into the system’s current state and the possibility of

some immediate improvements before further collect-

ing real-world interactions.

The average SUS score obtained from the evalua-

tion is 59, indicating a moderate level of usability. Ac-

cording to the SUS scoring framework, scores close to

51 are considered ”fair” or ”so-so”, while 71 are gen-

erally interpreted as ”good” (Bangor et al., 2009). A

score of 59, therefore, suggests that the system under

evaluation presents notable usability challenges. This

score implies that users may experience difﬁculties in

interacting with the system, potentially affecting over-

all user satisfaction and engagement.

The average duration of the interactions is 4

′

2”.

The responses to the ﬁrst question about general ac-

ceptance show that users are not yet ready to use the

robot regularly. This is despite the fact that the per-

ceived complexity is relatively low (Q2) and the ease

of use is acknowledged (Q3). Patients feel capable

of using the robot independently, but they believe that

the integration of services could be improved. This

sentiment is understandable given the observed slow-

ness of the robot’s movements and the response la-

tency of the language module. Although inherent er-

rors of the LLMs were present, the participants did

not overly penalize inconsistencies in the dialogues

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

776

Figure 3: Example sequence of anonymised images showing patients’ engagements the the hospital.

(Q6). In general, participants ﬁnd this human-robot

interaction system simple and effective in terms of us-

ability, capable of building trust with the patient, but

they are not inclined to use it or face it frequently.

4.3 Data Analysis

During the experiments, the detection of a high IAB

value was prioritized to ensure patient engagement.

However, if patients had difﬁculty engaging, they had

the option to raise their hand to signal their willing-

ness to engage. As a result, only 23% of the inter-

actions were initiated by detecting a high IAB value,

while the remaining 77% were initiated at the implicit

request of the patients, through a hand gesture.

One primary reason was a too long activation time

for the IAB (in laboratory settings IAB could take up

to 5s before activation) and so participants were prone

to display a reactive hand gesture signal immediately

after showing (unanswered) signs of interest, such as

gazing toward the agent. Also, hospital experiments

took place in 3 different locations, with highly vary-

ing lighting conditions. So it made the IAB prediction

more complex because the gaze orientation module,

which feeds into the main module, is rather sensitive

to light. Thus, after analyzing the data, we observed

that successful engagement using proper IAB detec-

tion worked only in less than half of the interactions.

Additionally, the person detection module was also

challenged, as face detection is also sensitive to light-

ing conditions. These observations highlight the lack

of robustness in the current IAB model.

Figure (3) illustrates a sequence of successive pa-

tient engagements. Following the image sequence

from left to right and top to bottom, we observe that

the patient on the right is not initially detected cor-

rectly, preventing any interaction at ﬁrst. The robot

starts in observation mode, where it detects a signal

from the user to initiate engagement, in this case, a

hand gesture. Once the conversation begins, the robot

decides to continue until the dialogue module consid-

ers the conversation to be completed. The robot then

returns to observation mode, waiting for a new signal

to engage a patient in a new conversation. This pattern

of ’observation post, engagement detection, interac-

tion, return to observation post’ represents the stan-

dard loop for the proactive objective of the agent in a

hospital waiting room. Interestingly, the person in the

white coat in the background is correctly detected but

is not considered engageable in photos 2 and 3. This

is likely due to her behavior on her phone, which does

not mark her as a potential target. In the last photo, we

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems

777

Table 1: Example of a conversation patient-robot.

Speaker Dialogue turns

Patient Hello, I’m X, and you?

Pepper Hello, I am Pepper, the conversational robot of the hospital.

How can I help you today?

Patient I am too early for my appointment, what should I do while

I wait?

Pepper You are early for your appointment. You can take a mo-

ment to relax and wait for the medical staff to take you to

the consultation rooms at the scheduled time. This may

take between 5 minutes and half an hour, depending on the

number of consultations for the day.

Patient But how can I relax? What can I do to relax while I wait?

Pepper You can take a moment to relax while waiting for your con-

sultation by ﬂipping through a book or listening to music.

The medical staff will take you to the consultation rooms at

the scheduled time.

Patient Can you play some music?

Pepper I can suggest playing music, but as a robot, I don’t have

access to a music system. Would you like me to talk to you

about music while you wait for your consultation?

Patient Sure.

Pepper How are you today?

observe that she moves closer to the engaged person

and is then detected as engageable by her IAB value,

which is consistent as she is directly looking at the

agent while assisting the user.

The dialogue in Table 1 is an example of a con-

versation between a patient and the agent. Pepper

demonstrates a good understanding of the user’s re-

quests and offers suggestions on how to pass the time

while waiting for a medical appointment. When the

user asks Pepper to play music, the robot explains

its technical limitations (lack of access to a music

system), but offers an alternative by suggesting they

could talk about music. This response shows that the

language model can handle technological limitations

while maintaining an engaging interaction. However,

it fails to follow up on the patient’s conﬁrmation of

their desire to talk about music. This difﬁculty in

maintaining a coherent dialogue beyond a few ex-

changes indicates that improvements are needed in

how the chatbot handles conversation history. Despite

knowing all previous exchanges, it still struggles to

pursue a conversation to its logical conclusion.

During this trial, patient-robot conversations were

difﬁcult due to the low volume and insufﬁcient clarity

of the patients’ speech. Sometimes a simple repeti-

tion was enough, but often the transcriptions differed

signiﬁcantly from the user’s actual utterance. This

complicated interactions with patients, highlighting

the need to improve the sound capture system for ef-

fective use with elderly individuals.

5 CONCLUSION

In summary, this research presents a system imple-

menting the FlowAct architecture, a preliminary ap-

proach in the ﬁeld of continuous perception-action

systems within a robotic context for pro-active mul-

timodal HRI. Central to this study is the development

of a ROS architecture for a socially assistive robot, en-

gineered to provide efﬁcient services while dynami-

cally and personally engaging users. The comprehen-

sive evaluation conducted in a controlled environment

not only validated the functionality of each module

but also the efﬁcacy of the architecture as a whole.

Initial experimentation in simulated and real hospi-

tal settings has highlighted the system’s proﬁciency

in both proactive and responsive interactions with hu-

man participants. Some conditions, however, will re-

quire an improvement of some individual perception

and action modules to ensure greater public accep-

tance.

REFERENCES

Bangor, A., Kortum, P. T., and Miller, J. T. (2009). De-

termining what individual sus scores mean: adding

an adjective rating scale. Journal of Usability Stud-

ies archive, 4:114–123.

Barakova, E. I. (2011). Robots for social training of autistic

children. In 2011 World Congress on Information and

Communication Technologies, pages 14–19.

Brooke, J. (1995). Sus: A quick and dirty usability scale.

Usability Eval. Ind., 189.

Chen, M., Herrera, F., and Hwang, K. (2018). Cognitive

computing: Architecture, technologies and intelligent

applications. IEEE Access, 6:19774–19783.

Dhaussy, T., Jabaian, B., and Lef

evre, F. (2023a). Interac-

tion acceptance modelling and estimation for a proac-

tive engagement in the context of hri. In Proceedings

of the IEEE/CVF International Conference on Com-

puter Vision (ICCV) Workshops, pages 3069–3074.

Dhaussy, T., Jabaian, B., Lef

evre, F., and Horaud, R.

(2023b). Audio-visual speaker diarization in the

framework of multi-user hri. In ICASSP 2023 - 2023

IEEE International Conference on Acoustics, Speech

and Signal Processing (ICASSP), pages 1–5.

Diehl, J., Crowell, C., Villano, M., Wier, K., Tang, K., and

Riek, L. (2014). Clinical Applications of Robots in

Autism Spectrum Disorder Diagnosis and Treatment,

pages 411–422. Springer New York, New York, NY.

Finstad, K. (2010). The usability metric for user experience.

Interacting with Computers, 22:323–327.

Fischer, T., Chang, H. J., and Demiris, Y. (2018). Rt-

gene: Real-time eye gaze estimation in natural envi-

ronments. In Proceedings of the European Conference

on Computer Vision (ECCV).

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

778

Garrell, A., Villamizar, M., Moreno-Noguer, F., and San-

feliu, A. (2017). Teaching robot’s proactive behavior

using human assistance. International Journal of So-

cial Robotics, 9(2):231–249.

Gasteiger, N., Hellou, M., Lim, J. Y., MacDonald, B., and

Ahn, H. S. (2023). A theoretical approach to design-

ing interactive robots, using restaurant assistants as an

example. In 2023 20th International Conference on

Ubiquitous Robots (UR), pages 980–985.

Grau, A., Indri, M., Lo Bello, L., and Sauter, T. (2021).

Robots in industry: The past, present, and future of a

growing collaboration with humans. IEEE Industrial

Electronics Magazine, 15(1):50–61.

Kanda, T., Glas, D. F., Shiomi, M., and Hagita, N. (2009a).

Abstracting people’s trajectories for social robots to

proactively approach customers. IEEE Transactions

on Robotics, 25(6):1382–1396.

Kanda, T., Shiomi, M., Miyashita, Z., Ishiguro, H., and

Hagita, N. (2009b). An affective guide robot in a

shopping mall. In 4th ACM/IEEE Intl Conference on

Human-Robot Interaction (HRI), pages 173–180.

McCann, H. J. and Bratman, M. E. (1991). Intention, plans,

and practical reason. No

us, 25(2):230.

Molina, J., Sierra Mar

ın, S., Munera, M., and Cifuentes G.,

C. (2018). Human robot interaction interface for a mo-

bility assistance device. In 1st International Seminar

on Rehabilitation and Assistive Robotics RAR2018.

Moulin-Frier, C., Fischer, T., Petit, M., Pointeau, G.,

Puigb

o, J.-Y., Pattacini, U., and ... (2018). Dac-h3: A

proactive robot cognitive architecture to acquire and

express knowledge about the world and the self. IEEE

Transactions on Cognitive and Developmental Sys-

tems, 10:1005–1022.

Njifenjou, A., Sucal, V., Jabaian, B., and Lef

evre, F. (2024).

Role-play zero-shot prompting with llms for open-

domain human-machine conversation.

Prescott, T., Camilleri, D., Martinez-Hernandez, U., Dami-

anou, A., and Lawrence, N. (2019a). Memory and

mental time travel in humans and social robots. Philo-

sophical Transactions B: Biological Sciences, 374.

Prescott, T. J., Camilleri, D., Martinez-Hernandez, U.,

Damianou, A., and Lawrence, N. D. (2019b). Mem-

ory and mental time travel in humans and social

robots. Philos. Trans. R. Soc. Lond. B Biol. Sci.,

374(1771):20180025.

Romero-Garc

es, A., Calderita, L. V., Mart

ınez-G

omez, J.,

Bandera, J. P., and Marﬁl, R. (2015). Testing a fully

autonomous robotic salesman in real scenarios. In

2015 IEEE International Conference on Autonomous

Robot Systems and Competitions, pages 124–130.

Rossi, A., Maro, M. D., Origlia, A., Palmiero, A., and

Rossi, S. (2022). A ros architecture for personalised

hri with a bartender social robot.

Salomon, R. (1997). Scaling behavior of the evolution strat-

egy when evolving neuronal control architectures for

autonomous agents. In Angeline, P. J., Reynolds,

R. G., McDonnell, J. R., and Eberhart, R., editors,

Evolutionary Programming VI, pages 47–57, Berlin,

Heidelberg. Springer Berlin Heidelberg.

Shiomi, M., Sakamoto, D., Kanda, T., Ishi, C. T., Ishiguro,

H., and Hagita, N. (2011). Field trial of a networked

robot at a train station. International Journal of Social

Robotics, 3(1):27–40.

Tek

ulve, J., Fois, A., Sandamirskaya, Y., and Sch

oner, G.

(2019). Autonomous sequence generation for a neu-

ral dynamic robot: Scene perception, serial order,

and object-oriented movement. Frontiers in Neuro-

robotics, 13.

Verschure, P. F. M. J., Pennartz, C. M. A., and Pezzulo,

G. (2014). The why, what, where, when and how

of goal-directed choice: neuronal and computational

principles. Philos. Trans. R. Soc. Lond. B Biol. Sci.,

369(1655):20130483.

Verschure, P. F. M. J., Voegtlin, T., and Douglas, R. J.

(2003). Environmentally mediated synergy between

perception and behaviour in mobile robots. Nature,

425(6958):620–624.

Wang, C.-Y., Bochkovskiy, A., and Liao, H.-Y. M. (2022).

Yolov7: Trainable bag-of-freebies sets new state-of-

the-art for real-time object detectors.

Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-

line and realtime tracking with a deep association met-

ric.

FlowAct: A Proactive Multimodal Human-Robot Interaction System with Continuous Flow of Perception and Modular Action Sub-Systems

779