Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition

and UAV Collaboration Framework

Akash Chaudhary

1 a

, Tiago Nascimento

1,2 b

and Martin Saska

1 c

Faculty of Electrical Engineering, Czech Technical University in Prague, Technicka 2, 166 27 Prague, Czech Republic

Universidade Federal da Para

ıba, Brazil

Keywords:

Human-Robot Interaction, Unmanned Aerial Vehicles, Gesture Recognition.

Abstract:

Harnessing human movements to command an Unmanned Aerial Vehicle (UAV) holds the potential to revo-

lutionize their deployment, rendering it more intuitive and user-centric. In this research, we introduce a novel

methodology adept at classifying three-dimensional human actions, leveraging them to coordinate on-ﬁeld

with a UAV. Utilizing a stereo camera, we derive both RGB and depth data, subsequently extracting three-

dimensional human poses from the continuous video feed. This data is then processed through our proposed

k-nearest neighbour classiﬁer, the results of which dictate the behaviour of the UAV. It also includes mech-

anisms ensuring the robot perpetually maintains the human within its visual purview, adeptly tracking user

movements. We subjected our approach to rigorous testing involving multiple tests with real robots. The

ensuing results, coupled with comprehensive analysis, underscore the efﬁcacy and inherent advantages of our

proposed methodology.

1 INTRODUCTION

In the rapidly evolving ﬁeld of robotics, intuitive

human-robot interaction (HRI) remains a pivotal chal-

lenge. The ability for robots to accurately interpret

and respond to human actions is crucial for advanc-

ing their integration into diverse applications, from in-

dustrial automation (Vysocky and Novak, 2016) and

healthcare (Mohebbi, 2020) to agriculture (Vasconez

et al., 2019) and autonomous vehicles (Mokhtarzadeh

and Yangqing, 2018). Traditional control interfaces,

such as joysticks and remote controllers, often fail to

provide the natural, seamless interaction that users re-

quire. This gap underscores the need for more intu-

itive and user-friendly methods to enhance human-

robot collaboration, particularly in the context of Un-

manned Aerial Vehicles (UAVs).

Recent advancements in action recognition and

human-robot collaboration have shown signiﬁcant

promise in addressing these challenges. For instance,

research on annotating human actions in 3D point

clouds has demonstrated the importance of precise

and ﬂexible data for collaborative tasks with industrial

https://orcid.org/0000-0001-7857-7641

https://orcid.org/0000-0002-9319-2114

https://orcid.org/0000-0001-7106-3816

Figure 1: A Group of UAVs controlled by a human operator

in an open ﬁeld.

robots, emphasizing the potential of 3D data to im-

prove HRI systems (Krusche et al., 2023). Similarly,

the integration of natural language instructions and

3D gesture recognition has enhanced the intuitiveness

of human-robot interaction, making it more effective

for industrial applications by facilitating a more natu-

ral communication interface (Park et al., 2024). Stud-

ies focusing on end-to-end systems for human-UAV

interaction highlight the relevance of intuitive con-

Chaudhary, A., Nascimento, T. and Saska, M.

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework.

DOI: 10.5220/0012921300003822

In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 2, pages 26-35

ISBN: 978-989-758-717-7; ISSN: 2184-2809

trol mechanisms in ﬁeld applications, showing how

3D gestures can be effectively used to control UAVs

in real-time scenarios (Jiao et al., 2020). Addition-

ally, the use of intuitive interaction systems, such as

RFHUI, highlights the signiﬁcance of gesture recog-

nition in enhancing the ease of operation and control

of UAVs in 3D space (Zhang et al., 2018).

Despite these advancements, interpreting human

gestures and translating them into robotic actions re-

main signiﬁcant hurdles. The complexity of human

movements and the variability in their execution pose

challenges for robotic systems, particularly those with

limited computational power, such as UAVs. Our

research aims to bridge this gap by proposing a

novel methodology for real-time, low computation-

ally expensive, three-dimensional action recognition

and UAV collaboration. By leveraging stereo cam-

eras to capture both RGB and depth data, we can

extract and classify three-dimensional human poses

from continuous video feeds. This approach enables

the UAV to accurately interpret human movements

and respond appropriately 1, thereby enhancing the

intuitiveness and effectiveness of human-UAV inter-

action. Our main contributions are:

1. A new method to estimate three-dimensional full-

body pose from available 2D poses.

2. A proposed feature vector space tailored for Hu-

man Motion Recognition.

3. A unique, fast and lightweight human motion

classiﬁer suitable for UAVs with limited comput-

ing power.

2 RELATED WORKS

Human-robot interaction (HRI) offers a myriad of

methodologies. Among these, the most intuitive is the

teleoperation of a robot through a physical controller.

Yamada et al.(Yamada et al., 2015) employ this strat-

egy by integrating it with virtual reality to direct

robots in construction scenarios. Conversely, Sathiya-

narayanan et al.(Sathiyanarayanan et al., 2015) har-

ness a wearable armband, translating its gestures into

commands for robot systems. For individuals with

disabilities, voice-controlled systems present an in-

valuable solution. Gundogdu et al.(Gundogdu et al.,

2018) pioneered such a system, facilitating the op-

eration of prosthetic robot arms. Our proposed ac-

tion classiﬁcation method adeptly amalgamates RGB

video data with depth video output, enabling the clas-

siﬁcation of 3D human movements. Our prior re-

search (Chaudhary et al., 2022) explored a similar do-

main, but was constrained to 2D data, thereby limiting

the dominion of the user over the robot and impeding

optimal performance. Our tailored feature space for

k-Nearest Neighbor further enhances its effectiveness

even amidst intricate actions by using depth informa-

tion, a custom feature space, and fast lookup times

during the classiﬁcation process.

In the realm of sequence classiﬁcation for human

motion classiﬁcation, Celebi et al. (Celebi et al.,

2013) marked a signiﬁcant advancement by intro-

ducing a weighted Dynamic Time Warping (DTW)

methodology that achieved a remarkable accuracy of

96%, a substantial improvement from the preceding

state-of-the-art’s 62.5%. Following this, Rwigema

et al. (Rwigema et al., 2019) reﬁned the system by

integrating a differential evolution strategy to opti-

mize DTW’s weightings. This enhanced approach

achieved a stellar accuracy of 99.40%. However, its

extended processing time posed challenges for real-

time classiﬁcation applications. In parallel, Schneider

et al. (Schneider et al., 2019) melded DTW with the

one-nearest neighbour technique for movement clas-

siﬁcation. Their methodology closely aligns with our

approach, prompting a comparative analysis between

their method and ours to discern our method’s efﬁ-

cacy. Additionally, Yoo et al. (Yoo et al., 2022)

demonstrated rapid processing capabilities in their

classiﬁcation system, although it necessitated an aux-

iliary IMU sensor for optimal performance. Addition-

ally, their system is limited to palm-action classiﬁca-

tion. Our proposed approach endeavours to address

the trifecta of challenges: processing speed, accuracy,

and sensor dependency, offering a holistic solution in

the domain of human motion recognition.

3 METHODOLOGY

Our overarching ambition is to architect a method

that is apt for deployment on ﬂying robots and pro-

ﬁcient in human detection, action classiﬁcation, and

the subsequent translation of these actions into robotic

tasks. Given the inherent computational limitations of

UAVs, the challenge lies in devising a classiﬁer that

synergizes accuracy with computational efﬁciency.

A salient feature of our methodology is its

prowess in classifying 3D actions. This capability

augments the spectrum of detectable movements, of-

fering an enriched, intuitive user interaction. It not

only mitigates potential classiﬁcation errors where 2D

projections might be misleading but also empowers

the system to discern directional nuances from ac-

tions, granting users reﬁned control over the trajectory

of the robot.

Structured methodically, our approach is seg-

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

3D Pose

Feature Vector

Encoder

Pose Normalizer

Autoencoder

RealSense

Camera

Mediapipe

2D Pose Estimation

Pose Converter

2D to 3D pose

Depth

Image

RGB

Image

pose

Human Operator

Pose Estimator Human Robot Coordinator

DTW Distance

Calculator

Classifier Leader Drone

Figure 2: Flowchart depicting the action classiﬁer and its use in UAV control, with the green blocks depicting our contribution.

mented into three core modules: Pose Estimation, Ac-

tion Classiﬁcation and UAV Control. An overview of

our proposed approach can be seen in Fig. 2.

4 POSE ESTIMATION

The recognition of static gestures can be achieved ei-

ther directly from RGB images via neural networks,

or by distilling skeletal data and applying subse-

quent functions to this data for classiﬁcation. For the

3D representation, we initially harness a pre-existing

model to derive 2D poses from live video and subse-

quently use depth images to gauge the distance of the

pose joints from the camera.

4.1 2D Pose Estimation

Our choice was to use MediaPipe, an implementa-

tion rooted in BlazePose (Bazarevsky et al., 2020).

This Google-based pose estimator is adept at extract-

ing 2D pose information from an RGB image in a live

video stream, capturing 33 key landmarks of human

anatomy. Calibration in MediaPipe is anchored on the

’Vitruvian Man’, ensuring accurate scale and orienta-

tion recognition. This precision is paramount- espe-

cially in aerial robotics, where UAV movement can

result in a non-horizontal camera orientation relative

to the ground.

MediaPipe operates in a distinct ﬁxed-rate loop,

processing the concurrent image. If image callbacks

outpace the MediaPipe loop rate, some frames might

be skipped. Despite this, our onboard system man-

aged a commendable 30 Hz MediaPipe loop fre-

quency, ensuring near-real-time responsiveness.

4.2 3D Pose Estimation

Our methodology leverages 2D poses to predict the

z-coordinate for each landmark. Assuming the cam-

era plane as the X-Y plane, each body pose keypoint

corresponds to a coordinate (x,y). Beyond just RGB,

the Intel Realsense also provides depth images, albeit

with a slight misalignment. By aligning the depth im-

age with the RGB counterpart, we ensure a precise

correspondence between them.

The z-coordinate is deduced by overlaying the 2D

poses on the depth image. An area surrounding each

landmark is considered, factoring in the surrounding

pixels for the best z-coordinate estimate. The area

of consideration inversely correlates with the distance

from the camera. This necessitates an initial estima-

tion of this distance, achieved by superimposing the

coordinates of the shoulders and hips on the depth im-

age, creating a bounding quadrilateral. Given the po-

tential for background inclusion, we opt for the ﬁrst

quartile average of the depth values, excluding back-

ground measurements.

Given the output from an RGB-D camera like

Realsense, each pixel’s correspondence to an actual

physical area varies with the distance. At a dis-

tance of 1 meter, each pixel represents an area of

1.5mm × 1.5mm, whereas at 6 meters, it is 9mm ×

9mm. With human proportions in consideration, we

scale the number of pixels to match a 90mm × 90mm

area. The subsequent z-coordinate estimation, com-

bined with the MediaPipe output, yields a compre-

hensive 3D pose.

5 FULL-BODY ACTION

CLASSIFICATION

Given the computational constraints on our ﬂeet of

UAVs, the emphasis is on a lightweight algorithm that

retains accuracy. A balance is struck with a custom

k-nearest Neighbor (kNN) classiﬁer. Our approach

to action classiﬁcation is holistic, beginning with the

careful design of a feature space. By leveraging the

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

relationship between several joints in 3D space and

in time, we derive a representative feature vector that

encapsulates the dynamics of human posture. The

core of our methodology adopts a two-stage classi-

ﬁcation approach. In the initial stage, Feature embed-

ding translates pose landmarks into meaningful vec-

tor space. Autoencoders are then employed to reduce

the dimensionality of the embeddings. Subsequently,

Dynamic Time Warping (DTW) is used for a reﬁned

comparison, accounting for potential temporal vari-

ations and guaranteeing the precise alignment of se-

quences. This comprehensive strategy ensures a bal-

ance between computational efﬁciency and classiﬁca-

tion accuracy, providing a robust solution for human

action recognition. The steps of classiﬁcation are de-

tailed in sections 6 and 7.

6 FEATURE SPACE DESIGN

In the realm of human pose estimation and analysis,

the extraction of meaningful features from detected

landmarks is of prime importance. The proposed

methodology focuses on extracting embeddings from

13 key anatomical landmarks, which are represented

by their 3D coordinates (x,y,z).

6.1 Landmarks Identiﬁcation

The identiﬁed landmarks in the human body are as

follows: Nose, Left and Right Shoulder, Left and

Right Elbow, Left and Right Wrist, Left and Right

Hip, Left and Right Knee, and Left and Right Heel.

6.2 Pose Normalization

To ensure consistency across different poses and indi-

viduals, the detected landmarks undergo a normaliza-

tion process. The normalization is performed in three

stages:

1. Translation Normalization: The landmarks are

translated such that the centre of the pose (mid-

point between the hips) is at the origin.

2. Scale Normalization: The landmarks are scaled

based on the size of the torso or the maximum dis-

tance from any landmark to the pose centre, mul-

tiplied by a given torso size multiplier.

3. Orientation Normalization (optional): The

landmarks are rotated to align the vector connect-

ing the hip centre to the shoulder centre with a pre-

deﬁned target direction, ensuring an upright orien-

tation of the pose.

6.3 Embedding Calculation

For the efﬁcient extraction of features from these

landmarks, an EmbeddingCalculator class has been

utilized. In constructing the feature space for action

classiﬁcation, we prioritized features with inherent re-

silience to noise and occlusions, essential in 3D pose

estimation. Selection focused on relative positions

and orientations between joints, as these are less sen-

sitive to occlusions and provide a stable reference in

noisy data. Additionally, incorporating temporal fea-

tures like joint velocity and acceleration helps smooth

out noise over time. Depth information plays a crucial

role in enhancing occlusion handling, allowing for a

more accurate estimation of partially visible actions.

This strategic selection ensures our classiﬁcation re-

mains robust across diverse and challenging condi-

tions.

1. Single Joint Operations: Processes features de-

rived from individual landmarks.

2. Joint Pair Operations: Processes features de-

rived from pairs of landmarks.

3. Tri Joint Operations: Processes features derived

from groups of three landmarks.

6.3.1 Single Joint Operations

For each landmark, the following are calculated:

• Joint Vector: Directly takes the 3D coordinates

of the landmark. Provides the spatial positioning

necessary for accurate pose recognition, crucial

for interpreting directional UAV commands based

on limb orientation.

Joint Vector = landmarks[i] (1)

where i is the index of the landmark in the prede-

ﬁned list.

• Joint Velocity (v): The rate of change of the

joint’s position with respect to time. Both veloc-

ity and acceleration are essential for distinguish-

ing dynamic gestures from static poses, enabling

the UAV to interpret the urgency or intended pace

of human commands. It is calculated as:

v =

current joint vector − previous joint vector

current timestamp − previous timestamp

(2)

• Joint Acceleration (a): The rate of change of a

joint’s velocity with respect to time. It is com-

puted as:

a =

v − previous v

current timestamp − previous timestamp

(3)

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

• Joint Vector Angle: The angle between the joint

vector and each of the coordinate axes (x, y, z).

Offers insights into the limb orientation, critical

for understanding gesture directionality and en-

suring precise UAV response to commands like

vertical takeoff or horizontal movement. For a

given joint vector v and axis a, the angle θ is com-

puted using the dot product:

θ = arccos



v · a

∥v∥∥a∥



(4)

• Joint Angular Velocity: The rate of change of

a joint’s vector angle with respect to time. Joint

angular velocity and acceleration help the sys-

tem gauge the smoothness or abruptness of move-

ments, aiding in the interpretation of gesture ur-

gency for immediate or deliberate UAV actions.

• Joint Angular Acceleration: The rate of change

of a joint’s angular velocity with respect to time.

• Displacement Vector: The change in position of

the joint from its previous position. Indicates the

trajectory of joint movements, guiding the UAV in

adjusting its ﬂight path to align with the operator’s

intended direction.

6.3.2 Joint Pair Operations

For each pair of landmarks, the following are com-

puted:

• Joint Pair Vector: The difference in the 3D co-

ordinates of the two landmarks. Provides a rela-

tional understanding of body posture by examin-

ing vectors between pairs of joints, aiding in the

nuanced differentiation of gestures for accurate

UAV command interpretation.

Joint Pair Vector = landmark[ j] − landmark[k]

(5)

where j and k are the indices of the two land-

marks.

• Joint Pair Velocity, Acceleration, Vector Angle,

Angular Velocity, and Angular Acceleration:

These are calculated similarly to the single joint

operations, but are applied to the joint pair vector.

6.4 Tri Joint Operations

Given the three landmarks A, B, and C, we can deﬁne

two vectors:

⃗

AB = B − A (6)

⃗

BC = C − B (7)

6.4.1 Tri Joint Angle

The angle θ between two vectors

⃗

AB and

⃗

BC is given

by:

θ = arccos

⃗

AB ·

⃗

∥

⃗

AB∥∥

⃗

BC∥

(8)

where

⃗

AB ·

⃗

BC is the dot product of the two vectors.

The normal to the plane containing A, B, and C is

given by the cross product of the vectors

⃗

AB and

⃗

BC:

⃗

N =

⃗

AB ×

⃗

BC (9)

The unit normal vector

N is then:

N =

⃗

∥

⃗

N∥

(10)

The tri joint angle Θ (or the feature we are consid-

ering) is then a combination of θ and

N, which could

be represented as:

Θ =

N × θ (11)

Our selection of the feature vector was grounded

in its capability to uniquely represent the anatomi-

cal structure and dynamics. It captures the geometric

conﬁguration of poses involving bends or twists, en-

abling complex gesture recognition for sophisticated

UAV manoeuvre commands. The unit normal vector

distinctively identiﬁes the plane in which rays con-

necting a landmark to its neighbouring joints reside.

Concurrently, the cosine of the angle effectively cap-

tures the relative positioning of these rays. By taking

the product of these two entities, we obtain a singular,

robust feature vector. This vector augments the fea-

ture space, bolstering our ability to discern and clas-

sify sequences with heightened precision.

6.4.2 Tri Joint Angular Velocity

The angular velocity ω for the tri joint angle is the rate

of change of Θ with respect to time:

ω =

∆Θ

∆t

(12)

6.4.3 Tri Joint Angular Acceleration

The angular acceleration α for the tri joint angle is the

rate of change of ω with respect to time:

α =

∆ω

∆t

(13)

6.5 Feature Vector Extraction and

Normalization

Upon processing the normalized landmarks with the

embedding calculator, we obtain the primary feature

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

vectors, as explained in the previous sub-sections.

These vectors, imbued with the dynamics of human

movement, are central to our classiﬁcation scheme.

To ensure a consistent representation across the

dataset, we calculate certain parameters. Speciﬁcally,

for each embedded sample, we ascertain its minimum

(min) and maximum (max) values. These extremi-

ties are extracted from a concatenated array of em-

beddings, which is aggregated frame by frame from

each sample.

Post parameter estimation, we normalize the fea-

ture vectors using min-max scaling. This normaliza-

tion is pivotal in ensuring that no speciﬁc feature over-

shadows others during the classiﬁcation process. By

mapping the features to a range between -1 and 1, we

achieve uniformity in their magnitudes while main-

taining their sign.

Integrating these features into our classiﬁcation

framework allows for an advanced, nuanced under-

standing of human motions, ensuring the UAV ac-

tions are tightly coupled with the operator’s intent.

This synergy between human gestures and UAV re-

sponse is fundamental for applications requiring intu-

itive, real-time robot control.

7 ENHANCED K-NEAREST

NEIGHBOR CLASSIFIER

THROUGH DIMENSIONALITY

REDUCTION

Our innovative approach to action classiﬁcation com-

bines the strengths of dimensionality reduction via

autoencoders and an augmented k-nearest Neigh-

bor (kNN) algorithm integrated with Dynamic Time

Warping (DTW). This two-step methodology is tai-

lored for varying computational efﬁciency and accu-

racy requirements.

7.0.1 Dimensionality Reduction with

Autoencoders

In the ﬁrst pathway, we deploy a deep autoencoder

for signiﬁcant dimensionality reduction of the input

space. The autoencoder architecture comprises a se-

ries of dense layers that form an encoding phase,

transitioning from an input dimension D

input

to a re-

duced latent dimension D

latent

, where D

latent

≪ D

input

Formally, the encoder function E : R

input

→ R

latent

compresses the data, and the decoder function D :

latent

→ R

input

aims to reconstruct the original data.

The reconstruction loss is minimized, L

reconstruction

∥X −D(E(X ))∥

, where X denotes the input data.

After dimensionality reduction, the latent repre-

sentations are processed through a kNN classiﬁer aug-

mented with DTW as the similarity metric, enhancing

accuracy and making it suitable for precision-critical

scenarios.

7.1 Dynamic Time Warping (DTW)

Following the encoding of the candidate set C, we

reﬁne our matches employing the DTW algorithm.

The DTW distance between two sequences S and S

′

is computed as:

DTW

(S,S

′

) = min

∑

(i, j)∈path

d(s

′

) (14)

Here, d(s

′

) denotes the Euclidean distance be-

tween the respective feature vectors, while the ”path”

symbolizes the optimal alignment between the se-

quences.

7.2 Classiﬁcation

Post the DTW ﬁltering, the sequences in our ﬁnal

shortlist dictate the classiﬁcation outcome. Given the

frequency distribution of each class within the short-

list, the class exhibiting the highest prevalence is des-

ignated to the incoming sequence. Mathematically,

for an incoming sequence S, the assigned class C

∗

is:

∗

= arg max

c∈Classes

Frequency(c,Shortlist) (15)

The selection of k in the kNN classiﬁer is critical for

the balance between noise sensitivity and the classi-

ﬁer’s generalization ability. We determined the opti-

mal k empirically using a cross-validation approach

on various dataset segments to achieve a balance that

maximizes classiﬁcation performance while minimiz-

ing error rates. The augmentation of kNN with Dy-

namic Time Warping (DTW) further enhances its sen-

sitivity to the temporal dynamics of actions, ensur-

ing that the dimensionality reduction does not com-

promise the classiﬁer’s ability to distinguish between

similar movements.

7.3 Mathematical Formulation and

Empirical Evaluation

Our methodology was empirically evaluated against

the UTD-MHAD(Chen et al., 2015) dataset to

quantify the performance metrics of accuracy and

computational efﬁciency. The autoencoder-based

method focused on reconstruction loss and classi-

ﬁcation accuracy, utilizing the formula Accuracy =

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

T P+T N

T P+T N+FP+FN

, where T P, T N,FP, and FN represent

true positives, true negatives, false positives, and false

negatives, respectively.

8 UAV CONTROL

The UAV continuously monitors the position of the

human and does so in two steps. In the ﬁrst step, it

tracks the operator’s position on the video feed and

perpetually corrects its heading, such that the human

remains in the centre of its ﬁeld of view. The next

step, it monitors the distance of the human from the

drone and tries to maintain a set distance, thereby

eliminating the need for the human to stay in one

place. The UAV is able to do this while simultane-

ously receiving commands from the human’s actions

and performing them.

9 RESULTS

9.1 Method Veriﬁcation and

Benchmarking

To ensure the robustness and reliability of our pro-

posed method, we initially subjected it to rigorous

benchmarking. Several performance metrics were

employed to verify the efﬁciency and accuracy of our

classiﬁer.

Performance Metrics:

• Accuracy: The overall rate of correctly classiﬁed

human movements among all classiﬁcations.

• F1 Score: A weighted average of precision and

recall, providing a balance between false positives

and false negatives.

• Confusion Matrix: A detailed breakdown of true

positives, false positives, true negatives, and false

negatives.

• Computational Time: The time taken for the

classiﬁer to process an input and generate an out-

put.

The performance metrics are summarized in Ta-

ble 1. These tests were run on the same Intel NUC

that is present in our UAVs (Hert et al., 2023), and

therefore accurately reﬂect the ground reality. The ap-

proach was tested on the UTD-MHAD Dataset (Chen

et al., 2015), which consists of 27 classes. We also

conducted tests with a reduced number of classes to

directly compare our method with those in (Schneider

et al., 2019). Six classes were chosen for this compar-

ison, with one additional class (a8) added to evaluate

its effect on algorithm performance. These 7 actions

were, a1: Arm swipe to the left, a6: Cross arms in

the chest, a7: Basketball shoot, a8: Hand draw x, a9:

Hand draw circle (clockwise), a24: Sit to stand, a27:

Forward lunge.

Two versions of the proposed approach were

tested on the dataset. The Heavy version utilizes the

full set of encodings and is expected to be very accu-

rate, but considerably slower. The encoded version in-

corporates dimensionality reduction, providing faster

classiﬁcation at slightly lower accuracy. This version

offered a good balance between accuracy and compu-

tational time for real-time classiﬁcation. The insights

drawn from this evaluation illuminate the trade-offs

between accuracy and computational demands, guid-

ing the selection of the optimal conﬁguration for spe-

ciﬁc application scenarios.

Heavy Conﬁguration: Precision at the Cost of

Computational Efﬁciency. The Heavy conﬁgura-

tion (characterized by its utilization of the full set

of encodings) demonstrated remarkable accuracy and

F1 scores across all evaluated class groupings. It

achieved a pinnacle of classiﬁcation precision in the

6-class setup, with an accuracy of 98.72% and an

F1 score of 99. However, this high degree of preci-

sion comes at a signiﬁcant computational cost. The

total time for processing 27 classes was recorded at

360.44 seconds, with a per-case time exceeding 2000

ms. This considerable computational demand demon-

strates the Heavy conﬁguration’s limited applicability

in real-time or resource-constrained scenarios.

Encoded Conﬁguration: A Pragmatic Balance.

Emerging as the balanced contender, the Encoded

conﬁguration signiﬁcantly reduces computational

time without drastically compromising on accuracy.

For 6 classes, it maintained an impressive accuracy of

97.44% and an F1 score of 97, with a markedly re-

duced per-case time of approximately 38 ms. This

conﬁguration adeptly balances computational efﬁ-

ciency and precision, making it an ideal candidate

for real-time applications. While notable, the dip in

accuracy to 83.24% for 27 classes still positions the

Encoded conﬁguration as a robust option, capable of

handling a diverse range of movements with consid-

erable accuracy.

Discussion on Trade-offs and Conﬁguration Selec-

tion. The analysis of the three conﬁgurations high-

lights a fundamental trade-off between computational

efﬁciency and accuracy. The Heavy conﬁguration,

while highly accurate, may not be feasible for real-

time applications due to its signiﬁcant computational

demands. The Encoded conﬁguration stands out as

the optimal choice for applications requiring a harmo-

nious balance between accuracy and computational

speed, offering high performance without substantial

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

Table 1: Performance metrics of the proposed method.

Conﬁguration Classes Test/Train cases Accuracy (%) Total Time (s) Per case Time (ms) F1 Score

Heavy 27 173/688 98.27 360.44 2083 98

6 39/153 98.72 59.03 1513 99

7 45/178 95.56 71.71 1594 96

Encoded 27 173/688 83.24 23.82 138 84

6 39/153 97.44 1.4896 38 97

7 45/178 86.67 1.7458 39 87

sacriﬁces. For real-time applications, the Encoded

conﬁguration’s balanced performance proﬁle makes

it exceptionally suitable. It provides a viable solu-

tion that accommodates the need for quick processing

times, while still maintaining a high level of accuracy.

This balance is crucial for deploying efﬁcient and

responsive systems in dynamic environments where

both precision and speed are essential.

Table 2: Comparison of accuracy with state-of-the-art meth-

ods.

Method Accuracy (%)

Classical 60

Schneider et. el. (Schneider et al., 2019) 63-76

Rwigema et al. (Rwigema et al., 2019) 99.4

Celebi et al. (Celebi et al., 2013) 96

Proposed method (Encoded) 86-97

Additionally, the confusion matrices of the En-

coded variant of the approach for 6 and 7 ges-

tures respectively are displayed in Fig 3 and Fig

4. The model exhibits a high degree of accuracy

for gestures, such as a24 sit to stand, a26 lunge, and

a7 basketball shoot, which are likely to have distinct

starting and ending poses or unique motion patterns

that are easily distinguishable. Confusions are primar-

ily seen with a1 swipe left which is sometimes mis-

taken for a8 draw X and vice versa, suggesting that

the horizontal component of the swipe is similar to

part of the ”draw X” motion. Similarly, misclassiﬁ-

cations between a1 swipe left and a9 draw circle cw

imply that certain segments of the swipe and circu-

lar gestures may be indistinguishable to the model.

The model’s difﬁculty in differentiating between

a1 swipe left and gestures involving complex hand

trajectories (a8 draw X, a9 draw circle cw) indicates

a potential area for improvement. Reﬁnement of the

feature set and the inclusion of more granular tempo-

ral data could enhance the model’s ability to discern

between these gestures with overlapping features.

9.2 Comparative Analysis with

State-of-the-Art Methods

In this section, we juxtapose the performance of

our proposed Encoded conﬁguration against various

state-of-the-art methods, as summarized in the ac-

companying table. The comparative analysis is cru-

cial to positioning our work within the broader land-

scape of action classiﬁcation methodologies, empha-

sizing its competitive advantages and identifying ar-

eas for further reﬁnement.

Overview of Comparative Performance

1. Classical Methods: These approaches, typically

involving hand-engineered features and classical

machine learning algorithms, show a base ac-

curacy of 60%. Our method signiﬁcantly sur-

passes this benchmark, demonstrating the efﬁcacy

of modern, data-driven approaches in handling

complex classiﬁcation tasks.

2. Schneider et al.: With accuracies ranging between

63% to 76%, the work by Schneider et al. closely

aligns with the initial performance metrics our

study aimed to exceed. By achieving accuracy be-

tween 86% to 97% in the Encoded conﬁguration,

our method not only surpasses Schneider et al.’s

performance, but also showcases the potential of

embedding calculators and dimensionality reduc-

tion techniques in enhancing classiﬁcation accu-

racy.

3. Rwigema et al.: Although Rwigema et al.’s

method achieves an impressive accuracy of

99.4%, it is noted for its unsuitability for real-

time applications due to substantial computational

requirements. This highlights a critical aspect of

our research focus—balancing high accuracy with

computational efﬁciency to enable real-time clas-

siﬁcation.

4. Celebi et al.: The method by Celebi et al.

presents a high accuracy of 96%, situating it

as a leading approach within the ﬁeld. Our

proposed method’s performance falls within this

high-accuracy bracket while emphasizing real-

time applicability.

9.3 Real-World Deployment

Upon veriﬁcation, we proceeded to deploy our system

on our UAV platform(Hert et al., 2022)(Hert et al.,

2023) and our UAV control system(Baca et al., 2021),

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework

Figure 3: Confusion Matrix of Encoded Variant with 6 Ges-

ture Classes.

to assess its real-world applicability. The primary fo-

cus of this phase was to determine how well our clas-

siﬁer could translate laboratory results into practical,

actionable commands in an outdoor environment.

During the trials, the UAV was subjected to a se-

ries of predeﬁned human actions. The 6 gestures that

were used for lab validation (a1, a6, a7, a9, a24, a26),

were performed for testing, with the human standing

between 4-8 meters away from the UAV. The UAV

correctly recognized and responded to 19 out of 20

actions, yielding a real-world accuracy rate of 95%.

Notably, the proposed approach adeptly handled dy-

namic environmental factors, such as changing light

conditions and background noise due to the presence

of clouds, occluding sunlight sporadically throughout

the experiment, showcasing its adaptability and ro-

bustness. Additionally, we encountered no false pos-

itives, which is crucial as the performance of unin-

tended actions is undesirable. The one action that was

not correctly identiﬁed was classiﬁed as a null action,

leading to no command being sent to the drone. This

is the intended behaviour that we want our approach

to adopt. Missing an action is preferable to misidenti-

fying an action and behaving erratically. Fig 5 shows

a UAV being controlled by a human operator using

gestures.

10 CONCLUSION

In conclusion, the feature space design offers a com-

prehensive approach to extracting rich embeddings

from human pose landmarks. These embeddings,

grounded in both anatomical signiﬁcance and math-

ematical rigour, are poised to enhance the capabili-

ties of pose-based analysis systems. The comparative

Figure 4: Confusion Matrix of Encoded Variant with 7 Ges-

ture Classes.

Figure 5: A UAV being controlled by a human operator in

an open ﬁeld.

analysis underscores the Encoded conﬁguration as the

preferred choice for a wide range of applications, par-

ticularly those necessitating real-time processing. It

embodies a practical compromise, delivering high ac-

curacy and F1 scores with considerably lower com-

putational times compared to the Heavy conﬁgura-

tion. The comparative analysis also elucidates the po-

sitioning of our proposed method within the action

classiﬁcation domain. By offering a substantial im-

provement over classical methods and some contem-

porary approaches, as well as by providing a viable al-

ternative to high-accuracy, computationally intensive

methods, our work carves out a niche in the pursuit

of real-time, efﬁcient, and accurate action classiﬁca-

tion. It underscores the importance of methodological

advancements that do not sacriﬁce practical applica-

bility for theoretical precision, thereby aligning with

the evolving needs of real-world applications. The

proposed method performed exceptionally when de-

ployed on a real UAV, proving its capability in real-

world applications.

ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics

ACKNOWLEDGEMENTS

This work has been supported by CTU grant no

SGS23/177/OHK3/3T/13, by the Czech Science

Foundation (GA

CR) under research project No. 23-

07517S, and by the Europen Union under the project

Robotics and advanced industrial production (reg.

no. CZ.02.01.01/00/22 008/0004590), by TACR

project no. FW03010020, by the National Council for

Scientiﬁc and Technological Development – CNPq,

by the National Fund for Scientiﬁc and Technologi-

cal Development – FNDCT, and by the Ministry of

Science, Technology and Innovations – MCTI from

Brazil under research project No. 304551/2023-6 and

407334/2022-0, by the Paraiba State Research Sup-

port Foundation - FAPESQ under research project No.

3030/2021.

REFERENCES

Baca, T., Petrlik, M., Vrba, M., Spurny, V., Penicka, R.,

Hert, D., and Saska, M. (2021). The mrs uav system:

Pushing the frontiers of reproducible research, real-

world deployment, and education with autonomous

unmanned aerial vehicles. Journal of Intelligent &

Robotic Systems, 102:26.

Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T.,

Zhang, F., and Grundmann, M. (2020). Blazepose:

On-device real-time body pose tracking.

Celebi, S., Aydin, A. S., Temiz, T. T., and Arici, T. (2013).

Gesture recognition using skeleton data with weighted

dynamic time warping. volume 1, pages 620–625.

Chaudhary, A., Nascimento, T., and Saska, M. (2022). Con-

trolling a swarm of unmanned aerial vehicles using

full-body k-nearest neighbor based action classiﬁer.

pages 544–551. IEEE.

Chen, C., Jafari, R., and Kehtarnavaz, N. (2015). Utd-mhad:

A multimodal dataset for human action recognition

utilizing a depth camera and a wearable inertial sen-

sor. pages 168–172. IEEE.

Gundogdu, K., Bayrakdar, S., and Yucedag, I. (2018). De-

veloping and modeling of voice control system for

prosthetic robot arm in medical systems. Journal of

King Saud University - Computer and Information

Sciences, 30:198–205.

Hert, D., Baca, T., Petracek, P., Kratky, V., Penicka, R.,

Spurny, V., Petrlik, M., Vrba, M., Zaitlik, D., Stoudek,

P., Walter, V., Stepan, P., Horyna, J., Pritzl, V.,

Sramek, M., Ahmad, A., Silano, G., Licea, D. B.,

Stibinger, P., Nascimento, T., and Saska, M. (2023).

Mrs drone: A modular platform for real-world deploy-

ment of aerial multi-robot systems. Journal of Intelli-

gent & Robotic Systems, 108:64.

Hert, D., Baca, T., Petracek, P., Kratky, V., Spurny, V., Petr-

lik, M., Vrba, M., Zaitlik, D., Stoudek, P., Walter, V.,

Stepan, P., Horyna, J., Pritzl, V., Silano, G., Licea,

D. B., Stibinger, P., Penicka, R., Nascimento, T., and

Saska, M. (2022). Mrs modular uav hardware plat-

forms for supporting research in real-world outdoor

and indoor environments. pages 1264–1273. IEEE.

Jiao, R., Wang, Z., Chu, R., Dong, M., Rong, Y., and Chou,

W. (2020). An intuitive end-to-end human-uav inter-

action system for ﬁeld exploration. Frontiers in Neu-

rorobotics, 13.

Krusche, S., Al Naser, I., Bdiwi, M., and Ihlenfeldt, S.

(2023). A novel approach for automatic annotation

of human actions in 3d point clouds for ﬂexible col-

laborative tasks with industrial robots. Frontiers in

Robotics and AI, 10.

Mohebbi, A. (2020). Human-robot interaction in rehabili-

tation and assistance: a review. Current Robotics Re-

ports, 1:131–144.

Mokhtarzadeh, A. A. and Yangqing, Z. J. (2018). Human-

robot interaction and self-driving cars safety integra-

tion of dispositif networks. pages 494–499. IEEE.

Park, S., Wang, X., Menassa, C. C., Kamat, V. R., and

Chai, J. Y. (2024). Natural language instructions for

intuitive human interaction with robotic assistants in

ﬁeld construction work. Automation in Construction,

161:105345.

Rwigema, J., Choi, H. R., and Kim, T. (2019). A differential

evolution approach to optimize weights of dynamic

time warping for multi-sensor based gesture recogni-

tion. Sensors (Switzerland), 19.

Sathiyanarayanan, M., Mulling, T., and Nazir, B. (2015).

Controlling a robot using a wearable device (myo).

Schneider, P., Memmesheimer, R., Kramer, I., and Paulus,

D. (2019). Gesture recognition in rgb videos usinghu-

man body keypoints and dynamic time warping.

Vasconez, J. P., Kantor, G. A., and Cheein, F. A. A. (2019).

Human–robot interaction in agriculture: A survey and

current challenges. Biosystems Engineering, 179:35–

48.

Vysocky, A. and Novak, P. (2016). Human – robot collab-

oration in industry. MM Science Journal, 2016:903–

906.

Yamada, H., Muto, T., and Ohashi, G. (2015). Development

of a telerobotics system for construction robot using

virtual reality. pages 2975–2979. Institute of Electri-

cal and Electronics Engineers Inc.

Yoo, M., Na, Y., Song, H., Kim, G., Yun, J., Kim, S., Moon,

C., and Jo, K. (2022). Motion estimation and hand

gesture recognition-based human–uav interaction ap-

proach in real time. Sensors, 22:2513.

Zhang, J., Yu, Z., Wang, X., Lyu, Y., Mao, S., Periaswamy,

S. C., Patton, J., and Wang, X. (2018). Rfhui: An in-

tuitive and easy-to-operate human-uav interaction sys-

tem for controlling a uav in a 3d space. In Proceedings

of the 15th EAI International Conference on Mobile

and Ubiquitous Systems: Computing, Networking and

Services, MobiQuitous ’18, page 69–76, New York,

NY, USA. Association for Computing Machinery.

Intuitive Human-Robot Interface: A 3-Dimensional Action Recognition and UAV Collaboration Framework