segmented the image into the hand in different
regions and obtained the Histogram of Oriented
Gradients (HOG) and a Local Binary Pattern (LBP)
from each region. Afterwards, they combined k-
means and Support Vector Machines (SVM) to
classify the hand poses, obtaining an F1-score near
96% using data from 25 subjects and 16 different
hand poses. Similarly a previous work (Bao,
Maqueda, del-Blanco, & Garcia, 2017) fed a deep
Convolutional Neural Network (CNN) to directly
classify hand poses in images without any previous
segmentation. They classified the hand pose with
average accuracy of 97.1% in the images with simple
backgrounds and 85.3% in the images with complex
backgrounds. They used a dataset with data from 40
subjects and seven different hand poses. Another
work (Gil-Martín, San-Segundo, & de Córdoba,
2023) used a normalization over the hand landmarks
to detect the same poses, achieving robust
performance even when the images had complex
backgrounds. This approach has the advantage of
sending less information to the recognition module
compared to traditional computer vision approaches
(landmark coordinates vs. a full image). Another
previous work (Benitez-Garcia, Olivares-Mercado,
Sanchez-Perez, Yanai, & Ieee Comp, 2021) used the
raw images to classify hand gestures via training a
ResNeXt-101 model, achieving a 86.32% of
accuracy when evaluating 13 subjects.
This paper is focused on exploring the impact of
different landmark-based input formats over the
gesture recognition task, rather than optimizing the
deep learning architecture for obtaining the
maximum performance. This work uses a state-of-
the-art deep learning architecture to understand
which input formats—specifically raw hand
landmark coordinates, speed of coordinate
movement, and anthropometric measures—yield the
most informative representations for gesture
recognition. The primary contributions of this
research are as follows:
Analyze different landmark input formats: raw
coordinates, speed of coordinates movement,
and anthropometric measures.
Investigate the effects of various
normalization techniques on raw landmark
coordinates.
Minimize the number of landmarks used in the
recognition process while keeping a high
recognition performance.
2 MATERIAL AND METHODS
This section describes the dataset used, the
landmarks information extraction, and the proposed
model architecture.
2.1 Dataset Description
We have used the public dataset called IPN Hand to
evaluate our system.
The IPN Hand dataset (Benitez-Garcia et al.,
2021) includes hand gestures performed by 50
subjects for interaction with touchless screens. It
contains 4,218 gesture instances and 800,000
frames. It includes 13 gestures performed with one
hand: pointing with one or two fingers, clicking with
one or two fingers, throwing up/down/left/right,
opening twice, double-clicking with one or two
fingers, zooming in, and zooming out. During data
collection, each subject did different gestures with
three random breaks in a single video. The subjects
used their own PC or laptop to collect the RGB
videos, which were recorded in the resolution of
640x480 with the frame rate of 30 fps.
This dataset is a great choice for hand gesture
recognition due to its extensive and varied content
regarding instances, gestures and subjects, ensuring
a diverse representation of hand motions. This
diversity enhances the dataset's applicability for real-
world human-machine interaction applications, such
as touchless screens and virtual reality interfaces,
where accurate gesture recognition is crucial. Its
realistic data collection, with subjects using their
own devices to record gestures in varied
environments, ensures that models trained on the
dataset can generalize well to practical usage
scenarios.
2.2 Landmarks-Based Representations
The original images from the dataset were processed
by the MediaPipe library to extract the x and y
coordinates from specific points of the hand called
landmarks. MediaPipe (Lugaresi et al., 2019;
Quinonez, Lizarraga, & Aguayo, 2022) is a powerful
library with the capacity to track pose and hands
from input frames or video streams. This framework
can extract 21 landmarks from the hand (including
wrist and four points along the five fingers). To
standardize the input data and ensure consistent
processing, we applied zero padding at the
beginning of each gesture sequence, thereby aligning
all examples to the same length, ranging from 25 to
250 timesteps. This padding ensures that a uniform