Authors:
Sergio Esteban-Romero
1
;
Romeo Lanzino
2
;
Marco Raoul Marini
2
and
Manuel Gil-Martín
1
Affiliations:
1
Grupo de Tecnología del Habla y Aprendizaje Automático, ETSI Telecomunicación, Universidad Politécnica de Madrid, Av. Complutense 30, 28040, Madrid, Spain
;
2
VisionLab, Department of Computer Science, Sapienza University of Rome, Via Salaria 113, Rome 00198, Italy
Keyword(s):
Multi-View Hand Pose Recognition, Leap Motion Controller 2, Multimodal Data, Multimodal Fusion, Deep Learning.
Abstract:
This paper presents a novel approach for multi-view hand pose recognition through image embeddings and hand landmarks. The method integrates raw image data with structural hand landmarks derived from the Leap Motion Controller 2. A Vision Transformer (ViT) pretrained model was used to extract visual features from dual-view grayscale images, which were fused with the corresponding Leap 2 hand landmarks, creating a multimodal representation that encapsulates both visual and landmark data for each sample. These fused embeddings were then classified using a multi-layer perceptron to distinguish among 17 distinct hand poses from the Multi-view Leap2 Hand Pose Dataset, which includes data from 21 subjects. Using a Leave-OneSubject-Out Cross-Validation (LOSO-CV) strategy, we demonstrate that this fusion approach offers a robust recognition performance (F1 Score of 79.33 ± 0.09 %), particularly in scenarios where hand occlusions or challenging angles may limit the utility of single-modality
data.
(More)