Authors:
Hajer Essefi
1
;
Olfa Ben Ahmed
1
;
Christel Bidet-Ildei
2
;
Yannick Blandin
2
and
Christine Fernandez-Maloigne
1
Affiliations:
1
XLIM Research Institute, UMR CNRS 7252, University of Poitiers, France
;
2
Centre de Recherches sur la Cognition et l’Apprentissage (UMR CNRS 7295), Université de Poitiers, Université de Tours, Centre National de la Recherche Scientifique, France
Keyword(s):
Deep Learning, Computer Vision, Human Action Recognition, Action Perception, RGB Video.
Abstract:
Human Action Recognition (HAR) is an important task for numerous computer vision applications. Recently, deep learning approaches have shown proficiency in recognizing actions in RGB video. However, existing models rely mainly on global appearance and could potentially under perform in real world applications, such as sport events and clinical applications. Refereeing to domain knowledge in how human perceive action, we hypothesis that observing the dynamic of a 2D human body joints representation extracted from RGB video frames is sufficient to recognize an action in video. Moreover, body joints contain structural information with a strong spatial (intra-frame) and temporal (inter-frame) correlation between adjacent joints. In this paper, we propose a psychology-inspired twin stream Gated Recurrent Unit network for action recognition based on the dynamic of 2D human body joints in RGB videos. The proposed model achieves a classification accuracy of 89,97% in a subject-specific exper
iment and outperforms the baseline method that fuses depth and inertial sensor data on the UTD-MHAD dataset. The proposed framework is more cost effective and highly competitive than depth 3D skeleton based solutions and therefore can be used outside capture motion labs for real world applications.
(More)