In pose estimation, Hand Pose Estimation (HPE)
is one of the prominent areas of CV with several
real-world applications such as Virtual/Augmented
Reality (VR/AR), sign language recognition, remote
surgery, and so on. In addition to the aforemen-
tioned challenges of CNNs, HPE poses some new
challenges such as self/object occlusion, size variabil-
ity, high dexterity, and depth ambiguity. As a re-
sult, researchers turned their attention to resolving the
above-mentioned issues, the model complexity in 2D
HPE is also one of the issues causing trouble in mak-
ing it more applicable in the real world. Despite these,
numerous HPE approaches were proposed, including
2D and 3D HPE based on RGB (Wang et al., 2018;
Chen et al., 2020; Pan et al., 2022), video (Khaleghi
et al., 2022; Ren et al., 2022), and depth(Ren et al.,
2022; Cheng et al., 2021) but still struggling to over-
come these issues.
In this research, we proposed a multi-stage de-
formable convolution network named Deformable
Pose Network (DPN) for 2D HPE keeping in mind the
above challenges, the deformable convolution (Dai
et al., 2017; Chen et al., 2021) especially focuses
on incorporating the geometrical constraints into the
convolutional operation and the backbone deals with
the hidden information overcoming the other issues.
This approach consists of two modules one is the
backbone and the other is the Deformable Convolu-
tion Block (DCB), we utilized the EfficientNet (EN)
B0 as a backbone for feature extraction, to strike
the balance between the computational cost and the
model efficiency. As a DCB, we used the concept
of Convolutional Pose Machine (CPM) (Wei et al.,
2016) that utilizes a six-stage Convolutional Block
(CB) for information processing, instead of the CB to
deal with the geometrical constraints we replaced the
six-stage CB with a four-stage DCB. These changes
make our proposed model computationally efficient
and enhance the model’s capability to learn the un-
known hidden information including the geometrical
constraints, resulting in accurate 2D HPE.
The proposed approach is summarized below:
• We utilized the customized EfficientNet B0 ver-
sion as a backbone by removing the fully con-
nected layer for feature extraction, which is one
of the best models striking the balance between
computation efficiency and accuracy.
• The multi-stage deformable convolution network
deals with the geometrical constraints and helps
the model to be more generalized to learn the ge-
ometrical transformations.
The article consists of the following sections, Sec-
tion 2 includes the related work on 2D HPE, the de-
tailed network flow is explained in Section 3, exper-
imental setups are explained in Section 4, Section 5
presents the experimental results and analysis, and
the conclusion and the future work are summarized
in Section 6.
2 RELATED WORK
Hand Pose Estimation (HPE) is a CV task that in-
volves localizing and identifying the hand keypoints
(joints) of a hand in a video or an image. As CNNs
(Schn
¨
urer et al., 2019; Charco et al., 2022) play a
crucial role in CV, researchers have actively proposed
different approaches to tackle the challenges in HPE,
to address the problem of self/object occlusion multi-
view RGB models (Simon et al., 2017a; Joo et al.,
2015; Panteleris and Argyros, 2017) were proposed,
but still constrained with a requirement of specific
camera setups. On the other hand, depth-based pose
estimation models (Schn
¨
urer et al., 2019; Cheng et al.,
2021) achieve better accuracy based on depth values,
resulting in a fast process. However, these models can
be sensitive to the environment (i.e., noise, lightning
conditions, and so on). Widespread adoption of RGB
cameras in recent years for HPE tasks due to their af-
fordability, anti-inference capabilities, and portability
many approaches were proposed based on CNNs us-
ing RGB images. CPM (Wei et al., 2016), enforces
CNNs to generate heatmaps indicating the location of
each keypoints. Although CNNs tackle some of the
key challenges but still struggle to deal with the geo-
metrical constraints, self/object occlusion, and high-
dexterity, to resolve these we utilized the idea of de-
formable convolutional (Chen et al., 2021) in our net-
work to make it more generalized.
In recent days, researchers tried to reduce the
computational complexity of 2D HPE models, while
striking the balance between accuracy and computa-
tional cost (Salman et al., 2023a). CPM (Wei et al.,
2016) was one of the state-of-the-art lightweight
base models a few years back. Yifei Chen et al.
(Chen et al., 2020) proposed an architecture based
on cascade structure regularization, consisting of
two lightweight modules Limb Deterministic Mask
(LDM) and Limb Probabilistic Mask (LPM), and
each module can be utilized separately for 2D HPE.
Hinqing Yang et al. tried to improve those modules in
terms of accuracy and computational efficiency and
somehow succeeded in this. In (Pan et al., 2022)
Tianhong Pan et al. optimized the CPM reducing the
complexity of the models and improving the accuracy.
However, the above-mentioned methods are state-of-
the-art lightweight models but still not applicable in
many cases because of the computational complex-
Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation
815