Deformable Pose Network: A Multi-Stage Deformable Convolutional

Network for 2D Hand Pose Estimation

Sartaj Ahmed Salman

1 a

, Ali Zakir

1 b

and Hiroki Takahashi

1,2

Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications,

Tokyo, Japan

Artiﬁcial Intelligence Exploration/Meta-Networking Research Center, The University of Electro-Communications, Tokyo,

Japan

Keywords:

Deformable Convolution, Multi-Stage DC, EfﬁcientNet, 2D HPE.

Abstract:

Hand pose estimation undergoes a signiﬁcant advancement with the evolution of Convolutional Neural Net-

works (CNNs) in the ﬁeld of computer vision. However, existing CNNs fail in many scenarios in learning

the unknown transformations and geometrical constraints along with the other existing challenges for accu-

rate estimation of hand keypoints. To tackle these issues we proposed a multi-stage deformable convolutional

network for accurate 2D hand pose estimation from monocular RGB images while considering the computa-

tional complexity. We utilized EfﬁcientNet as a backbone due to its powerful feature extraction capability, and

deformable convolution to learn about the geometrical constraints. Our proposed model called Deformable

Pose Network (DPN) outperforms in predicting the 2D keypoints in complex scenarios. Our analysis on the

Panoptic studio hand dataset shows that our proposed model improves the accuracy by 2.36% and 7.29% as

compared to existing methods i.e., OCPM and CPM respectively.

1 INTRODUCTION

Convolutional Neural Networks (CNNs) have under-

gone considerable advancements and achieved sub-

stantial success in several applications such as visual

recognition tasks such as pose estimation (Salman

et al., 2023c; Salman et al., 2023b; Simon et al.,

2017a; Kong et al., 2020; Zakir et al., 2024), ob-

ject detection (Girshick et al., 2014) semantic seg-

mentation (Long et al., 2015), and image classiﬁca-

tion (Krizhevsky et al., 2017). Their capability of

modeling geometric transformation comes from ex-

tensive data augmentation, the large model capacity,

and some hand-crafted modules (e.g., max pooling

(Boureau et al., 2010)). Despite the merits, CNNs

underperform in terms of modeling geometric trans-

formations in object pose, viewpoint, scale, and part

deformation. First, they are assumed to be known and

ﬁxed the data augmentation, features, and algorithms

were designed on these assumptions which prevent

generalization of a new task processing the unknown

geometric transformation, which is not properly mod-

eled. Second, even when the transformations are un-

https://orcid.org/0000-0001-9344-6658

https://orcid.org/0000-0002-3187-9551

known, hand-crafted designs of invariant features and

algorithms are not feasible and it’s difﬁcult to overlay

these transformations.

However, CNNs are limited to unknown transfor-

mations and large models and the origination of these

limitations is from the ﬁxed geometric structures of

the CNN modules. Speciﬁcally, the convolution unit

samples information from distinct points in the in-

put feature maps, while reducing spatial resolution

by a ﬁxed ratio using pooling layers. Similarly, a

RoI (region-of-interest) pooling layer segments a RoI

into a set of spatial bins. There the model fails to

handle the geometric transformations causing a no-

ticeable problem i.e., the ﬁeld sizes of the activation

units of the same CNN layers are the same, which

is quite undesirable for the high-level layers that en-

code the semantic over spatial locations. These differ-

ent locations may correspond to the object with dif-

ferent scales or deformations, for visual recognition

with ﬁne localization adaptive determination of scales

is favorable (Long et al., 2015). In object detection

(Girshick et al., 2014), they rely on the features that

are extracted based on the primitive bounding boxes,

and in pose estimation (Zakir et al., 2024; Zakir et al.,

2023) the geometric constraints of the keypoints.

814

Salman, S., Zakir, A. and Takahashi, H.

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation.

DOI: 10.5220/0012569000003660

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2024) - Volume 3: VISAPP, pages

814-821

ISBN: 978-989-758-679-8; ISSN: 2184-4321

In pose estimation, Hand Pose Estimation (HPE)

is one of the prominent areas of CV with several

real-world applications such as Virtual/Augmented

Reality (VR/AR), sign language recognition, remote

surgery, and so on. In addition to the aforemen-

tioned challenges of CNNs, HPE poses some new

challenges such as self/object occlusion, size variabil-

ity, high dexterity, and depth ambiguity. As a re-

sult, researchers turned their attention to resolving the

above-mentioned issues, the model complexity in 2D

HPE is also one of the issues causing trouble in mak-

ing it more applicable in the real world. Despite these,

numerous HPE approaches were proposed, including

2D and 3D HPE based on RGB (Wang et al., 2018;

Chen et al., 2020; Pan et al., 2022), video (Khaleghi

et al., 2022; Ren et al., 2022), and depth(Ren et al.,

2022; Cheng et al., 2021) but still struggling to over-

come these issues.

In this research, we proposed a multi-stage de-

formable convolution network named Deformable

Pose Network (DPN) for 2D HPE keeping in mind the

above challenges, the deformable convolution (Dai

et al., 2017; Chen et al., 2021) especially focuses

on incorporating the geometrical constraints into the

convolutional operation and the backbone deals with

the hidden information overcoming the other issues.

This approach consists of two modules one is the

backbone and the other is the Deformable Convolu-

tion Block (DCB), we utilized the EfﬁcientNet (EN)

B0 as a backbone for feature extraction, to strike

the balance between the computational cost and the

model efﬁciency. As a DCB, we used the concept

of Convolutional Pose Machine (CPM) (Wei et al.,

2016) that utilizes a six-stage Convolutional Block

(CB) for information processing, instead of the CB to

deal with the geometrical constraints we replaced the

six-stage CB with a four-stage DCB. These changes

make our proposed model computationally efﬁcient

and enhance the model’s capability to learn the un-

known hidden information including the geometrical

constraints, resulting in accurate 2D HPE.

The proposed approach is summarized below:

• We utilized the customized EfﬁcientNet B0 ver-

sion as a backbone by removing the fully con-

nected layer for feature extraction, which is one

of the best models striking the balance between

computation efﬁciency and accuracy.

• The multi-stage deformable convolution network

deals with the geometrical constraints and helps

the model to be more generalized to learn the ge-

ometrical transformations.

The article consists of the following sections, Sec-

tion 2 includes the related work on 2D HPE, the de-

tailed network ﬂow is explained in Section 3, exper-

imental setups are explained in Section 4, Section 5

presents the experimental results and analysis, and

the conclusion and the future work are summarized

in Section 6.

2 RELATED WORK

Hand Pose Estimation (HPE) is a CV task that in-

volves localizing and identifying the hand keypoints

(joints) of a hand in a video or an image. As CNNs

(Schn

urer et al., 2019; Charco et al., 2022) play a

crucial role in CV, researchers have actively proposed

different approaches to tackle the challenges in HPE,

to address the problem of self/object occlusion multi-

view RGB models (Simon et al., 2017a; Joo et al.,

2015; Panteleris and Argyros, 2017) were proposed,

but still constrained with a requirement of speciﬁc

camera setups. On the other hand, depth-based pose

estimation models (Schn

urer et al., 2019; Cheng et al.,

2021) achieve better accuracy based on depth values,

resulting in a fast process. However, these models can

be sensitive to the environment (i.e., noise, lightning

conditions, and so on). Widespread adoption of RGB

cameras in recent years for HPE tasks due to their af-

fordability, anti-inference capabilities, and portability

many approaches were proposed based on CNNs us-

ing RGB images. CPM (Wei et al., 2016), enforces

CNNs to generate heatmaps indicating the location of

each keypoints. Although CNNs tackle some of the

key challenges but still struggle to deal with the geo-

metrical constraints, self/object occlusion, and high-

dexterity, to resolve these we utilized the idea of de-

formable convolutional (Chen et al., 2021) in our net-

work to make it more generalized.

In recent days, researchers tried to reduce the

computational complexity of 2D HPE models, while

striking the balance between accuracy and computa-

tional cost (Salman et al., 2023a). CPM (Wei et al.,

2016) was one of the state-of-the-art lightweight

base models a few years back. Yifei Chen et al.

(Chen et al., 2020) proposed an architecture based

on cascade structure regularization, consisting of

two lightweight modules Limb Deterministic Mask

(LDM) and Limb Probabilistic Mask (LPM), and

each module can be utilized separately for 2D HPE.

Hinqing Yang et al. tried to improve those modules in

terms of accuracy and computational efﬁciency and

somehow succeeded in this. In (Pan et al., 2022)

Tianhong Pan et al. optimized the CPM reducing the

complexity of the models and improving the accuracy.

However, the above-mentioned methods are state-of-

the-art lightweight models but still not applicable in

many cases because of the computational complex-

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation

815

Figure 1: Detailed overview of Deformable Pose Network.

ity and high energy consumption. To tackle this we

utilize the stages idea of CPM and reduce the num-

ber of stages to reduce the model complexity without

affecting the accuracy (balancing the computational

cost and accuracy).

3 DEFORMABLE POSE

NETWORK

Generally, 2D HPE using heatmaps involves the key-

points detection to get the actual hand pose P, from

an RGB image or a video frame I. Consider, K as

a set of keypoints, wherein each keypoint k

repre-

sents a distinct region on the hand such as joints or

ﬁngertips. These keypoint k

are symbolized by indi-

vidual heatmaps H, forming the objective to predict

the heatmaps of each keypoint {H

, ., ., H

}. Conse-

quently, the pose P = {(x

, y

), (x

, y

), . . . , (x

, y

)}

denotes the coordinates with the highest probability

in each heatmap. The count of keypoints K varies

across the datasets, commonly comprising 21 key-

points. Therefore, for the given input we seek to esti-

mate the pose P, expressed as the set of keypoints as

described in Algorithm 1.

We proposed a new approach named Deformable

Pose Network (DPN) for efﬁcient and accurate 2D

HPE. A multi-stage deformable convolution is uti-

lized in our work inspired by the workﬂow of CPM

stages, combining the power of EN as a backbone for

feature extraction. Figure 1 shows the detailed archi-

tecture of our proposed method.

3.1 Modiﬁed EfﬁcientNet Version B0

for Enhanced Feature Extraction

EN, a state-of-the-art network is utilized as a back-

bone of our proposed approach for feature extraction.

EN is known for its ability to balance the model accu-

Data: RGB image or video frame I

Result: Estimated hand pose P represented as

keypoints

Initialize P =

0 (Set to store keypoints);

Detect keypoints K representing distinct hand

regions in I;

for i = 1 to K do

Generate heatmap H

for k

in I;

Extract coordinates (x

, y

) with highest

probability from H

;

Add (x

, y

) to P as a keypoint;

end

Return P as the estimated hand pose;

Algorithm 1: 2D Hand Pose Estimation using Heatmaps.

racy and computational cost, based on this there are

many versions of EN (B0-B7) each version varies in

depth, and B0 the lightest version is utilized in our

framework. Figure 2 shows the architecture of the

modiﬁed EN acting as a backbone in our network. We

employed the B0 version of EN to reduce the com-

plexity in comparison with its variants and the other

feature extraction networks (i.e. RestNet, VGG, and

more). The modiﬁed B0 consists of seven blocks,

containing varying numbers of MBConvs which are

the structure of MobileNetv2, it further includes the

removal of ﬁnal fully connected convolution layers,

reducing the model’s parameters and enabling it for

feature extraction. The input data goes through sev-

eral layers in a sequential process, the input is sub-

jected to a 3 × 3Conv, followed by the MBConvs op-

erations. The ﬁnal layer of the EN outputs 64 feature

maps and passes to the deformable convolution block

for further processing as shown in Figure 2.

3.2 Information Processing DCB

The CPM is one of the baseline CNN-based pose

estimation models, which deals with the complexi-

ties involved in HPE. However, it encounters lim-

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

816

Figure 2: Overall architecture of modiﬁed EfﬁcientNet BO.

itations in HPE due to unknown geometrical con-

straints and other mentioned challenges. To address

these issue within the CNN-based models we inte-

grated the DC, which focus on managing geometrical

constraints and enhancing the model’s adaptability in

learning the unknown features during the information

processing. Our proposed approach is a four-stage

network, The initial stage consists of two 3 × 3 DCBs

with a channel count of 256. Subsequent stages con-

sist of seven 3 × 3 DCBs, each with 128 channels.

The detailed overview of this information processing

DCB is shown in Figure 3.

The output feature maps generated by the back-

bone are directed to the DCB initial stage of our net-

work for subsequent information processing. Within

each stage, the DC, comprising two Convolutional

Layers (CL)layers offset CL and a modulator CL, and

a DC operation which is discussed in detail below:

3.2.1 Offset CL

It computes spatial offsets through learnable param-

eterization from the input feature map x which is the

output feature map of the backbone, denoted by OF.

It can be mathematically represented as Eq 1.

OF = OFC(x) (1)

Where OFC denotes the convolutional operation on

the input x for the computation of the offsets. Which

helps to determine the sampling location in x, making

it ﬂexible to receptive ﬁelds.

3.2.2 Modulator CL

It governs the signiﬁcance or modulation of sampled

regions generating modulation weights by leveraging

the output of a sigmoid function as shown in Eq 2:

M = 2 × σ(MC(x)) (2)

Here, M represents the modulator, σ, and MC denotes

the sigmoid function and the convolutional operation

respectively. This factor helps in adaptive feature ad-

justments according to their importance.

3.2.3 DC Operation

After the ﬁrst two CLs, DC operations play the role

that is the core of DC, integrating the offset and mod-

ulator with the regular CL. Mathematically this oper-

ation can be expressed as in Eq 3:

x = de f orm 2d(x, OF, w, b, M) (3)

Here, x denotes the input feature map, OF signiﬁes

the spatial offsets, w, b, and M represents the con-

volutional weights, bias, and the modulating factor

respectively. The incorporation dynamically adjusts

the receptive ﬁelds, enabling the model’s capabilities

to learn adaptive features and geometrical constraints,

and the ﬁnal output from the initial stage progresses

to the second stage.

The sequence iterates across all four stages and in

the ﬁnal stage we got the 21 ﬁnal keypoints. Along

with the DC, we reduced the number of stages and

channels in contrast to CPM, enhancing the over-

all adaptability and computational efﬁciency of our

model.

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation

817

Figure 3: Detailed overview of stages of deformable convolution block.

4 EXPERIMENTAL SETUP

4.1 Dataset

In our research, we utilized a publicly available data

set The Carnegie Mellon University Panoptic Hand

Dataset (CMU) (Simon et al., 2017b) from Panoptic

Studio to evaluate our proposed model. The dataset

includes 14,817 annotations of the right hand of in-

dividuals captured at the studio, the distribution is

shown in Table 1. As our research is HPE to achieve

this objective the annotated image patches were ex-

tracted from the full image using a box size of 2.2

times larger than the hand. The dataset is randomly

divided into three subgroups by a random sampling

technique for training, validation, and testing com-

prised of 80%, 10%, and 10% of the dataset respec-

tively.

Table 1: CMU panoptic hand dataset distribution.

Dataset Training Validation Test

CMU Panoptic 11,853 1482 1482

4.2 Implementation Details

We implemented our model using the PyTorch frame-

work, with a batch size of 64 and a learning

rate of 0.0001. The model is trained up to 100

epochs. The input images were scaled to [0, 1]

and normalized using a mean and standard deviation

of (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) re-

spectively. The Mean Squared Error (MSE) is utilized

as a loss function. To prevent the loss from decreas-

ing to an extremely low value, the loss function is ad-

justed using a scaling factor of 35.

4.3 Activation Function and Model

Optimizer

To incorporate nonlinear aspects into the network,

various activation functions were proposed such as

ReLU (Banerjee et al., 2019), Softmax (Sharma et al.,

2017), and, Mish (Misra, 2020). However, Mish out-

performs others notably, due to its nonlinear nature,

its mathematical representation is as follows:

f (x) = x tanh(ln(1 + e

)) (4)

Experimental results highlight Mish’s superior ef-

ﬁciency over other activation functions.

The model optimizers aim to decrease the loss

function and enhance network performance by ﬁnd-

ing the best parameter values. We adopted a newly

derived version of the Adam optimizer called AdamW

can signiﬁcantly bolster model optimization tech-

niques. In contrast to the Adam optimizer, the

AdamW algorithm separates the weight decay com-

ponent from the learning rate, enabling individual-

ized optimization of each component. This feature

effectively addresses the issue of excessive overﬁt-

ting. The results indicate that the model optimized

with AdamW demonstrates better generalization per-

formance. The AdamW optimizer was employed in

the training of our proposed approach.

4.4 Evaluation Metric

As an evaluation metric commonly used for pose es-

timation Percentage of Correct Keypoints (PCK) was

utilized in this study. It measures the probability that

the predicted keypoints fall in a speciﬁed threshold

distance, represented as σ from the ground truth. σ

was uniformly distributed in a range of 0.04 to 0.12,

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

818

Table 2: Numerical comparison of DPM with other models on CMU panoptic hand dataset.

Threshold σ 0.04 0.06 0.08 0.10 0.12 Average Improvement

CPM(Wei et al., 2016) 56.76 74.66 82.50 86.67 89.45 78.01 –

LDM-6(Chen et al., 2020) 59.51 76.19 83.77 87.83 90.27 79.51 1.50

LPM-6(Chen et al., 2020) 60.71 77.60 84.93 88.76 91.10 80.62 2.61

OCPM(Pan et al., 2022) 63.67 80.26 87.10 90.65 93.01 82.94 4.93

DPN 67.19 82.81 89.27 92.63 94.48 85.30 7.29

it isformulated as:

PCK

||D||

∑



||p

− p

max(w, h)

≤ σ



(5)

Here p

represents the keypoints ground truth, 1 is

the indicator function, p

denotes the predicted key-

points, k for the number of keypoints, D refers to the

number of test or validation sample, and w and h indi-

cates the height and width of the input image respec-

tively.

5 RESULTS AND ANALYSIS

In this section, we discuss the performance analysis

of our proposed network and compare it with various

HPE methodologies.

5.1 Quantitative Results

The results presented in Table 2 are the quantita-

tive analysis of our proposed model numerically and

graphically. The results indicate that our proposed

model achieves an improvement of 5.13 % at σ 0.12

and an average improvement of 7.29 % in comparison

to CPM (Wei et al., 2016). Against OCPM (Pan et al.,

2022) it achieves 1.57 % at σ 0.12 and 2.36 % on an

average. Figure 4 depicts a PCK comparison of DPM

with CPM, LDM-6, LPM-6, and OCPM, demonstrat-

ing its superior performance over existing lightweight

methods.

To compare the computational complexity we did

a parameter comparison, excluding LDM-6 and LPM-

6 due to the absence of parameters. Table 3 indi-

cates that our proposed architecture has fewer param-

eters in comparison with CPM and OCPM, signifying

the reduction of the computational complexity of our

methodology.

Table 3: Parameters comparison.

Model Parameters(M) Flops(G)

CPM(Wei et al., 2016) 36.80 103.23

OCPM(Pan et al., 2022) 29.28 80.53

DPN 8.55 16.38

Figure 4: PCK comparison with other lightweight 2D HPE

models.

5.2 Qualitative Results

To evaluate the effectiveness of DPN visually, we ran-

domly select the images from the test set as input for

the visualization. Figure 5 illustrates that our pro-

posed network shows effective results the model’s ef-

ﬁciency on low light and blurred images is notewor-

thy. The ﬁndings suggested that our proposed DPM

tends to perform better than the other lightweight

state-of-the-art models.

5.3 Ablation Study

To demonstrate the effectiveness of DC in the stages

we perform an ablation study by training the network

without DC, the results reveal that DC performs bet-

ter in comparison with the convolution even the model

we trained without is a six-stage network with more

parameters. The numerical results in Table 4 show the

incorporation of DC promisingly improves the net-

work performance in terms of accuracy.

Table 4: Comparison of six-stages without DC and four-

stage with DC.

Threshold σ 0.04 0.06 0.08 0.10 0.12 Average

Six-stage without DC 62.09 78.82 85.64 90.27 92.42 81.85

Four-stage with DC 67.19 82.81 89.27 92.63 94.48 85.30

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation

819

Figure 5: Visual illustration of predicted hand keypoints.

6 CONCLUSIONS

In this paper, we proposed a lightweight multi-stage

deformable convolutional network for 2D hand pose

estimation. To learn the hidden information Efﬁcient-

Net was used as a backbone for enhanced feature ex-

traction. To deal with the geometrical constraints we

utilized deformable convolution in each stage instead

of traditional convolutions. Evaluation on a publicly

available CMU hand dataset, our proposed approach

outperformed the state-of-the-art networks in terms of

accuracy and computational complexity. With the po-

tential of real-world application of hand pose estima-

tion in AR, VR, HCI and so on we will extend our

work to 3D HPE.

REFERENCES

Banerjee, C., Mukherjee, T., and Pasiliao Jr, E. (2019). An

empirical study on generalizations of the relu activa-

tion function. In Proceedings of the 2019 ACM South-

east Conference, pages 164–167.

Boureau, Y.-L., Ponce, J., and LeCun, Y. (2010). A theoret-

ical analysis of feature pooling in visual recognition.

In Proceedings of the 27th international conference on

machine learning (ICML-10), pages 111–118.

Charco, J. L., Sappa, A. D., and Vintimilla, B. X. (2022).

Human pose estimation through a novel multi-view

scheme. In VISIGRAPP (5: VISAPP), pages 855–862.

Chen, F., Wu, F., Xu, J., Gao, G., Ge, Q., and Jing, X.-Y.

(2021). Adaptive deformable convolutional network.

Neurocomputing, 453:853–864.

Chen, Y., Ma, H., Kong, D., Yan, X., Wu, J., Fan, W., and

Xie, X. (2020). Nonparametric structure regulariza-

tion machine for 2D hand pose estimation. In Pro-

ceedings of the IEEE/CVF Winter Conference on Ap-

plications of Computer Vision, pages 381–390.

Cheng, W., Park, J. H., and Ko, J. H. (2021). Handfold-

ingnet: A 3d hand pose estimation network using

multiscale-feature guided folding of a 2d hand skele-

ton. In Proceedings of the IEEE/CVF Conference on

Computer Vision, pages 11260–11269.

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., and

Wei, Y. (2017). Deformable convolutional networks.

In Proceedings of the IEEE Conference on Computer

Vision, pages 764–773.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).

Rich feature hierarchies for accurate object detection

and semantic segmentation. In Proceedings of the

IEEE Conference on Computer Vision and Pattern

Recognition, pages 580–587.

Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews,

I., Kanade, T., Nobuhara, S., and Sheikh, Y. (2015).

Panoptic studio: A massively multiview system for

social motion capture. In Proceedings of the IEEE

VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications

820

International Conference on Computer Vision, pages

3334–3342.

Khaleghi, L., Sepas-Moghaddam, A., Marshall, J., and

Etemad, A. (2022). Multi-view video-based 3d hand

pose estimation. IEEE Transactions on Artiﬁcial In-

telligence.

Kong, D., Ma, H., Chen, Y., and Xie, X. (2020). Rotation-

invariant mixed graphical model network for 2D hand

pose estimation. In Proceedings of the IEEE/CVF

winter Conference on Applications of Computer Vi-

sion, pages 1546–1555.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im-

agenet classiﬁcation with deep convolutional neural

networks. Communications of the ACM, 60(6):84–90.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 3431–3440.

Misra, D. (2020). Mish: A self regularized non-monotonic

activation function. BMVC.

Pan, T., Wang, Z., and Fan, Y. (2022). Optimized convo-

lutional pose machine for 2D hand pose estimation.

Journal of Visual Communication and Image Repre-

sentation, 83:103461.

Panteleris, P. and Argyros, A. (2017). Back to rgb: 3d

tracking of hands and hand-object interactions based

on short-baseline stereo. In Proceedings of the IEEE

International Conference on Computer Vision Work-

shops, pages 575–584.

Ren, P., Sun, H., Hao, J., Wang, J., Qi, Q., and Liao, J.

(2022). Mining multi-view information: a strong self-

supervised framework for depth-based 3d hand pose

and mesh estimation. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 20555–20565.

Salman, S. A., Zakir, A., Benitez-Garcia, G., and Taka-

hashi, H. (2023a). Acenet: Attention-driven contex-

tual features-enhanced lightweight efﬁcientnet for 2d

hand pose estimation. In 2023 38th International Con-

ference on Image and Vision Computing New Zealand

(IVCNZ), pages 1–6.

Salman, S. A., Zakir, A., and Takahashi, H. (2023b). Cas-

caded deep graphical convolutional neural network for

2D hand pose estimation. In International Workshop

on Advanced Imaging Technology (IWAIT) 2023, vol-

ume 12592, pages 227–232. SPIE.

Salman, S. A., Zakir, A., and Takahashi, H. (2023c). SDF-

PoseGraphNet: spatial deep feature pose graph net-

work for 2d hand pose estimation. Sensors, 23(22).

Schn

urer, T., Fuchs, S., Eisenbach, M., and Groß, H.-M.

(2019). Real-time 3d pose estimation from single

depth images. In VISIGRAPP (5: VISAPP), pages

716–724.

Sharma, S., Sharma, S., and Athaiya, A. (2017). Activa-

tion functions in neural networks. Towards Data Sci,

6(12):310–316.

Simon, T., Joo, H., Matthews, I., and Sheikh, Y. (2017a).

Hand keypoint detection in single images using multi-

view bootstrapping. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 1145–1153.

Simon, T., Joo, H., and Sheikh, Y. (2017b). Hand keypoint

detection in single images using multiview bootstrap-

ping. CVPR.

Wang, Y., Peng, C., and Liu, Y. (2018). Mask-pose cas-

caded cnn for 2d hand pose estimation from single

color image. IEEE Transactions on Circuits and Sys-

tems for Video Technology, 29(11):3258–3268.

Wei, S.-E., Ramakrishna, V., Kanade, T., and Sheikh, Y.

(2016). Convolutional pose machines. In Proceed-

ings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 4724–4732.

Zakir, A., Salman, S. A., Benitez-Garcia, G., and Taka-

hashi, H. (2023). Aeca-prnetcc: Adaptive efﬁ-

cient channel attention-based poseresnet for coordi-

nate classiﬁcation in 2d human pose. In 2023 38th

International Conference on Image and Vision Com-

puting New Zealand (IVCNZ), pages 1–6.

Zakir, A., Salman, S. A., and Takahashi, H. (2024). Sahf-

lightposeresnet: Spatially-aware attention-based hier-

archical features enabled lightweight poseresnet for

2d human pose estimation. In Park, J. S., Takizawa,

H., Shen, H., and Park, J. J., editors, Parallel and Dis-

tributed Computing, Applications and Technologies,

pages 43–54, Singapore. Springer Nature Singapore.

Deformable Pose Network: A Multi-Stage Deformable Convolutional Network for 2D Hand Pose Estimation

821