
human pose and shape recovery by a temporal convo-
lutional transformer network. IET Computer Vision,
17(4):379–388.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. In Interna-
tional Conference on Learning Representations.
Einfalt, M., Ludwig, K., and Lienhart, R. (2023). Up-
lift and upsample: Efficient 3d human pose estima-
tion with uplifting transformers. In Proceedings of
the IEEE/CVF Winter Conference on Applications of
Computer Vision (WACV), pages 2903–2913.
Fang, H.-S., Xie, S., Tai, Y.-W., and Lu, C. (2017). Rmpe:
Regional multi-person pose estimation. In 2017 IEEE
International Conference on Computer Vision (ICCV),
pages 2353–2362.
Hossain, M. R. I. and Little, J. J. (2018). Exploiting tem-
poral information for 3d human pose estimation. In
Ferrari, V., Hebert, M., Sminchisescu, C., and Weiss,
Y., editors, Computer Vision – ECCV 2018, pages 69–
86, Cham. Springer International Publishing.
Joo, H., Simon, T., Li, X., Liu, H., Tan, L., Gui, L.,
Banerjee, S., Godisart, T. S., Nabbe, B., Matthews,
I., Kanade, T., Nobuhara, S., and Sheikh, Y. (2017).
Panoptic studio: A massively multiview system for so-
cial interaction capture. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
Kang, Y., Liu, Y., Yao, A., Wang, S., and Wu, E. (2023). 3d
human pose lifting with grid convolution. In Proceed-
ings of the Thirty-Seventh AAAI Conference on Artifi-
cial Intelligence and Thirty-Fifth Conference on Inno-
vative Applications of Artificial Intelligence and Thir-
teenth Symposium on Educational Advances in Artifi-
cial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI
Press.
Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager,
G. D. (2017). Temporal convolutional networks for
action segmentation and detection. In Proceedings of
the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Li, W., Du, R., and Chen, S. (2022). Skeleton-based spatio-
temporal u-network for 3d human pose estimation in
video. Sensors, 22(7).
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., and Yang,
W. (2023). Exploiting temporal contexts with strided
transformer for 3d human pose estimation. IEEE
Transactions on Multimedia, 25:1282–1293.
Liu, J., Guang, Y., and Rojas, J. (2020). Gast-net:
Graph attention spatio-temporal convolutional net-
works for 3d human pose estimation in video. CoRR,
abs/2003.14179.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision (ICCV), pages 10012–10022.
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib,
M., Fua, P., Seidel, H.-P., Rhodin, H., Pons-Moll, G.,
and Theobalt, C. (2020). XNect: Real-time multi-
person 3D motion capture with a single RGB camera.
volume 39.
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H.,
Shafiei, M., Seidel, H.-P., Xu, W., Casas, D., and
Theobalt, C. (2017). Vnect: Real-time 3d human pose
estimation with a single rgb camera. volume 36.
Min, Y., Chai, X., Zhao, L., and Chen, X. (2019). Flicker-
net: Adaptive 3d gesture recognition from sparse point
clouds. In BMVC, page 105.
Min, Y., Zhang, Y., Chai, X., and Chen, X. (2020). An ef-
ficient pointlstm for point clouds based gesture recog-
nition. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR).
Nie, Q., Liu, Z., and Liu, Y. (2023). Lifting 2d human pose
to 3d with domain adapted 3d body concept. Int. J.
Comput. Vision, 131(5):1250–1268.
Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M.
(2019). 3d human pose estimation in video with tem-
poral convolutions and semi-supervised training. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR).
Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). Pointnet:
Deep learning on point sets for 3d classification and
segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR).
Sun, K., Xiao, B., Liu, D., and Wang, J. (2019). Deep high-
resolution representation learning for human pose es-
timation. In Proceedings Of The IEEE Conference On
Computer Vision And Pattern Recognition (CVPR).
Tang, Z., Qiu, Z., Hao, Y., Hong, R., and Yao, T. (2023).
3d human pose estimation with spatio-temporal criss-
cross attention. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 4790–4799.
Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,
and Solomon, J. M. (2019). Dynamic graph cnn for
learning on point clouds. ACM Trans. Graph., 38(5).
Wu, M. and Shi, P. (2023). Human pose estimation based on
a spatial temporal graph convolutional network. Ap-
plied Sciences, 13(5).
Yang, Y., Ren, Z., Li, H., Zhou, C., Wang, X., and Hua, G.
(2021). Learning dynamics via graph neural networks
for human pose estimation and tracking. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), pages 8074–8084.
Zhao, Q., Zheng, C., Liu, M., Wang, P., and Chen, C.
(2023). Poseformerv2: Exploring frequency domain
for efficient and robust 3d human pose estimation. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
8877–8886.
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., and
Ding, Z. (2021). 3d human pose estimation with spa-
tial and temporal transformers. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion (ICCV), pages 11656–11665.
ˇ
Skorv
´
ankov
´
a, D. and Madaras, M. (2021). Human pose
estimation using per-point body region assignment.
COMPUTING AND INFORMATICS, 40(2):387–407.
IMPROVE 2024 - 4th International Conference on Image Processing and Vision Engineering
90