Kanazawa, A., Black, M. J., Jacobs, D. W., and Malik,
J. (2017). End-to-end recovery of human shape and
pose.
Kanazawa, A., Zhang, J. Y., Felsen, P., and Malik, J. (2018).
Learning 3d human dynamics from video.
Kendall, A., Grimes, M., and Cipolla, R. (2015). Posenet: A
convolutional network for real-time 6-dof camera re-
localization. In 2015 IEEE International Conference
on Computer Vision (ICCV), pages 2938–2946.
Kingma, D. P. and Ba, J. (2014). Adam: A method for
stochastic optimization.
Kocabas, M., Athanasiou, N., and Black, M. J. (2019).
Vibe: Video inference for human body pose and shape
estimation.
Kocabas, M., Huang, C.-H. P., Hilliges, O., and Black, M. J.
(2021a). Pare: Part attention regressor for 3d human
body estimation.
Kocabas, M., Huang, C.-H. P., Tesch, J., M
¨
uller, L.,
Hilliges, O., and Black, M. J. (2021b). Spec github
repository. GitHub repository.
Kocabas, M., Huang, C.-H. P., Tesch, J., M
¨
uller, L.,
Hilliges, O., and Black, M. J. (2021c). Spec: Seeing
people in the wild with an estimated camera.
Kolotouros, N., Pavlakos, G., Black, M. J., and Daniilidis,
K. (2019). Learning to reconstruct 3d human pose and
shape via model-fitting in the loop.
Lee, J., Go, H., Lee, H., Cho, S., Sung, M., and Kim, J.
(2021). Ctrl-c: Camera calibration transformer with
line-classification.
Lee, J.-H. and Kim, C.-S. (2019). Monocular depth estima-
tion using relative depth maps. pages 9721–9730.
Li, J., Wang, C., Liu, W., Qian, C., and Lu, C. (2020a).
Hmor: Hierarchical multi-person ordinal relations for
monocular multi-person 3d pose estimation.
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.-S., and Lu,
C. (2018). Crowdpose: Efficient crowded scenes pose
estimation and a new benchmark.
Li, Y.-L., Liu, X., Lu, H., Wang, S., Liu, J., Li, J., and Lu,
C. (2020b). Detailed 2d-3d joint representation for
human-object interaction.
Li, Y.-L., Xu, L., Liu, X., Huang, X., Xu, Y., Wang, S.,
Fang, H.-S., Ma, Z., Chen, M., and Lu, C. (2020c).
Pastanet: Toward human activity knowledge engine.
Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C.,
and Freeman, W. T. (2019). Learning the depths of
moving people by watching frozen people.
Li, Z. and Snavely, N. (2018). Megadepth: Learning single-
view depth prediction from internet photos.
Lin, J. and Lee, G. H. (2020). Hdnet: Human depth estima-
tion for multi-person camera-space localization.
Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick,
R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L.,
and Doll
´
ar, P. (2014). Microsoft coco: Common ob-
jects in context.
Liu, R., Lehman, J., Molino, P., Such, F. P., Frank, E.,
Sergeev, A., and Yosinski, J. (2018). An intriguing
failing of convolutional neural networks and the co-
ordconv solution.
Lo Presti, L. and La Cascia, M. (2016). 3d skeleton-based
human action classification: A survey. Pattern Recog-
nition, 53:130–147.
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and
Black, M. (2015). Smpl: a skinned multi-person linear
model. volume 34.
Luo, B. and Hancock (1999). Feature matching with pro-
crustes alignment and graph editing. In Image Pro-
cessing And Its Applications, 1999. Seventh Interna-
tional Conference on (Conf. Publ. No. 465), volume 1,
pages 72–76 vol.1.
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko,
O., Xu, W., and Theobalt, C. (2016). Monocular 3d
human pose estimation in the wild using improved cnn
supervision.
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar,
S., Pons-Moll, G., and Theobalt, C. (2017). Single-
shot multi-person 3d pose estimation from monocular
rgb.
Moon, G., Chang, J. Y., and Lee, K. M. (2019). Cam-
era distance-aware top-down approach for 3d multi-
person pose estimation from a single rgb image.
Patel, P., Huang, C.-H. P., Tesch, J., Hoffmann, D. T., Tri-
pathi, S., and Black, M. J. (2021). AGORA: Avatars in
geography optimized for regression analysis. In Pro-
ceedings IEEE/CVF Conf. on Computer Vision and
Pattern Recognition (CVPR).
Qi, S., Wang, W., Jia, B., Shen, J., and Zhu, S.-C. (2018).
Learning human-object interactions by graph parsing
neural networks.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks.
Rogez, G., Weinzaepfel, P., and Schmid, C. (2019). LCR-
net++: Multi-person 2d and 3d pose detection in nat-
ural images. IEEE Transactions on Pattern Analysis
and Machine Intelligence, pages 1–1.
Sun, Y., Bao, Q., Liu, W., Fu, Y., Black, M. J., and Mei, T.
(2020). Monocular, one-stage, regression of multiple
3d people.
von Marcard, T., Henschel, R., Black, M., Rosenhahn, B.,
and Pons-Moll, G. (2018). Recovering accurate 3d hu-
man pose in the wild using imus and a moving camera.
In European Conference on Computer Vision (ECCV).
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao,
Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., and
Xiao, B. (2019). Deep high-resolution representation
learning for visual recognition.
Workman, S., Greenwell, C., Zhai, M., Baltenberger, R.,
and Jacobs, N. (2015). Deepfocal: A method for direct
focal length estimation. In 2015 IEEE International
Conference on Image Processing (ICIP), pages 1369–
1373.
Workman, S., Zhai, M., and Jacobs, N. (2016). Horizon
lines in the wild.
Xu, Y., Zhu, S.-C., and Tung, T. (2019). Denserac: Joint
3d pose and shape estimation by dense render-and-
compare.
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
78