and identically distributed (i.i.d.) with regards to the
training set. Instead, one ought to use one or sev-
eral datasets for testing and evaluation, that have sys-
tematic differences from the training dataset, making
them out-of-distribution (o.o.d.).
Now, as mentioned previously this is unfeasible,
which means that the final model evaluation is sus-
ceptible to shortcut learning. The evaluation data is
related to the training data in two ways. First off, the
evaluation data comes from the same dataset mak-
ing it i.i.d. with respect to the motions it contains.
Secondly, the source of the noise used to generate
the input motions is not the actual noise introduced,
when going through a webcam-based pose estimation
pipeline, but instead the same noise estimation used
as when training the network.
As such, the validation data used represent the
best possible effort, given the limited data availabil-
ity for this task. However, should datasets of skinned
human motion augmentation become generally avail-
able, it would then be desirable to re-evaluate the
model on those o.o.d. datasets.
5 CONCLUSION
An LSTM-based prediction model is constructed and
shown to be competitive with prior work on the task
of predicting human motion. The same approach is
then used to train an augmentation model, that is ca-
pable of cleaning up and merging two noisy motions
into a single motion. This shows that an LSTM-based
architecture is viable for augmenting human motions,
when evaluated on generated data. The lack of an-
notated data to evaluate on, means that it is unclear
how the model performs on real life data. Overcom-
ing this limitation and implementing various potential
improvements is a topic for future work.
REFERENCES
Carnegie Mellon University (2003). CMU MoCap Dataset.
Catalin Ionescu, Fuxin Li, C. S. (2011). Latent structured
models for human pose estimation. In International
Conference on Computer Vision.
Chiu, H., Adeli, E., Wang, B., Huang, D.-A., and Niebles,
J. C. (2018). Action-agnostic human pose forecasting.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,
Brendel, W., Bethge, M., and Wichmann, F. A. (2020).
Shortcut learning in deep neural networks. Nature
Machine Intelligence, 2(11):665–673.
Gopalakrishnan, A., Mali, A., Kifer, D., Giles, C. L., and
Ororbia, A. G. (2019). A neural temporal model for
human motion prediction.
Gopinath, D. and Won, J. (2020). fairmotion - tools to load,
process and visualize motion capture data. Github.
Harvey, F. G., Yurick, M., Nowrouzezahrai, D., and Pal, C.
(2020). Robust motion in-betweening. ACM Transac-
tions on Graphics, 39(4).
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-
ual learning for image recognition.
Holden, D. (2018). Robust solving of optical motion cap-
ture data by denoising. ACM Trans. Graph., 37(4).
Holden, D., Kanoun, O., Perepichka, M., and Popa, T.
(2020). Learned motion matching. ACM Trans.
Graph., 39(4).
Holden, D., Komura, T., and Saito, J. (2017). Phase-
functioned neural networks for character control.
ACM Trans. Graph., 36(4).
Ionescu, C., Papava, D., Olaru, V., and Sminchisescu, C.
(2014). Human3.6m: Large scale datasets and pre-
dictive methods for 3d human sensing in natural envi-
ronments. IEEE Transactions on Pattern Analysis and
Machine Intelligence.
Joo, H., Neverova, N., and Vedaldi, A. (2020). Exemplar
fine-tuning for 3d human model fitting towards in-the-
wild 3d human pose estimation.
Ling, H. Y., Zinno, F., Cheng, G., and Van De Panne, M.
(2020). Character controllers using motion vaes. ACM
Trans. Graph., 39(4).
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., and
Black, M. J. (2015). SMPL: A skinned multi-person
linear model. ACM Trans. Graphics (Proc. SIG-
GRAPH Asia), 34(6):248:1–248:16.
Mahmood, N., Ghorbani, N., Troje, N. F., Pons-Moll, G.,
and Black, M. J. (2019). AMASS: Archive of motion
capture as surface shapes. In International Conference
on Computer Vision, pages 5442–5451.
Mao, X., Li, Q., Xie, H., Lau, R. Y. K., Wang, Z., and Smol-
ley, S. P. (2017). Least squares generative adversarial
networks.
Martinez, J., Black, M. J., and Romero, J. (2017). On
human motion prediction using recurrent neural net-
works.
Osman, A. A. A., Bolkart, T., and Black, M. J. (2020).
STAR: A sparse trained articulated human body re-
gressor. In European Conference on Computer Vision
(ECCV), pages 598–613.
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Os-
man, A. A. A., Tzionas, D., and Black, M. J. (2019).
Expressive body capture: 3D hands, face, and body
from a single image. In Proceedings IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR),
pages 10975–10985.
Pavllo, D., Feichtenhofer, C., Auli, M., and Grangier, D.
(2019a). Modeling human motion with quaternion-
based neural networks. International Journal of Com-
puter Vision, 128(4):855–872.
Pavllo, D., Feichtenhofer, C., Grangier, D., and Auli, M.
(2019b). 3d human pose estimation in video with tem-
poral convolutions and semi-supervised training. In
Conference on Computer Vision and Pattern Recogni-
tion (CVPR).
Neural Network-based Human Motion Smoother
29