Empirical evaluation of gated recurrent neural net-
works on sequence modeling. In Advances in Neural
Information Processing Systems (NeurIPS).
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., and
Black, M. (2019). Capture, learning, and synthesis
of 3D speaking styles. In IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
10101–10111.
Denton, E. L. and Fergus, R. (2018). Stochastic video gen-
eration with a learned prior. In International Confer-
ence on Machine Learning (ICML).
Fan, R., Xu, S., and Geng, W. (2012). Example-based
automatic music-driven conventional dance motion
synthesis. IEEE Trans. Visualization and Computer
Graphics (TVCG), 18(3):501–515.
Foote, J. T. (1997). Content-based retrieval of music and
audio. In Multimedia Storage and Archiving Systems
II, volume 3229, pages 138–147. International Society
for Optics and Photonics.
Hamanaka, M., Hirata, K., and Tojo, S. (2016). deepgttm-
i&ii: Local boundary and metrical structure analyzer
based on deep learning technique. In International
Symposium on Computer Music Multidisciplinary Re-
search, pages 3–21. Springer.
Hamanaka, M., Hirata, K., and Tojo, S. (2017). deepgttm-
iii: Multi-task learning with grouping and metrical
structures. In International Symposium on Computer
Music Multidisciplinary Research, pages 238–251.
Springer.
Hamanaka, M., Nakatsuka, T., and Morishima, S. (2019).
Melody slot machine. In ACM SIGGRAPH Emerging
Technologies, page 19. ACM.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
G., Elsen, E., Prenger, R., Satheesh, S., Sengupta,
S., Coates, A., et al. (2014). Deep speech: Scal-
ing up end-to-end speech recognition. arXiv preprint
arXiv:1412.5567.
Jiang, H., Sun, D., Jampani, V., Yang, M.-H., Learned-
Miller, E., and Kautz, J. (2018). Super slomo: High
quality estimation of multiple intermediate frames for
video interpolation. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 9000–
9008.
Karras, T., Aila, T., Laine, S., Herva, A., and Lehtinen, J.
(2017). Audio-driven facial animation by joint end-to-
end learning of pose and emotion. ACM Trans. Graph-
ics (TOG), 36(4):94.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Lerdahl, F. and Jackendoff, R. S. (1996). A generative the-
ory of tonal music. MIT press.
Li, Y., Roblek, D., and Tagliasacchi, M. (2019). From here
to there: Video inbetweening using direct 3d convolu-
tions. arXiv preprint arXiv:1905.10240.
Logan, B. and Chu, S. (2000). Music summarization us-
ing key phrases. In IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP),
volume 2, pages II749–II752. IEEE.
Logan, B. et al. (2000). Mel frequency cepstral coefficients
for music modeling. In International Symposium on
Music Information Retrieval (ISMIR).
Meyer, S., Djelouah, A., McWilliams, B., Sorkine-
Hornung, A., Gross, M., and Schroers, C. (2018).
Phasenet for video frame interpolation. In IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 498–507.
Niklaus, S. and Liu, F. (2018). Context-aware synthesis for
video frame interpolation. IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
1701–1710.
Niklaus, S., Mai, L., and Liu, F. (2017). Video frame inter-
polation via adaptive separable convolution. In IEEE
International Conference on Computer Vision (ICCV).
Ofli, F., Erzin, E., Yemez, Y., and Tekalp, A. M. (2012).
Learn2dance: Learning statistical music-to-dance
mappings for choreography synthesis. IEEE Trans.
Multimedia (TOM), 14(3-2):747–759.
Pumarola, A., Agudo, A., Martinez, A. M., Sanfeliu,
A., and Moreno-Noguer, F. (2018). Ganimation:
Anatomically-aware facial animation from a single
image. In European Conference on Computer Vision
(ECCV), pages 818–833.
Ruggero Ronchi, M. and Perona, P. (2017). Benchmarking
and error diagnosis in multi-instance pose estimation.
In IEEE International Conference on Computer Vision
(ICCV), pages 369–378.
Shlizerman, E., Dery, L. M., Schoen, H., and Kemelmacher-
Shlizerman, I. (2018). Audio to body dynamics. In
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Suwajanakorn, S., Seitz, S. M., and Kemelmacher-
Shlizerman, I. (2017). Synthesizing obama: learning
lip sync from audio. ACM Trans. Graphics (TOG),
36(4):95.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz,
J., and Catanzaro, B. (2018a). Video-to-video syn-
thesis. In Advances in Neural Information Processing
Systems (NeurIPS).
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Tao, A., Kautz, J., and
Catanzaro, B. (2018b). High-resolution image synthe-
sis and semantic manipulation with conditional gans.
In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
Wang, T.-H., Cheng, Y.-C., Hubert Lin, C., Chen, H.-T.,
and Sun, M. (2019). Point-to-point video generation.
In IEEE International Conference on Computer Vision
(ICCV).
Wilcoxon, F. (1992). Individual comparisons by ranking
methods. In Breakthroughs in statistics, pages 196–
202. Springer.
Xu, Q., Zhang, H., Wang, W., Belhumeur, P. N., and Neu-
mann, U. (2018). Stochastic dynamics for video in-
filling. arXiv preprint arXiv:1809.00263.
Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji,
S., and Singh, K. (2018). Visemenet: Audio-driven
animator-centric speech animation. ACM Trans.
Graphics (TOG), 37(4):161.
Audio-guided Video Interpolation via Human Pose Features
35