Ho, J., Kalchbrenner, N., Weissenborn, D., and Salimans,
T. (2019). Axial attention in multidimensional trans-
formers. arXiv preprint arXiv:1912.12180.
Hsieh, J.-T., Liu, B., Huang, D.-A., Fei-Fei, L. F., and
Niebles, J. C. (2018). Learning to decompose and
disentangle representations for video prediction. In
Advances in Neural Information Processing Systems,
pages 517–526.
Kaiser, Ł., Roy, A., Vaswani, A., Parmar, N., Bengio, S.,
Uszkoreit, J., and Shazeer, N. (2018). Fast decoding in
sequence models using discrete latent variables. arXiv
preprint arXiv:1803.03382.
Kalchbrenner, N., van den Oord, A., Simonyan, K., Dani-
helka, I., Vinyals, O., Graves, A., and Kavukcuoglu,
K. (2017). Video pixel networks. In Proceed-
ings of the 34th International Conference on Machine
Learning-Volume 70, pages 1771–1779. JMLR. org.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,
Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,
Natsev, P., et al. (2017). The kinetics human action
video dataset. arXiv preprint arXiv:1705.06950.
Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine,
S., Dinh, L., and Kingma, D. (2019). Videoflow: A
flow-based generative model for video. arXiv preprint
arXiv:1903.01434, 2(5).
Lee, A. X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and
Levine, S. (2018). Stochastic adversarial video pre-
diction. arXiv preprint arXiv:1804.01523.
Luc, P., Clark, A., Dieleman, S., Casas, D. d. L.,
Doron, Y., Cassirer, A., and Simonyan, K. (2020).
Transformation-based adversarial video prediction on
large-scale data. arXiv preprint arXiv:2003.04035.
Mathieu, M., Couprie, C., and LeCun, Y. (2015). Deep
multi-scale video prediction beyond mean square er-
ror. arXiv preprint arXiv:1511.05440.
Menick, J. and Kalchbrenner, N. (2018). Generating
high fidelity images with subscale pixel networks
and multidimensional upscaling. arXiv preprint
arXiv:1812.01608.
Nam, S., Ma, C., Chai, M., Brendel, W., Xu, N., and Kim,
S. J. (2019). End-to-end time-lapse video synthe-
sis from a single outdoor image. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 1409–1418.
Ohnishi, K., Yamamoto, S., Ushiku, Y., and Harada, T.
(2018). Hierarchical video generation from orthog-
onal information: Optical flow and texture. In Thirty-
Second AAAI Conference on Artificial Intelligence.
Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., and
Wang, X. (2019). Video generation from single se-
mantic label map. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition,
pages 3733–3742.
Patraucean, V., Handa, A., and Cipolla, R. (2015). Spatio-
temporal video autoencoder with differentiable mem-
ory. arXiv preprint arXiv:1511.06309.
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert,
R., and Chopra, S. (2014). Video (language) mod-
eling: a baseline for generative models of natural
videos. arXiv preprint arXiv:1412.6604.
Razavi, A., van den Oord, A., and Vinyals, O. (2019). Gen-
erating diverse high-fidelity images with vq-vae-2. In
Advances in Neural Information Processing Systems,
pages 14837–14847.
Saito, M., Matsumoto, E., and Saito, S. (2017). Temporal
generative adversarial nets with singular value clip-
ping. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 2830–2839.
Saito, M. and Saito, S. (2018). Tganv2: Efficient training of
large models for video generation with multiple sub-
sampling layers. arXiv preprint arXiv:1811.09245.
Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P.
(2017). Pixelcnn++: Improving the pixelcnn with dis-
cretized logistic mixture likelihood and other modifi-
cations. arXiv preprint arXiv:1701.05517.
Shaham, T. R., Dekel, T., and Michaeli, T. (2019). Singan:
Learning a generative model from a single natural im-
age. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 4570–4580.
Srivastava, N., Mansimov, E., and Salakhudinov, R. (2015).
Unsupervised learning of video representations using
lstms. In International conference on machine learn-
ing, pages 843–852.
Tulyakov, S., Liu, M.-Y., Yang, X., and Kautz, J. (2018).
Mocogan: Decomposing motion and content for video
generation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1526–
1535.
Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,
O., Graves, A., et al. (2016). Conditional image gen-
eration with pixelcnn decoders. In Advances in neural
information processing systems, pages 4790–4798.
van den Oord, A., Vinyals, O., et al. (2017). Neural dis-
crete representation learning. In Advances in Neural
Information Processing Systems, pages 6306–6315.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. In Advances in
neural information processing systems, pages 5998–
6008.
Vondrick, C., Pirsiavash, H., and Torralba, A. (2016). Gen-
erating videos with scene dynamics. In Advances in
neural information processing systems, pages 613–
621.
Wang, T.-C., Liu, M.-Y., Tao, A., Liu, G., Kautz, J., and
Catanzaro, B. (2019). Few-shot video-to-video syn-
thesis. arXiv preprint arXiv:1910.12713.
Wang, T.-C., Liu, M.-Y., Zhu, J.-Y., Liu, G., Tao, A., Kautz,
J., and Catanzaro, B. (2018a). Video-to-video synthe-
sis. arXiv preprint arXiv:1808.06601.
Wang, Y., Gao, Z., Long, M., Wang, J., and Yu, P. S.
(2018b). Predrnn++: Towards a resolution of the
deep-in-time dilemma in spatiotemporal predictive
learning. arXiv preprint arXiv:1804.06300.
Wang, Y., Jiang, L., Yang, M.-H., Li, L.-J., Long, M., and
Fei-Fei, L. (2018c). Eidetic 3d lstm: A model for
video prediction and beyond.
Weissenborn, D., T
¨
ackstr
¨
om, O., and Uszkoreit, J. (2019).
Scaling autoregressive video models. arXiv preprint
arXiv:1906.02634.
Latent Video Transformer
111