
for a scheduling strategy that is well-suited to video
prediction.
An ablation study comparing similarity based on
vanilla frames (without any processing) reveals that
determining the probability parameter ε while ac-
counting for the changes by motions is crucial, and
further confirms effectiveness of leveraging differ-
ence frames.
Limitations of the proposed method include the
higher computational cost of calculating similarity
compared to previous methods, and the difficulty in
determining the optimal settings for the hash length
and the extent to which training should proceed to im-
prove the quality of the model’s output.
5 CONCLUSION
In this paper, we have introduced similarity-based
scheduled sampling that utilizes the similarity cal-
culated from difference frames. This approach ad-
dresses the challenge of setting a scheduling strat-
egy suited to video prediction tasks. The proposed
method outperforms previous methods. Furthermore,
an ablation study demonstrates the importance of de-
termining the probability parameter ε considering the
changes by motions.
We plan to work on exploring alternative methods
other than difference frames and reducing computa-
tional costs.
REFERENCES
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).
Scheduled sampling for sequence prediction with re-
current neural networks. In Advances in Neural Infor-
mation Processing Systems (NeurIPS), pages 1171–
1179.
Du, L., Ho, A. T., and Cong, R. (2020). Perceptual hashing
for image authentication: A survey. Signal Process-
ing: Image Communication, 81:115713.
Finn, C., Goodfellow, I., and Levine, S. (2016). Unsuper-
vised learning for physical interaction through video
prediction. In Advances in Neural Information Pro-
cessing Systems (NeurIPS), pages 64–72.
Gao, Z., Tan, C., Wu, L., and Li, S. Z. (2022). Simvp:
Simpler yet better video prediction. In Proceedings
of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (ICCV), pages 3160–3170.
Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh,
R., Chung, A. S., Hauswald, L., Pham, V. H.,
M
¨
uhlegg, M., Dorn, S., et al. (2020). A2d2:
Audi autonomous driving dataset. arXiv preprint
arXiv:2004.06320.
Liu, Y., Meng, F., Chen, Y., Xu, J., and Zhou, J. (2021a).
Confidence-aware scheduled sampling for neural ma-
chine translation. In Findings of the Association
for Computational Linguistics: ACL-IJCNLP 2021,
pages 2327–2337, Online. Association for Computa-
tional Linguistics.
Liu, Y., Meng, F., Chen, Y., Xu, J., and Zhou, J. (2021b).
Scheduled sampling based on decoding steps for
neural machine translation. In Proceedings of the
2021 Conference on Empirical Methods in Natural
Language Processing, pages 3285–3296, Online and
Punta Cana, Dominican Republic. Association for
Computational Linguistics.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021c). Swin transformer: Hierar-
chical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF International Confer-
ence on Computer Vision (ICCV), pages 9992–10002.
Oprea, S., Martinez-Gonzalez, P., Garcia-Garcia, A.,
Castro-Vargas, J. A., Orts-Escolano, S., Garcia-
Rodriguez, J., and Argyros, A. (2022). A review on
deep learning techniques for video prediction. IEEE
Transactions on Pattern Analysis and Machine Intel-
ligence, 44(6):2806–2826.
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K.,
and Woo, W.-c. (2015). Convolutional lstm network:
A machine learning approach for precipitation now-
casting. In Advances in Neural Information Process-
ing Systems (NeurIPS), pages 802–810.
Song, K., Tan, X., and Lu, J. (2021). Neural machine trans-
lation with error correction. In Proceedings of the
Twenty-Ninth International Joint Conference on Arti-
ficial Intelligence, IJCAI’20.
Su, J., Byeon, W., Kossaifi, J., Huang, F., Jan, K., and
Anandkumar, A. (2020). Convolutional tensor-train
lstm for spatio-temporal learning. In Advances in Neu-
ral Information Processing Systems (NeurIPS), pages
13714–13726.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to
sequence learning with neural networks. In Advances
in Neural Information Processing Systems (NeurIPS),
pages 3104–3112.
Tang, S., Li, C., Zhang, P., and Tang, R. (2023). Swinl-
stm: Improving spatiotemporal prediction accuracy
using swin transformer and lstm. In Proceedings of
the IEEE/CVF International Conference on Computer
Vision (ICCV), pages 13424–13433.
Wang, Y., Gao, Z., Long, M., Wang, J., and Philip, S. Y.
(2018). Predrnn++: Towards a resolution of the deep-
in-time dilemma in spatiotemporal predictive learn-
ing. In Proceedings of the International Confer-
ence on Machine Learning (ICML), pages 5123–5132.
PMLR.
Wang, Y., Jiang, L., Yang, M.-H., Li, L.-J., Long, M., and
Fei-Fei, L. (2019). Eidetic 3d LSTM: A model for
video prediction and beyond. In Proceedings of the In-
ternational Conference on Learning Representations
(ICLR).
Wang, Y., Long, M., Wang, J., Gao, Z., and Yu, P. S. (2017).
Predrnn: Recurrent neural networks for predictive
learning using spatiotemporal lstms. In Advances
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
128