
DAVIS dataset
With Cosine
Down 1 Down 2 Down 3
Without Cosine
Down 1 Down 2 Down 3
Figure 5: The qualitative results provide a visual repre-
sentation of how object information and texture are prop-
agated throughout the encoder network were evaluated on
the DAVIS dataset. In the first row, we see the complete
model architecture, while the second row represents the ver-
sion without cosine similarity. It is evident that details can
be effectively propagated through feature extraction when
cosine similarity is employed, reinforcing why results with
cosine similarity exhibit superior outcomes.
REFERENCES
Falbel, D. (2023). torchvision: Models, Datasets and Trans-
formations for Images. https://github.com/mlverse/
torchvision.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and
Hochreiter, S. (2017). GANs Trained by a Two Time-
Scale Update Rule Converge to a Local Nash Equi-
librium. Advances in Neural Information Processing
Systems, 30.
Kang, X., Lin, X., Zhang, K., Hui, Z., Xiang, W., He, J.-Y.,
Li, X., Ren, P., Xie, X., Timofte, R., Yang, Y., Pan,
J., Peng, Z., Zhang, Q., Dong, J., Tang, J., Li, J., Lin,
C., Li, Q., Liang, Q., Gang, R., Liu, X., Feng, S., Liu,
S., Wang, H., Feng, C., Bai, F., Zhang, Y., Shao, G.,
Wang, X., Lei, L., Chen, S., Zhang, Y., Xu, H., Liu,
Z., Zhang, Z., Luo, Y., and Zuo, Z. (2023). NTIRE
2023 Video Colorization Challenge. In IEEE/CVF
Conference on Computer Vision and Pattern Recog-
nition Workshops, pages 1570–1581.
Lee, S., Lee, S., Seong, H., and Kim, E. (2023). Revisiting
Self-Similarity: Structural Embedding for Image Re-
trieval. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 23412–23421.
Nakagawa, T., Sanada, Y., Waida, H., Zhang, Y., Wada, Y.,
Takanashi, K., Yamada, T., and Kanamori, T. (2023).
Denoising Cosine Similarity: A Theory-Driven Ap-
proach for Efficient Representation Learning. arXiv
preprint arXiv:2304.09552.
NVIDIA, Vingelmann, P., and Fitzek, F. H. (2020).
CUDA, release: 10.2.89. https://developer.nvidia.
com/cuda-toolkit.
Russakoff, D. B., Tomasi, C., Rohlfing, T., and Maurer,
C. R. (2004). Image Similarity using Mutual Infor-
mation of Regions. In 8th European Conference on
Computer Vision, pages 596–607. Springer.
Salmona, A., Bouza, L., and Delon, J. (2022). DeOld-
ify: A Review and Implementation of an Automatic
Colorization Method. Image Processing On Line,
12:347–368.
Stival, L. and Pedrini, H. (2023). Survey on Video Coloriza-
tion: Concepts, Methods and Applications. Journal of
Signal Processing Systems, pages 1–24.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
novich, A. (2015). Going Deeper with Convolutions.
In IEEE Conference on Computer Vision and Pattern
Recognition, pages 1–9.
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe,
B., and Chen, L.-C. (2019). Feelvos: Fast End-to-
End Embedding Learning for Video Object Segmen-
tation. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 9481–9490.
Wan, Z., Zhang, B., Chen, D., Zhang, P., Chen, D., Liao,
J., and Wen, F. (2020). Bringing Old Photos Back to
Life. In IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pages 2747–2757.
Wang, W., Feiszli, M., Wang, H., and Tran, D. (2021).
Unidentified Video Objects: A Benchmark for Dense,
Open-World Segmentation. In IEEE/CVF Interna-
tional Conference on Computer Vision, pages 10776–
10785.
Yang, R. and Timofte, R. (2021). NTIRE 2021 Chal-
lenge on Quality Enhancement of Compressed Video:
Dataset and Study. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops,
pages 667–676.
Zhang, B., He, M., Liao, J., Sander, P. V., Yuan, L., Bermak,
A., and Chen, D. (2019). Deep Exemplar-based Video
Colorization. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 8052–8061.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
392