
6 CONCLUSIONS
We have extended the mathematical formulation of
cycle-consistency to partial overlaps between views.
We have leveraged these insights to develop a self-
supervised training setting that employs multiple
new cycle variants and a pseudo-masking approach
to steer the loss function. The cycle variants ex-
pose different cycle-inconsistencies, ensuring that the
self-supervised learning signal is more diverse and
therefore stronger. We also presented a time di-
vergent batch sampling approach for self-supervised
cycle-consistency. Our methods combined improve
the cross-camera matching performance of the cur-
rent self-supervised state-of-the-art on the challeng-
ing DIVOTrack benchmark by 4.3 percentage points
overall, and by 4.7-9.1 percentage points for the most
challenging scenes.
Our method is effective in other multi-camera
downstream tasks such as Re-ID and cross-view
multi-object tracking. One limitation of self-
supervision with cycle-consistency is its dependence
on bounding boxes in the training data. Detections
from an untrained detector could be used to train with
instead, but this would likely degrade performance.
Another area for improvement is to take location and
relative distances into account both during training
and testing, as this provides informative identity in-
formation.
Self-supervision through cycle-consistency is ap-
plicable to many more settings than just learning
view-invariant object features. We believe the tech-
niques introduced in this paper also benefit works that
use cycle-consistency to learn image, patch, or key-
point features from videos or overlapping views.
REFERENCES
Bastani, F., He, S., and Madden, S. (2021). Self-supervised
multi-object tracking with cross-input consistency.
Advances in Neural Information Processing Systems,
34:13695–13706.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).
A simple framework for contrastive learning of visual
representations. In International conference on ma-
chine learning, pages 1597–1607. PMLR.
Dong, J., Jiang, W., Huang, Q., Bao, H., and Zhou, X.
(2019). Fast and robust multi-person 3d pose esti-
mation from multiple views. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 7792–7801.
Dwibedi, D., Aytar, Y., Tompson, J., Sermanet, P., and Zis-
serman, A. (2019). Temporal cycle-consistency learn-
ing. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 1801–
1810.
Fan, H., Zheng, L., Yan, C., and Yang, Y. (2018). Un-
supervised person re-identification: Clustering and
fine-tuning. ACM Transactions on Multimedia Com-
puting, Communications, and Applications (TOMM),
14(4):1–18.
Feng, W., Wang, F., Han, R., Qian, Z., and Wang, S. (2024).
Unveiling the power of self-supervision for multi-
view multi-human association and tracking. arXiv
preprint arXiv:2401.17617.
Gan, Y., Han, R., Yin, L., Feng, W., and Wang, S. (2021).
Self-supervised multi-view multi-human association
and tracking. In Proceedings of the 29th ACM Inter-
national Conference on Multimedia, pages 282–290.
Han, R., Wang, Y., Yan, H., Feng, W., and Wang, S. (2022).
Multi-view multi-human association with deep as-
signment network. IEEE Transactions on Image Pro-
cessing, 31:1830–1840.
Hao, S., Liu, P., Zhan, Y., Jin, K., Liu, Z., Song, M., Hwang,
J.-N., and Wang, G. (2023). Divotrack: A novel
dataset and baseline method for cross-view multi-
object tracking in diverse open scenes. International
Journal of Computer Vision, pages 1–16.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Huang, Q.-X. and Guibas, L. (2013). Consistent shape maps
via semidefinite programming. Computer graphics fo-
rum, 32(5):177–186.
Jabri, A., Owens, A., and Efros, A. (2020). Space-time cor-
respondence as a contrastive random walk. Advances
in neural information processing systems, 33:19545–
19560.
Li, M., Zhu, X., and Gong, S. (2019). Unsupervised tracklet
person re-identification. IEEE transactions on pattern
analysis and machine intelligence, 42(7):1770–1782.
Loy, C. C., Xiang, T., and Gong, S. (2010). Time-delayed
correlation analysis for multi-camera activity under-
standing. International Journal of Computer Vision,
90:106–129.
Ristani, E. and Tomasi, C. (2018). Features for multi-target
multi-camera tracking and re-identification. In Pro-
ceedings of the IEEE conference on computer vision
and pattern recognition, pages 6036–6046.
Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich,
A. (2020). Superglue: Learning feature matching
with graph neural networks. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 4938–4947.
Sun, S., Akhtar, N., Song, H., Mian, A., and Shah, M.
(2019). Deep affinity network for multiple object
tracking. IEEE transactions on pattern analysis and
machine intelligence, 43(1):104–119.
Wang, X., Jabri, A., and Efros, A. A. (2019). Learning cor-
respondence from the cycle-consistency of time. In
Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 2566–
2576.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
28