
robust features, improving its performance on real-
world data with varying color and lighting conditions.
A potential issue has been identified in the model’s
accuracy metrics: both the cross-validation accuracy
and the held-out test set accuracy are higher than the
training accuracy, which needs further investigation.
The higher cross-validation accuracy may result from
random fluctuations, as the two accuracies often inter-
sect during training, as shown in Figure 4a. The im-
provement of over 1% in the held-out test set’s perfor-
mance compared to the cross-validation set is unclear
and could be due to differences in data distribution or
the smaller sample size in the held-out set. However,
measures have been taken to eliminate any system-
atic errors that might affect the observed performance
gains. Future research could explore expanding the
network to include various types of pets. This might
involve using the DETR to identify the specific pet in
an image, such as a cat or dog, and then passing it to
a fine-tuned model that specializes in comparing pets
within each category. This method would combine
the strengths of both DETR and ViT models, result-
ing in a more robust system through enhanced con-
trastive data. Additionally, while this study focused
on dog images, the described contrastive learning ap-
proach can be applied to other datasets. Training on
diverse images enables the model to differentiate be-
tween various classes, with potential applications in
medical image classification, wildlife species identi-
fication, and handwriting comparison. Overall, the
study suggests that contrastive learning can signifi-
cantly improve image classification accuracy.
REFERENCES
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,
A., and Zagoruyko, S. (2020). End-to-End Object De-
tection with Transformers. In Vedaldi, A., Bischof,
H., Brox, T., and Frahm, J.-M., editors, Computer Vi-
sion – ECCV 2020, Lecture Notes in Computer Sci-
ence, pages 213–229, Cham. Springer International
Publishing.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).
A simple framework for contrastive learning of visual
representations. In III, H. D. and Singh, A., editors,
Proceedings of the 37th International Conference on
Machine Learning, volume 119 of Proceedings of Ma-
chine Learning Research, pages 1597–1607. PMLR.
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,
Q. V. (2019). AutoAugment: Learning Augmentation
Strategies from Data. In 2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 113–123. IEEE Computer Society.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An Image is Worth 16x16 Words: Trans-
formers for Image Recognition at Scale.
Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimension-
ality Reduction by Learning an Invariant Mapping.
In 2006 IEEE Computer Society Conference on Com-
puter Vision and Pattern Recognition (CVPR’06), vol-
ume 2, pages 1735–1742.
Inc, K. P. and Affiliates (2023). AdoptAPet: Search for local
pets in need of a home. https://www.adoptapet.com/,
last visited: 28 April 2023.
Koch, G., Zemel, R., Salakhutdinov, R., et al. (2015).
Siamese neural networks for one-shot image recog-
nition. In ICML deep learning workshop, volume 2.
Lille.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
Lerer, A. (2017). Automatic differentiation in Py-
Torch.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z.,
Tomizuka, M., Gonzalez, J., Keutzer, K., and Vajda,
P. (2020). Visual Transformers: Token-based Image
Representation and Processing for Computer Vision.
APPENDIX
Ablation Study Results
As described in Section 4, an ablation study was per-
formed where the entire ViT model was trained along-
side the final layers using the contrastive loss func-
tion. Comparing these results to those obtained with
the fixed layers, we can see that the fully-trained
model has a higher type I error rate and a lower F
1
score on both the cross-validation and held-out test
sets. This indicates that the fully-trained model is
overfitting the data to some extent, which is expected
given the increased flexibility of the model. Overall,
the results of the ablation study support our decision
to use a fixed ViT backbone with contrastive learn-
ing. This approach appears to be more effective at
learning a robust representation of the data and gen-
eralizing to new samples, as demonstrated by the su-
perior performance of the fixed layers model on both
the cross-validation and held-out test sets.
Model Demonstration
To make the contrastive learning model available to
a broader audience, we developed a web application
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
762