LostPaw: Finding Lost Pets Using a Contrastive Learning-Based

Transformer with Visual Input

Andrei Voinea, Robin Kock and Maruf A. Dhali

Department of Artiﬁcial Intelligence, Bernoulli Institute, University of Groningen, 9747 AG Groningen, The Netherlands

Keywords:

Contrastive Learning, Neural Networks, Object Detection, Transformers.

Abstract:

Losing pets can be highly distressing for pet owners, and ﬁnding a lost pet is often challenging and time-

consuming. An artiﬁcial intelligence-based application can signiﬁcantly improve the speed and accuracy of

ﬁnding lost pets. To facilitate such an application, this study introduces a contrastive neural network model

capable of accurately distinguishing between images of pets. The model was trained on a large dataset of dog

images and evaluated through 3-fold cross-validation. Following 350 epochs of training, the model achieved a

test accuracy of 90%. Furthermore, overﬁtting was avoided, as the test accuracy closely matched the training

accuracy. Our ﬁndings suggest that contrastive neural network models hold promise as a tool for locating

lost pets. This paper presents the foundational framework for a potential web application designed to assist

users in locating their missing pets. The application will allow users to upload images of their lost pets and

provide notiﬁcations when matching images are identiﬁed within its image database. This functionality aims

to enhance the efﬁciency and accuracy with which pet owners can search for and reunite with their beloved

animals.

1 INTRODUCTION

Experiencing the loss of a pet can be deeply traumatic

for owners, who frequently have difﬁculty locating

them even after utilizing ﬂyers, online searches, and

private investigators. Since pets can roam far from

home, conventional search strategies often prove in-

effective. Without uniﬁed communication channels,

assistance can be limited. Artiﬁcial intelligence can

help identify lost pets through image analysis, though

comparing images can still be challenging for vol-

unteers. In recent years, contrastive learning has

emerged as a promising solution to the problem of

differentiating between two or more input data classes

in computer vision (Chen et al., 2020). This approach

involves training a machine learning model to identify

subtle differences between images by comparing pairs

of data samples. This technique has demonstrated

notable efﬁcacy in various visual recognition tasks,

such as image classiﬁcation, where models are trained

to differentiate between objects or scenes based on

visual features. Contrastive learning methods efﬁ-

ciently learn high-dimensional data representations

by comparing classes, enabling them to handle trans-

formations like rotation and scaling. For instance, a

https://orcid.org/0000-0002-7548-3858

model trained on pet images can help identify miss-

ing animals, aiding owners and shelters. Analyzing

image pairs allows a contrastive neural network to dis-

tinguish between breeds and individuals, improving

lost pet searches. This paper will explore the technical

aspects and applications of this approach to develop a

complete web-based solution.

2 RELATED WORKS

Several components are required to create a con-

trastive learning model capable of differentiating be-

tween images of pets. A fundamental part of such a

model is a neural network architecture that can learn a

robust and effective data representation. In this study,

we employed the Vision Transformer model as the

foundation of our contrastive learning model. In ad-

dition, we used the Detection Transformer model to

extract the pets from the images and the AutoAug-

ment feature to augment the images. Finally, to opti-

mize the model, we utilize a contrastive loss function,

which allows the model to learn the underlying struc-

ture of the data by contrasting similar and dissimi-

lar examples. In the following sections, we provide

a more in-depth description of these technologies and

Voinea, A., Kock, R. and Dhali, M. A.

LostPaw: Finding Lost Pets Using a Contrastive Learning-Based Transformer with Visual Input.

DOI: 10.5220/0013261600003905

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 757-763

ISBN: 978-989-758-730-6; ISSN: 2184-4313

757

their implementation in our contrastive learning trans-

former model.

2.1 Transformer Models

Transformer models are a type of neural network ar-

chitecture widely successful in various natural lan-

guage processing tasks and have achieved state-of-

the-art results on a large selection of benchmarks

(Vaswani et al., 2017). Self-attention mechanisms en-

able the model to focus on different parts of input data

at various times. This allows it to capture long-range

dependencies, especially for tasks like language trans-

lation, where a word’s meaning depends on context.

Additionally, transformers utilize multi-headed atten-

tion, allowing simultaneous focus on multiple data

parts. This enhances processing capabilities and over-

all performance on numerous tasks, including natural

language processing, image classiﬁcation, and object

detection.

2.2 Detection Transformer

The Detection Transformer (DETR) is an end-to-

end object detector that uses a Transformer encoder-

decoder architecture (Carion et al., 2020). It fea-

tures a convolutional neural network (CNN) back-

bone that extracts image features, which are ﬂattened

and combined with positional encoding before be-

ing processed by the Transformer encoder to gener-

ate feature maps representing objects. The output

from the encoder is fed into a Transformer decoder

that uses learned positional embeddings called object

queries. The decoder creates embeddings for each

query, which are then classiﬁed as detections or ”no

object” via a shared feedforward network. When de-

tection occurs, the model provides the object’s class

(e.g., cat or dog) and the bounding box indicating its

location in the image.

2.3 Vision Transformer

The Vision Transformer (ViT) is a neural network

architecture for image classiﬁcation that processes

raw pixel values instead of using convolutional lay-

ers (Dosovitskiy et al., 2021). It consists of trans-

former blocks with self-attention mechanisms that

analyze 16×16 pixel patches from images, allowing

the model to determine the importance of different

patches based on their relationships. The patches

are embedded into a high-dimensional space and pro-

cessed through twelve transformer blocks, which out-

put a new sequence of patches. The output is then

passed through a linear layer and a softmax function

to produce ﬁnal class probabilities. ViT is trained

with supervised learning, using ground truth labels to

calculate cross-entropy loss for weight updates. This

enables models trained for classiﬁcation tasks to be

ﬁne-tuned for various applications, including image

pair comparisons.

2.4 AutoAugment

AutoAugment is a method that automates data aug-

mentation, which involves applying transformations

to images in a dataset to increase its size and enhance

the robustness of machine learning models (Cubuk

et al., 2019). It frames the task of ﬁnding the opti-

mal augmentation policy as a discrete search prob-

lem, utilizing a recurrent neural network as a con-

troller to sample policies that dictate which image

operations to apply, their probabilities, and their in-

tensity. The algorithm is trained using policy gradi-

ent methods, allowing it to adjust based on the val-

idation accuracy of a neural network trained with a

ﬁxed architecture. AutoAugment offers various pre-

conﬁgured policies with transformation functions, in-

cluding shearing, translation, rotation, and adjust-

ments in contrast, brightness, and sharpness. This

approach effectively enhances data augmentation and

improves the performance of models, especially in

image classiﬁcation tasks.

2.5 Contrastive Loss

Contrastive loss is a loss function widely used in

machine learning for unsupervised learning (Hadsell

et al., 2006). It aims to learn data representations

highlighting class relationships and differences be-

tween similar and dissimilar examples. This loss

function is often applied with Siamese networks,

which consist of two or more identical subnetworks

that process different inputs (Koch et al., 2015).

Trained with the exact weights, these subnetworks

learn a shared data representation. The contrastive

loss is then computed from this representation, guid-

ing updates to the network’s weights. As proposed

by (Hadsell et al., 2006), the contrastive loss func-

tion, shown in Equation 1, uses d to calculate the Eu-

clidean distance between two pet vectors and m as a

margin that controls sensitivity in classifying images

as similar. The function minimizes the distance be-

tween feature vectors of the same pet while maximiz-

ing the distance between those of different pets. Fur-

thermore, the loss function ensures that the distance

between the feature vectors of dissimilar examples ex-

ceeds the margin given by the hyperparameter m.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

758

L (X

, X

) =

max{m − d(X

, X

), 0}

different pet

d(X

, X

)

same pet

(1)

3 METHODOLOGY

In this study, we develop a contrastive neural network

model to distinguish between pet images. Trained on

a large dataset of various dog breeds, the model is

evaluated on a separate test set. Implemented using

the PyTorch framework (Paszke et al., 2017), it uti-

lizes supervised learning techniques. The following

subsection will cover the dataset, model architecture,

training procedure, and evaluation metrics used to as-

sess performance.

3.1 Dataset

To create the dataset used in this study, we obtain im-

ages of pets from adoption websites such as Adop-

tAPet (Inc and Afﬁliates, 2023). Each image is fed

through the DETR model, and the resulting bounding

boxes of pets are used to crop them from the image.

We focus only on images of dogs, resulting in 31, 860

pets being stored with an average of 2.47 images per

pet (78, 702 total images). The cropped images are

then resized to ﬁt a square of 384 × 384 pixels, and if

the image is not wide or tall enough, the missing area

is ﬁlled with black. Each image is then augmented

twice using a pre-trained AutoAugment model, which

follows the policies CIFAR10, ImageNet, and SVHN.

A test set is created by setting aside images extracted

from 3595 pets, totaling 8854 images. The augmented

dataset contained 236, 106 train images and 26, 562

test images, which are used to train and evaluate the

contrastive neural network model developed in this

study. For a schematic view of the data pipeline, see

Figure 1.

Raw image

with possibly

multiple pets

Object

detection

network

Augment

Images

Resize/Crop

Images

Scrape images

from pet

adoption sites

Figure 1: Data collection process. The top nodes represent

the individual steps that are taken for each image. The di-

agrams at the bottom show possible conﬁgurations of each

step.

To enable the use of our data for contrastive learn-

ing, we need to further combine the images into pairs,

forming a pairwise dataset. Each pair is labeled as ei-

ther different or same and contains two images of size

384 × 384. The pairwise dataset is compiled using a

random number generator to select labels and images,

and a seed value is used to ensure that the dataset is re-

producible between runs. To sample a pair during the

dataset generation, we ﬁrst choose whether the label is

same or different based on a similarity probability. A

value of 50% is chosen for this probability to create an

approximately equal number of the same and different

pairs. By selecting pairs to be the same or different

with equal probability, we ensure that the contrastive

ViT model is exposed to a balanced set of training la-

bels. This can help prevent the model from becoming

biased towards one type of example or the other and

improve the model’s generalization performance on

the held-out test set. Following this, the images for

the pair are selected from the cropped image set de-

scribed earlier. If the label is different, we randomly

select two different pets from the dataset (using the

same seed described earlier) and then select an image

for each pet. If the label is same, a random pet is cho-

sen, and the pair of images comprises two different

images of the same pet. Furthermore, as each image

in the dataset is augmented twice, we ensure that we

never choose two augmentations of the same image.

The process of selecting pairs is repeated to generate

an extensive set of training pairs for the contrastive

ViT model, an example of which can be observed in

Figure 2.

Figure 2: Example data pairs with labels underneath. Some

of the images have been augmented.

As the dataset is dynamically generated during

training, validating a model requires special attention

due to its stochastic nature. In this study, we em-

ploy k-fold cross-validation with a pairwise dataset,

assigning each pair to one of k distinct folds. This en-

sures that during testing, each fold is used only once

LostPaw: Finding Lost Pets Using a Contrastive Learning-Based Transformer with Visual Input

759

while the remaining folds are for training, prevent-

ing exposure to the same pairs. However, since pairs

are formed from images, there is a possibility that the

same images may appear in both training and cross-

validation. Due to the random selection process, it’s

possible for an image to be compared to one it has en-

countered before. Nevertheless, the large size of the

dataset makes duplicate pairs unlikely. To mitigate

this, we present additional results from a held-out test

set featuring entirely novel pets.

ViT

Features

ViT

Features

ViT

Features

ViT: Visual Transformer

ViT Patch Embedding

ViT Pooling [768]

2× Dense [1024, ELU]

Dense [512]

12× Encoder layers

Self Attention

Dense [768, RELU]

Layer Norm

Dense [768, RELU]

Dense [3072, GELU]

Dense [768, RELU]

Layer Norm

Features

Contrastive

loss learning

Figure 3: Architecture of the Contrastive Vision Trans-

former model.

3.2 Model Architecture

To develop the contrastive ViT model, we use ViT as

the backbone of the model (Wu et al., 2020). The ViT

output is ﬂattened and followed by three fully con-

nected layers to achieve the desired latent vector size.

The model features two hidden layers with twice the

number of neurons as the latent space and a ﬁnal out-

put layer matching the latent space size. These lay-

ers transform the ViT backbone’s output into a com-

pact representation of the image’s structure. We use

the Exponential Linear Unit (ELU) activation func-

tion between the layers to propagate negative val-

ues for the contrastive loss function. For a graphi-

cal overview of the model architecture, see Figure 3.

During training, we ﬁne-tune only the last three lay-

ers of the model while keeping the backbone parame-

ters frozen. This approach, while limiting the model’s

ability to learn new lower-level features, results in

a more stable training process and reduces the risk

of overﬁtting. An ablation study where all param-

eters are updated revealed a signiﬁcant drop in per-

formance, which is discussed in the Results section.

We select the hyperparameters for the contrastive ViT

model through parameter sweeps. By training with

various values and evaluating the model on a vali-

dation set, we identify the optimal hyperparameters.

This process helps us ﬁne-tune the model to better dif-

ferentiate between pictures of pets. Table 1 lists these

hyperparameters.

Table 1: Hyperparameters for the contrastive ViT model.

Name Value

Epochs 350

Latent Space Size 512

Batch Size 8

Batch Count per Epoch 128

Test Batch Size 8

Test Batch Count 128

Optimizer AdamW

Learning Rate 5.0e-5

Weight Decay 2.0e-4

Contrastive Margin 1.66

3.3 Evaluation Metrics

We use k-fold cross-validation to evaluate the per-

formance of the contrastive learning model. We use

k = 3 for our cross-validation, which resulted in 3 dif-

ferent models being trained and evaluated. We trained

the model for a ﬁxed number of epochs for each fold

and used the validation set to tune the model’s hyper-

parameters. Once the model was trained, we evalu-

ated it on the test set and recorded its performance in

terms of accuracy, the type I and type II errors, and

the F

score. The type I error represents the propor-

tion of false positives, while the type II error repre-

sents the probability of false negatives. Using these

values, we calculate the precision and recall of our

model, which are used to obtain the F

score value

(see Equation 2). The precision and recall values rep-

resent the performance of the classiﬁcation model on

the given dataset.

= 2 ·

Precision · Recall

Precision + Recall

(2)

4 RESULTS

Throughout 350 epochs, an average F

score of 88.8%

on the cross-validation set was achieved. In addition,

the model was trained on a large dataset and did not

appear to be overﬁtting, as the validation accuracy

closely followed the training accuracy (see Figure 4a).

In addition to the accuracy results, we further exam-

ined the loss values of the model during training (see

Figure 4b). We observed that the loss value steadily

dropped from a starting value of approximately 1.16

to a ﬁnal value of approximately 0.04. This trend gen-

erally indicates the model is steadily learning a better

representation of the data throughout training. Fur-

thermore, the low loss value suggests that the model

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

760

was able to learn an adequate representation of the

data, which could potentially allow the model to make

accurate decisions on unseen samples.

(a) Accuracy.

(b) Loss.

Figure 4: Mean train accuracy and loss of the contrastive

ViT model, averaged over three model runs. The data for

accuracy was smoothed by averaging the values every ﬁve

epochs.

When examining the errors of the model (see Fig-

ure 5), we observed that the model initially classiﬁed

every pair of pet images as the same pet. However,

over the course of training, the model learned to dif-

ferentiate between different pets, and the type I error

decreased. Furthermore, the type II error was very

close to zero for most of the training period. These re-

sults suggest that the model could learn a robust and

relatively effective representation of the data, which

could distinguish between different pets.

Table 2: Accuracy & errors of the model for various sets.

Dataset

Metric Training Cross-

val.

Held-

out

Held-out

std.

Accuracy 0.8687 0.8737 0.9028 0.0036

Type I error 0.1309 0.1261 0.0966 0.0036

Type II error 0.0004 0.0002 0.0006 0.0003

score 0.8838 0.8880 0.9108 0.0041

When examining the outcomes of the models on

the held-out test set, as illustrated in Table 2, we noted

that the average F

score was 91.1% (SD=0.41%).

Figure 5: Type I and II errors of the model on the test set

at every epoch. The data for the errors were smoothed by

averaging the values every ﬁve epochs.

Similarly, the mean type I error was 9.7%, and the

type II error was 0.06%. These outcomes indicate an

improvement over the metric values recorded on the

train and validation sets. This may be attributed to

various reasons, which are discussed in detail in the

subsequent section. However, the results still suggest

that the model has effectively generalized to new data.

Furthermore, during the ablation study, we observed

that a fully-trained contrastive ViT model achieved

a cross-validation F

score of 80.0% and a held-out

test set F

score of 78.6%. This provides strong sup-

port for our decision to set the layers of the backbone

model as ﬁxed. In particular, the model appears to

overﬁt the data more than it does with frozen layers

since the held-out test set performance is inferior to

the validation performance. For more results of the

ablation study, see Figure 6 in Appendix 5.

5 CONCLUSIONS

This study involved the development of a contrastive

neural network model to distinguish between images

of dogs, with its performance evaluated on a held-out

test set. Results from a three-fold cross-validation in-

dicated that the model can accurately differentiate be-

tween pet images and generalize well to unseen data.

The following section provides a detailed discussion

of the evaluation results and the implications of these

ﬁndings for applying artiﬁcial intelligence in search-

ing for lost pets. One concern is the relatively high

incidence of false positives. While this may initially

appear to be a limitation, it could be beneﬁcial in lo-

cating lost pets, particularly when a few pets are re-

ported missing in a given area, allowing for easy dis-

missal of incorrect identiﬁcations. Moreover, the use

of the AutoAugment feature, which sometimes alters

the colors of pet images, may inﬂuence the accuracy

of the model. However, this variation could enhance

generalization by allowing the model to learn more

LostPaw: Finding Lost Pets Using a Contrastive Learning-Based Transformer with Visual Input

761

robust features, improving its performance on real-

world data with varying color and lighting conditions.

A potential issue has been identiﬁed in the model’s

accuracy metrics: both the cross-validation accuracy

and the held-out test set accuracy are higher than the

training accuracy, which needs further investigation.

The higher cross-validation accuracy may result from

random ﬂuctuations, as the two accuracies often inter-

sect during training, as shown in Figure 4a. The im-

provement of over 1% in the held-out test set’s perfor-

mance compared to the cross-validation set is unclear

and could be due to differences in data distribution or

the smaller sample size in the held-out set. However,

measures have been taken to eliminate any system-

atic errors that might affect the observed performance

gains. Future research could explore expanding the

network to include various types of pets. This might

involve using the DETR to identify the speciﬁc pet in

an image, such as a cat or dog, and then passing it to

a ﬁne-tuned model that specializes in comparing pets

within each category. This method would combine

the strengths of both DETR and ViT models, result-

ing in a more robust system through enhanced con-

trastive data. Additionally, while this study focused

on dog images, the described contrastive learning ap-

proach can be applied to other datasets. Training on

diverse images enables the model to differentiate be-

tween various classes, with potential applications in

medical image classiﬁcation, wildlife species identi-

ﬁcation, and handwriting comparison. Overall, the

study suggests that contrastive learning can signiﬁ-

cantly improve image classiﬁcation accuracy.

REFERENCES

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov,

A., and Zagoruyko, S. (2020). End-to-End Object De-

tection with Transformers. In Vedaldi, A., Bischof,

H., Brox, T., and Frahm, J.-M., editors, Computer Vi-

sion – ECCV 2020, Lecture Notes in Computer Sci-

ence, pages 213–229, Cham. Springer International

Publishing.

Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).

A simple framework for contrastive learning of visual

representations. In III, H. D. and Singh, A., editors,

Proceedings of the 37th International Conference on

Machine Learning, volume 119 of Proceedings of Ma-

chine Learning Research, pages 1597–1607. PMLR.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le,

Q. V. (2019). AutoAugment: Learning Augmentation

Strategies from Data. In 2019 IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 113–123. IEEE Computer Society.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,

M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,

N. (2021). An Image is Worth 16x16 Words: Trans-

formers for Image Recognition at Scale.

Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimension-

ality Reduction by Learning an Invariant Mapping.

In 2006 IEEE Computer Society Conference on Com-

puter Vision and Pattern Recognition (CVPR’06), vol-

ume 2, pages 1735–1742.

Inc, K. P. and Afﬁliates (2023). AdoptAPet: Search for local

pets in need of a home. https://www.adoptapet.com/,

last visited: 28 April 2023.

Koch, G., Zemel, R., Salakhutdinov, R., et al. (2015).

Siamese neural networks for one-shot image recog-

nition. In ICML deep learning workshop, volume 2.

Lille.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in Py-

Torch.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. Advances in neural

information processing systems, 30.

Wu, B., Xu, C., Dai, X., Wan, A., Zhang, P., Yan, Z.,

Tomizuka, M., Gonzalez, J., Keutzer, K., and Vajda,

P. (2020). Visual Transformers: Token-based Image

Representation and Processing for Computer Vision.

APPENDIX

Ablation Study Results

As described in Section 4, an ablation study was per-

formed where the entire ViT model was trained along-

side the ﬁnal layers using the contrastive loss func-

tion. Comparing these results to those obtained with

the ﬁxed layers, we can see that the fully-trained

model has a higher type I error rate and a lower F

score on both the cross-validation and held-out test

sets. This indicates that the fully-trained model is

overﬁtting the data to some extent, which is expected

given the increased ﬂexibility of the model. Overall,

the results of the ablation study support our decision

to use a ﬁxed ViT backbone with contrastive learn-

ing. This approach appears to be more effective at

learning a robust representation of the data and gen-

eralizing to new samples, as demonstrated by the su-

perior performance of the ﬁxed layers model on both

the cross-validation and held-out test sets.

Model Demonstration

To make the contrastive learning model available to

a broader audience, we developed a web application

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

762

(a) Mean train accuracy and validation ac-

curacy.

(b) Type I and II errors of the model on

the test set.

Figure 6: Results of the ablation study averaged over three

model runs. The data for accuracy was smoothed by aver-

aging the values every ﬁve epochs. The data was collected

and processed in the same manner as for the plots presented

in Section 4.

that allows users to upload pictures of dogs and dis-

cover whether there are any similar dogs found in the

system. The web application processes the image us-

ing the contrastive learning model and returns a list of

pets along with their similarity score.

(a) Map of pet sightings.

(b) Image of sighted pet.

(d) First hit of similar pets.

Figure 7: Screenshots of the web application showing how

users might interact with the website. Clicking the sym-

bol highlighted by the red dashed outline (in 7a) opens

the pop-up shown in 7b. Similarly, clicking the sym-

bol highlighted by the solid red outline opens a picture

upload dialog and displays the results as in 7c and 7d.

For code availability, please follow the following link:

https://github.com/vandrw/lostpaw-transformer.

LostPaw: Finding Lost Pets Using a Contrastive Learning-Based Transformer with Visual Input

763