
triplet loss - DMTL), where the margin is dynam-
ically controlled with distances of word embed-
dings of class textual descriptors;
• We created a new dataset (OGYEIv2) of 112
classes of pill with 4480 images.
1
• We evaluated our model on our dataset with 5-fold
validation.
2 RELATED WORKS
Generic deep neural network (DNN) object detectors
have been applied for pill recognition in several recent
articles, such as (Tan et al., 2021), (Nguyen et al.,
2022), and (Heo et al., 2023).
In (Tan et al., 2021) three well-known DNN ob-
ject detectors (YOLOv3, RetinaNet, and SSD) were
compared on a custom dataset, resulting only in small
differences in mAP (∼ 2%, all above 0.80). More
specific approaches are described in (Nguyen et al.,
2022) and (Heo et al., 2023). In (Nguyen et al.,
2022) the proposed solution used a prescription-based
knowledge graph, representing the relationship be-
tween pills. A graph embedding network extracted
the relational features of pills and a framework was
applied to fuse the graph-based relational information
with the image-based visual features for the final clas-
sification. The drawback of this method is that it re-
quires medical prescriptions, or equivalently it can be
applied when there are multiple pills on the image.
In (Heo et al., 2023) the authors trained not only RGB
images of the pills but also imprinted characters. In
the pill recognition step, the different modules sepa-
rately recognize both the features of pills and their im-
prints, meanwhile correcting the recognized imprint
to fit the actual data of other features. A trained lan-
guage model was also applied to the imprint correc-
tion. It was shown through an ablation study that the
language model could significantly improve the pill
identification ability of the system. The drawback of
this solution is that a specific language model (includ-
ing an OCR - optical character recognition module) is
required for the application.
In contrast to these approaches, our solution
avoids the use of specific language models and uses
only generic models to process the information leaflet
of pills. In the above models the training would re-
quire the processing of textual printed information
and/or language-specific OCR modules. They face
problems when texts are not visible (see Fig.1 for il-
lustration) or when new classes are to be added to
1
The dataset is available at: https://www.kaggle.com/
datasets/richardradli/ogyeiv2
the model, also these texts should be added manually.
Our primary purpose is to elaborate a more general
and easily extensible framework.
For the above reasons, we followed the ap-
proaches (Zeng et al., 2017) and (Ling et al., 2020)
where the utilization of metrics learning was demon-
strated in order to embed the pill images.
The winner (Zeng et al., 2017) of an algorithm
challenge on pill recognition in 2016, announced
by the United States National Library of Medicine
(NLM-NIH) (Yaniv et al., 2016), used a multi-stream
technique. In (Zeng et al., 2017) the visual infor-
mation (e.g. colour, gray-scale, and gradient images
of already localized pills) are processed by so called
’training CNNs’. A knowledge distillation model
compression framework then condensed the training
CNNs into smaller footprint CNNs (’student CNNs’),
employed during inference. The CNNs were designed
to embed features in a metric space, where cosine
distance was utilized as a metric to determine how
similar the features, produced by the CNNs. During
the training of the streams, siamese neural networks
(SNNs) were used with three inputs: the anchor im-
age, a positive, and a negative sample, while the ap-
plied triplet loss was responsible to minimize the dis-
tance between the anchor image and positive samples,
and to increase the distance between the anchor image
and negative samples.
The winner model was improved in (Ling et al.,
2020) with better accuracy tested on the CURE
dataset. The teacher-student compression approach
was left and a separate OCR stream, and a stream
fusion network was introduced. The OCR stream
was responsible for text localization, geometric nor-
malization, and feature embedding with the Deep
TextSpotter (Busta et al., 2017). In addition to the
OCR stream, RGB, texture, and contour streams were
used; segmentation was performed using an improved
U-Net model to generate the stream inputs.
Our approach has similar structure to (Ling et al.,
2020) but with a few modifications: we replaced
the OCR method with an LBP (local binary pattern)
(Ojala et al., 1994) stream, we utilize a more refined
backbone in streams, we use state-of-the-art YOLO
network for object detection, and we added attention
mechanisms to the models. The performance of our
multi-stream framework was compared to the archi-
tecture of (Ling et al., 2020) in (R
´
adli et al., 23b), us-
ing the CURE dataset, showing a few percentage ad-
vantage in all test settings. The main contribution of
this article is the improvement of our previous model
by the introduction of a new triplet loss, which uti-
lizes textual information about medicines. Details of
our custom model are given in Section 4.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
730