current dataset. The parameters were set as follows:
both models used the SGDM as Gradient descent
algorithm, the epoch is set as 30, the initial learning
rate is 0.005 with a drop period of 20. We further fine-
tuned the models based on the validation set. Then,
predicted concepts with high probabilities above the
predefined threshold were selected as the preferred
labels for a given test image. The concept selection
rules according to the output score matrix includes the
Term Frequency, the Threshold of probabilities and
the Top Rank of probabilities. Based on the validation
set, we gradually adjusted the threshold from zero to
0.5, and the lowest term frequency is set to 5, while
the top rank of probabilities ranges from 1 to 5,
increasing by 1 each iteration.
As comparative experiments, both of the
DenseNet and Inception v3-based MLC models were
separately trained and verified on the three secondary
annotated subsets. The parameters were set as
follows: the Gradient Descent algorithm include
SGDM, ADAM and RMS; the epoch is set as 20,
initial learning rate is 0.001 with the drop period of
20. The threshold gradually increased from zero to
0.5 with an interval of 0.1, the term frequency is set
to 10 while the top rank ranges from 1 to 5, with an
interval of 1. Then with refer to the late fusion
strategy, the best results of the above methods are
combined as predicted concepts for test images.
Finally, the preferred concepts are filled into the
sentence template to form a comprehensible
description. A classical Dual path CNN model
(Zheng et al. 2020) was taken as a comparison
method.
4.3 Evaluation Criteria
In this study, the evaluation criteria follows the
ImageCLEFmedical 2021 track (Pelka et al. 2021).
For the concept detection task, balanced precision and
recall trade-off were measured in terms of F1 scores
between predicted and ground truth concepts, which
were calculated by the Python's scikit-learn library.
The caption evaluation is based on BLEU score
(Papineni et al. 2002), an automatic evaluation
method for machine learning that implemented by the
Python’s NLTK (v3.2.2) BLEU scoring method.
5 RESULTS
Based on re-annotated subsets, we validated the
performance of fine-grained MLC models separately.
Preliminary results show that the Inception V3 model
outperforms DenseNet in predicting Imaging Type
labels, with an F1 score of 0.9273. However, the
identification of other types of concepts, such as
Findings, is far from satisfactory. One possible reason
is that hundreds of candidate labels in a training
subset are still too many to adequately train an
effective MLC model compared to a limited number
of thousands of images.
Intuitively, it is understandable that images of
similar cases may have similar anatomical structures
or findings labels. However, since the images in the
original ImageCLEF dataset come from PMC
literatures, and the diversity and heterogeneity of
image content as well as context determine that it is
not suitable for specific disease detection tasks, which
makes it difficult to predict accurate body parts,
organs, or findings.
Table 4 shows the experimental results of our
MLC models on the concept detection task. Among
them, MLC_baseline represents the MLC model
trained on the overall concept set based on the
Inception-V3 backbone network. The MedCC_FD
represents the fine-grained MLC model trained on the
subset including the Findings (FD) concepts, similar
to this, MedCC_* represents the combination of the
concepts predicted by different MLC models. We
also combined the fine-grained predicted concepts
with the baseline.
Unexpectedly, the fine-grained MLC model
trained based on the subset of Imaging Types, i.e.
MedCC_IT obtained the best F1 score of 0.419,
indicating that concepts of this type have a high
coverage in radiology images, and are relatively
concentrated and suitable for training an effective
classification model. Whereas MedCC_FD and
MedCC_AS that predicted body-related concepts or
clinical findings introduced more unmentioned words
and reduced the overall score. However, previous
experience gained through manual annotation
suggests that some unmentioned terms are also worth
referring to interpret given medical images. Figure 4
shows an example in the validation set. For a given
medical image, MedCC produced a few medial
concepts as well as a concise caption, in which red
concepts are consistent with the Ground Truth (GT),
and unmatched concepts such as ‘Appendix’, ‘Mass
of body structure’ are also meaningful and related to
the given image.
Table 5 shows the performance of MedCC on the
caption prediction task. A Dual path CNN model was
taken as baseline, and achieved a BLEU score of
0.137. Due to the limited predefined sentence patterns
and the influence of concept detection results, our
pattern-based caption prediction model received a
BLEU score of only 0.257. Case analysis shows that
MedCC: Interpreting Medical Images Using Clinically Significant Concepts and Descriptions