
posed CAMMA model we highlighted that multi-task
learning is useful for improving generalization and
reduce overfitting for models developed for solving
the MVQA task. Through the performed experiments
on the OVQA and MVQA datasets it was empirically
proven, in response to RQ2, that embedding addi-
tional information into the model through the use of
multiple classification heads is beneficial for improv-
ing the model performance. In what concerns RQ3,
the experimental results highlighted that using addi-
tional tasks lead to a significant improvement in the
model accuracy compared to the single-task model.
5 CONCLUSIONS
In this paper we presented CAMMA, a cascading multi-
task architecture created for Medical Visual Question
answering that obtained state-of-the-art results on the
OVQA dataset. In our experimental set-up, multi-
task learning showed its prowess in the MVQA task
leading to improved performance and reduced over-
fitting. Although our choice of tasks is limited to
the categories for which we have annotations in the
OVQA and VQA-Med 2019 datasets, embedding ad-
ditional information into the model through the use of
multiple classification heads is a useful technique that
allows us to deal with data scarcity. A clear constant
we observe, however, is that for this task, a cascaded
approach results in increased performance suggesting
that answer classification is enhanced by knowledge
regarding question type and answer type.
Although we achieved impressive results, the
complexity of the problem allows for further improve-
ments. A future work would be to use task weights as
hyperparameters, in order to allow the model to learn
the best balance between the tasks. Additionally, we
may consider and experiment on new tasks which
could be added to the framework in order to mea-
sure their impact on the proposed approach. Since
extra classification task annotations are not available,
a self-supervised candidate such as image or question
reconstruction could be a interesting approach.
REFERENCES
Abacha, A. B., Hasan, S. A., et al. (2019). VQA-Med:
Overview of the Medical VQA Task at ImageCLEF
2019. In Working Notes of CLEF 2019, volume 2380.
Al-Sadi, A., Talafha, B., et al. (2019). JUST at ImageCLEF
2019 Visual Question Answering in the Medical Do-
main. In Working Notes of CLEF 2019, volume 2380.
Caruana, R. (1997). Multitask learning. Machine Learning,
28:41–75.
Chen, X., Wang, X., et al. (2023a). PaLI: A Jointly-Scaled
Multilingual Language-Image Model. In Proceedings
of ICLR 2023, pages 1–33.
Chen, Z., Diao, S., et al. (2023b). Towards unifying medical
vision-and-language pre-training via soft prompts. In
Proceedings of ICCV’23, pages 23403–23413.
Cong, F., Xu, S., et al. (2022). Caption-aware medical VQA
via semantic focusing and progressive cross-modality
comprehension. In MM’22, pages 3569–3577.
Do, T., Nguyen, B. X., et al. (2021). Multiple meta-model
quantifying for medical visual question answering. In
MICCAI 2021: Part V 24, pages 64–74. Springer.
Gong, H., Chen, G., et al. (2021a). Cross-modal self-
attention with multi-task pre-training for MVQA. In
Proceedings of ICMR 2021, pages 456–460.
Gong, H., Huang, R., et al. (2021b). SYSU-HCP at VQA-
Med 2021: A Data-centric Model with Efficient Train-
ing Methodology for Medical Visual Question An-
swering. In CLEF (Working Notes), pages 1218–1228.
Hong, X., Song, Z., et al. (2024). BESTMVQA: A Bench-
mark Evaluation System for Medical Visual Question
Answering. In ECML–PKDD, pages 435–451.
Huang, L., Zhang, C., and Zhang, H. (2020). Self-
adaptive training: beyond empirical risk minimiza-
tion. Advances in neural information processing sys-
tems, 33:19365–19376.
Huang, Y., Wang, X., Liu, F., and Huang, G. (2022).
OVQA: A clinically generated visual question an-
swering dataset. In SIGIR 2022, pages 2924–2938.
Khare, Y., Bagal, V., et al. (2021). Mmbert: Multimodal
bert pretraining for improved medical VQA. In Pro-
ceedings of ISBI 2021, pages 1033–1036. IEEE.
Lau, J. J., Gayen, S., et al. (2018). A dataset of clinically
generated visual questions and answers about radiol-
ogy images. Scientific data, 5(1):1–10.
Liao, Z. et al. (2020). AIML at VQA-Med 2020: Knowl-
edge inference via a skeleton-based sentence mapping
approach for MVQA. In CLEF, pages 1–14.
Liu, Z., Lin, Y., et al. (2021). Swin transformer: Hierar-
chical vision transformer using shifted windows. In
Proceedings of the IEEE/CVF, pages 10012–10022.
Nguyen, B. D. et al. (2019). Overcoming data limitation in
medical visual question answering. In Proceedings of
MICCAI 2019, Part IV 22, pages 522–530.
Papineni, K., Roukos, S., and othrs (2002). Bleu: a Method
for Automatic Evaluation of Machine Translation. In
The 40th Annual meeting of the ACL, pages 311–318.
Van Sonsbeek, T., Derakhshani, M. M., et al. (2023). Open-
ended MVQA through prefix tuning of language mod-
els. In Proceedings of MICCAI 2023, pages 726–736.
Yan, X. et al. (2019). Zhejiang University at ImageCLEF.
In CLEF (Working Notes), volume 2380, pages 1–9.
Yu, Z. et al. (2017). Multi-modal factorized bilinear pooling
with co-attention learning for VQA. In Proceedings of
ICCV’17, pages 1821–1830.
Yu, Z. et al. (2018). Beyond bilinear: Generalized multi-
modal factorized high-order pooling for VQA. IEEE
Tran. Neural Netw. Learn. Syst., 29(12):5947–5959.
Zhao, Y., Wang, X., et al. (2023). Multi-task deep learning
for medical image computing and analysis: A review.
Computers in Biology and Medicine, 153:106496.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
200