making it challenging to understand and improve the
model’s decision-making process. The outputs are
also less controllable compared to models like BERT,
which is a drawback, particularly in situations where
precise output is essential. Similarly to BERT, it is
possible to fine-tune a GPT-based model for specific
tasks, but this can also require extensive computa-
tional resources and data.
Our study has limitations. First, the evaluation
was implemented on a relatively small sample size of
just 153 spine reports. This may not accurately reflect
the processing capabilities of GPT-4 for spine radi-
ology reports in real-world clinical settings. Second,
using only the MIMIC-III and MTSamples database
restricts the diversity of data. Different databases
may include different language uses and terminolo-
gies, which might affect the model’s performance.
Third, each report was evaluated once, yet the output
of the model is not deterministic. Repeated evalua-
tions could provide a more accurate assessment.
6 CONCLUSION
In our study, we utilized GPT-4 for processing radi-
ology reports, completing the entire task with a sin-
gle prompt. We classified the sentences, determined
the sentiment of each spine-related sentence and ex-
tracted the level of anatomy, anatomy and disorder
triplets. Finally, we evaluated the method on two
different databases, 100 radiology spine reports from
the MIMIC-III database and 53 radiology spine re-
ports from the MTSamples collection. These results
highlight how prompt-learning large language models
can find information from free-text radiology reports
without needing expert knowledge or task-specific
fine-tuning. According to our findings, the GPT-4
model performed with over 91% accuracy and F-score
values in each of our five subtasks of information ex-
traction of the reports. Our MTSamples input and out-
put data, as well as our final prompt are available in
our online appendix.
ACKNOWLEDGEMENTS
The research presented in this paper was supported
in part by the European Union project RRF-2.3.1-21-
2022-00004 within the framework of the Artificial In-
telligence National Laboratory. The national project
TKP2021-NVA-09 also supported this work. Project
no TKP2021-NVA-09 has been implemented with the
support provided by the Ministry of Culture and Inno-
vation of Hungary from the National Research, De-
velopment and Innovation Fund, financed under the
TKP2021-NVA funding scheme.
REFERENCES
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., and Son-
tag, D. (2022). Large language models are few-shot
clinical information extractors. In Proceedings of the
2022 Conference on Empirical Methods in Natural
Language Processing, pages 1998–2022.
Akinci D’Antonoli, T., Stanzione, A., Bl
¨
uthgen, C., Vernuc-
cio, F., Ugga, L., Klontzas, M., Cuocolo, R., Cannella,
R., and Koc¸ak, B. (2023). Large language models in
radiology: fundamentals, applications, ethical consid-
erations, risks, and future directions. Diagnostic and
Interventional Radiology.
Bressem, K. K., Adams, L. C., Gaudin, R. A., Tr
¨
oltzsch, D.,
Hamm, B., Makowski, M. R., Sch
¨
ule, C.-Y., Vahldiek,
J. L., and Niehues, S. M. (2020). Highly accurate clas-
sification of chest radiographic reports using a deep
learning natural language model pretrained on 3.8 mil-
lion text reports. Bioinformatics, 26(21):5255–5261.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. In Advances in Neural Information Process-
ing Systems, volume 33, pages 1877–1901.
Chen, Q., Du, J., Hu, Y., Keloth, V., Peng, X., Raja, K.,
Zhang, R., Lu, Z., and Qi, W. (2023a). Large lan-
guage models in biomedical natural language process-
ing: benchmarks, baselines, and recommendations.
arXiv preprint arXiv:2305.16326.
Chen, Q., Sun, H., Liu, H., Jiang, Y., Ran, T., Jin, X., Xiao,
X., Lin, Z., Chen, H., and Niu, Z. (2023b). An exten-
sive benchmark study on biomedical text generation
and mining with ChatGPT. Bioinformatics, 39(9).
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Dilli, A., Ayaz, U. Y., Turanlı, S., Saltas, H., Karabacak,
O. R., Damar, C., and Hekimoglu, B. (2014). Clinical
research incidental extraspinal findings on magnetic
resonance imaging of intervertebral discs. Archives of
Medical Science, 10(4):757–763.
Fink, M. A., Bischoff, A., Fink, C. A., Moll, M., Kroschke,
J., Dulz, L., Heussel, C. P., Kauczor, H.-U., and We-
ber, T. F. (2023). Potential of chatgpt and gpt-4 for
data mining of free-text ct reports on lung cancer. Ra-
diology, 308 3.
Jimenez Gutierrez, B., McNeal, N., Washington, C., Chen,
Y., Li, L., Sun, H., and Su, Y. (2022). Thinking about
GPT-3 in-context learning for biomedical IE? think
A Deep Dive into GPT-4’s Data Mining Capabilities for Free-Text Spine Radiology Reports
91