accuracy with few-shot learning, providing 20 sets of
example sentences, and RAG, using only the chief
complaints and ICD-10 codes from the old EMR as
external data. The contextual limitation likely con-
tributed to this improvement. For RAG without cor-
rect cases, fewer reference documents resulted in
lower accuracy than few-shot learning, highlighting
the importance of data quality over quantity.
The evaluation set was limited to the top 20 dis-
ease names, and GPT-4 generated 5 candidate disease
names. Expanding the evaluation set to a wider range
of disease names and conducting evaluations using
external data is necessary. Additionally, subjective
evaluation of the validity and diagnostic reasons by
veteran physicians is important.
8 CONCLUSIONS
This study compared disease name estimation meth-
ods using semantic representation learning + machine
learning, BERT, and GPT-4, and evaluated their ac-
curacy. Despite being trained on only 1,605 chief
complaints, semantic representation learning + ma-
chine learning showed slightly higher accuracy than
BERT, which was fine-tuned on over 10,000 progress
summaries, under certain conditions. However, it was
found to have limitations in disease name estimation
based on chief complaints.
For GPT-4, evaluation data were created based on
the top 20 disease names with the highest occurrence
frequency in the new EMR, targeting cases with chief
complaints of more than 10 characters. Evaluations
using zero-shot learning, few-shot learning, and RAG
demonstrated that RAG achieved the highest perfor-
mance. When all chief complaints, including the eval-
uation data, were used, the highest Top-5 accuracy
of 84.5% was achieved, while excluding the evalua-
tion data decreased the accuracy to 65.5%. The op-
timal number of reference chunks for RAG was 15.
Even when excluding the evaluation data, limiting the
database to the 20 diagnostic disease names improved
the Top-5 accuracy to 82.5%. Furthermore, the latest
GPT-4o model was evaluated under the same condi-
tions as RAG, and it further improved the Top-5 ac-
curacy to 90.0%.
In the future, we aim to expand the benchmark to
cover additional middle categories of ICD-10, con-
duct more extensive evaluations, and perform subjec-
tive evaluations by experienced physicians. This aims
to implement disease name estimation from chief
complaints as a practical diagnostic support tool in
medical settings.
ACKNOWLEDGMENTS
Part of this study was conducted by Shuta Asai, Tat-
suki Sakata as their graduation research in 2023, and
is currently being conducted by Mikio Osaki as part
of his ongoing graduation research in 2024, all from
Fukui University of Technology. We thank them for
their contributions. This work was supported by JSPS
KAKENHI Grant Number 24K14964 and 20K11833.
This study was approved by the Ethical Review Com-
mittee of the Fukui University of Technology and the
Toyama University Hospital.
REFERENCES
Berntson, A. et al. (2023). Azure ai search: Outperforming
vector search with hybrid retrieval and ranking ca-
pabilities. https://techcommunity.microsoft.com/
t5/ai-azure-ai-services-blog/azure-ai-search-
outperforming-vector-search-with-hybrid/ba-p/
3929167. Accessed: 2024-05-18.
Chen, A., Liu, L., and Zhu, T. (2024). Advancing the
democratization of generative artificial intelligence in
healthcare: a narrative review. Journal of Hospital
Management and Health Policy, 8(0).
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., and Radev,
D. (2023). Evaluating gpt-4 and chatgpt on japanese
medical licensing examinations.
Kawazoe, Y., Shibata, D., Shinohara, E., Aramaki, E., and
Ohe, K. (2021). A clinical specific bert developed us-
ing a huge japanese clinical text corpus. PLoS One,
16(11)(9).
Keshi, I., Daimon, R., and Hayashi, A. (2022). Interpretable
disease name estimation based on learned models us-
ing semantic representation learning of medical terms.
In Coenen, F., Fred, A. L. N., and Filipe, J., editors,
Proceedings of the 14th International Joint Confer-
ence on Knowledge Discovery, Knowledge Engineer-
ing and Knowledge Management, IC3K 2022, Volume
1: KDIR, Valletta, Malta, October 24-26, 2022, pages
265–272. SCITEPRESS.
Le, Q. V. and Mikolov, T. (2014). Distributed representa-
tions of sentences and documents. In Proc. of ICML,
pages 1188–1196.
Singhal, K. et al. (2023). Towards expert-level medical
question answering with large language models.
Yanagita, Y., Yokokawa, D., Uchida, S., Tawara, J., and
Ikusaka, M. (2023). Accuracy of chatgpt on medi-
cal questions in the national medical licensing exam-
ination in japan: Evaluation study. JMIR Form Res,
7:e48023.
Integrated Evaluation of Semantic Representation Learning, BERT, and Generative AI for Disease Name Estimation Based on Chief
Complaints
301