Integrated Evaluation of Semantic Representation Learning, BERT, and

Generative AI for Disease Name Estimation Based on Chief Complaints

Ikuo Keshi

1,2

, Ryota Daimon

, Yutaka Takaoka

3,5

and Atsushi Hayashi

4,5

AI & IoT Center, Fukui University of Technology, 3-6-1, Gakuen, Fukui, Japan

Electrical, Electronic and Computer Engineering Course, Department of Applied Science and Engineering,

Fukui University of Technology, 3-6-1, Gakuen, Fukui, Japan

Data Science Center for Medicine and Hospital Management, Toyama University Hospital, 2630 Sugitani, Toyama, Japan

Department of Ophthalmology, University of Toyama, 2630 Sugitani, Toyama, Japan

Center for Data Science and Artiﬁcial Intelligence Research Promotion, Toyama University Hospital, 2630 Sugitani,

Toyama, Japan

Keywords:

Generative AI, Electronic Medical Record (EMR), Chief Complaints, Disease Name Estimation, Medical AI,

Medical Diagnostic Support Tool, Semantic Representation Learning, BERT, GPT-4.

Abstract:

This study compared semantic representation learning + machine learning, BERT, and GPT-4 to estimate dis-

ease names from chief complaints and evaluate their accuracy. Semantic representation learning + machine

learning showed high accuracy for chief complaints of at least 10 characters in the International Classiﬁca-

tion of Diseases 10th Revision (ICD-10) codes middle categories, slightly surpassing BERT. For GPT-4, the

Retrieval Augmented Generation (RAG) method achieved the best performance, with a Top-5 accuracy of

84.5% when all chief complaints, including the evaluation data, were used. Additionally, the latest GPT-4o

model further improved the Top-5 accuracy to 90.0%. These results suggest the potential of these methods

as diagnostic support tools. Future work aims to enhance disease name estimation through more extensive

evaluations by experienced physicians.

1 INTRODUCTION

We developed a method for estimating disease names

based on learning semantic representations of medi-

cal terms to improve both accuracy and interpretabil-

ity (Keshi et al., 2022). While semantic representation

learning provides high interpretability for discharge

summaries, it struggles with texts with poor context,

such as a patient’s chief complaint. Therefore, we

aimed to improve the accuracy and interpretability of

disease name estimation by evaluating generative AI

techniques like GPT-4.

This study evaluated semantic representation

learning to determine the conditions of the chief com-

plaint using generative AI. We conducted a reference

evaluation using BERT models (Devlin et al., 2019;

Kawazoe et al., 2021), pretrained on Japanese clinical

texts, and Wikipedia. Finally, we used an integrated

approach to infer disease names from chief com-

plaints, applying zero-shot learning, few-shot learn-

ing, and RAG with GPT-4. We comprehensively eval-

uated these approaches’ accuracy and explored their

potential application for medical diagnosis.

This study highlights the importance of combin-

ing traditional supervised learning and generative AI

techniques to improve the accuracy of disease name

estimation, especially from minimal contextual data

like chief complaints. This combination is crucial to

address the challenges of medical diagnosis and en-

hance accuracy.

2 RELATED RESEARCH

The ﬁeld of medical AI is rapidly advancing with

the application of large language models. Gen-

erative AI is being widely adopted in the medi-

cal ﬁeld, and its democratization has the potential

to enhance diagnostic accuracy (Chen et al., 2024).

Google’s Med-PaLM2, ﬁne-tuned with medical texts,

has shown high performance in the US medical li-

censing exam (Singhal et al., 2023). OpenAI’s GPT-

4 can pass the Japanese national medical exam but

still faces challenges in professional medical applica-

294

Keshi, I., Daimon, R., Takaoka, Y. and Hayashi, A.

Integrated Evaluation of Semantic Representation Learning, BERT, and Generative AI for Disease Name Estimation Based on Chief Complaints.

DOI: 10.5220/0012927100003838

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2024) - Volume 1: KDIR, pages 294-301

ISBN: 978-989-758-716-0; ISSN: 2184-3228

Table 1: Number of cases in the old EMR corresponding to

the top 20 ICD-10 codes in the new EMR.

ICD-10 code new EMR old EMR

C34.1 1127 210

H25.1 929 123

C61 912 2216

C34.3 893 158

C22.0 864 1501

I20.8 698 75

I35.0 690 70

I50.0 545 166

C16.2 536 231

I67.1 515 387

C25.0 503 111

C15.1 483 253

I48 483 253

C34.9 468 1579

P03.4 432 399

C56 393 1276

M48.06 373 845

H35.3 368 1060

H33.0 361 625

C20 357 343

tions (Kasai et al., 2023). In the 2022 National Med-

ical Examination for Physicians (NMLE) in Japan,

GPT-4 achieved a correct response rate of 81.5%,

signiﬁcantly higher than GPT-3.5’s 42.8%, and ex-

ceeded the passing standard of 72%, showing its po-

tential to support diagnostic and therapeutic deci-

sions (Yanagita et al., 2023).

Given these advancements, this study focuses on

utilizing these models to establish evaluation criteria

for estimating disease names from chief complaints.

3 DATASET

Developing disease estimation AI models using elec-

tronic medical records faces the challenge of accuracy

drop when applied across different hospitals. This

study aims to create models with high accuracy across

two types of EMRs with different data distributions.

3.1 Progress Summary Dataset

The training data includes discharge summaries from

Toyama University Hospital (2004-2014, 94,083

cases) and the evaluation data from 2015-2019

(61,772 cases). Data cleansing involved excluding

cases with missing values, unused ﬁelds, rare disease

names (less than 0.02%), and short progress sum-

maries (less than 50 words).

Table 1 shows the number of cases in both EMRs

for the top 20 disease codes. Despite distribution dif-

ferences, the top 20 disease codes in the new EMR

appear in the old EMR, ensuring sufﬁcient cases for

model training and evaluation.

The records include the ICD-10 code, the ﬁrst 500

Table 2: The number of cases according to different chief

complaint conditions.

old EMR new EMR

Before data cleansing 94,083 cases 61,772 cases

After data cleansing 73,150 cases 48,911 cases

Subcategories with any chief complaint 35,509 cases 28,787 cases

Subcategories with chief complaints of

more than 10 characters

8,300 cases 5,876 cases

Middle categories with chief complaints

of more than 10 characters

6,766 cases 4,949 cases

Table 3: The number of cases for benchmarks focusing on

the top 20 ICD-10 codes.

old EMR new EMR

Subcategories with any chief complaint 4,205 cases 5,547 cases

Subcategories with chief complaints of

more than 10 characters

1,013 cases 1,054 cases

Middle categories with chief complaints

of more than 10 characters

1,605 cases 1,715 cases

characters of the progress summary, department, gen-

der, and age.

3.2 Chief Complaint Dataset

Chief complaints were extracted from both EMRs.

Table 2 shows the variation in case numbers under

different conditions. Table 3 presents benchmarks for

the top 20 ICD-10 codes in the new EMR.

In the chief complaint dataset, restricting the num-

ber of letters signiﬁcantly reduces case numbers but

retains sufﬁcient data for machine learning. Records

include the ICD-10 code, chief complaint, depart-

ment, gender, and age.

4 PROPOSED METHOD

We developed a model to estimate disease names from

chief complaints by extending GPT-4 using EMRs.

GPT-4 can pass the Japanese national examination for

physicians, but its performance can be improved us-

ing the chief complaint dataset from Chapter 3. This

study employs supervised learning (semantic repre-

sentation learning + machine learning) and a BERT

model pretrained on medical documents for compar-

ative validation.

4.1 Semantic Representation Learning

of Medical Terms

The semantic representation learning process (Fig-

ure 1) involves using the ﬁrst 500 characters of the

progress summary. The step of obtaining a weight

vector of the progress summary includes generating a

paragraph vector (Le and Mikolov, 2014) with initial

Integrated Evaluation of Semantic Representation Learning, BERT, and Generative AI for Disease Name Estimation Based on Chief

Complaints

295

Figure 1: Semantic representation learning process based on the medical-term semantic vector dictionary.

P034… Fetuses and neonates a ected

by cesarean delivery

neonatal disorder

Figure 2: Distribution of weights by ICD-10 code for the

disease feature word ”neonatal disorder”.

weights based on the medical-term semantic vector

dictionary (Keshi et al., 2022). The resulting para-

graph vector, which captures the semantic meaning

of the text, is then combined with other explanatory

variables such as gender, age, and department. The

learning model subsequently uses linear SVM and lo-

gistic regression to classify the ICD-10 codes based

on these features.

4.1.1 Structure of Medical-Term Semantic

Vector Dictionary

The structure of the medical-term semantic vector

dictionary is based on the disease thesaurus named

T-dictionary

. It associates 299 feature words (264

disease feature words + 35 main symptoms) with ba-

sic disease names to provide semantic information for

interpretable disease name estimation (Figure 1).

4.1.2 Classiﬁcation and Visualization

Figure 2 shows the top 20 ICD-10 codes on the ver-

tical axis and the weight distribution of the disease

feature word ”neonatal disorder” on the horizontal

axis. For ICD-10 code P034, where the mean of the

weight distribution is greater than 1.0, it indicates fea-

tures and neonates affected by cesarean delivery. This

visualization facilitates the interpretation of how the

model arrived at a particular diagnosis by highlight-

ing the signiﬁcance of speciﬁc disease feature words

in the classiﬁcation process.

4.2 Disease Name Estimation Using

BERT

We evaluated a BERT model pretrained on medical

documents. The BERT model required pre-training

and ﬁne-tuning to achieve accurate disease name esti-

mation.

Table 4 provides information on the BERT models

used in the study.

https://www.tdic.co.jp/products/tdic

https://github.com/cl-tohoku/bert-japanese

https://ai-health.m.u-tokyo.ac.jp/home/research/uth-

bert

https://github.com/ou-medinfo/medbertjp

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

296

Table 4: Information on the BERT Models Used.

Model Name Model Size Training Data

TU-BERT

(Tohoku University

BERT)

Base Japanese Wikipedia

(approximately 17 million

sentences)

UTH-BERT

(University of

Tokyo Hospital BERT)

Base Clinical texts (120 million

records)

MedBERTjp

(Osaka

University Graduate School of

Medicine BERT)

Base Japanese Wikipedia + Corpus

scraped from “Today’s Diagnosis

and Treatment: Premium”

4.3 Estimation of Disease Names Using

GPT-4

We used GPT-4 (model version: 1106-Preview)

from Azure OpenAI Service.

, The chief complaint

dataset was selected for training and evaluation pur-

poses to avoid personal information. Additionally,

we conducted an evaluation using the latest GPT-4o

(model version: 2024-05-13) under the same condi-

tions that yielded the best performance in the earlier

evaluation.

4.3.1 Zero-Shot Learning

In zero-shot learning, GPT-4 estimated disease names

based solely on a system prompt, without any speciﬁc

training on the target dataset. This approach leverages

the model’s pre-existing knowledge to make predic-

tions, demonstrating its ability to infer disease names

from chief complaints even in the absence of domain-

speciﬁc data.

4.3.2 Few-Shot Learning

In few-shot learning, one set of chief complaints and

corresponding ICD-10 codes for each of the top 20

ICD-10 codes in the new EMR was used from the old

EMR, providing 20 sets as example responses to GPT-

4.3.3 RAG

The RAG approach used three databases:

• RAG1: A database of chief complaints and ICD-

10 codes excluding the chief complaints of the top

20 ICD-10 codes in the new EMR.

• RAG2: A database of chief complaints and ICD-

10 codes from the old EMR corresponding to the

top 20 ICD-10 codes from the new EMR.

• RAG3: A database linking all chief complaints

with corresponding ICD-10 codes, including the

evaluation data.

https://portal.azure.com/#view/Microsoft Azure

ProjectOxford/CognitiveServicesHub/

∼

/OpenAI

old EMR

35,509 cases

Seman c

representa on

learning

Machine learning

Training data

Each benchmark

of old EMR

old EMR

learned model

new EMR

28,787cases

Seman c

representa on

learning

Evalua on data

Each benchmark

of new EMR

Disease name

es ma on

Figure 3: Experimental ﬂow of semantic representation

learning.

5 EXPERIMENTAL SETUP

5.1 Semantic Representation Learning

+ Machine Learning

We used vectors of disease feature words from se-

mantic representation learning to create models using

machine learning. Statﬂex

was employed for inter-

pretability evaluation to graph the variance and mean

of the vectors. Figure 3 shows the experimental ﬂow

of disease name estimation from chief complaints

using semantic representation learning and machine

learning.

The datasets of all chief complaints shown in Ta-

ble 2 (35,509 cases in the old EMR and 28,787 cases

in the new EMR) were used for semantic representa-

tion learning. We evaluated each benchmark shown

in Table 3. Both linear SVM and logistic regression

were evaluated due to the shorter text length of chief

complaints.

We determined the optimal conditions for chief

complaints with the highest accuracy based on over-

all accuracy and macro-average F1 score of the top 20

ICD-10 codes. These conditions were used in subse-

quent BERT and GPT-4 experiments.

5.2 BERT

All training data were taken from the progress sum-

mary dataset in the old EMR for ﬁne-tuning BERT.

The evaluation consisted of two methods:

• Extracting progress summaries related to the top

20 ICD-10 codes from the new EMR and classi-

fying them as evaluation data.

• Extracting chief complaints related to the top 20

ICD-10 codes from the new EMR and classifying

them as evaluation data.

https://www.statﬂex.net/

Integrated Evaluation of Semantic Representation Learning, BERT, and Generative AI for Disease Name Estimation Based on Chief

Complaints

297

5.3 GPT-4

For GPT-4 experiments, we used the chief complaint

dataset to avoid personal information.

5.3.1 Zero-Shot Learning

GPT-4 estimated disease names based solely on a sys-

tem prompt, without any speciﬁc training on the target

dataset.

System Prompt Example

# R ole

You are an e x p eri e n ced d oc tor at a

→ ho s pit a l . You wil l an sw er

→ qu e s ti o n s fro m you ng d oc t or s and

→ me d ic a l staf f in J ap a nes e .

# O b j ec t i ve

Ba se d on the in put of th e pa tient ’ s

→ ch ie f c om pl ai nt , you will

→ pe r fo r m th e f o ll o w in g ta sks :

- E st i mat e t he pati en t ’ s d i sea se and

→ pr o vi d e up to five po s sib l e

→ di a g no s e s a lo ng wi th th eir ICD

→ - 10 c od es of mi d dl e cat e gor i e s .

# D ata S p e cif i c a tio n s

For each c hie f com pl ai nt , d i spl ay the

→ ICD -10 c od e of the mi d dl e

→ cat e gor i es an d the top f iv e

→ ca n d id a t e d i a gn o s es .

# O utp ut F or mat

The ou tpu t s h ou ld be in t he fo l low i ng

→ J SO N f orm at :

( for mat det ail s om i tt e d )

5.3.2 Few-Shot Learning

Few-shot learning involved providing example sen-

tences to GPT-4 to enable in-context learning.

Few-shot Learning Example

{" rol e ": " user " , " co nte n t ": " Loss of

→ a pp et it e , gen e r ali z e d fati gu e ,

→ p ai n in da rk s urr o u n di n g s "} ,

{" rol e ": " as s is t a nt " , " con t ent ":"[ {"

→ Es t i ma t e d D i se a se ": " C 25 ", "

→ Di a g no s i s ": " Canc er of the

→ pa n cre a s "] "}

5.3.3 RAG

In the experiment, the three conﬁgurations RAG1,

RAG2, and RAG3 described in the proposed method

were used to evaluate the performance of the model.

Each conﬁguration was designed to test the model un-

der different conditions, focusing on the availability

and relevance of reference data.

RAG External Data Example

Di a g no s is C ode : C34

C34 , B ac k pain , a b dom i nal pain , liv er

→ dys f u nc t i on

C34 , Ab n or m al s e n sa t ion in the ri gh t

→ up pe r arm , sw e ll i ng in the ri gh t

→ s upr a c l avi c u l a r f oss a

In the RAG, new and old EMR chief complaints

were entered into text ﬁles for each ICD-10 code of

the middle categories and managed in an Azure stor-

age Blob container. Data was chunked into 512-token

segments with 128-token overlap. The search used

Azure AI Search’s hybrid (keyword + vector) search

and semantic ranking features (Berntson et al., 2023).

For evaluation, the Zero-shot learning, Few-shot

learning, and RAG methods used the same 200 sets

of evaluation data, which consisted of 200 chief com-

plaints randomly selected from the top 20 ICD-10

codes in the new EMR. The results of these evalua-

tions are presented in the following sections. Based

on the results of the semantic representation learn-

ing experiments, RAG was constructed targeting chief

complaints of more than 10 characters in the ICD-10

middle categories. RAG1 and RAG3 included 872

types of ICD-10 codes, while RAG2 focused on the

top 20 ICD-10 codes from the new EMR. To align the

evaluation with the other two methods, 200 evalua-

tion data sets were constructed by randomly selecting

10 chief complaints from each of the top 20 ICD-10

codes. Each evaluation data set had only one correct

ICD-10 code.

6 EVALUATION RESULTS

6.1 Semantic Representation Learning

+ Machine Learning

The evaluation results of disease name estimation

using semantic representation learning and machine

learning (logistic regression and linear SVM) based

on the chief complaint benchmarks are shown in the

ﬁrst six rows of Table 5. The regularization parame-

ter C was determined using a grid search. The high-

est overall accuracy was 62.0% when the chief com-

plaint had more than 10 characters and the ICD-10

codes were categorized at the middle level. The high-

est macro-average F1 score was 51.7 points when the

chief complaints had more than 10 characters and the

ICD-10 codes were categorized at the subcategory

level, using logistic regression. Linear SVM showed

the best results (the accuracy: 56.1 %, the F1-score:

49.1) with chief complaints of more than 10 charac-

ters and ICD-10 codes categorized at the middle level.

KDIR 2024 - 16th International Conference on Knowledge Discovery and Information Retrieval

298

Table 5: Evaluation results of disease name estimation from chief complaints and progress summaries.

Model Name Type of Evaluation Data C value Accuracy F1-score