Medical Chatbot for Disease Prediction Using Machine Learning and

Symptom Analysis

Oltean Anisia Veronica

, Ioan Daniel Pop

and Adriana Mihaela Coroiu

“Babes-Bolyai” University, Department of Computer Science, 400084, Cluj-Napoca, Romania

Keywords:

Disease Prediction, Logistic Regression, Random Forests, Decision Trees, Naive Bayes, Multilayer

Perceptron.

Abstract:

This paper emphasizes the transformational role of artiﬁcial intelligence in the medical ﬁeld by studying not

only various machine learning algorithms used for symptoms-based disease prediction, but also methods used

in conversational artiﬁcial intelligence. At its core, the research was carried out as a ﬁrst step in the devel-

opment of a medical chatbot that allows patients to receive diagnosis and advice related to various diseases

and their possible treatments. In our paper, various machine learning algorithms were compared for predict-

ing diseases based on symptoms, such as Logistic Regression, Random Forests, Decision Trees, Naive Bayes

and Multilayer Perceptron, which were evaluated on multiple datasets. Given the lack of publicly available

datasets for such a task, a ﬁnal dataset was generated, achieving satisfactory accuracy values of approximately

80%.

1 INTRODUCTION

Health is one of the most important things in a per-

son’s life. Studies on improving health date back

thousands of years. Considering the advance of tech-

nology in the last decades, it is not at all surprising

that various studies have been carried out regarding

its integration into the medical system. Chatbots are

among the technological tools most used in the med-

ical ﬁeld. Chatbots are used in various ﬁelds, among

the most well-known being health, customer relations

and ﬁnance services, with the main advantage of im-

proving interaction with customers through less hu-

man intervention (an example would be virtual agents

that assist in placing an order and are available 24/7,

without the need to call a call center). They have be-

come increasingly popular in recent years due to ad-

vancements in artiﬁcial intelligence and natural lan-

guage processing, they are proving increased skills

and performance in understanding users and provid-

ing coherent answers. As the source (Siddique and

Chow, 2021) mentions, in the ﬁeld of healthcare, chat-

bots have a huge development potential, because they

can improve the patient experience through remote

https://orcid.org/0009-0003-7706-9583

https://orcid.org/0000-0002-3740-6579

https://orcid.org/0000-0001-5275-3432

monitoring, being able to provide quick and timely

responses. Access to health is essential for everyone,

but scheduling a medical consultation is not always

a viable solution due to long waiting times and high

costs, a medical chatbot can be the solution to such

problems. For such a conversational agent to be effec-

tive, it needs to process and analyze user queries, ex-

tract relevant information such as symptoms and pro-

vide a diagnosis as accurate as possible, accompanied

by any advice that should be followed (treatments,

scheduling a consultation or even an emergency visit

to the doctor). It is important that the chatbot has a

natural language dialogue with the patient and pro-

vides personalized answers, depending on the condi-

tion and symptoms identiﬁed. In essence, chatbots

represent valuable tools in medicine, whose purpose

is not to replace medical staff, but rather to highlight

an interdisciplinary approach between technology and

specialists to beneﬁt patients (Altamimi et al., 2023).

Considering all of this, our paper aims to highlight

the role of artiﬁcial intelligence in medicine, address-

ing two main topics, namely the use of algorithms to

predict diseases based on symptoms and the develop-

ment of a medical chatbot to facilitate dialogue with

patients through natural language. Therefore, the cre-

ated chatbot has as main objectives:

• recognizing symptoms from user phrases

• collection of associated symptoms, simulating the

600

Veronica, O. A., Pop, I. D. and Coroiu, A. M.

Medical Chatbot for Disease Prediction Using Machine Learning and Symptom Analysis.

DOI: 10.5220/0013357900003928

In Proceedings of the 20th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2025), pages 600-607

ISBN: 978-989-758-742-9; ISSN: 2184-4895

dialogue between doctor and patient

• providing a diagnosis for the symptoms found (as-

sumes training the classiﬁcation model)

• to provide explanations and possible treatments

for the diseases identiﬁed

2 LITERATURE REVIEW

In light of the importance of the ﬁeld, there are a va-

riety of studies that address this topic, in the follow-

ing we will analyze the studies that we considered the

most relevant.

The ﬁrst analyzed case study (Tode et al., 2021),

the chatbot CureBot proposes a relatively simple ar-

chitecture in building the bot, the preprocessing of the

message from the user is done with the Natural Lan-

guage Toolkit (NLTK) library to ignore the punctua-

tion. Phrases/sentences are turned into tokens, after

which the bag of words technique is applied to en-

code words into numerical data. The neural network

model is created using the Keras framework (sequen-

tial, with 2 hidden starts), softmax activation function

and SGD (stochastic gradient descent) optimization

for sentence intent classiﬁcation. The training takes

place based on a ﬁle in .json format called ’intents’

that contains the tag, followed by a series of phrases

for matching (pattern matching) and a series of sen-

tences to generate the response. Although the source

of the data or the accuracy of the model is not men-

tioned, the author mentions that the model is built to

identify the presence or absence of Covid-19 infec-

tion based on user inputted symptoms. In a similar

way, the HELPI chatbot (Karuna et al., 2023) is built,

which uses the data set ”The Disease Symptom Pre-

diction Dataset”, available on Kaggle (approx. 5000

rows) with 133 columns (the ﬁrst 132 represent the

symptoms, and the last one is the disease), there be-

ing 42 possible diseases (diagnoses) in total. The pre-

diction of the diagnosis takes place through decision

trees, using the CART (Classiﬁcation and Regression

Trees) algorithm, but even in this case no metrics are

mentioned regarding the performance of the model.

The next model analyzed is Diabot (Bali et al.,

2019), a chatbot built speciﬁcally for diabetes pre-

diction based on user-input symptoms. On the back-

end, the RASA framework is used, and the classiﬁca-

tion problem (presence/absence of the disease) is the

result of an ensemble learning approach.The dataset

used contains training on 768 women from a popu-

lation in Phoenix, Arizona, USA, of which 258 had

diabetes and the remaining 500 did not. There are 9

attributes in total (8 represent the factors considered in

disease prediction, the last one being the target vari-

able – 0 or 1). When training the model, the result of

6 models (Multinomial Na

ıve Bayes, Decision Trees,

Random Forest, k-NN, Logistic Regression and Gra-

dient Boost) are combined, each of them being trained

individually. The ﬁnal decision is made by voting, us-

ing the majority voting algorithm – of the two possi-

ble classes (0 or 1), the one predicted by most of the

models wins. The proportion of the data set used for

training and testing is 80 – 20%, with the accuracy of

the model reaching 82%.

The Kiwi chatbot (Chakraborty et al., 2022) is

a purpose-built model to answer questions/queries

about the Covid-19 infection using information from

a dataset built by the author. Similar to the ﬁrst ana-

lyzed model (Tode et al., 2021), the data is organized

in a JSON ﬁle containing several tags representing the

categories of information in which the user’s message

will have to be classiﬁed. Each category contains

several ’patterns’, phrases that describe examples of

possible queries that can come from the user. The

model will choose the most appropriate response from

a set of predeﬁned responses. Before building the

model, language processing techniques are applied

to pre-process the text such as punctuation ignoring

and lemmatization, and the bag of words technique

is used to map the words/phrases as numbers. The

built model is based on a neural network with 3 layers,

among which the hidden layer has the ReLu activation

function, while the output layer has softmax. Categor-

ical Crossentropy or Adam were used as optimizers

for the model. When testing the model, the author

uses an improved model with encoder-decoder archi-

tecture (assumes the addition of LSTM - Long Short

Term Memory layers), but the conﬁguration is not dis-

closed. The accuracy of the model reaches 94% and

is compared to other possible methods tried (recurrent

neural networks RNN or decision trees), but this is the

variant with the best results.

The following analyzed article (Vasileiou and Ma-

glogiannis, 2022) proposes the development of an in-

telligent system based on dialogue with application

in telemedicine. Thus, the authors propose a chatbot

model developed with the DialogFlow conversational

AI platform. The NLP component of the platform

deals with analyzing the text, establishing the user’s

intention (intent classiﬁcation) and also identifying

keywords. The system response is either predeﬁned

(a response is chosen from the training set introduced

in the platform) or from an ML engine (ML Engine

– used for diagnostics). The Accuracy of the model

reaches 98.3%. The second model, Heart Disease,

uses as data set the Cleveland Heart Dataset (UCI)

– composed of 303 medical records with 14 attributes

Medical Chatbot for Disease Prediction Using Machine Learning and Symptom Analysis

601

that are taken into account when predicting the dis-

ease). The data set split is 33% testing and 67% train-

ing, here using several classiﬁcation models based on

the sklearn library – logistic regression, SVC (sup-

port vector classiﬁer), Gaussian Naive Bayes Classi-

ﬁer, Decision Tree Classiﬁer (decision trees) and Ran-

dom Forest. The best performing model was the lo-

gistic regression model, with 82% accuracy.

In the paper (Polignano et al., 2020), the authors

propose a personal medical assistant called HealthAs-

sistantBot (HOB), specialized for the Italian lan-

guage. The interaction with the conversational agent

takes place through the Telegram platform, the built

system having 2 main tasks: Intent Recognition and

Entity Recognition. Classiﬁcation algorithms such

as Naive Bayes, Logistic Regression, Decision Tree

Forest and a Multilayer Perceptron Network were use

to create the model. The performance of the model

was tested considering metrics such as accuracy and

F1 score, and it was found that the Naive Bayes

model performed best for k between 1000 and 2500,

with values for accuracy and F1 score equal to 0.942

(k=1000) and 0.87 ( k=2500).

In the paper (Shedthi B et al., 2023), the authors

propose the development of a website where users

can communicate different aspects related to health,

as well as with an integrated chatbot that has the

task of identifying the user’s symptoms and provid-

ing a diagnosis based on machine learning algorithms.

It is considered that a minimum of 3 symptoms are

needed for the prediction of the disease to occur. The

dataset used is available on Kaggle (133 columns and

41 symptoms), where values of 1 (symptom present)

or 0 (symptom absent) can appear on each row. All

the algorithms used in the work provide over 90% ac-

curacy. In Table 1 it can be seen the result for each

algorithm. In the table below, the following abbrevi-

ations were used: Support Vector Machine as SVM,

Random Forest as RF, K-Nearest Neighbors as KNN,

Bayesian Network as BN and Logistic Regression as

LG.

Table 1: The performances of the models presented in the

paper (Shedthi B et al., 2023).

ACC Precision Recall F1 Score

SVM 0.9079 0.91 0.91 0.90

RF 0.9737 0.98 0.97 0.97

KNN 0.9079 0.89 0.91 0.89

NB 0.9605 0.98 0.96 0.97

LR 0.9474 0.97 0.95 0.94

The last work, (Babu and Boddu, 2024), uses a

pre-trained BERT (Bidirectional Encoder Represen-

tations from Transformers) model used to obtain con-

textualized embeddings that are then used in tasks

such as entity recognition and intent classiﬁcation.

Fine-tuning is done on a set to generate responses

of approximately 11,000 question-answer pairs col-

lected from various sources (MIMIC-III, PubMed,

BioASQ, etc.). The model achieves high accuracy

(98%) for processing medical queries, but despite the

obtained metrics, the authors highlight some prob-

lems such as the high time required for training and

the possible decrease in model efﬁciency if there is

not enough training data for some medical cases.

3 THEORETICAL BACKGROUND

Symptom-based disease diagnosis, a central topic

in our paper, is an application of supervised learn-

ing–namely multi-class classiﬁcation. The input vari-

ables (features) are represented by a list of symptoms

that will need to be processed into numerical form in

order to be processed by the AI-based classiﬁcation

algorithms. The target variable y is a discrete variable

and it can take values from the set of all diagnoses

(diseases), the number of which varies depending on

the data set used. Formally, the diagnosis problem

involves ﬁnding a model that, for an input data set

, x

, ..., x

) where x

represents a symptom and m is

the total number of symptoms, assigns a label y rep-

resenting the diagnosis (disease). y is part of a ﬁnite

set of diagnoses D with card(D) = n, so y can take any

value from the set of n possible diagnoses. An ex-

ample for D could be {appendicitis, pneumonia, ﬂu,

indigestion}, where there are n=4 possible diagnoses,

and for x=(cough, fever, chest pain) a possible list

of symptoms from the set {cough, abdominal pain,

fever, chest pain, constipation, ﬂatulence} etc.

3.1 Methods

In this work, we used ﬁve algorithms for disease

prediction based on symptoms: Logistic Regression,

Gaussian Naive Bayes, Random Forest, Decision Tree

and MultiLayer Perceptron. All these algorithms

were used in the multiclass classiﬁcation task.

Logistic Regression (LR) is a statistical model

used for binary classiﬁcation, predicting the probabil-

ity that a given input belongs to one of two classes.

Unlike linear regression, which predicts continuous

values, logistic regression uses the logistic function

(sigmoid) to output probabilities between 0 and 1 (Za-

bor et al., 2022). The model estimates the param-

eters (weights) by maximizing the likelihood of the

observed data. It is widely used in machine learning

and statistics for problems like spam detection, medi-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

602

cal diagnosis, and credit scoring, where the goal is to

classify inputs into two distinct categories based on

various features (Zabor et al., 2022).

A variation of the Naive Bayes classiﬁer that

makes the assumption that the features have a nor-

mal (Gaussian) distribution is called Gaussian Naive

Bayes (GNB). Based on the idea that features are in-

dependent of one another, it computes the probability

of each class given the input features. This method

is known as Bayes’ Theorem. The likelihood of the

features is represented by a Gaussian distribution for

Gaussian Naive Bayes, which is determined by its

mean and variance. It offers simplicity and efﬁciency

and is especially useful for classiﬁcation tasks where

the input data is continuous, like medical diagnosis

spam detection and text classiﬁcation (Reddy et al.,

2022).

Random Forest (RF) is an ensemble learning

method used for both classiﬁcation and regression

tasks. It operates by constructing multiple decision

trees during training and combining their outputs to

improve accuracy and reduce overﬁtting. Each tree

is built using a random subset of the data, and each

node in a tree splits on the best feature from a random

subset of features. The ﬁnal prediction is typically

made by averaging the predictions of all trees (for

regression) or by majority voting (for classiﬁcation).

Random Forests are robust, handle large datasets well,

and are resistant to noise and overﬁtting (Probst et al.,

2019).

A Decision Tree (DT) is a supervised machine

learning algorithm used for classiﬁcation and regres-

sion tasks. It models decisions as a tree-like structure,

where each internal node represents a test on a fea-

ture, each branch represents the outcome of the test,

and each leaf node represents a ﬁnal decision or pre-

diction. The tree is built by recursively splitting the

data into subsets based on the feature that provides the

highest information gain or the lowest impurity. Deci-

sion Trees are intuitive, easy to visualize, and handle

both numerical and categorical data, though they can

be prone to overﬁtting without pruning (Costa and Pe-

dreira, 2023).

A Multilayer Perceptron (MLP) is a type of arti-

ﬁcial neural network composed of multiple layers of

nodes: an input layer, one or more hidden layers, and

an output layer. Each node (neuron) in the network,

except for the input layer, applies a nonlinear activa-

tion function to its inputs and passes the result to the

next layer. MLPs are trained using backpropagation

to minimize a loss function, adjusting the weights be-

tween neurons. They are capable of learning complex

patterns and are commonly used for tasks like classi-

ﬁcation, regression, and pattern recognition, but can

be computationally expensive (Almeida, 2020).

All ﬁve models have been tested, to see which one

best ﬁts our problem.

4 DATASET

The collection of data sets has been a major impedi-

ment precisely because of their lack, as there are few

publicly available data sets. Thus, when establish-

ing the classiﬁcation algorithms, we considered sev-

eral data sets collected from various platforms such

as Kaggle, HuggingFace or GitHub. The examples

described in this section include the analysis of the

collected data sets, and then the necessary process-

ing on them (elimination of duplicates, encoding of

symptoms in numerical form).

The ﬁrst dataset used is (KaggleDataset) (Patil,

2024) and consists of a total of 132 symptoms and 42

possible diseases in a csv ﬁle. In total, there are 4920

rows, each simulating an example patient represented

by a list of symptoms and the associated disease. At

ﬁrst glance, the distribution of patients per disease is

balanced – each disease has exactly 120 patient ex-

amples associated with it, but upon further analysis,

following the elimination of duplicate rows, the num-

ber of examples decreases from 4920 to 348, losing

at the same time and balancing the number of patients

per disease.

The second dataset used (LargeDatasetHF) (Anh,

2024) contains information on 392 diseases, in total

there are up to 892 distinct symptoms in the dataset.

Here, however, for each disease there is only one ex-

ample with associated symptoms (there is 1 sample

per class), and to test the accuracy of the classiﬁca-

tion models we formed a test case with 83 diseases.

The next dataset analyzed is one of the datasets

synthetically generated by the authors of the paper

(Yuan and Yu, 2024) (MedlinePlus) available at (Yuan

and Yu, 2021). The set contains 893 diseases and

1556 symptoms in total and is similar to that of (Anh,

2024) in that there is only one example per disease.

For the ﬁnal project, we manually selected 38 dis-

eases from the data obtained by the authors of the

paper (Yuan and Yu, 2024) from the other data ﬁles

available in the public Github repository (Symcat),

they being similar to those used in the paper (Polig-

nano et al., 2020) which makes possible similarity

of performance, where a similar data set adapted to

the Italian language is used. And here, in the case

of maintaining the large dimensions of the number

of diseases and symptoms, the accuracy of the mod-

els drops to approximately 60%, a phenomenon also

noted in the work described. For this reason we came

Medical Chatbot for Disease Prediction Using Machine Learning and Symptom Analysis

603

to the decision to reduce the data set (Symcat38), the

results improved.

An advantage of this generated data set is the oc-

currence of prevalence for each symptom associated

with the diseases, which makes it easier to generate

patients. The generation of a ”patient” consists in se-

lecting some symptoms from a list associated with

each disease in order to obtain several ”examples”

of various symptoms for a certain diagnosis. When

generating the patients, we chose to eliminate dupli-

cates in order not to have data from the training set

in the test set. However, this choice has the disadvan-

tage of some classes (diseases) that are represented

by fewer examples in training, these diseases being

predicted with a low probability in testing. Applying

tutor methods to a dataset that contains 100% real pa-

tients, could reduce the accuracy of the system a little,

but certainly the adaptability of the system to real in-

puts would increase.

5 EXPERIMENTAL RESULTS

5.1 Encoding Symptoms

5.1.1 One-Hot Encoding

For all datasets used, a technique described in

(Hapke et al., 2019) called One-Hot, adapted for

symptom encoding, was used: if there are a to-

tal of M symptoms in a dictionary D, a list l =

(symptom 1, symptom 2, ..., symptom N) of symp-

toms will be represented as a vector of size M which

will have values of 1 only for indices in dictionary

D of symptoms in list l, and 0 otherwise. This tech-

nique is useful when working with categorical vari-

ables in machine learning models, since many algo-

rithms only work with numeric data, and categorical

variables must be converted to numbers to be pro-

cessed correctly.

5.1.2 Vector Embeddings

Since if the number of symptoms in an example is

low (3 or 4) this representation can lead to a sparse

array, we also tried transforming a list of symptoms

into a vector of real numbers of size 200 using word

embeddings (the vector representations) from the pa-

per (Zhang et al., 2019). Since a symptom can consist

of several words, for the representation we chose to

do the arithmetic mean of the vectors for each word.

For example, for ”sore throat” there are embeddings

for the words ”sore” and ”throat”, denoting by em(s)

the encoding vector obtained for the symptom (word)

s, then em(”sore throat”) we calculated it according to

the formula

em(”sore”)+ em(”throat”)

(1)

where dividing by 2 the resulting vector means divid-

ing by 2 each component of the vector. However, this

method gives very poor results for some algorithms

(see Table 2). LargeDatasetHF represents the origi-

nal dataset (where each disease has only one patient),

and LargeDatasetHFAugumented represents the same

dataset only that we generated patients. The compar-

ison between the two data sets highlights that large

data sets cannot achieve high accuracy results com-

pared to using other small data sets. When we talk

about the performance of the models, we are not only

referring to accuracy, considering the fact that we are

solving a classiﬁcation problem, for the performance

evaluation, besides accuracy, we decided to use met-

rics such as precision, recall and F1-score. In Table 2

Table 2: Prediction using Word Embeddings.

Kaggle LargeDatasetHF Augumented

LR 0.98 0.78 0.65

RF 0.98 0.79 0.61

DT 0.63 0.10 0.21

MLPC 1.0 0.85 0.65

GNB 0.95 0.01 0.51

5.2 Results and Discussion

For the classiﬁcation of symptoms in diseases, we

used 5 algorithms from the scikit-learn library (Pe-

dregosa et al., 2011), namely Logistic Regression

(LogisticRegression), Decision Tree Forest (Ran-

domForestClassiﬁer), Naive Bayes (GaussianNaive-

BayesClassiﬁer), decision trees (DecisionTreeClas-

siﬁer) and Multi-layer Perceptron (MLPCLassiﬁer).

When testing the algorithms, the proportion of train-

ing and testing sets was considered to be 80% training

and 20% testing, regardless of the data set used. The

obtained results are available in the Table 3, which

contains the comparative analysis of the metrics ob-

tained by us in relation to the analyzed works, and

Figure 1 contains the results obtained for the ﬁnal data

set (Symcat38).

For the ﬁrst data set addressed (Kaggle (Patil,

2024)), the accuracy of the algorithms used is 1.0,

which suggests overﬁtting (overestimation). We

compared the result thus obtained with the work

(Shedthi B et al., 2023), in which the authors observe

the same problem with overﬁtting, adapting the data

set through a procedure described brieﬂy, in which the

symptoms ”that do not weigh majorly in the predic-

tion of diseases” are removed from dataset, also tak-

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

604

Table 3: Results obtained in the experiments.

Paper

DataSet

M1 M2 M3 M4 M5

Paper1

DataSet1

0.94 0.97 - - 0.96

Our Results

DataSet1

1.0 1.0 1.0 1.0 1.0

Paper2

DataSet2

- - - 0.55 -

Our Results

DataSet2

0.64 0.61 0.37 0.64 0.82

Paper3

DataSet3

0.6 0.58 - 0.60 0.60

Our Results

DataSet3

0.63 0.59 0.45 0.64 0.71

Our Results

DataSet4

0.72 0.71 0.53 0.72 0.80

ing into account the correlation between symptoms,

determined by a correlation matrix.

Algorithm testing on large datasets was done with

LargeDatasetHFAugumented (Anh, 2024), Medline-

Plus (Yuan and Yu, 2024) and Symcat (Yuan and Yu,

2024) (available at (Yuan and Yu, 2021)). The ac-

curacy of the algorithms is low. In the paper (Yuan

and Yu, 2024), the authors use their own disease pre-

diction algorithm based on Reinforcement Learning,

which aims to collect symptoms starting from an ini-

tial data set, ﬁnally performing the disease prediction

using MLPClassiﬁer. The accuracy is 55% and they

are consistent with the similar results used in both the

paper (Polignano et al., 2020) and the experiment pro-

posed in this paper. Although other algorithms are

used in (Polignano et al., 2020), the accuracy does

not exceed 60%. The authors of the paper justify this

low level of accuracy on the grounds that in a data

set with many diseases and symptoms, the presented

models are not able to distinguish between diseases

that have a subset of common symptoms, but also the

fact that some diseases in the data set have vaguely de-

ﬁned symptoms. A summative analysis of the exper-

iments is presented in Table 3 (with the mention that

”-” has been entered in the table for the untested algo-

rithms). In Table 3 the following notations were used:

DataSet1 is Kaggle, DataSet2 is Medline, DataSet3

is Symcat, DataSet4 is LargeDatasetHFAugumented

and Paper1 is (Shedthi B et al., 2023), Paper2 is

(Yuan and Yu, 2024) and Paper3 is (Polignano et al.,

2020). The notations from M1 to M5 represent the

methods applied in the order of their presentation in

the ﬁrst paragraph of this chapter.

To avoid the problem of low accuracy in the case

of datasets with a large number of symptoms and dis-

eases, we manually selected a dataset consisting of

38 diseases and 167 symptoms (Symcat38), gathered

from the data available in the Git repository of the

paper (Yuan and Yu, 2024), several data ﬁles gener-

ated in Symcat, similar to the data used in (Polignano

et al., 2020). The selection of samples (patients) per

disease was carried out using a procedure similar to

that described in (Polignano et al., 2020), borrowed

from (Yuan and Yu, 2024). In this case, for a disease,

the prevalence of the disease is also known, i.e. for

each symptom in the list, a number between 0 and 1

is known, which represents the prevalence of the dis-

ease. To generate a synthetic patient, a list of values

following the uniform distribution is generated and a

boolean vector with the value 1 is formed if the gen-

erated value is lower than the prevalence value. Each

symptom present in the symptom list has an associated

real numerical value in the prevalence list. It is worth

mentioning that the prevalence value for a symptom is

different if it appears associated with several diseases.

To generate a synthetic patient for a disease, the pro-

cedure proceeds as follows: we assume that the list

of symptoms S has length l; This generates l real val-

ues between 0 and 1, by drawing l values from [0,

1) following the uniform distribution (ie each number

in the range has the same probability of being cho-

sen). Then, the prevalence value is subtracted from

each obtained value, and if the subtraction results in

a number less than 0 (the generated value for a symp-

tom is less than the prevalence value), then the symp-

tom will be added to the list of symptoms for the gen-

erated patient. The process is repeated until such a

list of length at least 2 is generated, since it would not

make sense to have diseases without associated symp-

toms in the data set, although it is possible in reality

in the case of asymptomatic patients, the situation is

difﬁcult to model in the case classiﬁcation algorithms

that require numerical data for training.

To generate patients for each disease, we gener-

ated 30 such patients, but after removing duplicates,

the number of patients per disease is varied. The re-

sulting dataset has 921 patients generated in total.

The algorithms used are the same, and the results

obtained are:

• Decision Trees - 0.61

• Random Forest - 0.720

• MultiLayer Perceptron - 0.78

• Gaussian Naive Bayes - 0.800

• Logistic regression - 0.805

Since the Logistic Regression algorithm has the

highest accuracy value, it is and the model we inte-

grated into the ﬁnal application.

An important aspect to mention in the case of this

data set is the presence of two additional characteris-

Medical Chatbot for Disease Prediction Using Machine Learning and Symptom Analysis

605

Figure 1: The accuracy obtained for each algorithm.

tics compared to the others: age and sex. Experimen-

tally, we tried the problem of disease prediction on the

same data set, taking into account the generation of

age and sex. To encode the categories, in addition to

the one-hot vector obtained for the symptoms, the two

numerical values corresponding to the age and gender

categories are added. The performance of the algo-

rithms is comparable, in the case of GaussianNaive-

Bayes the accuracy even reaches 83%. However,

the obtained models are not capable of capturing all

the relationships between symptoms and sex/age, this

fact being highlighted in an example where the dis-

ease Uterine ﬁbroids is predicted with a high thresh-

old (over 65%) in the case of all algorithms, even if

the selected gender is in category 0 (male) and given

that there is no training data in the dataset except for

symptoms and female gender. The problem can be ex-

plained by the representation of the data (symptoms,

age and sex), which is sparse in the case of symptoms,

and the other two values added to the obtained list do

not have such a signiﬁcant weight in the model train-

ing phase.

For the construction of the initial model of the

chatbot, we proposed a model that starts from a set

of sentences/phrases, classiﬁes them (using multi-

class classiﬁcation) into categories of intentions (in-

tents), identiﬁes entities with names (symptoms and

diseases), after which generates answers using the al-

gorithm for classifying symptoms into diseases and

additional information from a manually compiled

knowledge base.

Intent classiﬁcation is a text classiﬁcation problem

where the user’s phrases need to be classiﬁed into cer-

tain categories in order for the chatbot to provide an-

swers based on the task/question it needs to answer.

We manually generated a dataset where each intent

(tag) contains a list of sentences in json format. The

task of text classiﬁcation starts with text preprocess-

ing, at which stage we used the NLTK (Natural Lan-

guage Toolkit) library (Bird et al., 2009) to remove

punctuation from texts, tokenize sentences and bring

words to a basic (dictionary-derived – “ stem”) us-

ing SnowBallStemmer. For the actual classiﬁcation

model we used the Tensorﬂow platform (Abadi et al.,

2015). In the ﬁrst phase, we used the Tokenizer class

to build a vocabulary and transform the processed

sentences into numerical form, and for the neural net-

work we chose a sequential architecture consisting of

layers of embedding, bidirectional lstm, dense (with

relu-enabled function), dropout and dense on the ﬁnal

start (with the softmax activation function to obtain

the class/intent probabilities). Model training used

the crossentropy categorical loss function and Adam

as optimizer and a number of 100 epochs. The model

achieves 1.0 accuracy on the training and 0.77 on the

test data set.

6 CONCLUSIONS

The medical ﬁeld is one of the most important ﬁelds

in everyday life, in this study we tried to demon-

strate that this vital ﬁeld can be improved using ma-

chine learning together with other artiﬁcial intelli-

gence techniques. In the application developed in this

work, we demonstrated the construction of a medi-

cal chatbot capable of understanding the user’s intent,

extracting entities from texts and providing answers

based on information available in the form of csv ﬁles.

The chatbot made in this way represents a promising

application in the medical ﬁeld, but which can be sig-

niﬁcantly improved. First, the algorithms used to pre-

dict diseases based on symptoms could be replaced

by more sophisticated algorithms that ensure high ac-

curacy even when applied to larger datasets. Second,

the performance of such a model can be improved by

using qualitative data sets that contain more character-

istics besides symptoms and age/sex (gender), such as

symptom duration and risk factors. Furthermore, gen-

erative artiﬁcial intelligence models specialized in the

medical ﬁeld could be used to generate the answers so

that the answers obtained are both scientiﬁcally cor-

rect and diverse to engage the user in conversation.

This paper underlines the signiﬁcance of incorpo-

rating technology breakthroughs into medical prac-

tices by demonstrating the potential of machine learn-

ing approaches. It is essential to acknowledge this

study’s limitations. The quality and representative-

ness of the given datasets determine how accurate and

generalizable the classiﬁcation models are.

ENASE 2025 - 20th International Conference on Evaluation of Novel Approaches to Software Engineering

606

Within this paper it was obtained satisfactory re-

sults, making a comparison with related work it can

be seen that the results obtained are good. The ﬁnd-

ings of this research contribute to the growing body

of knowledge about machine learning applications in

the medical ﬁeld and provide a base for future studies

aimed at improving medical practices and improving

communication with patients.

Future work would consist of creating a bigger

data set and testing and validating the models cre-

ated in this paper on this new data set, respectively,

trying to check what performance could be obtained

with other ML approaches.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems. Software avail-

able from tensorﬂow.org.

Almeida, L. B. (2020). Multilayer perceptrons. In Hand-

book of Neural Computation, pages C1–2. CRC Press.

Altamimi, I., Altamimi, A., Alhumimidi, A. S., Altamimi,

A., and Temsah, M.-H. (2023). Artiﬁcial intelligence

(ai) chatbots in medicine: a supplement, not a substi-

tute. Cureus, 15(6).

Anh, B. H. Q. (2024). Disease symptoms, @ONLINE.

Babu, A. and Boddu, S. B. (2024). Bert-based medical chat-

bot: Enhancing healthcare communication through

natural language understanding. Exploratory Re-

search in Clinical and Social Pharmacy, 13:100419.

Bali, M., Mohanty, S., Chatterjee, S., Sarma, M., and Pura-

vankara, R. (2019). Diabot: a predictive medical chat-

bot using ensemble learning. International Journal of

Recent Technology and Engineering, 8(2):6334–6340.

Bird, S., Klein, E., and Loper, E. (2009). Natural language

processing with Python: analyzing text with the natu-

ral language toolkit. ” O’Reilly Media, Inc.”.

Chakraborty, S., Paul, H., Ghatak, S., Pandey, S. K., Ku-

mar, A., Singh, K. U., and Shah, M. A. (2022). An

ai-based medical chatbot model for infectious disease

prediction. Ieee Access, 10:128469–128483.

Costa, V. G. and Pedreira, C. E. (2023). Recent advances

in decision trees: An updated survey. Artiﬁcial Intel-

ligence Review, 56(5):4765–4800.

Hapke, H., Howard, C., and Lane, H. (2019). Natural

Language Processing in Action: Understanding, an-

alyzing, and generating text with Python. Simon and

Schuster.

Karuna, G., Reddy, G. G., Sushmitha, J., Gayathri, B.,

Sharma, S. D., and Khatua, D. (2023). Helpi–an auto-

mated healthcare chatbot. In E3S Web of Conferences,

volume 430, page 01040. EDP Sciences.

Patil, P. (2024). Symptoms and diseases dataset @ON-

LINE.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., et al. (2011). Scikit-

learn: Machine learning in python. Journal of ma-

chine learning research, 12(Oct):2825–2830.

Polignano, M., Narducci, F., Iovine, A., Musto, C.,

De Gemmis, M., and Semeraro, G. (2020). Healthas-

sistantbot: a personal health assistant for the italian

language. IEEE Access, 8:107479–107497.

Probst, P., Wright, M. N., and Boulesteix, A.-L. (2019). Hy-

perparameters and tuning strategies for random for-

est. Wiley Interdisciplinary Reviews: data mining and

knowledge discovery, 9(3):e1301.

Reddy, E. M. K., Gurrala, A., Hasitha, V. B., and Kumar, K.

V. R. (2022). Introduction to naive bayes and a review

on its subtypes with applications. Bayesian reasoning

and gaussian processes for machine learning applica-

tions, pages 1–14.

Shedthi B, S., Shetty, V., Chadaga, R., Bhat, R., Bangera, P.,

and Kini K, P. (2023). Implementation of chatbot that

predicts an illness dynamically using machine learn-

ing techniques. International Journal of Engineering,

(Articles in Press).

Siddique, S. and Chow, J. C. (2021). Machine learning in

healthcare communication. Encyclopedia, 1(1):220–

239.

Tode, V., Gadge, H., Madane, S., Kachare, P., and Deokar,

A. (2021). A chatbot for medical purpose using deep

learning. International Journal of Engineering Re-

search & Technology.

Vasileiou, M. V. and Maglogiannis, I. G. (2022). The health

chatbots in telemedicine: Intelligent dialog system for

remote support. Journal of Healthcare Engineering,

2022(1):4876512.

Yuan, H. and Yu, S. (2021). Efﬁcient symptom inquiring

and diagnosis via adaptive alignment of reinforcement

learning and classiﬁcation.

Yuan, H. and Yu, S. (2024). Efﬁcient symptom inquiring

and diagnosis via adaptive alignment of reinforcement

learning and classiﬁcation. Artiﬁcial Intelligence in

Medicine, 148:102748.

Zabor, E. C., Reddy, C. A., Tendulkar, R. D., and Patil, S.

(2022). Logistic regression in clinical studies. In-

ternational Journal of Radiation Oncology* Biology*

Physics, 112(2):271–277.

Zhang, Y., Chen, Q., Yang, Z., Lin, H., and Lu, Z. (2019).

Biowordvec, improving biomedical word embeddings

with subword information and mesh. Scientiﬁc data,

6(1):52.

Medical Chatbot for Disease Prediction Using Machine Learning and Symptom Analysis

607