creases the number of minority class members by re-
sampling the data set. We have selected this data level
approach to address imbalanced data because it al-
lows to benefit from the complete initial data set (no
loss of information) and also because previous com-
parisons with other techniques on our data set reveal
its advantage.
We use three well known classifiers, random for-
est, artificial neural network and logistic regression
along with a Bayesian network. The interest of this
probabilistic graphical model is to be explainable,
which is important in the context of the development
of an application of fall prevention.
In Section 2, we present an overview of previous
works done in the use of imbalanced data in medical
field. We present the data set, the pre-processing steps
and the description of selected and target variables in
Sections 3 to 6 respectively. Section 7 discusses the
methodology whereas section 8 presents the results
and discussions. Finally, we conclude the article.
2 RELATED WORKS
Data mining combined with machine learning is a
powerful tool for resolving a wide range of issues.
Healthcare data is difficult to manually handle due
to the large number of data sources. Artificial in-
telligence advancements have introduced precise and
accurate systems for medical applications that deal
with sensitive medical data(Ahmed et al., 2020). We
present an overview of some of the work done in the
use of imbalanced data in the medical field.
In study (Shuja et al., 2020), the author uses data
mining techniques to create a model for diabetic pre-
diction. At first step they preprocess the data us-
ing the Synthetic Minority Oversampling Technique,
and then feed this preprocessed data to five classifiers
(Bagging, Support Vector Machine, Multi-Layer Per-
ceptron, Simple Logistic, and Decision Tree) in order
to select the best classifier for a balanced data set to
predict diabetes. In another study (Ishaq et al., 2021),
the authors classify the survivors during heart fail-
ure from a data set of 299 hospitalised patients. The
goal is to identify key characteristics and data min-
ing techniques that can improve the accuracy of car-
diovascular patient’s survival prediction. This study
uses nine classification models to predict patient sur-
vival: Decision Tree, Adaptive Boosting Classifier,
Logistic Regression, Stochastic Gradient classifier,
Random Forest, Gradient Boosting classifier, Extra
Tree Classifier (ETC), Gaussian Naive Bayes classi-
fier, and Support Vector Machine. Synthetic Minority
Oversampling Technique (SMOTE) is used to solve
the problem of class imbalance. To deal with the
problem of classifying imbalanced data, the author, in
study (Jeatrakul et al., 2010), proposed a method that
combines SMOTE and Complementary Neural Net-
work. Three classification algorithms, Artificial Neu-
ral Network, k Nearest Neighbor and Support Vec-
tor Machine, were used for comparison. The bench-
mark data set with various ratios between the minor-
ity and majority classes were obtained from the ma-
chine learning repository at the University of Cali-
fornia Irvine. The findings demonstrate that the pro-
posed combination of techniques is effective and im-
proves the performance. The author in (Guan et al.,
2021) proposed a hybrid re-sampling method to solve
the problems of small sample size and class imbal-
ance which combines SMOTE and weighted edited
nearest neighbour rule (WENN). First, SMOTE uses
linear interpolation to create synthetic minority class
examples. Then WENN uses a weighted distance
function and the k-nearest neighbour rule to detect
and delete unsafe majority and minority class exam-
ples. By taking into account local imbalance and spa-
tial sparsity, the weighted distance function scales up
a commonly used distance.
3 DATA SOURCE
The 1810 patients who attended the Lille University
Hospital Falls Clinic, between January 2005 and De-
cember 2018, were included in the study. The mini-
mum and maximum age of the patients are 51 and 100
years respectively, with an average age of 81 years
old. Also, the male and female patients are 28% and
72% respectively. The patients are admitted in that
service for a complete day, during which they meet
different medical personnel and each of them explores
a set of factors such as history of falls, nutrition, phys-
ical activities, medical tests such as balance test etc.
At each step, the data collected about the patient are
registered. After that, a team of specialists about the
fall of the elderly gathers around the case file of the
patient and discusses about the most appropriate rec-
ommendations on the basis of the observed risk fac-
tors of the person. At the end of the day, a small num-
ber of appropriate recommendations is selected and
explained to the patient. The patient is invited to come
back 6 months later in the hospital for a short consul-
tation during which an assessment is done regarding
the recommendations and the number of falls during
the last 6 months. This information is added in the
data file which was provided to us for our analysis.
Evaluation of Risk Factors for Fall in Elderly People from Imbalanced Data using the Oversampling Technique SMOTE
51