of Cardiovascular Risk” (Yancy et al., 2013) provides
detailed recommendations for estimating cardiovas-
cular disease risk in the clinical practice, considering
several factors including age, gender, race, cholesterol
and blood pressure levels, diabetes and smoking sta-
tus, and the use of blood pressure-lowering medica-
tions. In Europe, the 10-year risk factor of fatal CVD
is estimated based on different charts established by
the European Society of Cardiology for high-risk and
low-risk populations across Europe, which may be
further adapted to national or regional specific charts
based on published mortality data.
In the literature, the CVDs risk prediction is ad-
dressed with either appropriate risk tools or the aid of
machine learning. In (Gale et al., 2014), the Fram-
ingham cardiovascular disease risk score and inci-
dent frailty studied on English cohort data for age-
ing participants. Moreover, the systematic coronary
risk evaluation (SCORE) has been suggested to pre-
dict the 10-year risk of cardiovascular death in Eu-
rope or the QRISK to predict the composite outcome
of coronary heart disease and ischaemic stroke. Oth-
ers employ machine learning techniques, also aiming
at predicting potential risk of CVDs (Mohan et al.,
2019), (Yang et al., 2020).
ML is a branch of artificial intelligence (AI) and a
powerful tool in the medical field, as it can help pre-
dict various diseases. In (Dinesh et al., 2018), var-
ious data-driven approaches are presented to predict
diabetes and cardiovascular disease with ML models.
Here, we will solely focus on its application to cardio-
vascular medicine (Haq et al., 2018). Our purpose is
to identify predictive data patterns and high-risk CVD
groups among the elderly. Moreover, we aim to create
personalized risk models that will be part of the pre-
dictive AI tools integrated into the SmartWork (Koc-
sis et al., 2019) and GATEKEEPER systems. The
presented method for the risk prediction of CVDs oc-
currence was developed and validated independently
with a publicly available dataset and, in parallel, as
part of the projects with pilot data. The incorporation
of the ML models into the Long-term Risk Prediction
tools of the SmartWork system aims to design a smart
age-friendly healthy living and working environment
for office workers. The GATEKEEPER system pur-
sues to sustain, as healthy as possible, the life of older
people living at home, preventing the occurrence of
CVD, type 2 diabetes mellitus (T2DM)(Fazakis et al.,
2021), high cholesterol, hypertension (Dritsas et al.,
2021), chronic obstructive pulmonary disease-COPD
(Hussain et al., 2021) (chronic conditions related to
Metabolic Syndrome-MetS).
Given that MetS combines risk factors that pro-
mote the development of cardiovascular disease
(CVD) and type 2 diabetes (T2DM)(Hoyas and Leon-
Sanz, 2019), as a first approach, our paper aims to
present a methodology for correctly identifying those
at risk of diagnosed with a CVD in long-term. For
this purpose, the classification performance of various
ML models is estimated on each test instance from a
CVD dataset. The ML models that achieve the high-
est recall (namely, high sensitivity) and Area Under
Curve (AUC) show that the CVD class can be pre-
dicted correctly. The main contribution of this work
is a comparative evaluation of different ML models
on a balanced dataset and the proposal of a Logistic
Regression model for the long-term CVD risk predic-
tion. In the upcoming sections, the main steps of the
employed process are demonstrated.
The rest of this paper is organized as follows. Sec-
tion 2 presents the main parts of the methods for the
long-term risk prediction of CVD. Section 3 makes
an analysis of the dataset features and Section 4 de-
scribes the pre-processing steps for the design of the
training and testing dataset and feature ranking. Sec-
tion 5 presents the experimental set up and the clas-
sification performance of ML techniques. Ultimately,
Section 6 concludes the paper and notes future direc-
tions of the current outcomes.
2 MACHINE LEARNING
METHODS
Data science and especially machine learning has
been widely used in the field of medicine for the risk
analysis of several chronic conditions. The most com-
mon application of these models aims to determine
the most suitable factors for the long-term risk pre-
diction to avoid serious health complications (due to
certain symptoms) and support health care manage-
ment.
In this study, the forecasting performance of four
different machine learning models is presented. In
particular, the Naive Bayes, SVM, Logistic Regres-
sion and Random Forest are utilized to estimate the
long-term risk of an older person being diagnosed
with cardiovascular disease.
The dataset is separated into a training set of size
M, a test set of size N. A categorical variable c which
captures the class label of an instance i in the dataset.
In the context of this work, the investigating problem
has two possible classes, e.g., c = ”CVD” or ”Yes”
or c= ”Non-CVD” or ”No”. The features vector of
an instance i is captured by f
i
=
f
i1
, f
i2
, f
i3
, . . . , f
in
T
(with M n).
Our aim is to achieve high recall or sensitivity
and Area Under Curve (AUC) through supervised ma-
ICT4AWE 2022 - 8th International Conference on Information and Communication Technologies for Ageing Well and e-Health
316