find any of the categorical variables to be significantly
associated with the death event. Similarly, the
Information Value (IV) analysis identified all the
categorical variables (anaemia, diabetes, high blood
pressure, sex, and smoking) as weak predictors, with
IV values less than 0.02, except for anaemia and high
blood pressure. Based on these analyses, it can be
concluded that the categorical variables are relatively
weak predictors for the occurrence of a death event
compared to the selected continuous variables.
Based on the comprehensive analysis, the final set
of variables chosen for modeling includes the
following continuous variables: age, ejection fraction,
serum creatinine, serum sodium, and time. No
categorical variable was strong enough to be included
in the final model.
3 PREDICTION OF HEART
FAILURE INCIDENCE
3.1 Methodology and Model
Establishment
In the current study focused on predicting heart failure-
related death events, this paper employed two distinct
machine learning models—Logistic Regression and
Random Forest—to assess the predictability of
selected clinical features.
The Logistic Regression model was trained using
the default optimization algorithm and employed for
predicting the test set. Its key advantages lie in model
interpretability and computational efficiency. This
paper evaluated the model's performance using various
metrics, including accuracy, F1 score, precision, recall,
and AUC.
On the other hand, the Random Forest model is
more complex, involving an ensemble of multiple
decision trees. Through the use of grid search and 10-
fold cross-validation, this paper identified the optimal
combination of hyperparameters to achieve the best
predictive performance. The Random Forest model not
only allowed to capture potential nonlinear patterns in
the data but also provides additional insight into feature
importance, help people understand which variables
play a crucial role in predicting heart failure-related
death events.
The Logistic Regression model served as a
straightforward yet robust baseline for our predictions.
However, in order to capture potential nonlinear
relationships and interactions among the features, this
paper also employed a Random Forest model. The
Random Forest model underwent hyperparameter
tuning using grid search with 10-fold cross-validation
to identify the optimal parameter settings. The model
that demonstrated the best performance had
n_estimators=100, max depth=10, min samples
split=5, and min samples_leaf=1. These
hyperparameters indicate the complexity and depth of
decision trees within the Random Forest, tailored to
our specific dataset.
3.2 Analysis of Results
Both Logistic Regression and Random Forest models
were used to predict heart failure-related deaths. As
shown in Table 5, The logistic regression model, as a
linear algorithm, shows quite reasonable predictive
performance, especially in terms of Accuracy (78.3%)
and AUC (0.746).
On the other hand, the random forest model, as an
ensemble learning method, is inferior to the logistic
regression model in many aspects. In particular,
Random Forest fared slightly worse in terms of AUC
(0.703) and Accuracy (73.3%). However, it is
important to note that both models performed equally
in terms of Recall (0.52), meaning that the two models
were similar in their ability to identify positive (death
from heart failure).
As shown in Figure 4 and Figure 5, these
performance indicators further emphasize the
effectiveness of the feature set selected through
rigorous statistical testing (which mainly includes
continuous variables like age, ejection fraction, serum
sodium, serum creatinine and time). In both models,
none of the categorical variables behaved strongly
enough to be included in the final model. This
observation underscores the importance of these
physiological parameters in predicting mortality
associated with heart failure. As a result, this analysis
provides healthcare professionals with valuable
insights to identify key clinical features that
significantly impact patient outcomes.
Table 5: Metrics of Test Data.
Clinical Record Analysis of Heart Failure Identification of Key Features and Disease Prediction
301