3 METHOD
In this project, the author employed exploratory data
analysis (EDA) to preprocess the data, followed by
the application of logistic regression, random forest,
and XGBoost for analysis.
3.1 Algorithm
(1) Logistic regression is a statistical approach used
to investigate data, considering the influence of one
or more independent factors on a particular outcome.
It finds its niche in tasks where the outcome is binary,
meaning it has only two possible categories, typically
referred to as 0 and 1.
(2) Random forest algorithm is a machine learning
technique that builds upon the principles of decision
trees. It is possible to perform feature selection by
evaluating the significance of each feature through
calculation. Random forest algorithm first uses the
bootstrap aggregation method to gain training sets. A
decision tree is built for each training set. When
sampling using bootstrap, one sample is selected
randomly from the original set (N samples) with
replacement. One training set is generated by
repeating this step N times. The probability that a
single sample will be selected in N times of samplin
g is:
P
= 1
−
(1
−
1
/N
)
N
(1)
When n goes to in infinity:
1
−
(1
−
1
/N
)
N
≈
1
−
1
/e
≈
0
.
632
(2)
This suggests that around 63.2% of the sample data is
consistently utilized as the training set for each
modeling iteration. Consequently, approximately
36.8% of the training data remains unused and does
not contribute to the model training process. These
unused data points are commonly referred to as “out-
of-bag data” or OOB data.
Consider a decision tree denoted
as
G
−
n
(
x
n
)
,
a
constituent element of a random forest model. This
specific decision tree is purposefully engineered to
provide predictions solely for the data point
x
n
.
Assuming a total of
N
decision trees exist within
the random forest, the out- of-bag error,
conventionally symbolized as r
1
, may be precisely
defined as follows:
The out-of-bag error (r1) is
computed
through the
process of averaging the prediction errors for N data
points, involving the comparative analysis between
the actual values (yn) and the predictions rendered by
G−n (xn).
To offer an alternative perspective:
Imagine the presence of an error metric denoted
as r2 designed to quantify the errors associated with
out-of-bag (OOB) samples following random
permutations. In this particular context, the feature
importance (I) associated with a specific feature, for
instance, xn, can be elucidated as follows:
The feature importance (I(xn)) is computed as the
average across N iterations, with each iteration
entailing the subtraction of r2 from r1.
(3) XGBoost, which stands for Extreme Gradient
Boosting, is a powerful and widely used machine
learning technique. It’s particularly well-suited for
situations where you have structured or tabular data
and are working on supervised learning tasks. It
operates as an ensemble learning technique that
amalgamates the forecasts of numerous independent
models, often in the form of decision trees.
3.2 Evaluation Criteria
(1) Confusion matrices are useful in the context of
stroke prediction (or any binary classification
problem) for evaluating the performance of predictive
models. A confusion matrix presents a detailed
summary of how well a model’s predictions match
the real outcomes in the dataset. It is particularly
valuable for assessing the model’s ability to make
accurate predictions and for understanding the types
of errors it makes.
(2) Accuracy measures the model’s ability to make
correct predictions by considering the total correct
predictions (TP + TN) in relation to all predictions
made. It provides a holistic assessment of the model’s
overall effectiveness.
(3)
(3) Precision is a performance metric that assesses the
reliability of a model’s positive predictions. It is
calculated by taking the ratio of True Positives (TP)
to the sum of True Positives (TP) and False Positives
(FP). In essence, precision informs the frequency with
which the model’s positive predictions are accurate.
(4) Recall tells how good the model is at finding all
the positive cases. It’s calculated by dividing the
number of true positives (correctly identified
positives) by the sum of true positives and false
negatives. In the context of stroke prediction, recall is
crucial to avoid missing high-risk stroke cases.
(5) The F1-score is a way to express both Precision
and Recall with a single number, utilizing the