machine learning methodologies, including Logistic
Regression models, researchers are committed to
providing a reliable reference for the early prediction
of stroke and offering new perspectives and
methodological foundations for future research in this
field. Facing the severe global health challenge of
stroke, the goal of this project is not only to seek
technological innovations and methodological
advancements but also to provide practical tools for
clinical use. This would facilitate early diagnosis and
timely treatment, thereby improving patients' survival
rates and quality of life.
2 RELATED WORK
Stroke, afflicts 17 million annually(Murphy,2020),is
recognized as the second leading cause of death
worldwide and a primary source of long-term
disability (Silva, 2018), prompting extensive research
into predictive methods for stroke using various
approaches. There have been statistical researchers
who utilized SPSS software and traditional statistical
models, such as multiple linear regression, to study
common causes of stroke, including factors like age,
occupation, and climate conditions of the living
environment (Li, 2016). These traditional statistical
model studies possess mature techniques,
comprehensive theories, and are easy to apply and
interpret, yet they gradually show signs of
obsolescence with the development of machine
learning and deep learning technologies. This is due
to reasons such as their relatively simplistic model
metrics, lower prediction effectiveness; inability to
self-optimize, weaker adaptability, and generalization
capacity; and finally, traditional statistical models are
constrained by human brain computational and
analytical limitations, struggling with large-scale,
high-dimensional data processing.
Deep learning has also garnered significant
attention in recent predictive research, demonstrating
superior performance in many studies. However, the
reliance on hardware and the opaqueness of deep
neural network (DNN) functions (Wu, 2022), along
with the critical issue that deep neural networks can
fail entirely in adverse dataset conditions (e.g.,
extreme imbalance between positive and negative
instances), remain challenging to explain.
In contrast, machine learning models showcase
high predictive result effectiveness and a
comprehensive range of model metrics. Based on
adaptive algorithms, they offer strong generalizability
and versatility; they can process high-dimensional
data efficiently and handle large datasets effectively.
It is worth noting that most machine learning model
research has matured, possessing a self-consistent and
comprehensive theoretical foundation. Researchers
have previously trained models using various
sampling strategies in predictive studies based on
machine learning models, including Logistic
Regression, Gradient Boosting Machine, Extreme
Gradient Boosting, Random Forests, Support Vector
Machines, and Decision Trees. These studies have
indicated a significant association between these
machine learning models and laboratory variables in
relation to stroke recurrence. The models
demonstrated the stability of predicting stroke
recurrence within a five-year time frame, highlighting
the importance of laboratory variables in periodical
predictions. Additionally, researchers have utilized
various feature selection strategies, evaluating the
performance of six interpretable algorithms,
showcasing the potential of various machine learning
models in predicting long-term stroke recurrence
(Zhang 2021, Boukhennoufa 2022, Song 2022)
Beyond the application of foundational models,
there have been many fascinating interdisciplinary
studies in recent years. For instance, research by
Pritam Chakraborty, Anjan Bandyopadhyay, Sricheta
Parui, Sujata Swain from the Karolinska Institute of
Industrial Technology combined machine learning
and game theory in stroke prediction investigations
(Chakraborty, 2024). Another study aimed at
exploring methods for handling specific,
representative stroke datasets, such as Han Zhaoyi
and Lian Gaoshe's study from Taiyuan University of
Technology, which achieved the highest efficiency in
training imbalanced datasets with "SMOTEENN
sampling + Recursive Feature Elimination with
Random Forests(RFRFE) + XGBoost classification
algorithm" (Han, 2023).
This research focuses on the study of machine
learning models for stroke prediction.
3 METHODOLOGY
Based on the "Stroke Prediction Dataset," this study
conducted a comprehensive analysis of a wide range
of clinical patient characteristics and medical
indicators using various machine learning models.
The aim was to identify the most effective model for
predicting stroke. In our research, we first
preprocessed the data through label encoding.
Subsequently, the experimental data underwent
imbalanced learning and feature selection; finally,
several machine learning models were constructed
and trained, including the Logistic Regression model,