2.3 Method Introduction
2.3.1 Data Processing
In this experiment, Python is used to systematically
process the data, employing the Pandas library to read
the dataset related to diabetes. The initial step
involves feature encoding, where the Label Encoder
is utilized to convert binary categorical features (such
as gender, presence of polyuria, etc.) into numerical
representations. For multi-categorical features, the
One Hot Encoder is used for independent encoding.
Subsequently, label encoding is performed using the
Label Encoder to convert categorical labels into
numerical format.
After that, the data set is separated into a training
and a test set, typically with an 80-20 ratio. Finally,
data standardization is conducted to normalize the
features, ensuring consistent scales across different
features, and enhancing the stability of the model
during training. This standardization is achieved
using the Standard Scaler function.
2.3.2 Support Vector Machine Model
Support Vector Machine is a supervised learning
algorithm suitable for binary classification problems.
Its main idea is to find an optimal hyperplane that
separates data points belonging to different types.
And then a linear kernel function is used to map the
data into a high-dimensional space, making it easier
to separate in the new space.
In the SVM model, the data was initially divided
into feature variables (X) and target variables (y).
Categorical features were processed using one-hot
encoding, while Label Encoder was employed to
convert the target variable into numerical form. The
data was divided into test and training sets in a two to
eight ratio. An SVM model was constructed with a
linear kernel function and a regularization parameter
C=1. Model performance was verified using 5-fold
cross-validation. The learning_curve function was
utilized to visualize the accuracy trends of the model
with different numbers of training samples.
2.3.3 Random Forest Model
The quarterly differential autoregressive moving
equilibrium model (SARIMA model) is an
improvement on the (ARIMA model), which is used
to transform non-smooth time series into smooth
periodic series. The model achieves the difference
and autoregressive moving average of the time series
by regression of the current and lagging values of the
dependent variable, as well as taking into account the
random error term. This transformation makes the
time series smoother and better able to capture
periodic features.
In the random forest model, categorical variables
were initially mapped to numerical representations.
Subsequently, the dataset was partitioned into feature
variables (X) and target variables (Y), and a random
forest classifier was employed for binary
classification. Model performance was assessed
through 5-fold cross-validation, followed by an
analysis of learning curves to examine the model's
ability at various training set sizes. The accuracy from
cross-validation was ultimately derived, and the
swerve of the model's training and validation
performance with increasing training data was
visualized using learning curve plots.
2.3.4 Multilayer Perceptron Model
A Multilayer Perceptron model is an artificial neural
network composed of multiple layers of neurons,
such as input, hidden and an output layer. Every
neuron has a connection with all neurons in the
preceding layer and is associated with weights and
biases. MLP captures complex relationships between
input features by learning these weights and biases.
In the MLP model, categorical variables were
initially converted into numerical representations
using Label Encoder. Subsequently, feature variables
(X) and target variables (y) were extracted. Following
this, Standard Scaler was applied to standardize the
features, ensuring uniform scales. The MLP model
was defined, comprising two hidden layers with 100
and 50 neurons, respectively, and a maximum
iteration limit set to 500. 5-fold cross-validation was
performed using Stratified K Fold, recording training,
and testing accuracies, as well as losses for each fold.
Finally, charts depicting the variation of training loss
and training/testing accuracies with the number of
iterations were generated to analyze the model's
performance.
2.3.5 Measurement
The Logistic Regression is a linear model which is the
main tool to solve binary classification problems. It
achieves the result by passing the weighted sum of
input features through a sigmoid function, mapping
the input to a probability range between 0 and 1,
facilitating classification.
In the logistic model, the data was initially
preprocessed and features were standardized.
Subsequently, the data was split into a training and a
testing set. This model was defined as a logistic
regression model with a linear layer, a Sigmoid