2 METHODOLOGIES
This research followed a structured approach
consisting of five main steps. Initially, a preliminary
analysis and visualization of the dataset were
conducted. The second step involved data
preprocessing to address any inconsistencies or
imbalances. Subsequently, the third step focused on
feature engineering to enhance the dataset's predictive
capabilities. The fourth step entailed selecting the
appropriate machine learning model for training. This
study employed six models: Logistic Regression (LR),
Decision Tree (DT), Random Forest (RF), Gradient
Boosting Decision Trees (GBDT), Extreme Gradient
Boosting (XGBoost), and Deep Neural Network
(DNN).
The final step encompassed a comparative
analysis of the model performance using five
evaluation indicators, leading to the identification of
the most suitable model. The workflow of the
research is illustrated in Figure 1 below.
2.1 Data Set Exploration
In the first step of this research was look at the first
few lines of the dataset to understand the basic
structure, features, and samples of the data. And then
use Python to find basic descriptive statistics of the
statistics, such as mean, median, standard difference,
etc., in order to get a preliminary understanding of the
distribution of the data. At the same time, it is also
necessary to draw some statistical charts and
correlation heat maps of data characteristics. The
specific content of chart analysis will be shown in the
Experimental Setup and Results of the fourth part of
the paper.
2.2 Data Processing
Taking the data collected in questionnaire survey as
an example, respondents often fill in some survey
questions with blanks. This can also simply explain
that data sets generally have certain problems of
missing and inauthentic. In order to avoid the impact
of numerical missing and data anomalies on the
efficiency and performance of machine learning, it is
necessary to preprocess the data set. In this study,
oversampling synthesis, oversampling and
undersampling were used to deal with data imbalance.
The data set consists of 30,000 observations. Use
70% as the training set and 30% as the test set after
the data preprocessing step.
2.3 Feature Engineering
Feature engineering is an important part of machine
learning. This includes the selection of feature values
and the labeling of features. In this research, the data
set has 24 eigenvalues, such as age, sex, education and
so on. Some of these features have little relevance to
credit forecasting research, so it is necessary to do
some feature selection in the research. The second is
the feature tag coding. Some feature types in the data
set represent high-dimensional information. Feature
screening can reduce the noise generated by low
correlation feature values in machine learning, so as
to improve research efficiency and accuracy. High
dimensional information needs to be reduced, which
is simply to use different numbers to represent
different features in the same feature type.
2.4 Model Selection and Construction
In this study, the feature types include both high-
dimensional information and continuous data such as
credit card consumption amount and repayment
amount. So the six machine learning models used in
the study also include linear model. In order to make
more comprehensive predictions of credit, the six
models used in this survey include linear models, tree
models (including three ensemble learning methods)
and deep learning models. Ensemble learning is a
machine learning model that combines multiple
learners. The performance and generalization ability
of the whole model can be improved by the prediction
of multiple learners. By incorporating multi-level
nonlinear learning, deep neural networks can
autonomously acquire intricate feature
representations. This model proves highly effective
for credit forecasting, particularly when considering
multiple criteria.
● Linear Regression
Linear regression is based on the basic assumption
that there is a linear relationship between input
features and output targets. This means that output
targets can be predicted by linear combinations of
Figure 1: Research Workflow(Photo/Picture credit :Original).