2 METHODOLOGIES 
This research followed a structured approach 
consisting of five main steps. Initially, a preliminary 
analysis and visualization of the dataset were 
conducted. The second step involved data 
preprocessing to address any inconsistencies or 
imbalances. Subsequently, the third step focused on 
feature engineering to enhance the dataset's predictive 
capabilities. The fourth step entailed selecting the 
appropriate machine learning model for training. This 
study employed six models: Logistic Regression (LR), 
Decision Tree (DT), Random Forest (RF), Gradient 
Boosting Decision Trees (GBDT), Extreme Gradient 
Boosting (XGBoost), and Deep Neural Network 
(DNN). 
The final step encompassed a comparative 
analysis of the model performance using five 
evaluation indicators, leading to the identification of 
the most suitable model. The workflow of the 
research is illustrated in Figure 1 below. 
2.1  Data Set Exploration 
In the first step of this research was look at the first 
few lines of the dataset to understand the basic 
structure, features, and samples of the data. And then 
use Python to find basic descriptive statistics of the 
statistics, such as mean, median, standard difference, 
etc., in order to get a preliminary understanding of the 
distribution of the data. At the same time, it is also 
necessary to draw some statistical charts and 
correlation heat maps of data characteristics. The 
specific content of chart analysis will be shown in the 
Experimental Setup and Results of the fourth part of 
the paper. 
2.2 Data Processing 
Taking the data collected in questionnaire survey as 
an example, respondents often fill in some survey 
questions with blanks. This can also simply explain 
that data sets generally have certain problems of 
missing and inauthentic. In order to avoid the impact 
of numerical missing and data anomalies on the 
efficiency and performance of machine learning, it is 
necessary to preprocess the data set. In this study, 
oversampling synthesis, oversampling and 
undersampling were used to deal with data imbalance.  
The data set consists of 30,000 observations. Use 
70% as the training set and 30% as the test set after 
the data preprocessing step. 
2.3 Feature Engineering 
Feature engineering is an important part of machine 
learning. This includes the selection of feature values 
and the labeling of features. In this research, the data 
set has 24 eigenvalues, such as age, sex, education and 
so on. Some of these features have little relevance to 
credit forecasting research, so it is necessary to do 
some feature selection in the research. The second is 
the feature tag coding. Some feature types in the data 
set represent high-dimensional information. Feature 
screening can reduce the noise generated by low 
correlation feature values in machine learning, so as 
to improve research efficiency and accuracy. High 
dimensional information needs to be reduced, which 
is simply to use different numbers to represent 
different features in the same feature type.  
2.4  Model Selection and Construction 
In this study, the feature types include both high-
dimensional information and continuous data such as 
credit card consumption amount and repayment 
amount. So the six machine learning models used in 
the study also include linear model. In order to make 
more comprehensive predictions of credit, the six 
models used in this survey include linear models, tree 
models (including three ensemble learning methods) 
and deep learning models. Ensemble learning is a 
machine learning model that combines multiple 
learners. The performance and generalization ability 
of the whole model can be improved by the prediction 
of multiple learners. By incorporating multi-level 
nonlinear learning, deep neural networks can 
autonomously acquire intricate feature 
representations. This model proves highly effective 
for credit forecasting, particularly when considering 
multiple criteria. 
● Linear Regression 
Linear regression is based on the basic assumption 
that there is a linear relationship between input 
features and output targets. This means that output 
targets can be predicted by linear combinations of 
 
 
 
Figure 1: Research Workflow(Photo/Picture credit :Original).