model of customer churn (Li, 2019). Scholars around
the world have conducted a lot of inquiry in this
regard so that the models can predict more accurately
(Wang 2022, Chandar 2006).
Hu proposed that the problem of customer churn
of retail banks can be used for research and solution
with data mining technology. Chandar et al. used
three algorithms (CART, C5.0 and TreeNet) to
predict the churn of bank customers. It was found
ultimately that the classification prediction of the
CART algorithms was the best. To deal with
imbalanced data, Wangyu Liao (2012) used the
improved Boosting method, proving that the
improved Boosting method has enhanced the ability
of model processing imbalanced data, and reduced
the predictive deviation caused by the imbalance of
the data set (Liao, 2012). Huang et al. proposed
understandable support vector machine, and at the
same time, he used simple Bayez tree to build a
customer loss model, which has high accuracy in
prediction (Huang, 2014). On the basis of the use of
support vector machine algorithms, He et al. also
explored the prediction of commercial bank customer
churn. Focusing on data imbalances, the model is
further improved by random samples method, and the
results show that the method can significantly
improve the accuracy of the model forecast (He,
2017). Huang et al. proposed an algorithms that
combines Particle Swarm Optimization and Back
Propagation to establish a warning model of corporate
customer churn (Huang, 2018). However, the Back
Propagation has a lot of disadvantages, such as unable
to converge quickly, high possibility of caught in
local minimum. Swetha P and Dayananda B proposed
the Improvized-XGBOOST model with feature
functions for the prediction of customer churn. The
result illustrates that the model is more efficient and
it can be suitable for complex data sets (Swetha,
2020).
It can be seen from the above that many scholars
have conducted a series of related studies on customer
churn, while most of the studies used a variety of
single models to build the customer churn predictive
model, having achieved some results.
3 METHODOLOGIES
3.1 Data Preprocessing
The complexity of data types and their internal
correlations, and the disunity of data quality will have
a negative impact on data interpretation and analysis.
Data preprocessing is a crucial step in the process of
machine learning. The quality of the data greatly
affects the outcome of the machine learning model. It
includes data cleaning, data integration, data
conversion, data reduction and other steps. Through
these steps, the accuracy, interpretability and
robustness of the model can be improved. In this
research, the process of data preprocessing includes
five parts: deleting redundant features, performing
one-hot encoding of text information, processing
missing values, scaling features, and using SMOTE
method to balance the data set. In addition, the
original dataset is split into a training set (80%) and a
test set (20%).
Deleting Redundant Features
Redundant features will increase the computational
complexity of model training. Deleting redundant
features can reduce the computational cost and
improve the efficiency of the model. In this research,
Surname and CustomerID were removed from
features, because they duplicate RowNumber.
Performing One-Hot Encoding of Text
Information
The text information in the feature quantity is one-hot
encoded, aiming to convert the text information into
a numerical type. In addition, one-hot encoding can
prevent the model from having a preference for
values of different sizes, thus affecting the accuracy
of the model. In this research, one-hot encoding is
applied to the geographical location and gender in the
feature quantities to help the model better understand
and utilize classification information.
Processing Missing Values
Missing values will cause the model to lack effective
information during training and prediction, thereby
reducing model performance. Therefore, filling in
missing values in the data set can improve the
stability and interpretability of the model. In this
research, missing values only appear in
HasCreditCard and Is ActiveMember. Considering its
actual meaning, that is, whether the users have credit
cards and whether they are active users, 0 is used to
fill the missing values.
Scaling Features
Feature scaling can prevent some feature values from
being too large, causing these features to overly affect
the prediction results. In this research,
StandardScaler is used to normalize numerical
features to ensure they have the same scale.