to find anomalies, and potential health insurance
claims fraud. Waghade and Karandikar (2018) also
discussed the need for better machine learning and
data mining methods to improve the effectiveness of
fraud detection systems in health insurance.
In 2014, Tianqi Chen invented XGBoost as a
library containing optimized tree gradient boosting. It
is designed as a highly efficient, flexible, and portable
model. The fundamental idea of XGBoost is to
optimize the tree gradient boosting model to handle
sparse data, manage large amounts of data efficiently,
implement efficient computation, and be highly
scalable. This project is often claimed to be the most
successful machine learning since XGBoost-based
models often outperform other models and dominate
data science competitions (Chen, 2016). It has also
been implemented to many problems across different
fields, including insurance. Fauzan & Murfi (2018)
implemented XGBoost for insurance claim prediction
and got better accuracy than other ensemble learning
models such as AdaBoost, Stochastic Gradient
Boosting, Random Forest, and Neural Networks.
Rusdah & Murfi (2020) also proved that XGBoost
could learn from the dataset with missing values
directly and give comparable accuracy to the
XGBoost model trained using the imputed dataset.
Yet, XGBoost is not optimized over the
imbalanced dataset in the classification problem. In
many cases, the XGBoost base model did not give
desirable results, such as in the simulation by Ruisen
et al. (2018). Several common strategies have been
invented and implemented to handle imbalanced
classes. Dhankhad et al. (2018) and Rio et al. (2015)
used undersampling and oversampling with increased
ratios to alter the dataset's composition before
implementing the machine learning model. Another
widely used approach to imbalance class problems is
Synthetic Minority Oversampling Technique
(SMOTE), which Varmedja et al. (2019)
implemented to predict credit card fraud. Some
researchers have also developed and implemented a
strategy to handle imbalance class without altering
the dataset, such as Wei et al. (2012), who integrated
contrast pattern mining, neural network, and decision
forest to predict online banking fraud activities. Wang
et al. (2020) proposed modifying the XGBoost base
model and called it Imbalance-XGBoost. The
improvement was made by adding either a weighted
function or a focal loss function on the boosting
machine. The fundamental idea of the weighted
function is to increase the penalty if the model
wrongly predicts the minority class. Meanwhile, the
focal loss function adds a multiplier factor to the
cross-entropy function for the same purpose as the
weighted function. These modifications are expected
to improve the XGBoost base model's performance.
This paper examines the imbalanced class
handling of XGBoost in predicting insurance fraud.
The comparative analysis of the existing methods is
measured based on some metrics, i.e., accuracy,
precision, recall, f1-score, and AUC. Our
implementation shows that the weighted-XGBoost
outperforms other approaches in handling the
imbalanced class problem. The imbalance-XGBoost
models are quite reliable for improving base models.
They can reach up to 28% improvement of the recall
score on minority class compared to the basic
XGBoost model. The precision score of both
imbalance-XGBoost models decreases, while the
weighted-XGBoost model simultaneously improves
the precision and recall score.
The rest of the paper is organized as follows:
Section 2 presents materials and methods that explain
the theoretical foundations of machine learning
models implemented in this research. In Section 3, we
discuss the process and the results of the simulations.
Finally, we give a conclusion of this research in
Section 4.
2 MATERIALS AND METHODS
2.1 XGBoost
XGBoost is a popular model that optimizes gradient
tree boosting and learns from tabular data. High
scalability makes XGBoost run ten times faster than
other conventional models and robust to a high-
dimensional dataset. This high scalability is
empowered by implementing a tree-learning
algorithm optimized for sparse data, a weighted
quantile algorithm for more efficient computation,
and a cache-aware block structure for parallelizing
the tree-learning process using all processor cores
(Chen & Guastrin, 2016).
The important improvement to XGBoost is how it
handles overfitting. Overfitting is a condition where
machine learning does capture not only the trend but
also the noise. Consequently, model performance on
training data will be very high, while model
performance on observations outside training data
will be far worse (Ying, 2019). The first method
implemented in XGBoost is a regularized learning
objective where weight terms are added to prevent the
model from overlearning data. Equation 1 shows
regularized learning function used in XGBoost.