Unequal distribution of data between majority
class i.e. patients that are less likely to survive and mi-
nority class i.e. patients that are likely to survive can
induce bias towards majority class, leaving minority
class samples to be often misclassified. Misclassifica-
tion of minority class can lead to hectic postoperative
treatment procedures, high dosage of recommended
drugs and accelerated health follow-ups and diagnos-
tic tests which can cause stress both physically and
psychologically. An ability to predict survival status
of patient at a given time by clinician can alleviate
this stress. Hence, to use machine learning models
in clinical practice they should be designed in such
a way that they are robust towards bias induced by
majority class. These models can also be used as a
risk assessment tool to help us determine which pa-
tients should be offered imaging. However, all these
tools suffer from aforementioned common challenge
of bias towards majority class. Furthermore, they are
also dynamic in nature and need to be updated con-
tinuously as the environment changes. Hence, model
should be constructed and designed in such a way that
it can adjust if there are changes in the subset of the
population.
In this paper, we investigate different approaches
for predicting survival status of patients suffering
from non-small cell lung cancer. In Section 2 we
present signal model, i.e. different classifiers on
which our analysis will be performed and later in the
paper we list evaluation metrics for measuring perfor-
mance. In addition we define a fusion algorithm that
can be used to combine decisions of different machine
learning algorithms. In Section 3, related dataset and
results from different tests performed on training data
will be discussed. In Section 4 we conclude our find-
ings for this study and present suggestions for future
work.
2 SIGNAL MODELS
2.1 Data Set
The dataset used for evaluation of the proposed model
is from MAASTRO Clinic, (Maastricht, The Nether-
lands). This dataset is open source and can be found
at TCIA (The cancer imaging archive) under NSCLC
(Aerts, 2019). Four hundred and twenty-two con-
secutive patients were included (132 women and 290
men), with inoperable, histologic or cytologic con-
ferred NSCLC, UICC stages I-IIIb, treated with rad-
ical radiotherapy alone (n = 196) or with chemora-
diation (n = 226). Mean age was 67.5 years (range:
33–91 years). The study has been approved by the
institutional review board. All research was carried
out in accordance with Dutch law. The Institutional
Review Board of the Maastricht University Medical
Centre (MUMC+) waved review due to the retrospec-
tive nature of this study. Out of 422 records, we have
only 365 patients with all the information. The sur-
vival time (in days) in the dataset is from the start of
the treatment and there is a possibility that the sta-
tus of patient recorded may not be accurate i.e. the
clinicians may not have received the information right
when the event outcome occurred.
2.2 Machine Learning Models
Training a model that predicts the survival status at
a given time, means forecasting the odds of outcome
instead of forecasting the point estimate of the occur-
rence. In our case there are two disease outcomes i.e.
alive and dead, defined so that if the result of odds are
greater than 50% then the predicted class is assigned
value 1 (alive) otherwise it is 0 (dead). We investi-
gate applicability of several models: gradient boost-
ing, XGboost and random forrest. The main difficulty
in this particular application are the unbalanced data
sets since the number of patients surviving the lung
cancer after certain period of time is relatively small.
To this purpose we propose to fuse the the proposed
machine learning algorithms using our information
fusion algorithm proposed in (Liu et al., 2007).
2.3 Gradient Boosting
Boosting is defined as a strategy that involves combi-
nation of multiple simple models resulting in an over-
all stronger model. The simple models are called as
weak learners. For example, the flow chart in Figure
1 below explains the gradient boosting method for N
trees. Tree 1 is trained using a feature matrix X and
target variable y. The predictions labelled ˆy
1
are used
to determine the training set loss function r
1
. Tree2
is then trained using the feature matrix X and the loss
function r
1
of Tree1 as labels. The predicted results
hatr
1
are then used to determine the loss function r
2
.
The process is repeated until all the N trees forming
the ensemble are trained.
In other words, instead of fitting a model on the
data at each iteration, it fits a new model to the resid-
ual errors made by the previous model. The details
of gradient boosting method are outlined in (Ke et al.,
2017).