consultations with an oncologist. Several data mining
algorithms are used to compare and select the
technique, or an ensemble thereof, for best results.
2 RELATED WORK
The existing predictive models have used data mining
techniques such as artificial neural networks, decision
trees and statistical methods to predict cancer
survival. Two data mining techniques, artificial
neural networks and decision trees (C5), and one
statistical technique, logistic regression, were
compared using the SEER public-use database
(SEER, n.d.) for the period 1973-2000 (Delen,
Walker, & Kadam, 2005). The cleansed,
preprocessed dataset consisted of 202,932 records.
Only 17 out of 72 variables were selected; these
comprised of 1 dependent variable and 16 predictor
variables including race, age, grade, marital status,
primary site code, histology, behavior, extension of
disease, lymph node involvement, radiation, stage of
cancer and tumor size. The comparative performance
was evaluated by accuracy, sensitivity, specificity
and k-fold cross-validation. The results showed that
decision tree (C5) was the best predictor with the
highest accuracy of 93%; followed by artificial neural
networks with an accuracy of 91.2%, and logistic
regression with an accuracy of 89.2%. The study is
based on the assumption that all patients died due to
breast cancer, which may not be the case (Riihimäki,
Thomsen, Brandt, Sundquist, & Hemminki, 2012).
Several spin-offs of this work followed through the
years. Bellaachia and Guven (Bellaachia & Guven,
2006) added VSR and COD variables to their study.
A new dependent variable Survivability was derived
using Survival Time Recode (STR) and VSR.
Accuracy, precision, and recall performance
measures are used to evaluate the data mining
techniques. The experimentation ranked Naïve Bayes
technique as best followed by neural networks and
C4.5 algorithms. One limitation of this study, as
stated by the authors, is the exclusion of records with
missing data (Extent of Disease and Site Specific
Surgery). Endo et al. (Endo, Takeo, & Tanaka, 2008)
compared seven algorithms to predict breast cancer
survival. Among these methods, Logistic Regression
showed the highest accuracy (85%), Decision tree
(J48) showed the highest sensitivity and ANN
displayed the highest specificity. A study by Wang et
al. (Wang, Bunjira, Wu, & Lin, 2013) predicts 5-year
breast cancer patient survivability by using two data
mining techniques: logistic regression and decision
tree, with conclusion that logistic regression is
comparatively superior. A few studies have focused
on developing models to predict presence of cancer in
addition to performing a comparison of the data
mining techniques (Chaurasia & Pal, 2017) (Senturk
& Kara, 2014).
A hybrid scheme based on fuzzy decision trees as
an alternative to breast cancer prognosis was
investigated (Khan, Choi, Shin, & Kim, 2008). The
final dataset of 162,500 records with 16 variables and
a binary target variable was used for experimentation.
It was concluded that hybrid fuzzy decision tree
classification technique (accuracy 85%) is more
powerful and fair than independently applied decision
tree classification technique (accuracy 82%). Three
different models for cancer prognosis were examined:
Bayesian Network (BN) model, Artificial Neural
Network (ANN) model and hybrid BN/ANN model
(Choi, Han, & Park, 2009). The SEER public-use
database (SEER, n.d.) for the period 1973-2003 with
294,275 records and 9 input variables was used. For
a threshold of 60 months, the proposed hybrid BN
model and ANN model performed better than the
Bayesian network. The results also showed that ANN
mostly contributed to the better performance of the
hybrid BN model.
Ensembles combine prediction outcomes of
individual classification techniques in order to
achieve better accuracy (Alpaydin, 2004). Common
ensemble techniques include bagging, boosting,
voting and stacking (IBM Knowledge Centre, n.d.).
Ensembles modeling techniques only combine
classification techniques, unlike hybrid modeling
technique which can combine classification and
clustering, or clustering and association techniques.
Agrawal et al. (Agrawal, Misra, Narayanan,
Polepeddi, & Choudhary, 2012) used an ensemble of
several data mining algorithms to develop an online
lung cancer outcome calculator. The predictive model
was built with 64 variables and the online calculator
was built by selecting 13 of these variables selected
on the basis of predictive power. Overall, the
Ensemble voting classification technique performed
best with the highest prediction accuracy (91.4%) and
AUC (94%). This was later extended to develop a
Breast Cancer Outcome (BOSOM) calculator
(Meren, 2014) for online survival measurement using
data mining and predictive modeling on the SEER
public-use database (SEER, n.d.) (1973-2010). The
study concluded with average accuracies of the
calculator (which uses a subset of variables) and
complete dataset at 88.27% and 90.71%, respectively.
HEALTHINF 2020 - 13th International Conference on Health Informatics
296