0.12 0.08 0.04 0.00
0.0 0.4 0.8
Lambda
AUC
(a) p = 20
0.12 0.08 0.04 0.00
0.0 0.4 0.8
Lambda
AUC
(b) p = 400
0.12 0.08 0.04 0.00
0.0 0.4 0.8
Lambda
AUC
(c) p = 1000
Figure 1: Examples of λ solution path produced by glmnet algorithm in the setting with the lasso penalty. The dashed (solid)
line denotes AUC estimated from training (test) data set. The vertical line denotes λ
OPT
.
For better readibility in results, we denote this ap-
proach LOG+B.
Similarly we combine logistic regression and reg-
ularized logistic regression models from elastic net.
For better readability in results, we denote this ap-
proach LOG+EN.
3 RESULTS
We tested the described approaches with simulated
and publicly available breast cancer data sets. How-
ever, the results with simulated data are not shown
due to the conference page limit. The performances
of individual models were evaluated as well because
of a comparison of the models.
In R environment, we used glm function from
‘base’ package to fit the logistic regression models
with clinical data; ‘mboost’ and ‘glmnet’ packages to
fit the logistic regression models with gene expression
data.
The shrinkage factor ν and the number of it-
erations of the base procedure are the main tun-
ing parameters of boosting. Based on recommen-
dation from (B¨uhlmann and Hothorn, 2007), we set
ν = 0.1 to the standard default value in ‘mboost’
package. The numbers of iterations were esti-
mated with Akaike’s information stopping criterion
(AIC) (Akaike, 1974). We also tested a functional-
ity of AIC stopping criterion, and evaluated perfor-
mances of data with fixed number of iterations within
the range 50-800 iterations, and compared with AIC
estimated performances (data not shown). The max-
imal number of iterations was set to m
max
= 700 and
was sufficient.
The choice of the penalty (5) is the main tuning
parameter of elastic net. The algorithm from ‘glm-
net’ package computes a group of solutions (regular-
ization path) for a decreasing sequence of values for
λ (Friedman et al., 2010). We evaluated all solutions
on training and test data set with different penalties
α (1, 0, 0.5) and with different numbers of variables
(20, 400, 1000) to inspect performances in different
dimensional setting. Based on results, we chose the
lasso penalty (α = 1) for further experiments. Fig-
ure 1 depicts examples of this experiment with the
lasso penalty. The vertical lines in the subfigures de-
note the estimated values of λ
OPT
, which were es-
timated via training data set cross-validation (CV).
The subfigures were generated from simulated gene
expression data set of moderate power, see (
ˇ
Silhav´a
and Smrˇz, 2010), and depict one Monte Carlo cross-
validation(MCCV) iteration (the same for all figures).
MCCV was applied as a validation strategy.
MCCV generates learning set in that way that the
learning data sets are drawn out of {1, ··· , n} samples
randomly and without replacement. The test data set
consists of the remaining samples. The random split-
ting in non-overlapping learning and test data set was
repeated 100-times. The splitting ratio of training and
test data set was set to 4 : 1. Responses consisted of
predicted class probabilities were measured with the
area under the ROC curve (AUC).
We test the described approaches in different set-
tings to simulate various quality of data sets. We con-
sidered redundant and non-redundant settings of data
and different predictive powers of gene expression
and clinical data. From the simulations, the combined
models make more accurate predictions or take over
the values of the models with higher performances.
We also evaluted the described approaches with
two publicly available breast cancer data sets. The
van’t Veer data set (van’t Veer et al., 2002) includes
breast cancer patients after curative resection. cDNA
Agilent microarray technology was used to give the
expression levels of 22483 genes for 78 breast cancer
patients. 44 patients that are classified into the good
prognosis group, did not suffer from a recurrencedur-
ing the first 5 years after resection, the remaining
34 patients belong to the the poor prognosis group.
The data set was prepared as described in (van’t Veer
et al., 2002) and is included in R package ‘DEN-
ICAART 2012 - International Conference on Agents and Artificial Intelligence
592