LOG: logistic regression
LDA: Linear discriminant Analysis, a traditional
classification method that finds a linear
combination of features to separate two or more
classes. LDA requires continuous predictors that
are normally distributed, but in practice this
restriction can be relaxed.
The chosen classifiers are either state-of-the-art (e.g.
XGB and RF) or well-proven classification
algorithms. They are also substantially different in
their theoretical underpinnings and should therefore
yield non-identical prediction errors.
3.3 Computational Details
Enrollment management data was stored in a
MongoDb database as it tended to be the most variant.
Extraction scripts were used to generate flat structures
for export to the final data stores. This was combined
with the flattened structures from the student
information system which uses Oracle database, and
the housing information which is stored in MS-SQL
Server. Ultimately the extracted data was stored in a
MariaDb, where SQL scripts comprised the final
steps in the ETL to generate the final units of analysis
exports.
The two-stage boosted framework was coded
using a combination of Python 3.6 using the scikit-
learn and pandas libraries, and SPSS Modeler 18.2,
for rapid prototyping, given the number of
experiments conducted in this preliminary
exploration. We used the Bayesian optimization
library scikit-optimize (skopt) for hyperparameter
tuning in the first stage, and the rfbopt library
(https://rbfopt.readthedocs.io) for hyperparameter
optimization of XGBtree and Random Forests in
SPSS Modeler for the second stage.
The experiments were run on an Intel Xeon
server, 2.90GHz, 8 processors, 64GB RAM. Parallel
processing was coded into the system to make use of
all n cores during training and tuning.
4 RESULTS AND DISCUSSION
Table 2 displays the assessment of predictive
performance of the two-stage classification
framework for the sixteen experiments described in
section 3.2.
Accuracy and ROC AUC are reported, although
the prevalent predictive performance metric is ROC
AUC in this case, given the unbalanced nature of the
datasets. Predictive performance is slightly higher in
the first stage when using logistic regression vs
XGBtree, but both values (0.66 and 0.64) are rather
low, which confirms the challenges faced by
researchers when trying to make predictions of Fall
semester freshmen attrition.
When analysing the results on Spring predictions
we can verify that the inclusion of the error measure
in the list of Spring predictors enhances the predictive
performance of the classification models. Predictive
performance improvement was moderate but
consistent. For error measures derived with a first
stage (Fall) using logistic regression, three out of four
classifiers had better predictive performance when
the error measure is included as a predictor. The AUC
value for XGBtree is 0.78, greater than the AUC
value when the error measure is excluded (0.759).
Similarly, the AUC value for Random Forests is
0.802, greater than 0.796. In the case of LDA, the
different in AUC is much more substantial: 0.817 vs
0.639. For logistic regression, instead, the results are
reversed: the AUC when excluding the error measure
is higher (0.808 vs. 0.816). When using XGBtree in
the first stage we have similar results: the AUC values
are either higher when including the error measure, or
at least remain the same. The AUC value for XGBtree
is 0.782, greater than the AUC value when the error
measure is excluded (0.766). For Random Forests and
Logistic Regression, the inclusion of the error
measure does not change the AUC value (0.792 and
0.816 respectively). For LDA, we see a considerable
drop in predictive performance, but still, the inclusion
of the error measure improves the AUC value (0.684
vs. 0.639).
Figure 4 depicts the feature importance charts for
each of the sixteen experiments. The error measure
plays a prominent role as a predictor in all but one
scenario, ranking among the five most relevant
predictors (the only exception is the case in which
XGBTree is used for Fall prediction, and logistic
regression for Spring prediction).
These results suggest that the inclusion of the
error measure can be beneficial and will tend to
increase predictive performance. It could certainly be
meaningful to consider its inclusion when
implementing an ensemble of classifiers: some
classifiers could be trained with inclusion of the error
measure, and others without it, and then allow the
ensemble, either through voting or through stacking,
to produce the final prediction. For details of this
approach check (Lauría et al., 2018).
A surprising outcome is the fact that logistic
regression outperformed both XGBtree and Random
Forests, two state of the art classifiers. This may be
due to limited hyperparameter optimization.