vincingly argued that the sum one constraint will be
approximately enforced as long as some low predic-
tion error models exist in the ensemble. Note that in
(Dawes, 1979) it was suggested that non-negativity
constraints are required: assuming a model’s perfor-
mance is not anti-correlated with the true behaviour
the weight should be non-negative in a linear fit. In
general, by imposing a sum one condition on the
weights we ensure our final model will also be ap-
proximately equal to the expectation of the underly-
ing distribution. The non-negative constraint is self-
evident as long as models are not in severe error (anti-
correlated), and such models are excluded by con-
struction. Selection based on optimality (e.g., relative
predictive strength, as measured by lift here) is likely
to have a lesser affect, but can be argued on adap-
tive grounds: such weightings will adaptively adjust
to model performance, so, for example, if the selec-
tion criteria drift over time, selection of weights via
optimization will downweight models that become
ineffective and reward models that show improved
performance and will therefore reduce model mis-
specification risk. Moreover, as we use lift, for var-
ious values of N, we adaptively reweight to take rela-
tive performance into account.
4.2 Implementation Details
The algorithm was implemented in Python 3.6. We
found no implementation or computational issues.
The LR model was implemented in SAS 9.4 and
again no computational difficulties were encountered.
In contrast, implementing the neural network mod-
els was somewhat problematic on the Windows lap-
top used (dual core 2.9 GHz i7, 8 GB Random Ac-
cess Memory (RAM), Windows 7 Enterprise). Python
was used with the Scikit-Learn 0.19.2 package (Pe-
dregosa et al., 2011) for training the MLP and the
1.12.0 TensorFlow package (Abadi et al., 2015) for
training the DNN. For the DNN model, due to the
large dimension of the training data (760 attributes)
and number of nodes, our limited computational re-
sources led to slow training and system stability was
compromised to the extent that restarts were neces-
sary. ML libraries often target Graphical Processor
Units (GPUs) to allow more efficient computations,
and the laptop used lacked both GPUs and adequate
RAM. Despite the computational loads stressing our
machine, training was successful although moving
to larger dimensions (more attributes) or training set
sizes would be difficult.
4.3 Potential Extensions
We briefly raise a few items of interest for extending
the approach taken:
• We do not perform dimension reduction or any
other feature engineering, other than the stepwise
reduction used for the LR and MLP; such consid-
erations can improve the performance of the un-
derlying models in the ensemble and additional
work in this area could be beneficial.
• There are many variations one can make to our
algorithm. For example, instead of finding the in-
tersection in step A1, majority voting can be used
to find the agreement set. The model accuracy can
be used to determine weights in step A2, etc. The
crucial aspects are a winnowing of data to keep
the ‘top’ rated postal codes with a voting between
models to ensure enough high value data is con-
sidered, and the integrated use of a budget and
error consideration when selecting and using this
subset of data to determine weights. In addition,
the algorithm is generic, in that any number of
models can be used. If M, the number of mod-
els, is large then the agreement set is expected to
be too conservative, and the use of majority vot-
ing would become increasingly attractive. This is
particularly true if we want to use an ensemble of
weak learners.
• In Canada, postal codes are categorized into urban
and rural regions. Splitting the data into urban and
rural regions may be beneficial. If the number of
postal codes in rural regions is small, eliminating
them from the original data set may improve per-
formance of the model. If the number of postal
codes in rural regions is big enough, developing
separate urban and rural models can be another
option.
• A different direction for research is to explore the
saturation and the frequency of the mailings in a
fixed period of time. The final selection N can be
adjusted considering these aspects.
• It should be noted that the approach explored
here is related to stacked generalization (Wolpert,
1992), which is the generic idea of using model
outputs (predicted probability of success here) as
features to construct a meta-model. Here we are
selecting a linear model corresponding to averag-
ing, with weights found by an algorithm that ac-
counts for model error and a finite N, but other
meta-models can be used (for example logistic re-
gression is a reasonable approach, as probabili-
ties will be the output, and neural networks or any
other suitable machine learning algorithm can be
considered).
Tailored Military Recruitment through Machine Learning Algorithms
91