responsive to new data, leading to diverse output sim-
plex vectors. Combining these features into a single
number that measures capacity is the goal of the LDM
process, of which it is only partially successful.
2 SHORTCOMINGS
As notes in our Results section, the full LDM pro-
cess seems to struggle in combining the two aspects
of flexible models in an unambiguous way. Further-
more, we observe some values and trends which dis-
agree with our traditional understanding of the rel-
ative flexibility of various methods. For example,
KNN-1 should be the most prone to overfitting, hav-
ing the greatest flexibility, yet its average entropy
value is lower than that of KNN-10, which should be
far more constrained and thus far less flexible.
The problem could stem from one or more aspects
of the procedure. Perhaps crucial information was lost
as a result of averaging the values of the simplex vec-
tors, as suggested in the previous section. In addition,
by making algorithms output probabilities based on
conditional independence of test instance labelings,
this allows an algorithm like KNN-10 to place posi-
tive probability mass on many more individual test in-
stances (likely having some nonzero number of neigh-
bors with any chosen class label), whereas KNN-1
can only ever assign positive probability to the la-
bel of its single neighbor. Treating arbitrary simplex
vectors as parameters for a Dirichlet model may also
be problematic, since this modeling assumption was
made for simplicity.
Lastly, given the negative entropy values of the
LDM process, it is difficult to understand these as
positive capacity values, undermining the purpose for
which the LDM was proposed. Negative entropy val-
ues can arise when using differential entropy, as when
estimating the entropy of a continuous Dirichlet dis-
tribution. For the LDM-Dirichlet process to be used,
one would still need a way of correlating entropy
scores to storage capacity in bits.
3 FUTURE WORK
Given the aforementioned shortcomings of the LDM
process and the continued need for methods of esti-
mating algorithm capacity, other approaches should
continue to be pursued. The question of how to es-
timate algorithm capacity is important, and failing to
find a general solution to the question does not render
the question any less important.
One particularly promising idea, inspired by re-
search in deep neural networks, is to use a form of
autoencoder (Doersch, 2016; Olshausen and Field,
1996; Lee et al., 2007; Bengio et al., 2014; Bengio
et al., 2013; Kingma and Welling, 2014) as applied
to training data with labels that are independent of
features. Generalization requires being able to pre-
dict labels given knowledge of the true relationship
between features and labels. For a dataset with no
relationship (in other words, with independence be-
tween features and labels), the only way an algorithm
can reproduce the labels from the training dataset con-
sistently is to memorize them, which it can do only in
proportion to its capacity. Thus, for binary labels, the
number of labels the algorithm can correctly retrieve
in testing is the capacity (in bits) of how many labels
it could memorize, plus some small number of luck-
ily guessed labels (the number of which can be bound
with high probability).
A label recorder takes randomly generated labels
and independent training features comprising a train-
ing dataset, trains on that dataset, then tests on the
same training set. The number of correctly repro-
duced labels will give a point estimate of the capacity,
subject to random variation. Repeating this process
and taking the average of the observed capacities will
allow one to get an increasingly tighter estimate of the
true algorithm capacity, arguably with fewer assump-
tions and steps than the LDM process.
Table 2 shows preliminary label recorder results
for the models tested. Each method was tested on
a set of 150 instances from the Iris dataset (Fisher,
1936), with labels generated independently and uni-
formly at random. The point estimates were the aver-
age number of labels correctly recovered at test time,
averaged over 1000 independent trials for each model.
As can be seen from the table, we have unpruned De-
cision Trees and the Random Forest Classifier having
the highest estimated capacity, while more bias-heavy
models such as Quadratic Discriminant Analysis and
Gaussian Na
¨
ıve Bayes have less capacity. Further-
more, the estimated capacities for KNN as a func-
tion of the regularization parameter K show decreas-
ing capacity with increasing K, aligning better with
our intuition than the LDM inferred entropies. Thus,
label recorders present a promising avenue for esti-
mating algorithm capacity. Creating label recorders
and using them to provide rigorous bounds on algo-
rithm capacity is the subject of future work, which
we hope will complement (if not supersede) the work
presented here.
The Labeling Distribution Matrix (LDM): A Tool for Estimating Machine Learning Algorithm Capacity
985