Table 1. Basic information on the datasets used in the experiments. (Singletons are words that
occur once; Class n-tons refers to words that occur in n classes exactly).
Job 20 Industry 4 7
Category Newsgroups Sector Universities Sectors
Type of documents
job titles & newsgroup web web web
descriptions messages pages pages pages
Number of documents 131643 19974 9629 4199 4 573
Running words 11221K 2549K 1834K 1090K 864K
Average document length 85 128 191 260 189
Vocabulary size 84212 102 752 64551 41763 39375
Singletons (Vocab.%) 34.9 36.0 41.4 43.0 41.6
Classes 65 20 105 4 48
Class 1-tons (Vocab.%) 49.2 61.1 58.7 61.0 58.8
Class 2-tons (Vocab.%) 14.0 12.9 11.6 17.1 11.7
After preprocessing, ten random train-test splits were created from each dataset,
with 20% of the documents held out for testing. Both, conventional and conditional
maximum likelihood training of the naive Bayes model were compared in each split,
using a training vocabulary comprising the top D most informative words in accor-
dance to the information gain criterion [9] (D was varied from 100, 200, 500, 1000,
. . . up to full training vocabulary size). We used Laplace smoothing with ǫ = 10
−5
for conventional training [5], and the GIS algorithm without smoothing for conditional
maximum likelihood training through maximum entropy [10]. The results are shown in
Figure 1. Each plotted point in this Figure is an error rate averaged over its correspond-
ing ten data splits. Note that each plot contains one curve for the conventional training
method and two curves for GIS training: one corresponds to the parameters obtained
after the best iteration and the other to the parameters returned after GIS convergence.
This “best iteration” curve may be interpreted as a (tight) lower bound to the error rate
curve we could obtain by early stopping of the GIS to avoid overfitting.
From the results in Figure 1, we may say that conditional maximum likelihood
training of the naive Bayes model provides similar to or better results than those of
conventional training. In particular, they are significantly better in the Job Category and
4 Universities tasks, where it is also worth noting that maximum entropy does not suffer
from overfitting (the best GIS iteration curve is almost identical to that after GIS conver-
gence). However, in the 20 Newsgroups, Industry Sector and 7 Sectors tasks, the results
are similar. Note that, in these tasks, the error curve for relative frequencies tends to lie
in between the two curves for GIS, which are parallel and separated by a non-negligible
offset (2% in 20 Newsgroups, and 4% in Industry Sector and 7 Sectors). Of course, this
is a clear indication of overfitting that may be alleviated by early stopping of GIS and,
as done for relative frequencies, by parameter smoothing. Another interesting conclu-
sion we may draw from Figure 1 is that, with the sole exception of the 4 Universities
task, the best results are obtained at full vocabulary size. This was previously observed
in [5] for relative frequencies.
Summarising, the best test-set error rates obtained in the experiments are given
in Table 2. These results match previous results usign the same techniques on the five
64