Table 3: Comparing the F-measure of the n-Gram LM and
SVM (all-gram features) classifiers for DMOZ dataset. All
F
1
values are multiplied by 100.
Topic SVM all-gram n-Gram LM/LLO
Adult 87.6% 87.58%
Arts 81.9% 82.03%
Business 82.9% 82.71%
Computers 82.5% 82.79%
Games 86.7% 86.43%
Health 82.4% 82.49%
Home 81% 81.13%
Kids 80% 81.09%
News 80.1% 79.01%
Recreation 79.7% 80.22%
Reference 84.4% 83.37%
Science 80.1% 82.52%
Shopping 83.1% 82.48%
Society 80.2% 81.66%
Sports 84% 85.30%
Average 82.44% 82.72%
There is a trade-off between smaller and larger
values of n. Higher values of n imply more scarce
data and a higher number of n-grams in the testing
phase that have not been seen during the training
phase. On the other hand, for a lower value of n, it is
harder for the model to capture the character depen-
dencies (Peng et al., 2003). The quantity of unseen
n-grams in the testing phase is also dependent on the
class distributions and the homogeneity of the class
vocabularies. Classes with more samples have more
chance to cover more n-gram vocabulary.
In this context, smoothing is needed to estimate
the likelihood of unseen n-grams. The value of γ con-
trols the amount of probability mass that is to be dis-
counted from seen n-grams and re-assigned to the un-
seen ones. The higher the value of γ the higher the
probability mass being assigned to unseen n-grams.
Figure 1 shows the variation of the F-measure with
the value of n the in n-gram LM, for the different class
labels in the WebKB dataset. The macro-average F-
measure is also shown in the figure. It is clear that the
best results are achieved at n=4.
Similarly, the effect of the smoothing parameter
(γ) is shown in figure 2. Figure 2 also shows that re-
laxing the model’s independence assumption, by us-
ing the Log-likelihood Odds model, results in better
performance, and more immunity to the variations of
the smoothing parameter.
When the model encounters a high percentage
of n-grams that were never seen during the training
phase, the precision of the model is affected. Smooth-
ing, on the other hand, tries to compensate this effect
by moving some of the probability mass to the unseen
n-grams. As stated earlier, the amount of the probabil-
ity mass assigned to the unseen n-grams is controlled
by the value of γ.
In figure 3, we can see the correlation between the
precision and the percentage of seen n-grams for the
different classes. It is also clear that the correlation
gets stronger with lower values of γ. For the shown
models, the Pearson correlation coefficients for the
precision values with the percentages of seen n-grams
are 0.51, 0.65 and 0.74 for γ = 1, 0.1 and 0.01 respec-
tively.
6 N-GRAM LM SCALABILITY
The storage size needed for the n-gram LM is a func-
tion of the number of n-grams and classes we have,
while for the all-grams approach used by Baykan et
al. (Baykan et al., 2009), the storage requirements
are a function of the number of URLs in the train-
ing set as well as the different orders of ‘n’ used in
the all-grams. This means that in the n-gram LM
the memory and storage requirements can be 100,000
times less than that needed by the conventional ap-
proaches. This reduction was shown, during our tests,
to also have a big impact on the classification process-
ing time.
Let us use any of the binary-classifiers used in
DMOZ dataset to explain this in more details. We
have about 100,000 URLs in the ‘Sports’ category,
thus as shown in Baykan et al. (Baykan et al., 2009),
we will build a balanced training-set of positive and
negative cases of about 200,000 URLs.
As we have seen in equations 4 and 5, for an n-
gram language model we need to store the counts
of n-grams and (n-1)-grams for each class. Since
we can achieve slightly better results than Baykan et
al. (Baykan et al., 2009) with ‘n=7’, we will do our
calculations based on the 7-gram LM here. The num-
ber of 7-grams in the positive and negative classes are
746,024 and 1,037,419 respectively, while the number
of 6-grams for the same two classes are 568,162 and
795,192. Thus the total storage needed is the summa-
tion of the above 4 values, i.e. 3,146,797
For the approach used by Baykan et al. (Baykan
et al., 2009), we need to construct a matrix of all
features and training-data records. The features in
this case will be the all-grams, i.e. 4, 5, 6, 7 and 8-
grams, and the training-data records are the 200,000
URLs in the training-set. This matrix is to be used
by a Naive Bayes or SVM classifiers later on. The
counts for the 4, 5, 6, 7 and 8-grams are 222,649,
684,432, 1,198,689, 1,628,422, 2,008,153 respec-
tively. Thus, the total number of features is the sum-
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
18