racy was also achieved using Web page profiles con-
structed of the most frequent 700 byte 5-grams; in
this case the data set had both the HTML tags and
the JavaScript code removed during preprocessing.
5 CONCLUSIONS
The research reported in this paper is on the auto-
matic classification of Web pages by their genres, us-
ing a distance function classification model. The goal
of this phase of the research was threefold. First,
we wanted to investigate the effects of typical data
preprocessing steps when using an n-gram approach
to Web page representation. As discussed in Sec-
tion 4.2.1, we did not find a particular level of prepro-
cessing to be superior on every data set. Our model is
able to achieve very high classification accuracy using
byte n-gram Web page profiles even when no prepro-
cessing of the data set is performed.
Second, we wanted to do an initial exploration of
whether the strength of our classification model lies
in the byte n-gram representation of the Web pages
and genres, or in the architecture of the model itself.
We achieved high classification accuracy on each data
set using Web page profiles constructed of the most
frequent L word 1-grams in each Web page, indicat-
ing that the strength of the model comes less from the
byte n-gram approach, and more from the model’s ar-
chitecture. Further investigation is needed.
Finally, we wanted to evaluate our Web page genre
classifier by comparing the results of our experiments
to the results of other researchers, as reported in the
literature, on the same or similar data sets. As dis-
cussed in Section 4.2.3, our classification accuracy
is either in the same range as, or higher than results
reported by other researchers who are using differ-
ent features and different machine learning methods.
These results are very encouraging, and indicate that
our model warrants further investigation.
The major contribution of this research is to show
that our distance function classification model is a vi-
able approach to the classification of Web pages by
genre, and that achieving high classification accuracy
with this model is not dependent on the use of byte
n-grams, on the level of preprocessing of the data, or
on the use of a particular data set. Future work will
include refining the model and expanding the scope of
the work by using more challenging data sets, includ-
ing highly unbalanced and multi-labeled data sets.
REFERENCES
Asirvatham, A. and Ravi, K. (2001). Web page classifica-
tion based on document structure. IEEE Nat. Conv.
Boese, E. and Howe, A. (2005). Effects of web docu-
ment evolution on genre classification. In Proc. 14th
ACM International Conf. on Information and Knowl-
edge Management (CIKM ’05) , pages 632–639.
Cavnar, W. and Trenkle, J. (1994). N-gram-based text cat-
egorization. Proc. 3rd Annual Symposium on Docu-
ment Analysis and Information Retrieval, SDAIR-94.
Crowston, K. and Kwasnik, B. (2004). A framework for
creating a facetted classification for genres: address-
ing issues of multidimensionality. Proc. 37th Annual
Hawaii International Conf. on System Sciences.
Dong, L., Watters, C., Duffy, J., and Shepherd, M. (2008).
An examination of genre attributes for web page clas-
sification. Proc. 41st Annual Hawaii International
Conf. on System Sciences (HICSS-41).
Houvardas, J. and Stamatatos, E. (2006). N-gram feature
selection for authorship identification. Proc. 12th In-
ternational Conf. on Artificial Intelligence: Method-
ology, Systems, Applications, pages 77–86.
Kanaris, I. and Stamatatos, E. (2007). Webpage genre
identification using variable-length character n-grams.
19th IEEE International Conf. on Tools with Artificial
Intelligence (ICTAI 2007), 2:3–10.
Ke
ˇ
selj, V., Peng, F., Cercone, N., and Thomas, T. (2003). N-
gram-based author profiles for authorship attribution.
In Proc. Conf. Pacific Association for Computational
Linguistics, PACLING’03, pages 255–264.
Lim, C., Lee, K., and Kim, G. (2005). Multiple sets of fea-
tures for automatic genre classification of web doc-
uments. Information Processing and Management,
41(5):1263–1276.
Mason, J., Shepherd, M., and Duffy, J. (2009). An n-gram
based approach to automatically identifying web page
genre. Proc. 41st Annual Hawaii International Conf.
on System Sciences (HICSS-42).
Meyer zu Eissen, S. and Stein, B. (2004). Genre classi-
fication of web pages. Proc. 27th German Conf. on
Artificial Intelligence (KI-2004), Ulm, Germany.
Rehm, G. (2002). Towards automatic web genre identifi-
cation. Proc. 35th Annual Hawaii International Conf.
on System Sciences (HICSS-35), 04:101.
Santini, M. (2007). Automatic identification of genre in web
pages. PhD thesis, University of Brighton, U.K.
Shannon, C. (1948). A mathematical theory of communi-
cation. Bell System Tech. J., 27:379 – 423, 623 – 656.
Shepherd, M. and Watters, C. (1998). The evolution of cy-
bergenres. Proc. 31st Annual Hawaii International
Conf. on System Sciences (HICSS-31), 02:97.
Stein, B. and Meyer zu Eissen, S. (2008). Retrieval mod-
els for genre classification. Scandinavian Journal of
Information Systems, 20(1):93–119.
Swales, J. (1990). Genre analysis. Cambridge University
Press New York.
CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach
655