CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach

Jane E. Mason, Michael Shepherd, Jack Duffy

2009

Abstract

The research reported in this paper is part of a larger project on the automatic classification of Web pages by their genres, using a distance function classification model. In this paper, we investigate the effect of several commonly used data preprocessing steps, explore the use of byte and word n-grams, and test our classification model on three Web page data sets. Our approach is to represent each Web page by a profile that is composed of fixed-length n-grams and their normalized frequencies within the document. Similarly, each of the genres in a data set is represented by a profile that is constructed by combining the n-gram profiles for each exemplar Web page of that genre, forming a centroid profile for each Web page genre. We use a distance function approach to determine the similarity between two profiles, assigning each Web page the label of the genre profile to which its profile is most similar. Our results compare very favorably to those of other researchers.

References

  1. Asirvatham, A. and Ravi, K. (2001). Web page classification based on document structure. IEEE Nat. Conv.
  2. Boese, E. and Howe, A. (2005). Effects of web document evolution on genre classification. In Proc. 14th ACM International Conf. on Information and Knowledge Management (CIKM 7805) , pages 632-639.
  3. Cavnar, W. and Trenkle, J. (1994). N-gram-based text categorization. Proc. 3rd Annual Symposium on Document Analysis and Information Retrieval, SDAIR-94.
  4. Crowston, K. and Kwasnik, B. (2004). A framework for creating a facetted classification for genres: addressing issues of multidimensionality. Proc. 37th Annual Hawaii International Conf. on System Sciences.
  5. Dong, L., Watters, C., Duffy, J., and Shepherd, M. (2008). An examination of genre attributes for web page classification. Proc. 41st Annual Hawaii International Conf. on System Sciences (HICSS-41).
  6. Houvardas, J. and Stamatatos, E. (2006). N-gram feature selection for authorship identification. Proc. 12th International Conf. on Artificial Intelligence: Methodology, Systems, Applications, pages 77-86.
  7. Kanaris, I. and Stamatatos, E. (2007). Webpage genre identification using variable-length character n-grams. 19th IEEE International Conf. on Tools with Artificial Intelligence (ICTAI 2007), 2:3-10.
  8. Kes?elj, V., Peng, F., Cercone, N., and Thomas, T. (2003). Ngram-based author profiles for authorship attribution. In Proc. Conf. Pacific Association for Computational Linguistics, PACLING'03, pages 255-264.
  9. Lim, C., Lee, K., and Kim, G. (2005). Multiple sets of features for automatic genre classification of web documents. Information Processing and Management, 41(5):1263-1276.
  10. Mason, J., Shepherd, M., and Duffy, J. (2009). An n-gram based approach to automatically identifying web page genre. Proc. 41st Annual Hawaii International Conf. on System Sciences (HICSS-42).
  11. Meyer zu Eissen, S. and Stein, B. (2004). Genre classification of web pages. Proc. 27th German Conf. on Artificial Intelligence (KI-2004), Ulm, Germany.
  12. Rehm, G. (2002). Towards automatic web genre identification. Proc. 35th Annual Hawaii International Conf. on System Sciences (HICSS-35), 04:101.
  13. Santini, M. (2007). Automatic identification of genre in web pages. PhD thesis, University of Brighton, U.K.
  14. Shannon, C. (1948). A mathematical theory of communication. Bell System Tech. J., 27:379 - 423, 623 - 656.
  15. Shepherd, M. and Watters, C. (1998). The evolution of cybergenres. Proc. 31st Annual Hawaii International Conf. on System Sciences (HICSS-31), 02:97.
  16. Stein, B. and Meyer zu Eissen, S. (2008). Retrieval models for genre classification. Scandinavian Journal of Information Systems, 20(1):93-119.
  17. Swales, J. (1990). Genre analysis. Cambridge University Press New York.
Download


Paper Citation


in Harvard Style

E. Mason J., Shepherd M. and Duffy J. (2009). CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 646-653. DOI: 10.5220/0001837706460653


in Bibtex Style

@conference{webist09,
author={Jane E. Mason and Michael Shepherd and Jack Duffy},
title={CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={646-653},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001837706460653},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - CLASSIFYING WEB PAGES BY GENRE - A Distance Function Approach
SN - 978-989-8111-81-4
AU - E. Mason J.
AU - Shepherd M.
AU - Duffy J.
PY - 2009
SP - 646
EP - 653
DO - 10.5220/0001837706460653