A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION

Ulli Waltinger, Alexander Mehler, Armin Wegner

Abstract

This paper presents an approach of two-level categorization of web pages. In contrast to related approaches the model additionally explores and categorizes functionally and thematically demarcated segments of the hypertext types to be categorized. By classifying these segments conclusions can be drawn about the type of the corresponding compound web document.

References

  1. Eissen, S. M. Z. and Stein, B. (2004). Genre classification of web pages: User study and feasibility analysis. In In: Biundo S., Fruhwirth T., Palm G. (eds.): Advances in Artificial Intelligence, pages 256-269. Springer.
  2. Galavotti, L., Sebastiani, F., and Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In ECDL 7800: Proc. of the 4th European Conf. on Res. and Adv. Tech. for DL, pages 59-68, London, UK. Springer-Verlag.
  3. Joachims, T. (1997). Text categorization with support vector machines: Learning with many relevant features. Technical report.
  4. Joachims, T., Cristianini, N., and Shawe-Taylor, J. (2001). Composite kernels for hypertext categorisation. In Proc. of the 11th Int. Conf. on Machine Learning, pages 250-257. Morgan Kaufmann.
  5. Karlgren, J. and Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis. In Proc. of the 15th Conf. on CL, pages 1071-1075. ACL.
  6. Kessler, B., Nunberg, G., and Schiitze, H. (1997). Automatic detection of text genre. pages 32-38.
  7. Lee, Y.-B. and Myaeng, S. H. (2002). Text genre classification with genre-revealing and subject-revealing features. In SIGIR 7802: Proc. of the 25th Int. ACM SIGIR, pages 145-150, New York, NY, USA. ACM.
  8. Lee, Y.-B. and Myaeng, S. H. (2004). Automatic identification of text genres and their roles in subjectbased categorization. In HICSS 7804: Proc. of the 37th HICSS'04, page 40100.2, Washington, DC, USA. IEEE Computer Society.
  9. Lim, C., Lee, K., and Kim, G. (2005). Automatic genre detection of web documents. In Su K., Tsujii J., Lee J., Kwong O. Y., NLP, Berlin. Springer.
  10. Lindemann, C. and Littig, L. (2006). Coarse-grained classification of web sites by their structural properties. In Proc. of the 8th ACM - WIDM'06, pages 35-42, New York, NY, USA. ACM Press.
  11. Mehler, A. (2007). Structure formation in the web. toward a graph-theoretical model of hypertext types. In Witt, A. and Metzing, D., editors, Linguistic Modelling of Information and Markup Languages. Springer, Dordrecht.
  12. Mehler, A., Gleim, R., and Dehmer, M. (2005). Towards structure-sensitive hypertext categorization. In Proc. of the 29th Annual Conf. of the GCS, Universit├Ąt Magdeburg, March 9-11, 2005, Berlin/New York. Springer.
  13. Mehler, A., Gleim, R., and Wegner, A. (2007). Structural uncertainty of hypertext types. An empirical study. In Proc. of Towards Genre-Enabled Search Engines: The Impact of NLP, September, 30, 2007, Borovets, Bulgaria, pages 13-19.
  14. Rehm, G., Santini, M., and Alexander Mehler, e. (2008). Towards a reference corpus of web genres for the evaluation of genre identification systems. In Pro. of the 6th LREC 2008, Marrakech (Morocco).
  15. Santini, M. (2006). Identifying genres of web pages. In Proc. of TALN 2006.
  16. Santini, M., Power, R., and Evans, R. (2006). Implementing a characterization of genre for automatic genre identification of web pages. In Proc. of the COLING/ACL, pages 699-706, Morristown, NJ, USA. ACL.
Download


Paper Citation


in Harvard Style

Waltinger U., Mehler A. and Wegner A. (2009). A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 680-683. DOI: 10.5220/0001834806800683


in Bibtex Style

@conference{webist09,
author={Ulli Waltinger and Alexander Mehler and Armin Wegner},
title={A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={680-683},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001834806800683},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - A TWO-LEVEL APPROACH TO WEB GENRE CLASSIFICATION
SN - 978-989-8111-81-4
AU - Waltinger U.
AU - Mehler A.
AU - Wegner A.
PY - 2009
SP - 680
EP - 683
DO - 10.5220/0001834806800683