AISLES THROUGH THE CATEGORY FOREST - Utilising the Wikipedia Category System for Corpus Building in Machine Learning

Rüdiger Gleim, Alexander Mehler, Matthias Dehmer, Olga Pustylnikov

Abstract

The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.

References

  1. Baroni, M. and Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In Proceedings of the LREC, Lisbon.
  2. Budanitsky, A. and Hirst, G. (2006). Evaluating wordnetbased measures of semantic distance. Computational Linguistics, 32(1):13-47.
  3. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning. Data Mining, Inference, and Prediction. Springer, Berlin/New York.
  4. Holt, R. C., Schürr, A., Elliott Sim, S., and Winter, A. (2006). GXL: A graph-based standard exchange format for reengineering. Science of Computer Programming, 60(2):149-170.
  5. Kilgarriff, A., Rundell, M., and Dhonnchadha, E. U. (2005). Corpus creation for lexicography. In Proceedings of the Asialex, Singapore, June.
  6. Leuf, B. and Cunningham, W. (2001). The Wiki Way. Quick Collaboration on the Web. Addison Wesley.
  7. Mehler, A. (2006). Text linkage in the wiki medium - a comparative study. In Proceedings of the EACL Workshop on New Text - Wikis and blogs and other dynamic text sources, Trento, Italy, April 3-7.
  8. Mehler, A. and Gleim, R. (2006). The net for the graphs - towards webgenre representation for corpus linguistic studies. In Baroni, M. and Bernardini, S., editors, WaCky! Working Papers on the Web as Corpus, pages 191-224. Gedit, Bologna.
  9. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill, New York.
  10. Newman, M. E. J. (2003). The structure and function of complex networks. SIAM Review, 45:167-256.
  11. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1-47.
  12. Shapiro, A. (2002). Touchgraph http://www.touchgraph.com/index.html.
  13. Zlatic, V., Bozicevic, M., Stefancic, H., and Domazet, M. (2006). Wikipedias: Collaborative web-based encyclopedias as complex networks. http://www.citebase.org/cgi-bin/citations? id=oai:arXiv.org:physics/0602149.
Download


Paper Citation


in Harvard Style

Gleim R., Mehler A., Dehmer M. and Pustylnikov O. (2007). AISLES THROUGH THE CATEGORY FOREST - Utilising the Wikipedia Category System for Corpus Building in Machine Learning . In Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-972-8865-78-8, pages 142-149. DOI: 10.5220/0001267101420149


in Bibtex Style

@conference{webist07,
author={Rüdiger Gleim and Alexander Mehler and Matthias Dehmer and Olga Pustylnikov},
title={AISLES THROUGH THE CATEGORY FOREST - Utilising the Wikipedia Category System for Corpus Building in Machine Learning},
booktitle={Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2007},
pages={142-149},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001267101420149},
isbn={978-972-8865-78-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Third International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - AISLES THROUGH THE CATEGORY FOREST - Utilising the Wikipedia Category System for Corpus Building in Machine Learning
SN - 978-972-8865-78-8
AU - Gleim R.
AU - Mehler A.
AU - Dehmer M.
AU - Pustylnikov O.
PY - 2007
SP - 142
EP - 149
DO - 10.5220/0001267101420149