5 APPLICATION SCENARIOS
So far we introduced an approach to representational
and algorithmic issues of exploring wiki category sys-
tems. The implementation of the
WikiCEP reflects
these considerations. It supports researchers who
need to gather corpora for their machine learning
tasks. In this section, we outline three of them:
• Text categorisation is the task of automatically as-
signing category labels to a set of input texts (Se-
bastiani, 2002). It hinges on the availability of
positive and negative training samples in order to
train reliable classifiers. One way is to use the in-
put corpus in order to separate training and test
data and to overcome its limited size by means
of cross-validation methods (Hastie et al., 2001).
We propose using the
WikiCEP as a means to addi-
tionally select data or to enlarge the feature space
by exploring similarly categorised articles.
• Lexical chaining is the task of exploring chains
of semantically related words in a text, that is,
tracking semantically related tokens (Budanitsky
and Hirst, 2006). It hinges on the availability of
terminological ontologies like WordNet. We pro-
pose using the
WikiCEP as a means to explore the
Wikipedia category system as a social terminolog-
ical ontology instead, that is, we propose using the
Wikipedia as a source of defining semantic relat-
edness and similarity of lexical units.
• In lexicology, corpora are widely used for vari-
ous applications. This relates, for example, to har-
vesting for new lexical terms, word sense disam-
biguation and the extraction of exemplary phrases.
(Kilgarriff et al., 2005) describe the development
of a corpus to support the creation of an English-
Irish distionary which, besides print media, in-
corporates web documents. Further, (Baroni and
Bernardini, 2004) propose an approach to incre-
mentally build specialised corpora from the web
based on a set of seed terms.
WikiCEP marks a
complementary approach which enables lexicog-
raphers to incorporate Wikipedia articles for their
work.
6 CONCLUSION
This article addressed the potential of social tagging
which Wikipedia offers to classify articles in order to
enhance browsing for readers as well as to support the
composition for domain-specific corpora. We mapped
the category system onto a forest of generalised trees
as an enhanced representation format for graph-like
structured ontologies. This, nevertheless, allows tree-
like processing of the data while keeping full infor-
mation and overcoming flaws like cycles and multi-
ple root categories (by introducing a virtual root to the
kernel structure if necessary). Section 3 and 4 showed
an exemplary application of the enhanced representa-
tion of the category system which adresses composi-
tion of domain-specific corpora and enhanced brows-
ing. Future work will address the utilisation of more
sophisticated heuristics to build the kernel hierarchi-
cal structure.
REFERENCES
Baroni, M. and Bernardini, S. (2004). Bootcat: Bootstrap-
ping corpora and terms from the web. In Proceedings
of the LREC, Lisbon.
Budanitsky, A. and Hirst, G. (2006). Evaluating wordnet-
based measures of semantic distance. Computational
Linguistics, 32(1):13–47.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The ele-
ments of statistical learning. Data Mining, Inference,
and Prediction. Springer, Berlin/New York.
Holt, R. C., Sch
¨
urr, A., Elliott Sim, S., and Winter, A.
(2006). GXL: A graph-based standard exchange for-
mat for reengineering. Science of Computer Program-
ming, 60(2):149–170.
Kilgarriff, A., Rundell, M., and Dhonnchadha, E. U. (2005).
Corpus creation for lexicography. In Proceedings of
the Asialex, Singapore, June.
Leuf, B. and Cunningham, W. (2001). The Wiki Way. Quick
Collaboration on the Web. Addison Wesley.
Mehler, A. (2006). Text linkage in the wiki medium – a
comparative study. In Proceedings of the EACL Work-
shop on New Text – Wikis and blogs and other dynamic
text sources, Trento, Italy, April 3-7.
Mehler, A. and Gleim, R. (2006). The net for the graphs
– towards webgenre representation for corpus linguis-
tic studies. In Baroni, M. and Bernardini, S., editors,
WaCky! Working Papers on the Web as Corpus, pages
191–224. Gedit, Bologna.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill,
New York.
Newman, M. E. J. (2003). The structure and function of
complex networks. SIAM Review, 45:167–256.
Sebastiani, F. (2002). Machine learning in automated text
categorization. ACM Computing Surveys, 34(1):1–47.
Shapiro, A. (2002). Touchgraph wikibrowser.
http://www.touchgraph.com/index.html.
Zlatic, V., Bozicevic, M., Stefancic, H., and Do-
mazet, M. (2006). Wikipedias: Collaborative
web-based encyclopedias as complex networks.
http://www.citebase.org/cgi-bin/citations?
id=oai:arXiv.org:physics/0602149
.
AISLES THROUGH THE CATEGORY FOREST - Utilising the Wikipedia Category System for Corpus Building in
Machine Learning
149