2.6 Natural Language Independence
This dimension is closely related to the extraction di-
mension since it refers to the applicability of the ap-
proach to sources in a language different than the one
the approach was originally designed to work with.
2.7 Degree of Automation
This dimension is concerned with the degree of man-
ual intervention required for the approach to work
and has the following possible values: manual, semi-
automatic, or fully automatic.
2.8 Evaluation Method
This dimension classifies the approach used to evalu-
ate the generated ontology. These approaches can be
classified in four main categories (Brank et al., 2005):
Comparative methods, where the ontology is com-
pared with a “gold standard”.
Proxied methods, where the results of using the on-
tology in an application domain (such as docu-
ment classification) are compared.
Data-based methods, where the“fit” of the ontology
to a domain is measured from a source of data
(such as a collection of documents).
Human-assessed methods, where the quality of the
ontology is evaluated by a group of people against
some predefined criteria.
3 USING OUR FRAMEWORK
A considerable amount of research has been de-
voted to the extraction of semantic information from
Wikipedia using various approaches (a good review
can be found in (Medelyan et al., 2009)). Our frame-
work is useful to classify ontology learning systems
that use Wikipedia as their main source of informa-
tion. Using the framework allows understanding how
these systems work and identifying gaps that might be
exploited to improve the content of the generated on-
tologies. This section shows the results of using the
framework to classify six systems. Table 1 shows a
summary of the results.
3.1 DBpedia
The DBpedia project focuses on extracting simple se-
mantic information from Wikipedia’s structure and
templates in the form of RDF triples (Auer et al.,
2008). The DBpedia dataset contains about 103 mil-
lion RDF triples. Some of these include very specific
information (mainly from the data extracted from the
infoboxes) and some include metadata (such as the
page links between Wikipedia articles). The dataset
is available on the group’s web page.
The goal of DBpedia is to create a knowledge
repository with general knowledge extracted from
Wikipedia. It uses two main sources of information:
database dumps and the page templates. The relation-
ships extracted from the database dumps are untyped
and only indicate that an article is related somehow to
the articles to which it is linked. The templates allow
extracting both attributes and several typed relation-
ships, mainly from individuals.
3.2 Wikipedia Thesaurus
The Wikipedia Laboratory group created an asso-
ciation thesaurus from Wikipedia by using several
techniques that calculate the semantic relatedness be-
tween articles (Ito et al., 2008). The thesaurus is avail-
able on the group’s web page.
An association thesaurus contains concepts and
relationships between them, with a numeric value that
indicates how close the concepts are semantically.
The researchers use several techniques to achieve the
same result. One of these, known as pfibf (Path Fre-
quency - Inversed Backward link Frequency), uses
Wikipedia’s internal links to derive the relatedness
measure. It is similar to the traditional tfidf (Term
Frequency - Inverse Document Frequency) method
used in data mining, but specifically designed to deal
with Wikipedia’s structure. Another method is based
on link co-ocurrence analysis. In this approach the
relatedness measures are calculated based on the co-
occurrence of pairs of links in the articles.
The resulting thesaurus was assessed by humans
and also compared with a “gold standard” for word
similarity.
3.3 WikiNet
The EML Research group created a large scale, multi-
lingual concept network (Nastase et al., 2010). The
resource consists of language-independent concepts,
relationships between them, and their corresponding
lexical representations in different languages. The
dataset is available on the group’s web page.
The multi-lingual concept network is similar to
WordNet, but derived automatically from Wikipedia.
It is not intended to replace WordNet but rather to
complement it, since WordNet has limited coverage
despite its high quality.
KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development
384