ONTOLOGIES DERIVED FROM WIKIPEDIA

A Framework for Comparison

Alejandro Metke-Jimenez, Kerry Raymond and Ian MacColl

Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia

Keywords:

Ontology, Wikipedia.

Abstract:

Since its debut in 2001 Wikipedia has attracted the attention of many researchers in different ﬁelds. In recent

years researchers in the area of ontology learning have realised the huge potential of Wikipedia as a source

of semi-structured knowledge and several systems have used it as their main source of knowledge. However,

the techniques used to extract semantic information vary greatly, as do the resulting ontologies. This paper

introduces a framework to compare ontology learning systems that use Wikipedia as their main source of

knowledge. Six prominent systems are compared and contrasted using the framework.

1 INTRODUCTION

Ontologies play a central role not only in the Seman-

tic Web vision but also in other ﬁelds such as natu-

ral language processing and document classiﬁcation.

However, there have been several criticisms of the

traditional, engineering-oriented approach to ontol-

ogy building and, therefore, considerable research has

been done in the area of automatic ontology learning

(Hepp, 2007).

Many researchers have started using Wikipedia

as the main source for ontology learning systems.

Wikipedia is an interesting resource because it is built

by the community so its overall structure has not been

imposed but, rather, reﬂects the consensus reached

by its users. Several approaches have been used to

extract semantic information from Wikipedia and the

type and quality of the ontologies that have been gen-

erated varies greatly. This paper addresses the re-

search question of how to compare these approaches.

Section 2 introduces the framework and explains

its dimensions in detail. Section 3 provides a descrip-

tion of the six systems that were selected for this anal-

ysis that use Wikipedia as their main source of infor-

mation and shows how these can be classiﬁed using

the framework. Section 4 discusses future work. Fi-

nally, Section 5 summarises the paper’s research con-

tributions.

2 A FRAMEWORK FOR

COMPARISON

Other frameworks to compare ontology learning sys-

tems have been developed in the past (Shamsfard

and Abdollahzadeh Barforoush, 2003; Zhou, 2007).

However,their dimensions fail to consider key aspects

that need to be taken into consideration when dealing

with a resource such as Wikipedia. The purpose of

the proposed framework is to enable an easy way to

compare current approaches that rely speciﬁcally on

Wikipedia. We propose a framework with the follow-

ing eight dimensions:

1. Type of ontology being generated

2. Wikipedia features used

3. Derived ontology elements

4. Additional sources used

5. Extraction mechanism

6. Natural language independence

7. Degree of automation

8. Evaluation method

Each dimension is explained in detail in the rest of

this section.

2.1 Type of Ontology being Generated

The term ontology comes from the ﬁeld of philoso-

phy but it has been borrowed by the computer sci-

ence community. Depending on the ﬁeld, the exact

382

Metke-Jimenez A., Raymond K. and MacColl I..

ONTOLOGIES DERIVED FROM WIKIPEDIA - A Framework for Comparison.

DOI: 10.5220/0003094003820387

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2010), pages 382-387

ISBN: 978-989-8425-29-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

meaning of ontology can be slightly different. Many

researchers refer to the term vaguely, usually citing

Gruber’s deﬁnition that states that an ontology is a

“speciﬁcation of a set of shared concepts and their re-

lationships in some domain” (Gruber, 1993). We be-

lieve that in the context of the proposed framework, it

is necessary to refer to a deﬁnition that is much more

speciﬁc. We propose classifying ontologies based on

their goal since we believe this is what deﬁnes their

characteristics. The following is the classiﬁcation we

use in the framework.

An Association Ontology is the most basic type of

ontology and its goal is to provide measures of

semantic relatedness between its resources (con-

cepts, individuals, or both). Its relationships have

no type but typically include a numeric value to

indicate a degree of relatedness between the con-

nected resources.

Lexical Ontologies focus on including several lex-

ical representations of concepts and individuals

and also is-a relationships between them. These

ontologies are used to assist in natural language

processing tasks where the word disambiguation

problem is a major challenge.

Classiﬁcation Ontologies are used to classify re-

sources. These ontologies include is-a relation-

ships and a large set of concepts and individuals.

When a concept in an ontology is used to classify

some resource, the user is indicating that the re-

source is about that concept.

Representational Ontologies are used to provide an

abstract representation of some reality. These

are usually ﬁne-grained and speciﬁc to some do-

main. Relationships are usually more complex

than the ones found in classiﬁcation ontologies,

since modelling a domain often requires express-

ing additional conditions and restrictions that go

beyond the simple taxonomic ones. However,

these ontologies do not require complex reason-

ing since it is not the intention of the modellers to

be able to derive new knowledge from the ontol-

ogy.

Knowledge Repositories are used to answer seman-

tically rich queries. These ontologies enable auto-

mated reasoning, comparable to a human expert,

and therefore require different types of relation-

ships, axioms, and complex reasoning. In prac-

tice, complex reasoning is very expensive compu-

tationally so these ontologies are restricted either

in size or in complexity.

2.2 Wikipedia Features Used

This dimension is concerned with the Wikipedia fea-

tures being used as sources of semantic information.

A good review of Wikipedia’s most important fea-

tures can be found in (Medelyan et al., 2009).

2.3 Derived Ontology Elements

Ontologies are composed of different elements and

also have different levels of expressiveness. Every ap-

proach deals at least with the extraction of concepts

and relationships between them. For example, a sim-

ple way of deriving the concepts of an ontology is by

creating one for each article in Wikipedia.

2.4 Additional Sources Used

Even though Wikipedia is a good source of semantic

information, many research projects have chosen to

use additional sources to improve the quality of the

generated ontology.

2.5 Extraction Mechanism

This dimension is concerned with the actual mecha-

nism used to extract the information from Wikipedia

(and possibly other sources) and turn it into an on-

tology. Several different techniques have been used

for this purpose and can be classiﬁed in the following

major categories:

Linguistic Analysis relies on natural language pro-

cessing and typically do not make use of

Wikipedia’s special features (it is treated just like

regular text).

Rule-based Methods deﬁne rules based on common

patterns observed in Wikipedia.

Statistical Methods are based on statistical informa-

tion (such as word co-ocurrence, for example).

Machine Learning Approaches use either super-

vised or unsupervised machine learning tech-

niques to extract or derive new semantic informa-

tion.

Connectivity-based Algorithms make use of the in-

ternal links in Wikipedia in order to treat it like

a graph in which the articles represent the nodes

and the links represent the edges.

Transformation-based methods use structured or

semi-structured information and deﬁne rules to

transform it to elements in the target ontology.

ONTOLOGIES DERIVED FROM WIKIPEDIA - A Framework for Comparison

383

2.6 Natural Language Independence

This dimension is closely related to the extraction di-

mension since it refers to the applicability of the ap-

proach to sources in a language different than the one

the approach was originally designed to work with.

2.7 Degree of Automation

This dimension is concerned with the degree of man-

ual intervention required for the approach to work

and has the following possible values: manual, semi-

automatic, or fully automatic.

2.8 Evaluation Method

This dimension classiﬁes the approach used to evalu-

ate the generated ontology. These approaches can be

classiﬁed in four main categories (Brank et al., 2005):

Comparative methods, where the ontology is com-

pared with a “gold standard”.

Proxied methods, where the results of using the on-

tology in an application domain (such as docu-

ment classiﬁcation) are compared.

Data-based methods, where the“ﬁt” of the ontology

to a domain is measured from a source of data

(such as a collection of documents).

Human-assessed methods, where the quality of the

ontology is evaluated by a group of people against

some predeﬁned criteria.

3 USING OUR FRAMEWORK

A considerable amount of research has been de-

voted to the extraction of semantic information from

Wikipedia using various approaches (a good review

can be found in (Medelyan et al., 2009)). Our frame-

work is useful to classify ontology learning systems

that use Wikipedia as their main source of informa-

tion. Using the framework allows understanding how

these systems work and identifying gaps that might be

exploited to improve the content of the generated on-

tologies. This section shows the results of using the

framework to classify six systems. Table 1 shows a

summary of the results.

3.1 DBpedia

The DBpedia project focuses on extracting simple se-

mantic information from Wikipedia’s structure and

templates in the form of RDF triples (Auer et al.,

2008). The DBpedia dataset contains about 103 mil-

lion RDF triples. Some of these include very speciﬁc

information (mainly from the data extracted from the

infoboxes) and some include metadata (such as the

page links between Wikipedia articles). The dataset

is available on the group’s web page.

The goal of DBpedia is to create a knowledge

repository with general knowledge extracted from

Wikipedia. It uses two main sources of information:

database dumps and the page templates. The relation-

ships extracted from the database dumps are untyped

and only indicate that an article is related somehow to

the articles to which it is linked. The templates allow

extracting both attributes and several typed relation-

ships, mainly from individuals.

3.2 Wikipedia Thesaurus

The Wikipedia Laboratory group created an asso-

ciation thesaurus from Wikipedia by using several

techniques that calculate the semantic relatedness be-

tween articles (Ito et al., 2008). The thesaurus is avail-

able on the group’s web page.

An association thesaurus contains concepts and

relationships between them, with a numeric value that

indicates how close the concepts are semantically.

The researchers use several techniques to achieve the

same result. One of these, known as pﬁbf (Path Fre-

quency - Inversed Backward link Frequency), uses

Wikipedia’s internal links to derive the relatedness

measure. It is similar to the traditional tﬁdf (Term

Frequency - Inverse Document Frequency) method

used in data mining, but speciﬁcally designed to deal

with Wikipedia’s structure. Another method is based

on link co-ocurrence analysis. In this approach the

relatedness measures are calculated based on the co-

occurrence of pairs of links in the articles.

The resulting thesaurus was assessed by humans

and also compared with a “gold standard” for word

similarity.

3.3 WikiNet

The EML Research group created a large scale, multi-

lingual concept network (Nastase et al., 2010). The

resource consists of language-independent concepts,

relationships between them, and their corresponding

lexical representations in different languages. The

dataset is available on the group’s web page.

The multi-lingual concept network is similar to

WordNet, but derived automatically from Wikipedia.

It is not intended to replace WordNet but rather to

complement it, since WordNet has limited coverage

despite its high quality.

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

384

Table 1: Results of applying the framework to the six selected systems.

System

Dimension DBPedia WikipediaThesaurus WikiNet YAGO WikiOnto WikiTaxonomy

Type of Knowledge repository Association Lexical Knowledge repository Representational Classiﬁcation

Ontology

Derived Concepts Concepts Concepts Concepts Concepts Concepts

Elements Individuals Numeric relations Individuals Attributes Individuals Individuals

Attributes Is-a relations Is-a relations Is-a relations Is-a relations

Untyped relations Instance-of relations Instance-of relations Instance-of relations

Other relations Other relations Other relations

Wikipedia Articles Articles Articles Infoboxes Articles Categories

Features Internal links Article text Categories Categories Internal links Internal links

Templates Internal links Internal links Redirects Sections Article text

Cross-language links

Disambiguation pages

Additional None None None WordNet WordNet None

Sources

Extraction Transformation-based Statistical methods Linguistic analysis Rule-based Transformation-based Linguistic analysis

Mechanism Connectivity-based Transformation-based Machine learning Connectivity-based

Statistical methods Linguistic analysis Rule-based

Language Partial Yes Yes Yes Partial Partial

Independence

Degree of Automatic Automatic Automatic Automatic Assisted Automatic

Automation

Evaluation Not speciﬁed Human assessed Human assessed Human assessed Not speciﬁed Comparative

Method Comparative Comparative

Several Wikipedia features are used to derive the

content of the network. Articles and categories are

used to derive concepts. In order to derive relations,

syntactic analysis is performed on category names to

extract relations and these are propagated to their ar-

ticles. Information found in infoboxes is also used

to further type these relationships. Finally, cross-

language links are used to create an index that in-

cludes the different language representations of each

concept.

The resulting concept network was evaluated by

comparing it with Cyc and YAGO. The comparison

was performed both manually and automatically.

3.4 YAGO

YAGO (Yet Another Great Ontology) is a project that

aims to create a huge general-purpose ontology that

includes both concepts and named/entities (Suchanek

et al., 2008). The ontology can be accessed through

several front ends and can also be downloaded from

the project’s web page.

The authors use different heuristics to extract at-

tributes and relationships from the infoboxes of the

articles, deduce types based on Wikipedia’s category

network, and extract is-a relations. Also, Word-

Net is used as the source for an organised taxonomy

in which the concepts extracted from Wikipedia are

placed. WordNet synsets are used to derive the mean-

ing of Wikipedia concepts, using Wikipedia redirects

to ﬁnd alternative names for entities, and applying

heuristics based on the Wikipedia categories to extract

additional information.

The resulting ontology was assessed by present-

ing the facts to human judges for evaluation. For each

fact the judges could select if the fact was correct, in-

correct, or that they did not know.

3.5 WikiOnto

The goal of the WikiOnto project is not to create

an ontology but to provide an environment to assist

in the extraction and modelling of ontologies, using

Wikipedia as the main source of information (Silva

and Jayaratne, 2009). Even though the goal of the

project is not to create an ontology, it was included

in the evaluation because it can be considered a semi-

automatic approach to ontology building.

The environment enables users to choose an arti-

cle of interest as the starting point and it provides sug-

gestions of other concepts that might be relevant to

include in the ontology using clustering techniques.

The ontology builder can add new concepts to the

ontology and then specify the relationships between

them. The authors plan to implement syntactic anal-

ysis in order to suggest relations between concepts

other than is-a.

ONTOLOGIES DERIVED FROM WIKIPEDIA - A Framework for Comparison

385

Table 2: Derived elements and Wikipedia features used in the evaluated systems.

Derived Element

Concepts & Relations

Wikipedia Feature Individuals Attributes Untyped Numeric Is-a Instance-of Other

Articles

Title D, T, N, Y, O

Deﬁnition Sentence

Overview Paragraph

Sections O O

Text T X X

Templates D D

Infoboxes N, Y

Internal Links O D T O

External Links

Cross-language Links N

Category Links N Y

Categories

Title N, X N, Y, X X

Text

Parent Categories N, X X

Cross-language Links N

Redirects Y

Disambiguation Pages N

Edit Histories

Discussion Pages

D = DBpedia, T = Wikipedia Thesaurus, N = WikiNet, Y = YAGO, O = WikiOnto, X = WikiTaxonomy

3.6 WikiTaxonomy

WikiTaxonomy is an ontology derived from

Wikipedia’s category structure. It includes con-

cepts, individuals, and simple taxonomic relations

(Ponzetto and Strube, 2008). The ontology is

available in RDFS format.

The authors use a combination of connectivity-

based methods and linguistic-based methods to derive

is-a relations between Wikipedia categories. These

relations are then propagated using inference-based

methods. Finally, concepts and individuals are identi-

ﬁed by using several heuristics.

4 FUTURE WORK

By analysing the cross-product between the frame-

work’s derived elements and Wikipedia features di-

mensions it is easy to identify which Wikipedia fea-

tures have been used to derivecertain type of semantic

information. Table 2 shows the result of applying this

cross-product to the six systems that were previously

classiﬁed with the framework. Future work will in-

volve using this information to identify new sources

of semantic information and deriving new mecha-

nisms to exploit it.

5 CONCLUSIONS

This paper introduces a framework to compare ap-

proaches to deriving ontologies automatically or

semi-automatically from Wikipedia. Six prominent

systems are classiﬁed using the framework and their

most relevant characteristics are summarised. Finally,

the cross-product of two of the framework’s dimen-

sions is used to show how new sources of semantic

information in Wikipedia can be identiﬁed.

ACKNOWLEDGEMENTS

This research is supported in part by the CRC Smart

Services, established and supported under the Aus-

tralian Government Cooperative Research Centres

Programme, and a Queensland Universityof Technol-

ogy scholarship.

REFERENCES

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,

R., and Ives, Z. (2008). Dbpedia: A nucleus for a

web of open data. In The Semantic Web, volume

KEOD 2010 - International Conference on Knowledge Engineering and Ontology Development

386

4825/2008, pages 722–735. Springer Berlin / Heidel-

berg.

Brank, J., Grobelnik, M., and Mladeni´c, D. (2005). A sur-

vey of ontology evaluation techniques. In Proceedings

of the Conference on Data Mining and Data Ware-

houses (SiKDD 2005).

Gruber, T. R. (1993). A translation approach to portable

ontology speciﬁcations. Knowledge Acquisition,

6(2):199–221.

Hepp, M. (2007). Possible ontologies: How reality con-

strains the development of relevant ontologies. IEEE

Internet Computing, 11(1):90–96.

Ito, M., Nakayama, K., Hara, T., and Nishio, S. (2008).

Association thesaurus construction methods based on

link co-occurrence analysis for wikipedia. In Pro-

ceeding of the 17th ACM Conference on Information

and Knowledge Management, Napa Valley, Califor-

nia, USA. ACM.

Medelyan, O., Milne, D., Legg, C., and Witten, I. H. (2009).

Mining meaning from wikipedia. International Jour-

nal of Human-Computer Studies, 67(9):716–754.

Nastase, V., Strube, M., Boerschinger, B., Zirn, C., and El-

ghafari, A. (2010). Wikinet: A very large scale multi-

lingual concept network. In Proceedings of the Sev-

enth conference on International Language Resources

and Evaluation (LREC’10), Valletta, Malta. European

Language Resources Association (ELRA).

Ponzetto, S. and Strube, M. (2008). WikiTaxonomy: A

large scale knowledge resource. In Proceeding of the

2008 conference on ECAI 2008: 18th European Con-

ference on Artiﬁcial Intelligence, pages 751–752. IOS

Press.

Shamsfard, M. and Abdollahzadeh Barforoush, A. (2003).

The state of the art in ontology learning: a framework

for comparison. Knowl. Eng. Rev., 18(4):293–316.

Silva, L. N. D. and Jayaratne, L. (2009). Wikionto: A

system for semi-automatic extraction and modeling of

ontologies using wikipedia xml corpus. International

Conference on Semantic Computing, 0:571–576.

Suchanek, F. M., Kasneci, G., and Weikum, G. (2008).

Yago: A large ontology from wikipedia and wordnet.

Web Semantics: Science, Services and Agents on the

World Wide Web, 6(3):203–217.

Zhou, L. (2007). Ontology learning: state of the art and

open issues. Information Technology and Manage-

ment, 8(3):241–252.

ONTOLOGIES DERIVED FROM WIKIPEDIA - A Framework for Comparison

387