FactRunner: Fact Extraction over Wikipedia

Rhio Sutoyo

and Christoph Quix

2,3

and Fisnik Kastrati

King Mongkut’s University of Technology North Bangkok, Thai-German Graduate School of Engineering,

Bangkok, Thailand

RWTH Aachen University, Information Systems and Databases, Aachen, Germany

Fraunhofer Institute for Applied Information Technology FIT, St. Augustin, Germany

Keywords:

Information Extraction, Semantic Search.

Abstract:

The increasing role of Wikipedia as a source of human-readable knowledge is evident as it contains an enor-

mous amount of high quality information written in natural language by human authors. However, querying

this information using traditional keyword based approaches requires often a time-consuming, iterative process

to explore the document collection to ﬁnd the information of interest. Therefore, a structured representation

of information and queries would be helpful to be able to directly query for the relevant information. An

important challenge in this context is the extraction of structured information from unstructured knowledge

bases which is addressed by Information Extraction (IE) systems. However, these systems struggle with the

complexity of natural language and produce frequently unsatisfying results.

In addition to the plain natural language text, Wikipedia contains links between documents which directly link

a term of one document to another document. In our approach for fact extraction from Wikipedia, we consider

these links as an important indicator for the relevance of the linked information. Thus, our proposed system

FactRunner focusses on extracting structured information from sentences containing such links. We show that

a natural language parser combined with Wikipedia markup can be exploited for extracting facts in form of

triple statements with a high accuracy.

1 INTRODUCTION

For the past three decades, relational database sys-

tems (RDBMS) have gained great popularity for data

management in various application areas. Neverthe-

less, with the introduction of the World Wide Web

(WWW), a huge amount of data repositories in form

of unstructured and semi-structured sources became

available (e.g., webpages, newspaper articles, blogs,

scientiﬁc publication repositories, etc). Such sources

often contain useful information and are designed to

be read by humans, and not by machines. Wikipedia

is a good example of this case; with more than 3.8

million articles, it has become one of the main sources

of human-readable knowledge repository in the mod-

ern information era.

Keyword search is still the dominating way of

searching in such large document collections. Al-

though keyword search is quite efﬁcient from a

system-oriented point-of-view (relevant documents

can be found very quickly using a traditional

keyword-based search engine), ﬁnding a certain piece

of information requires often an iterative, time-

consuming process of keyword search. An initial set

of keywords is tried ﬁrst; if relevant documents are re-

turned, then the user has to read parts of the retrieved

documents in order to ﬁnd the information of inter-

est. If the answer is not found, the user has to reﬁne

her query, and look for other documents which might

contain the answer. This process of querying, reading,

and reﬁning might have to be repeated several times

until a satisfying answer is found.

Therefore, the search process would be simpli-

ﬁed if it could be supported, at least partially, by

some system that actually understands the semantics

of the searched data. Such systems are called se-

mantic search engines and various approaches have

been proposed (Mangold, 2007; Dong et al., 2008).

They range from approaches which replace the orig-

inal keywords entered by the user with semantically

related terms (e.g., (Burton-Jones et al., 2003)) to

more complex approaches which require a query in

a formal language as well as semantically annotated

data (e.g., SPARQL endpoints for Linked Data (Heath

and Bizer, 2011)).

The latter search paradigm is geared towards data

423

Sutoyo R., Quix C. and Kastrati F..

FactRunner: Fact Extraction over Wikipedia.

DOI: 10.5220/0004375604230432

In Proceedings of the 9th International Conference on Web Information Systems and Technologies (WEBIST-2013), pages 423-432

ISBN: 978-989-8565-54-9

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

retrieval as the system is only able to answer con-

cepts or documents which contain semantic annota-

tions. This is ﬁne as long as the documents have been

annotated with semantic information (e.g., RDF state-

ments). However, most information which is avail-

able today is stored in unstructured text documents

which do not contain semantic annotations. Web doc-

uments, in contrast to plain text documents, contain

other useful information such as links, tables, and im-

ages. Especially, links connecting documents can be

exploited to extract more accurate information from

text documents. We apply this idea to the Wikipedia

collection.

The task of converting information contained in

document collections into formal knowledge is ad-

dressed in the research area of Information Extraction

(IE), but it is still an unsolved problem. Approaches

such as Open Information Extraction (Banko et al.,

2007) are able to extract information in form of triples

(e.g., statements of the form subject – predicate – ob-

ject) from unstructured documents, but the extracted

triples require more consolidation, normalization and

linkage to existing knowledge to become useful for

queries. In order to fulﬁll such an aim at a large

scale, this task has to be done in unsupervised and

fully-automatic manner. A particular challenge for

information extraction from Wikipedia articles is the

lack of redundancy of sentences describing a partic-

ular fact. Redundancy has been greatly leveraged by

systems that perform IE over the web (e.g., TextRun-

ner (Banko et al., 2007) and KnowItAll (Etzioni et al.,

2004)) to increase their quality of extracted facts. Fur-

thermore, redundancy is important for the scoring

model as frequently occurring facts get a higher score,

i.e., these facts are assumed to be correctly extracted

with a high probability. A strong scoring model is

fundamental towards ensuring the accuracy of the ex-

tractor (i.e., for ﬁltering out incorrect triples). This is

not the case with Wikipedia, as articles generally ad-

dress only one topic, and there are no other articles

addressing the same topic. This feature reduces infor-

mation redundancy greatly; if a fact is missed while

extracting information from one article, the probabil-

ity is very low that the same fact will be encountered

in other articles.

The contribution of this paper is a novel sys-

tem called FactRunner for open fact extraction from

Wikipedia. The primary goal of this system is to ex-

tract high quality information present in Wikipedia’s

natural language text. Our approach is complemen-

tary to other approaches which extract structured in-

formation from Wikipedia infoboxes (Weld et al.,

2009) or the category system (Hoffart et al., 2011).

Our method utilizes the existing metadata present in

Wikipedia articles (i.e., links between articles) to ex-

tract facts with high accuracy from Wikipedia’s En-

glish natural language text. The metadata is also used

in our scoring model and provides higher scores to

facts which are based on metadata. Extracted facts

are stored in form of triples (subject, predicate,

object), where a triple is a relation between sub-

jects and objects that is connected by predicates (Rusu

et al., 2007). These triples can then later be queried

using structured queries, and thereby enabling the in-

tegration of structured and unstructured data (Halevy

et al., 2003).

The paper is structured as follows. We will ﬁrst

present in section 2 the main components of our sys-

tem. Section 3 describes the preprocessing step of

our system, which actually does most of the work

for the triple extraction, as it simpliﬁes and normal-

izes the natural language text input. Because of this

simpliﬁcation, the triple extraction method (section 4)

is quite simple compared to the preprocessing steps.

Section 5 then presents the evaluation results for our

system which we applied to a sample of documents

from Wikipedia. Related work is discussed in section

6 before we conclude our paper.

2 SYSTEM OVERVIEW

Wikipedia contains a vast amount of information

stored in natural text. There is much valuable infor-

mation hidden in such text, but computers cannot di-

rectly reason over the natural text. Wikipedia is de-

signed to be read by humans. To tackle this prob-

lem, we introduce a solution which utilizes Wikipedia

metadata to detect high quality sentences in the

Wikipedia data and extract valuable facts from those

sentences. Metadata surrounds important text frag-

ments located in Wikipedia articles. Metadata pro-

vides a link pointing to separate Wikipedia articles

devoted to the highlighted text, this way giving more

information (e.g., general deﬁnition, biography, his-

tory, etc.), and emphasizing the importance of such

text. Wikipedia contributors have invested efforts to

create such metadata, and they inserted them manu-

ally, giving a good hint on importance of such frag-

ments. To this end, we treat metadata as an indicator

of importance of sentences where they occur, and we

invest efforts into exploiting such sentences, with the

ultimate goal of producing high quality triples.

The system architecture of our system is repre-

sented in Figure 1. The architecture consists of six

components, namely, the data source, the preproces-

sor, the extractor, the triple ﬁnalizer, the scoring, and

the result producer. The main components are the pre-

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

424

Figure 1: System Architecture

processor and the triple extractor. The preprocessor

does most of the work; it prepares the natural lan-

guage text in such a way that the extractor can easily

extract the triples from the transformed input. Prepro-

cessor and triple extractor will be described in more

detail in sections 3 and 4.

2.1 Data Source

While YAGO (Suchanek et al., 2007) exploits the

Wikipedia category system and KYLIN (Weld et al.,

2009) uses Wikipedia infoboxes as their data source

for extracting facts, we use the natural language text

of Wikipedia articles because there are more facts em-

bedded in the text. Furthermore, not all Wikipedia ar-

ticles have infoboxes because the use of infoboxes in

Wikipedia articles is optional. Thus, although the idea

of providing articles’ general information through

Wikipedia infoboxes is favorable, many Wikipedia

articles do not have infoboxes. On the other hand,

most Wikipedia articles belong typically to one or

more categories. Unfortunately, these categories are

usually deﬁned for quite general classes and are not

as detailed as a classiﬁcation which can be found in

Wikipedia’s natural language text. Another reason for

choosing Wikipedia is that the articles are rich with

markup around text fragments which can be utilized

to extract facts with a higher precision.

3 PREPROCESSOR

The preprocessor is responsible for generating a set of

sentences to be passed to our triple extractor. There

are six steps, which are explained in the following

subsections.

3.1 Entity Resolution

Entity resolution is a component in our system that

utilizes metadata in order to synchronize multiple oc-

currences of the same entity expressed with different

textual representations, e.g., “Albert Einstein” vs. “A.

Einstein” . This component is invaluable towards rec-

onciling different variants of an entity representation.

For all cases, we always replace entities with their

longest text representation in order to preserve infor-

mation. In our current approach, we consider for rec-

onciliation only entities corresponding to persons.

3.2 Sentence Cleaner

This component is responsible for cleaning sentences

in articles from brackets and similar text fragments.

We realized during the testing of our system that

words inside brackets are often supplementary facts

which give more details about the words occurring

before the brackets. Moreover, it is also difﬁcult for

the triple extractor to understand linguistically the

semantic relationship of the brackets and to correctly

extract facts from them. Furthermore, we also extract

metadata from the articles before we remove them

completely from the texts. For each entity surrounded

with metadata, we store its textual representa-

tion fragment and its URL, e.g., <harry potter,

http://en.wikipedia.org/wiki/Harry Potter>,

into our metadata collection. This collection will

be later used to remove unwanted sentences in the

ﬁltering process.

FactRunner:FactExtractionoverWikipedia

425

3.3 Sentence Splitter

After the text is ﬁltered and cleaned, we split our in-

put from paragraphs into a set of sentence chunks.

Generally, one sentence may contain one or more

triples. Sentence splitting is a crucial process to-

wards ensuring that only qualitative sentences are fed

to the system, this way ensuring triple extraction with

higher precision. For this purpose, we rely on the sen-

tence detection library provided by LingPipe (Alias-i,

2008).

3.4 Coreference Matcher

The need for coreferencing surfaces when multiple

entities or pronoun words in a text refer to the same

entity or have the same referent. In the ﬁrst sen-

tence of Wikipedia articles, an author typically in-

troduces a person or an object with his full name

(e.g., Albert Einstein). Then, in the following sen-

tences the writer will begin using a substitute word

(e.g., he, his, him, Einstein) for that person. Build-

ing the connection between these entities, e.g., Al-

bert Einstein and Einstein, is what we refer to as

coreferencing. Triples extracted from sentences with

coreferences are fuzzy, because the context of the

sentence is lost once the triple has been extracted.

Even a human cannot understand the meaning of

a triple such as (he,received,Nobel Prize in

Physics) without a given context. Thus, we need

to replace entities that have different representation

forms into one single representation, i.e., full name.

In order to do so, we apply the LingPipe’s corefer-

ence resolution tool (Alias-i, 2008).

Although the purpose of the coreference matcher

is similar with the entity resolution, which is to syn-

chronize different associated entities into the same

representation form, they work in a different way. In

our approach, entity resolution only affects entities

which are marked as links. Such links have been en-

tered by humans manually and are highly accurate,

therefore they can be used to correctly replace enti-

ties to their original representation (i.e., the title of the

article describing the entity). On the other hand, the

coreference matcher is able to catch entities which are

not covered with metadata including pronoun words.

However, its accuracy might not be as precise as en-

tity resolution because LingPipe’s coreferencer is a

machine-based prediction system.

3.5 Sentence Filter

We remove invalid sentences which are caused either

by the sentence splitting tool or by human errors. For

example, sentences that do not start with a proper text

or end with inappropriate punctuation symbols are ig-

nored. Furthermore, we also exclude sentences which

have a length over the deﬁned threshold. This step

is necessary to be imposed because with a deep pars-

ing technique, long sentences demand high resources

which ultimately slow down signiﬁcantly the runtime

of the whole system. Sentences with less than three

words are removed as well, because a sentence at least

needs to have a subject, a predicate, and an object to

form a proper triple. After making sure that all sen-

tences are valid by employing the above mentioned

methods, we use the extracted metadata provided by

the sentence cleaner to use those sentences with meta-

data only for triple extraction. Based on our obser-

vations, sentences with metadata are more likely to

produce high quality triples.

3.6 Sentence Simpliﬁer

An important step to improve the performance of the

overall system is sentence simpliﬁcation. The pur-

pose of this step is to split complex sentences into

sub-sentences. Wikipedia authors sometimes use rich

stylish writing (e.g., list of items, dependent clauses,

etc.) which makes some sentences become com-

plex and hard to understand. Therefore, complex

sentences need to be simpliﬁed in order to produce

simple and clear triples as ﬁnal result. We borrow

the concept of sentence simpliﬁcation processes from

(Defazio, 2009).

We utilize the Stanford parser (Klein and Man-

ning, 2003) to get the grammatical structure of sen-

tences. The result of this parsing tool is a parse

tree which will distinguish noun phrases (NP), verb

phrases (VP), and group of words that belong to-

gether (e.g., dependent clauses (SBAR), prepositional

phrases (PP), etc.) in the sentences. We use the gener-

ated parse tree to understand the correlation between

words in a sentence and to split the sentences cor-

rectly. Nevertheless, parsing of natural language text

is a challenge for any kind of NLP tool. Thus, not all

generated parse trees are correct. Because our sim-

pliﬁcation process is really dependent on parse trees,

the simpliﬁcation results will be wrong if the provided

parse trees are wrong. However, the overall results of

this step are still decent and useful for the extraction

process.

The sentence simpliﬁcation has four steps:

Split Coordinating Conjunctions. Coordinating

conjunctions are words like “and”, “or”, and

“but” which are used in enumerations or to

connect sub-sentences. An example from Albert

Einstein’s Wikipedia article is: ‘his travels

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

426

included Singapore, Ceylon, and Japan.’ As we

aim at supporting structured queries, we should

extract three triples from this sentence and not

only one. This would enable us to answer a

question like ’Who travelled to Ceylon?’ and not

only the question ’Who travelled to Singapore,

Ceylon, and Japan?’. Such simple enumerations

could be still handled in a post-processing after

the triples have been extracted, but we also want

to handle more complex sentences like ‘Welker

and his department paved the way for microwave

semiconductors and laser diodes’ (from the

Wikipedia article about Heinrich Welker). The

coreferencing step described above transforms

‘his’ into ‘Welker’s’. To extract the complete

factual information from this sentence, we have

to extract four triples from the sentence which

correspond to the following statements:

• Welker paved the way for microwave semicon-

ductors.

• Welker paved the way for laser diodes.

• Welker’s department paved the way for mi-

crowave semiconductors.

• Welker’s department paved the way for laser

diodes.

Furthermore, we make also use of metadata in this

step. If a text fragment is a link to another ar-

ticle, then this text fragment is not processed by

this method.

Extract Dependent Clauses. In this simpliﬁcation

step, we will separate a dependent clause (SBAR

in the terminology of the Stanford parser) from

its main clause and create a new sentence which

contains the SBAR clause and the subject from

its main clause. There are two types of SBAR

clauses: subordinate conjunctions (for adver-

bial clauses) and relative pronouns (for relative

clauses) (Defazio, 2009). Our approach handles

each type differently. The ﬁrst type will be ex-

tracted into sub-sentence modiﬁers, while the sec-

ond type will create a new sentence. For example,

the sentence ‘Albert Einstein also investigated the

thermal properties of light which laid the founda-

tion of the photon theory of light.’ contains a rel-

ative clause which is translated into ‘Albert Ein-

stein also investigated the thermal properties of

light.’ and ‘The thermal properties of light laid

the foundation of the photon theory of light.’

Extract Adjective Phrases. In this step, we extract

adjective phrases from their main phrase. Ad-

jective phrases generally appear in the middle of

two phrases, i.e., a noun phrase and verb phrase,

which are separated by comma. Adjective phrases

could be a noun phrase (e.g., ‘Planck’s oldest son,

Karl Weierstrass, was killed in action in 1916.’)

or a verb phrase (e.g., ‘Jacques Vergs, born 5

March 1925 in Siam, is a French lawyer. . . ’. We

handle those two cases differently. For the noun

phrases, we extract them from their main sentence

and make them as the subject in the new sen-

tence (e.g., we would get ‘Planck’s oldest son was

killed. . . ’ and ‘Karl Weierstrass was killed. . . ’).

Verb phrases will be used as the predicate (includ-

ing the object) in the new sentence (e.g., ‘Jacques

Vergs born 5 March 1925 in Siam.’ and ‘Jacques

Vergs is a French lawyer. . . ’).

Extract Secondary Verbs. Lastly, we extract all

secondary verbs from the given sentences. These

kinds of sentences contain two verb phrases with

the same noun phrase. The second verb phrase

is nested in the ﬁrst verb phrase (e.g., ‘Amy

Lee Grant is an American singer-songwriter, best

known for Amy Lee Grant’s Christian music.’

would be translated into ‘Amy Lee Grant is

an American singer-songwriter.’ and ‘Amy Lee

Grant best known for her Christian music.’

4 EXTRACTOR

Triple extractor is responsible for extracting facts and

representing them in the form of triples. Triples

are stored in the form of (subject, predicate,

object). We extract the triples by using the parse

tree approach introduced in (Rusu et al., 2007). We

use the Stanford parser (Klein and Manning, 2003)

to generate the parse tree. The main idea of this ap-

proach is to utilize the parse tree as a helping tool

to extract subjects, predicates, and objects from sen-

tences, then combine them to produce a triple. In or-

der to ﬁnd the subject of a sentence, we have to search

for it in the noun phrase (NP) tree. Furthermore, the

predicate and the object of a sentence can be found

in the verb phrase (VP) tree. By applying the simpli-

ﬁcation method of extracting dependent clauses and

adjective phrases described above, we get these fol-

lowing results for the sentence ‘Albert Einstein was a

German-born theoretical physicist who developed the

theory of general relativity, effecting a revolution in

physics.’:

• (Albert Einstein, was, German-born

theoretical physicist)

• (Albert Einstein, developed, theory of

general relativity)

• (Albert Einstein, effecting,

revolution in physics)

FactRunner:FactExtractionoverWikipedia

427

4.1 Triple Finalizer

The triple ﬁnalizer converts the predicate of the triple

results into their lemmatized form. The lemmatized

form is a base form of a verb, for example, the word

receiving or received will be changed to the word re-

ceive as its lemma form. Lemma results prevent am-

biguity and is useful for queries. All triples produced

by this step are the ﬁnal triples of our system. For this

component, our system uses a lemmatizer provided

by the Stanford Natural Language Processing Group

(Toutanova et al., 2003).

4.2 Scoring

We assign lower scores for triples which do not con-

tain metadata. Triples with metadata deserve a higher

score because the metadata ensure the quality of the

sentence and also the information which they sur-

round. Furthermore, the completeness of a triple (i.e.

subject, predicate, and object) and the entity type of

the subject (i.e., PERSON, LOCATION, and ORGA-

NIZATION) also determine the triple scores. The en-

tity types are assigned during the coreferencing step

by the LingPipe’s Named Entity Recognition (Alias-

i, 2008). The following formula is used to compute

the score of the triples:

score =

metadata + completeness + entity

The score is a hint on the quality of the extracted

triple; it could be used in further processing steps

(e.g., in ranking results of a structured query). How-

ever, this was not the focus of our work, and we are

aware that a more sophisticated scoring mechanism

might be required which includes more factors that

indicate the quality of the extracted triple. This is an

issue which we will address in future work.

5 EVALUATION

The evaluation is divided into two parts: a user-based

evaluation (sections 5.1 and 5.2) and a system-based

evaluation which compares our system with ReVerb

(Etzioni et al., 2011) (section 5.3). ReVerb is a recent

system for open information extraction and has shown

good performance in its evaluation.

5.1 User-based Evaluation: Setup and

Dataset

The goal of the evaluation is to verify the correct-

ness of the extracted triples as we aim at extracting

high quality triples. As there is no golden standard

for the task of triple extraction over Wikipedia, we

had to take a sample of documents from Wikipedia,

apply the FactRunner method, and rate the correct-

ness of the extracted triples manually. In order to

avoid a very subjective rating of correctness of the

extracted triples, several users were involved in the

evaluation. A triple was considered as correct when

all users agreed on the correctness of that triple.

The evaluation was done using a web-based inter-

face. For each sentence, the list of extracted triples

was presented. For each triple, the user could state

whether the triple is correct or not. As stated before,

we consider only correctness and not completeness;

therefore, we did not ask the user for other triples that

could have been extracted from the sentence.

We conducted our evaluation by using two cat-

egories of article in Wikipedia: American Actors

and American Physicists. Table 1 summarizes some

statistics about the dataset and the extracted triples.

5.2 User-based Evaluation: Results

For the user-based evaluation, we picked 15 articles

from each category as the predeﬁned documents that

we presented to the users. The overview of this eval-

uation is shown in Figure 2. The y-axis of the ﬁgure

shows the number of triples extracted (total, correct,

and incorrect). The average precision of the used-

based evaluation system is 77%. It shows promising

result where 6 from 30 predeﬁned articles have pre-

cision equal or more than 90%. We believe the com-

plex process of the sentence preprocessing, especially

the sentence ﬁlter and sentence simpliﬁer component,

are the main reason for this high value. Nevertheless,

three articles from our datasets obtained precision be-

low than 60%. After we looked into our result, we

found that the main reason for this low precision is

because the subjects of the triples are not clear (e.g.,

he, she, her, these, their, etc.). The reason for this is

that the coreference matcher failed to change pronoun

words into the corresponding entities in the subject of

the triples. Thus, the users are confused and evaluated

those triples as incorrect triples.

5.3 System-based Evaluation

Table 2 summarizes the results of the comparison of

our system with ReVerb. The dataset is the same as

for the user-based evaluation. The numbers show how

many triples we are able to extract for the dataset

wrt. ReVerb. Thus, it is indicator for the recall of

the system; however, as it is very time consuming

to deﬁne a reference result for such a large dataset,

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

428

Table 1: Statistics of the dataset and the extracted triples.

American Actors American Physicists

# Articles 2,483 1,298

Total #Triples 51,088 26,967

Triples with Full Score 39,313 19,975

Contains Entity 46,735 23,394

Contains Metadata 42,140 22,524

Triple Completeness 48,510 25,587

Figure 2: Overview of user-based evaluation.

an exact computation of the recall value is not pos-

sible. Although we used the same datasets for both

systems, our system considered only sentences which

have metadata and are shorter than 200 characters.

Thus, the number of sentences considered by our sys-

tem is less than in ReVerb. The evaluation shows that

both systems are able to extract multiple facts from

one sentence. In average, FactRunner is able to ex-

tract 1.81 triple/sentence and ReVerb is able to ex-

tract 1.3 triple/sentence. Thus, our system is able to

extract more triples with fewer sentences. The reason

behind this result is most likely that FactRunner uses

a deep-parsing technique in order to extract its triples.

However, the cost of this technique is a slow process-

ing speed (about 7-8 times slower than ReVerb on the

same machine).

5.4 Performance

We tested our system on a PC with Windows XP

64bit, Intel Core2 Quad CPU Q9550 (2.83GHz), and

8 GB RAM. The system is implemented in Java and

uses the aforementioned libraries (LingPipe, Stanford

Parser, etc.). Due to the NLP techniques which ana-

lyze the given texts in detail, the system needs about

3 seconds to process one document. The runtimes

for the American actors dataset were about 128 min-

utes, and about 64 minutes for the American physi-

cists dataset. This results in an average time of about

20 documents per minute, or 3 seconds per document.

In contrast, ReVerb achieved a value of about 160

documents per minute. However, we must emphasize

that performance was not yet the focus of our system.

For example, we do not make use of the multicore

CPU by using parallel threads. Parallization or distri-

bution in a cluster of the application would be easily

possible, because each document can be processed in-

dependently.

Figure 3 shows the distribution of the processing

time for each step of the FactRunner system. Most of

the time is spent in sentence simpliﬁcation because it

requires the analysis of the natural language text us-

ing an NLP tool. Nevertheless, the additional effort

pays off as the extraction is simpliﬁed to a great ex-

FactRunner:FactExtractionoverWikipedia

429

Table 2: Comparison of FactRunner and ReVerb.

FactRunner ReVerb

American Actor American Physicist American Actor American Physicist

# Articles 2,483 1,298 2,483 1,298

# Considered Sentences 27,489 15,371 40,760 25,625

# Extracted Triples 51,088 26,967 52,817 33,620

Triples/Sentence 1.86 1.75 1.30 1.31

Figure 3: Distribution of processing time.

tend. In a previous version of the system, the extractor

required much more resources because it had to apply

the NLP techniques on more complex sentences.

6 RELATED WORK

There are two major bodies of work in the area of in-

formation extraction (IE). The ﬁrst body of work re-

lies on pattern-based techniques, whereas the second

relies on natural language processing methods.

6.1 Pattern-based Techniques

The pattern-based techniques, as the term implies, are

techniques used to identify patterns in order to extract

information in the target corpus. Patterns are skele-

tal frames of text fragments which can be used to ex-

tract relations in a given text corpus. Such pattern-

based extraction systems are able to extract relations

if a given pattern is matched with an existing rela-

tion in the document collection (Blohm, 2010). Be-

fore the extraction process, pattern-based techniques

generally need to prepare a limited number of patterns

to be taken into account in the extraction process.

Pattern-based approaches perform well when consid-

ering precision, but they cannot capture information

that does not use the predeﬁned patterns. KnowItAll

(Etzioni et al., 2004), Snowball (Agichtein and Gra-

vano, 2000), DIPRE (Brin, 1999), WOE (Wu and

Weld, 2010), and YAGO (Suchanek et al., 2007) are

all successful existing systems which use this extrac-

tion paradigm. The YAGO ontology uses Wikipedia

as their source, as we do it in our approach. YAGO

combines the Wikipedia category system with the

WordNet lexical ontology in order to extract informa-

tion which in turn creates a larger ontology.

6.2 Methods based on Natural

Language Processing

Our approach is similar with the other direction of ex-

traction technique, i.e., natural language processing

(NLP). This technique relies on NLP tools which fo-

cus on analyzing natural language text. The NLP ap-

proach is able to handle an unbounded number of rela-

tions because it does pattern recognition and not pat-

tern matching. An NLP-based approach can capture

a large amount of information from many sources, for

instance the World Wide Web. Nevertheless, natural

language processing is a complex and ambiguous pro-

cess. To the best of our knowledge, there are no NLP

tools that can perfectly understand natural language

text, thus resulting errors cannot be avoided. Our ap-

proach is a combination of NLP techniques, as well

as proliferation of the already existing metadata in the

text, in order to perform a precise relation extraction

from English natural text.

TextRunner (Banko et al., 2007) is a good exam-

ple showing the beneﬁts of NLP-based techniques. It

is assembled based on the idea of Open Information

Extraction (OIE) paradigm. OIE is a newly devel-

oped approach of information extraction system that

could provide relation independence and high scala-

bility, handling a large corpus such as the Web corpus.

Similar to our approach, TextRunner is able to extract

information from a vast variety of relations located

in a given corpus with only a single pass. However,

our approach is mainly targeting Wikipedia articles

and its metadata; not the World Wide Web. In short,

our goal is to further enhance the extraction process

by utilizing the already existing metadata available in

the Wikipedia collections.

Another good representative of the NLP-based ap-

proach is KYLIN (Weld et al., 2009). KYLIN uses

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

430

Wikipedia infoboxes as a training dataset in order to

extract information from Wikipedia and the Web. The

result of this approach is a system that automatically

creates new infoboxes for articles which do not have

one, as well as completing infoboxes with missing

attributes. Unfortunately, not all Wikipedia articles

have infoboxes. Infoboxes also suffer from several

problems, which are: incompleteness, inconsistency,

schema drift, type free system, irregular lists, and ﬂat-

tened categories (Wu and Weld, 2007). In contrast

to KYLIN, our approach uses the existing metadata

available in almost every Wikipedia article. Our ap-

proach treats each sentence with metadata individu-

ally, thus we are not relying on any speciﬁc limited

resources for training datasets. Furthermore, although

KYLIN can learn to extract values for any attributes,

their attribute sets are limited to those attributes oc-

curring in Wikipedia infoboxes only.

7 CONCLUSIONS

AND OUTLOOK

In this paper, we have presented a novel approach for

extracting facts from the Wikipedia collection. Our

approach utilizes metadata as a resource to indicate

important sentences in Wikipedia documents. We

also have implemented techniques to resolve ambigu-

ous entities (i.e., persons) in those sentences. Further-

more, we simplify complex sentences in order to pro-

duce better triples (e.g., one triple contains only one

fact which corresponds to a sentence) as the ﬁnal re-

sult of our system. The simpliﬁcation also improves

the runtime of our system. We also apply lemmatiza-

tion of the triples’ predicates to have a uniform repre-

sentation, which simpliﬁes querying and browsing of

the extracted information. The evaluation has shown

that we can achieve a high precision of about 75%.

For future work, we have several ideas to improve

the current version of the system. The ﬁrst idea is

to improve our scoring component. Triples extracted

from the ﬁrst sentences of a document could get

higher scores as these sentences usually contain very

clear facts. Furthermore, the transformation which we

have applied during the preprocessing step should be

also taken into account in the score of a triple (e.g.,

a triple with a subject derived from coreferencing is

less certain). Another important issue is the consol-

idation and normalization of the triples. We already

apply lemmatization and entity resolution, but further

consolidation according to the semantics of predicates

would be helpful for querying. Nevertheless, we think

that our proposed system provides a good basis for ex-

tracting high quality triples from Wikipedia.

ACKNOWLEDGEMENTS

This work has been supported by the German

Academic Exchange Service (DAAD, http://

www.daad.org) and by the DFG Research Cluster on

Ultra High-Speed Mobile Information and Commu-

nication (UMIC, http://www.umic.rwth-aachen.de).

REFERENCES

Agichtein, E. and Gravano, L. (2000). Snowball: Extracting

relations from large plain-text collections. In Proc. 5th

ACM Intl. Conf. on Digital Libraries, pages 85–94.

Alias-i (2008). LingPipe 4.1.0.

Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M.,

and Etzioni, O. (2007). Open information extraction

from the web. In Veloso, M. M., editor, Proc. 20th Intl.

Joint Conf. on Artiﬁcial Intelligence (IJCAI), pages

2670–2676, Hyderabad, India.

Blohm, S. (2010). Large-scale pattern-based information

extraction from the world wide web. PhD thesis, Karl-

sruhe Institute for Technology (KIT).

Brin, S. (1999). Extracting patterns and relations from the

world wide web. In Atzeni, P., Mendelzon, A. O., and

Mecca, G., editors, Proc. Intl. Workshop on The World

Wide Web and Databases (WebDB), volume 1590 of

Lecture Notes in Computer Science, pages 172–183.

Springer.

Burton-Jones, A., Storey, V. C., Sugumaran, V., and Purao,

S. (2003). A heuristic-based methodology for seman-

tic augmentation of user queries on the web. In Proc.

22nd Intl. Conf. on Conceptual Modeling (ER), vol-

ume 2813 of LNCS, pages 476–489.

Defazio, A. (2009). Natural language question answering

over triple knowledge bases. Master’s thesis, Aus-

tralian National University.

Dong, H., Hussain, F., and Chang, E. (2008). A sur-

vey in semantic search technologies. In Proc. 2nd

Intl. Conf. on Digital Ecosystems and Technologies

(DEST), pages 403–408. IEEE.

Etzioni, O., Cafarella, M. J., Downey, D., Kok, S., Popescu,

A.-M., Shaked, T., Soderland, S., Weld, D. S., and

Yates, A. (2004). Web-scale information extraction

in knowitall: (preliminary results). In Proc. WWW,

pages 100–110.

Etzioni, O., Fader, A., Christensen, J., Soderland, S., and

Mausam (2011). Open information extraction: The

second generation. In Proc. IJCAI, pages 3–10,

Barcelona, Spain.

Halevy, A. Y., Etzioni, O., Doan, A., Ives, Z. G., Madhavan,

J., McDowell, L., and Tatarinov, I. (2003). Crossing

the structure chasm. In Proc. 1st Biennal Conference

on Innovative Data Systems Research (CIDR), Asilo-

mar, CA, USA.

Heath, T. and Bizer, C. (2011). Linked Data: Evolving the

Web into a Global Data Space. Synthesis Lectures on

the Semantic Web. Morgan & Claypool Publishers.

FactRunner:FactExtractionoverWikipedia

431

Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham,

E., de Melo, G., and Weikum, G. (2011). Yago2: Ex-

ploring and querying world knowledge in time, space,

context, and many languages. In Proc. WWW (Com-

panion Volume), pages 229–232.

Klein, D. and Manning, C. D. (2003). Accurate unlexical-

ized parsing. In Hinrichs, E. W. and Roth, D., ed-

itors, Proc. 41st Annual Meeting of the Association

for Computational Linguistics (ACL), pages 423–430,

Sapporo, Japan.

Mangold, C. (2007). A survey and classiﬁcation of seman-

tic search approaches. International Journal of Meta-

data, Semantics and Ontologies, 2(1):23–34.

Rusu, D., Dali, L., Fortuna, B., Grobelnik, M., and

Mladenic, D. (2007). Triplet extraction from sen-

tences. In Proc. 10th Intl. Multiconference on Infor-

mation Society, volume A, pages 218–222.

Suchanek, F. M., Kasneci, G., and Weikum, G. (2007).

Yago: A Core of Semantic Knowledge. In Proc.

WWW.

Toutanova, K., Klein, D., Manning, C. D., and Singer, Y.

(2003). Feature-rich part-of-speech tagging with a

cyclic dependency network. In Proc. Intl. Conf. of

the North American Chapter of the Association for

Computational Linguistics on Human Language Tech-

nology - Volume 1, pages 173–180, Stroudsburg, PA,

USA.

Weld, D. S., Hoffmann, R., and Wu, F. (2009). Using

Wikipedia to bootstrap open information extraction.

SIGMOD Record, 37(4):62–68.

Wu, F. and Weld, D. S. (2007). Autonomously semantifying

wikipedia. In Silva, M. J., Laender, A. H. F., Baeza-

Yates, R. A., McGuinness, D. L., Olstad, B., Olsen,

Ø. H., and Falc

ao, A. O., editors, Proc. 16th Conf

Information and Knowledge Management (CIKM),

pages 41–50, Lisbon, Portugal. ACM.

Wu, F. and Weld, D. S. (2010). Open information extraction

using wikipedia. In Hajic, J., Carberry, S., and Clark,

S., editors, Proc. 48th Annual Meeting of the Associa-

tion for Computational Linguistics (ACL), pages 118–

127, Uppsala, Sweden.

WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies

432