Extraction of Biographical Data from Wikipedia
Robert Viseur
Centre of Excellence in Information and Communication Technologies, Rue des Frères Wright,
29/3, B-6041 Charleroi Belgium
Université de Mons, Faculté Polytechnique, Rue de Houdain, 9, 7000 Mons, Belgium
Keywords: Wikipedia, Dbpedia, Biography, Text Mining, Open Data.
Abstract: Using the content of Wikipedia articles is common in academic research. However the practicalities are
rarely analysed. Our research focuses on extracting biographical information about personalities from
Belgium. Our research is divided into three sections. The first section describes the state of the art for data
extraction from Wikipedia. A second section presents the case study about data extraction for biographies of
Belgian personalities. Different solutions are discussed and the solution adopted is implemented. In the third
section, the quality of the extraction is discussed. Practical recommendations for researchers wishing to use
Wikipedia are also proposed on the basis of our case study.
1 INTRODUCTION
Wikipedia (wikipedia.org) is a collaborative
multilingual encyclopedia launched in 2001. The
project has been supported financially since 2003 by
the Wikimedia Foundation
(wikimediafoundation.org). The volume of the
encyclopedia has grown steadily since its inception.
In January 2013, the largest editions of Wikipedia
were the English edition (more than four million
items), the German edition (more than one and a half
million items), the French edition (more than one
million three hundred thousand items) and the Dutch
edition (over one million one hundred thousand
items).
In recent years, academic research and practical
examples of using Wikipedia content have
increased. Hu et al. (2009) used it to improve the
performance of a system for clustering documents.
Kazama et al. (2007) and Charton et al. (2010) used
it to improve a named entity recognition system.
Buscaldi and Rosso (2006) improved the
performance of a Question Answering technology.
The BBC used it to be able to make the
interconnection of information in its internal
databases, and the enrichment by external data
sources (Kobilarov et al., 2009). The “Exploiting
Wikipedia” query on the scientific search engine
Google Scholar (scholar.google.fr) returns more than
22,000 results!
Our research relates to the extraction of
biographical data about people from Belgium. Using
Wikipedia to supply a biographical database seems
appropriate, due to the breakdown by type of content
within the encyclopedia. Indeed the articles related
to biographies represented 15% of the total content
in January 2008, behind the articles about culture
and the arts (Kittur et al., 2009).
However, several questions arise.
a) The French, German and Dutch editions of
Wikipedia are useful, because these languages
are the three national languages of Belgium.
However it is difficult, on this basis, to identify
the volume of content about Belgium rather than
France, Germany or the Netherlands.
b) Many papers exploit Wikipedia content.
However, few give guidance concerning the
practical difficulties associated with the
extraction of data from Wikipedia. Successful
extraction involves knowing how to identify
relevant articles but also to be able to extract the
desired data from the content of articles.
Our research is organized into three sections. The
first section will provide a state of the art about data
extraction in Wikipedia. A second section will
present the case study of the extraction of
biographical data about Belgians. Different solutions
will be discussed and the chosen solution will be
implemented. In the third section, the quality of the
extraction will be discussed. Practical
recommendations for researchers wishing to use
248
Viseur R..
Extraction of Biographical Data from Wikipedia.
DOI: 10.5220/0004595302480252
In Proceedings of the 2nd International Conference on Data Technologies and Applications (DATA-2013), pages 248-252
ISBN: 978-989-8565-67-9
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
Wikipedia content will also be offered on the basis
of our practical example.
2 STATE OF THE ART
The extraction of biographies was already done by
Biadsy et al. (2008). However the approach adopted
by the authors was different from ours. They
developed a system of multi-document summaries,
based on a classifier of biographical sentences and
on a scheduling component for sentences deemed of
interest. They were based on the articles using the
Wikipedia template for biographies and were able to
extract nearly 17,000 articles. The treatment was
made on the Wikipedia XML copy available online.
In practice, the use of XML copies is not the
only way to manipulate the contents of the
encyclopedia. On the one hand, information
extraction is possible using reverse engineering tools
directly on the pages published online. On the other
hand, a structured version of Wikipedia has been
available since 2007, called Dbpedia.
DBpedia (dbpedia.org) is a community effort
that started in 2007 (Auer et al., 2007). It aims to
extract structured information from Wikipedia and
to make this information available on the Web. The
extraction process is based on copies of the
Wikipedia database (“database dump”). The data is
updated through the use of flow referencing updates
of the encyclopedia (Hellmann et al., 2009). The
extractor is based on the content of articles, and
especially on the associated Infobox. The Infoboxes
appear in tabular form in the upper right-hand corner
of numerous articles and present factual information.
The content extracted from the encyclopedia is
converted into RDF format. Several mechanisms are
suggested to access and explore DBpedia: access to
RDF data by URI (Universal Resource Identifier),
use of Web agents (e.g. browsers for the semantic
Web) and SPARQL access points to query DBpedia
using language referring to the SQL used for
relational databases.
DBpedia appears as a partial solution for the
extraction of data from Wikipedia content. The
interrogation facility permitted by the SPARQL
query language for the identification of relevant
articles makes it an attractive tool. However,
DBpedia has several limitations.
Firstly, the language coverage of DBPedia is
currently limited to 13 languages (see “International
DBpedia chapters”, dbpedia.org). At its inception in
2007, DBpedia was only available in English. A
project for the French language was launched in late
2012. Called Sémanticpédia (www.semanticpedia
.org) it combines the efforts of the French Ministry
of Culture and Communication, Wikimedia France
and INRIA to produce a French version of DBpedia
(fr.dbpedia.org).
Secondly, the extraction process is based
primarily on the content of Infoboxes (Auer et al.,
2007); (Hellmann et al., 2009). However, a quick
review of Wikipedia articles shows that not all the
pages of the encyclopedia offer an Infobox, and that
they are not always complete. Part of the
information contained in the articles thus escapes
from the extractors. However DBpedia already
claimed nearly 2 million references at its inception
(Auer et al., 2007).
3 CASE STUDY: EXTRACTING
BIOGRAPHICAL DATA ABOUT
BELGIANS
3.1 Identification of Relevant Articles
We first compared two approaches: firstly, the
querying of DBpedia from English and French
access points and, secondly, the identification of
relevant articles using techniques of crawl on the
website of the encyclopedia.
The querying of English and French DBpedia
was performed with the SPARQL query language,
by using the “birthplace” property (i.e. “Belgique”
for the French language and “Belgium” for the
English language).
The identification of Belgian personalities'
biographies was performed in two stages. The first
step takes as its starting point the Wikipedia page
about Belgians (http://fr.wikipedia.org/wiki/
Cat%C3% A9gorie:Personnalit%C3%A9_belge),
starting from the Belgian Wikipedia portal
(http://fr.wikipedia.org/wiki/Portal:Belgium). A
recursive crawl was processed on this page and the
pages of the following categories in order to identify
the category pages containing information about
Belgians. This mechanism allowed us to find more
than 700 relevant categories. The URLs of these
categories were stored. The second step then
explored the category pages and identified
Wikipedia articles devoted to Belgians. The URLs
of these files were saved in a file. More than 10,000
items were collected through this method (see Table
1).
The volume of the classical method by crawl of
Wikipedia rather than querying DBpedia proves so
much more fruitful.
ExtractionofBiographicalDatafromWikipedia
249
Table 1: Number of items per method.
Number of results
DBpedia(en)
899
DBpedia (fr)
200
Wikipedia (fr)
10,884
3.2 Data Extraction from the Text
3.2.1 Extraction Process
A copy of the articles was saved locally. In practice,
we worked on the text of the articles in the specific
Wikipedia format. This version is accessible from
URLs for which the template is
http://fr.wikipedia.org/w/index.php?action=raw&titl
e=xxxxx, and provides a plain text (text +
Mediawiki syntax) without HTML tags and starting
with an Infobox when there is one.
The plain text is analysed through two
operations. The first one is to extract the Infobox
when it exists. The second one is to identify
sentences in the biography that may contain
important biographical information such as date of
birth, date of death and professional activity. In
practice, the first sentence of the article is always
used, because it often contains by convention the
most important information about the person. It may
be supplemented by a second sentence, if it matches
with a set of triggering words. This treatment results
in a condensed biography, which is saved for each
article. These condensed biographies then pass
through a set of regular expressions to extract the
date of birth, the date of death (if the person is dead)
and his/her profession. This structured data is stored
in a CSV file.
This file contains 10,610 entries, with the
following fields: name, date of birth, date of death,
professional activity, URL of the category and URL
of the article in HTML format (see Table 2). From
an initial total of 10,884 items, 57.6% allow
extraction of the date of birth, 26.9%, date of death
and 56.3%, professional generally provides an
alternative information if the extraction failed
Table 2: Volumetrics (extraction process).
Number of articles
10,884
100.0%
Number of Infoboxes
2,980 27.4%
Numbered of condensed
biographies
10,610 97.5%
Number of successful
extractions
Date of birth
6,269 57.6%
Date of death
2,936 26.9%
Profession
6,129 56.3%
(categories often indicate a profession or a social
function). Only 27.4% of the articles have an
Infobox.
3.2.2 Main Difficulties
We met four main difficulties.
Firstly, the items are accompanied by an Infobox
in less than one out of three cases. This makes it
necessary to use text analysis techniques to achieve
the extraction of dates (birth, death) and professions
. The extraction of dates is particularly difficult
because the articles often include other dates (dates
related to important events in the people's lives). The
extraction uses a set of regular expressions, which
present writing difficulties for the non-specialist.
Secondly, even when an Infobox is present, the
field names of the Infobox are not homogeneous.
The date of birth is announced by date_naissance,
date naissance, date de naissance,
date_de_naissance or naissance. A preliminary
grouping is necessary. This presents no big technical
difficulty.
Table 3: Heterogeneity of date formats (examples).
([[Bree]], [[12 avril]] [[1876]] – [[Ixelles]], [[14
septembre]] [[1953]])
([[Pétange]], {{Date de
naissance|12|juillet|1817}} - Pétange, {{Date de
décès|14|mai|1898}}])
né le [[12 janvier]] [[1597]] à [[Bruxelles]]
([[Belgique]]) et mort le [[12 juillet]] [[1643]] à
[[Livourne]] ([[Italie]])
'''Ellen Petri''' (née le 25 mai [[1982]],
[[Merksem]] ([[Anvers]]))
'''Paul Deschanel''', né le {{date|13|février|1855}}
à [[Schaerbeek]] ([[Bruxelles]]) et décédé le
{{date|28|avril|1922}} à [[Paris]]
'''Robert Gruslin''' né à [[Rochefort
(Belgique)|Rochefort]] le [[18 mars]] [[1901]],
décédé à [[Profondeville]] le {{1er juin}}
[[1985]]
Thirdly, the date formats are not homogeneous,
either in the text or in the Infobox (see Table 3).
Dates can be written with numbers only, with the
month in letters or be supplemented by other
information such as place of birth or the type of
activity for which the person is famous.
Fourthly, the screening of sentences useful for
data extraction requires a more advanced
implementation than the technique used here. A
classifier as implemented by Biadsy et al. (2008)
deserves an investment to improve the overall
performance of the extraction.
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
250
3.2.3 Error Rates
The evaluation was conducted on a set of 2,980
entries (i.e. entries including an Infobox). The dates
of birth extracted from the text of articles were
compared with those provided in the Infobox. The
content of the infobox is structured. The extraction is
significantly simplified, and the data extracted can
be considered as free of extraction errors.
Table 4: Extraction Error Rate (Date of Birth).
Total number of
items
2,980 100%
No possible
comparison
1,336 44.8%
Number of Info-
boxes without date
743 24.9%
Possible comparison
1,644 55.2% 100%
Identical dates 1,486 90.4%
Different dates 158 9.6%
Partial
information
126 7.7%
Extraction error 32 1.9%
A comparison was made between the data extracted
from the text of Wikipedia articles and data
extracted from the Infobox (see Table 4). The test
was carried out for 2,980 birthdates (100%). The
comparison was performed on 1,644 dates for which
the data was present in the Infobox and in the result
of the extraction from the text of the article.
Different dates are found in 9.6% of cases. However,
7.7% of the dates were correct but had incomplete
information. Typically, the year of birth was
extracted, but not the full date (eg mai 1988 vs.
1988). The information extracted from the text could
be more complete than that extracted from the
Infobox. The information could be found in the text
and not in the Infobox.
The presence of information in the Infobox and
not in the text is due to extraction errors. In practice,
the information in the Infobox always seems to be
present in the text. This finding provides a lower
limit to the rate of failed extractions from the text of
19.9%.
Almost two thirds of people are born after 1900
(63.1% of dates of birth given in the Infoboxes). The
low number of dates of death would be due to the
average age of registered persons rather than
extraction errors.
This method presents two difficulties. On the one
hand, date formats may differ between data
extracted from the text and data extracted from its
Infobox (example: 8 mars 1965 vs. 8 03 1965). A
method for converting dates is therefore necessary to
standardize the format. Mediawiki tags and
additional information can also accompany the date
of birth (e.g. date_de_naissance = [[28 juillet en
sport|28 juillet]] [[1982 en football|1982]]). On the
other hand, the structure of the Infobox is not
standardized and field names may vary from one
item to another.
4 DISCUSSIONS
AND PERSPECTIVES
This extraction work was initiated with the thought
that the use of DBpedia would easily allow us to get
biographical data we wanted with the SPARQL
query language. A first test showed that the volume
available with DBpedia was significantly less than
that which could be obtained from conventional
techniques of crawling the Wikipedia website. The
DBpedia project is essential for researchers
participating in projects related to linked data or
wishing to have a controlled vocabulary. However, it
shows its limits in terms of completeness on specific
topics.
The existence of the DBPedia project and the
visibility of a structured Infobox may give the
impression that Wikipedia lends itself to easy data
retrieval. However, it is clear from our experiments
that, firstly, the Infoboxes are far from systematic
(less than 30% of the articles considered possess
one) and, on the other hand, the structure of the
Infobox is not completely homogeneous. However
the existence of a set of agreements in the form of
markup or the turns of sentences, in terms of dates or
professions, makes it feasible to extract content from
articles without requiring the use of sophisticated
techniques.
This research offers several perspectives. Firstly
the influence of the formulation of requests on
SPARQL results should be studied further. Secondly
the consistency of the information extracted in
different languages should be checked. Thirdly a
comparison with more general extraction methods
and tools (e.g. OpenNLP, ReVerb or TextRunner)
should be processed. Fourthly the reliability of data
in the encyclopedia should be checked. This work is
ongoing and is based on a comparison with
reference data. Disambiguation is one of the
challenges to be addressed in order to automate this
comparison.
ExtractionofBiographicalDatafromWikipedia
251
REFERENCES
Auer S., Bizer C., Kobilarov G., Lehmann J., Cyganiak R.,
Ives Z., 2007. DBpedia: A Nucleus for a Web of Open
Data, Lecture Notes in Computer Science, Vol. 4825,
pp 722-735.
Bekavac B., Tadić M., 2008. A Generic Method for Multi
Word Extraction from Wikipedia, Proceedings of the
Int. Conf. on Information Technology Interfaces, June
23-26, 2008.
Biadsy F., Hirschberg J., Filatova E., 2008. An
Unsupervised Approach to Biography Production
using Wikipedia, Proceedings of ACL-08: HLT, pp.
807–815.
Buscaldi D., Rosso P., 2006. Mining Knowledge from
Wikipedia for the Question Answering task,
Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006).
Charton E. Gagnon M., Ozell B., 2010. Extension d’un
système d’étiquetage d’entités nommées en étiqueteur
sémantique, TALN 2010, 19–23 juillet 2010.
Hellmann S., Stadler C., Lehmann L., Auer S., 2009.
DBpedia Live Extraction, Lecture Notes in Computer
Science, Vol. 5871, pp 1209-1223.
Hu X., Zhang X., Lu C., Park, E. K., Zhou, X., 2009.
Exploiting Wikipedia as external knowledge for
document clustering, KDD '09 Proceedings of the 15th
international conference on Knowledge discovery and
data mining.
Kazama J., Torisawa K., 2007. Exploiting Wikipedia as
External Knowledge for Named Entity Recognition,
Proceedings of the 2007 Joint Conference on
Empirical Methods in Natural Language Processing
and Computational Natural Language Learning, June
2007, pp. 698–707.
Kittur A., Chi E.H., Suh B., 2009. What's in Wikipedia?:
Mapping Topics and Conflict using Socially
Annotated Category Structure, Proceedings of the 27th
international Conference on Human Factors in
Computing Systems, April 04-09, 2009.
Kobilarov G., Scott T., Raimond Y., Oliver S., Sizemore
C., Smethurst M., Bizer C., Lee R., 2009. Media meets
Semantic Web - How the BBC uses DBpedia and
Linked Data to make Connection, ESWC 2009, pp.
723-737.
DATA2013-2ndInternationalConferenceonDataManagementTechnologiesandApplications
252