Wikipedia content will also be offered on the basis
of our practical example.
2 STATE OF THE ART
The extraction of biographies was already done by
Biadsy et al. (2008). However the approach adopted
by the authors was different from ours. They
developed a system of multi-document summaries,
based on a classifier of biographical sentences and
on a scheduling component for sentences deemed of
interest. They were based on the articles using the
Wikipedia template for biographies and were able to
extract nearly 17,000 articles. The treatment was
made on the Wikipedia XML copy available online.
In practice, the use of XML copies is not the
only way to manipulate the contents of the
encyclopedia. On the one hand, information
extraction is possible using reverse engineering tools
directly on the pages published online. On the other
hand, a structured version of Wikipedia has been
available since 2007, called Dbpedia.
DBpedia (dbpedia.org) is a community effort
that started in 2007 (Auer et al., 2007). It aims to
extract structured information from Wikipedia and
to make this information available on the Web. The
extraction process is based on copies of the
Wikipedia database (“database dump”). The data is
updated through the use of flow referencing updates
of the encyclopedia (Hellmann et al., 2009). The
extractor is based on the content of articles, and
especially on the associated Infobox. The Infoboxes
appear in tabular form in the upper right-hand corner
of numerous articles and present factual information.
The content extracted from the encyclopedia is
converted into RDF format. Several mechanisms are
suggested to access and explore DBpedia: access to
RDF data by URI (Universal Resource Identifier),
use of Web agents (e.g. browsers for the semantic
Web) and SPARQL access points to query DBpedia
using language referring to the SQL used for
relational databases.
DBpedia appears as a partial solution for the
extraction of data from Wikipedia content. The
interrogation facility permitted by the SPARQL
query language for the identification of relevant
articles makes it an attractive tool. However,
DBpedia has several limitations.
Firstly, the language coverage of DBPedia is
currently limited to 13 languages (see “International
DBpedia chapters”, dbpedia.org). At its inception in
2007, DBpedia was only available in English. A
project for the French language was launched in late
2012. Called Sémanticpédia (www.semanticpedia
.org) it combines the efforts of the French Ministry
of Culture and Communication, Wikimedia France
and INRIA to produce a French version of DBpedia
(fr.dbpedia.org).
Secondly, the extraction process is based
primarily on the content of Infoboxes (Auer et al.,
2007); (Hellmann et al., 2009). However, a quick
review of Wikipedia articles shows that not all the
pages of the encyclopedia offer an Infobox, and that
they are not always complete. Part of the
information contained in the articles thus escapes
from the extractors. However DBpedia already
claimed nearly 2 million references at its inception
(Auer et al., 2007).
3 CASE STUDY: EXTRACTING
BIOGRAPHICAL DATA ABOUT
BELGIANS
3.1 Identification of Relevant Articles
We first compared two approaches: firstly, the
querying of DBpedia from English and French
access points and, secondly, the identification of
relevant articles using techniques of crawl on the
website of the encyclopedia.
The querying of English and French DBpedia
was performed with the SPARQL query language,
by using the “birthplace” property (i.e. “Belgique”
for the French language and “Belgium” for the
English language).
The identification of Belgian personalities'
biographies was performed in two stages. The first
step takes as its starting point the Wikipedia page
about Belgians (http://fr.wikipedia.org/wiki/
Cat%C3% A9gorie:Personnalit%C3%A9_belge),
starting from the Belgian Wikipedia portal
(http://fr.wikipedia.org/wiki/Portal:Belgium). A
recursive crawl was processed on this page and the
pages of the following categories in order to identify
the category pages containing information about
Belgians. This mechanism allowed us to find more
than 700 relevant categories. The URLs of these
categories were stored. The second step then
explored the category pages and identified
Wikipedia articles devoted to Belgians. The URLs
of these files were saved in a file. More than 10,000
items were collected through this method (see Table
1).
The volume of the classical method by crawl of
Wikipedia rather than querying DBpedia proves so
much more fruitful.
ExtractionofBiographicalDatafromWikipedia
249