Wikipedia infoboxes as a training dataset in order to
extract information from Wikipedia and the Web. The
result of this approach is a system that automatically
creates new infoboxes for articles which do not have
one, as well as completing infoboxes with missing
attributes. Unfortunately, not all Wikipedia articles
have infoboxes. Infoboxes also suffer from several
problems, which are: incompleteness, inconsistency,
schema drift, type free system, irregular lists, and flat-
tened categories (Wu and Weld, 2007). In contrast
to KYLIN, our approach uses the existing metadata
available in almost every Wikipedia article. Our ap-
proach treats each sentence with metadata individu-
ally, thus we are not relying on any specific limited
resources for training datasets. Furthermore, although
KYLIN can learn to extract values for any attributes,
their attribute sets are limited to those attributes oc-
curring in Wikipedia infoboxes only.
7 CONCLUSIONS
AND OUTLOOK
In this paper, we have presented a novel approach for
extracting facts from the Wikipedia collection. Our
approach utilizes metadata as a resource to indicate
important sentences in Wikipedia documents. We
also have implemented techniques to resolve ambigu-
ous entities (i.e., persons) in those sentences. Further-
more, we simplify complex sentences in order to pro-
duce better triples (e.g., one triple contains only one
fact which corresponds to a sentence) as the final re-
sult of our system. The simplification also improves
the runtime of our system. We also apply lemmatiza-
tion of the triples’ predicates to have a uniform repre-
sentation, which simplifies querying and browsing of
the extracted information. The evaluation has shown
that we can achieve a high precision of about 75%.
For future work, we have several ideas to improve
the current version of the system. The first idea is
to improve our scoring component. Triples extracted
from the first sentences of a document could get
higher scores as these sentences usually contain very
clear facts. Furthermore, the transformation which we
have applied during the preprocessing step should be
also taken into account in the score of a triple (e.g.,
a triple with a subject derived from coreferencing is
less certain). Another important issue is the consol-
idation and normalization of the triples. We already
apply lemmatization and entity resolution, but further
consolidation according to the semantics of predicates
would be helpful for querying. Nevertheless, we think
that our proposed system provides a good basis for ex-
tracting high quality triples from Wikipedia.
ACKNOWLEDGEMENTS
This work has been supported by the German
Academic Exchange Service (DAAD, http://
www.daad.org) and by the DFG Research Cluster on
Ultra High-Speed Mobile Information and Commu-
nication (UMIC, http://www.umic.rwth-aachen.de).
REFERENCES
Agichtein, E. and Gravano, L. (2000). Snowball: Extracting
relations from large plain-text collections. In Proc. 5th
ACM Intl. Conf. on Digital Libraries, pages 85–94.
Alias-i (2008). LingPipe 4.1.0.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M.,
and Etzioni, O. (2007). Open information extraction
from the web. In Veloso, M. M., editor, Proc. 20th Intl.
Joint Conf. on Artificial Intelligence (IJCAI), pages
2670–2676, Hyderabad, India.
Blohm, S. (2010). Large-scale pattern-based information
extraction from the world wide web. PhD thesis, Karl-
sruhe Institute for Technology (KIT).
Brin, S. (1999). Extracting patterns and relations from the
world wide web. In Atzeni, P., Mendelzon, A. O., and
Mecca, G., editors, Proc. Intl. Workshop on The World
Wide Web and Databases (WebDB), volume 1590 of
Lecture Notes in Computer Science, pages 172–183.
Springer.
Burton-Jones, A., Storey, V. C., Sugumaran, V., and Purao,
S. (2003). A heuristic-based methodology for seman-
tic augmentation of user queries on the web. In Proc.
22nd Intl. Conf. on Conceptual Modeling (ER), vol-
ume 2813 of LNCS, pages 476–489.
Defazio, A. (2009). Natural language question answering
over triple knowledge bases. Master’s thesis, Aus-
tralian National University.
Dong, H., Hussain, F., and Chang, E. (2008). A sur-
vey in semantic search technologies. In Proc. 2nd
Intl. Conf. on Digital Ecosystems and Technologies
(DEST), pages 403–408. IEEE.
Etzioni, O., Cafarella, M. J., Downey, D., Kok, S., Popescu,
A.-M., Shaked, T., Soderland, S., Weld, D. S., and
Yates, A. (2004). Web-scale information extraction
in knowitall: (preliminary results). In Proc. WWW,
pages 100–110.
Etzioni, O., Fader, A., Christensen, J., Soderland, S., and
Mausam (2011). Open information extraction: The
second generation. In Proc. IJCAI, pages 3–10,
Barcelona, Spain.
Halevy, A. Y., Etzioni, O., Doan, A., Ives, Z. G., Madhavan,
J., McDowell, L., and Tatarinov, I. (2003). Crossing
the structure chasm. In Proc. 1st Biennal Conference
on Innovative Data Systems Research (CIDR), Asilo-
mar, CA, USA.
Heath, T. and Bizer, C. (2011). Linked Data: Evolving the
Web into a Global Data Space. Synthesis Lectures on
the Semantic Web. Morgan & Claypool Publishers.
FactRunner:FactExtractionoverWikipedia
431