FactRunner: Fact Extraction over Wikipedia

Rhio Sutoyo; Christoph Quix; Fisnik Kastrati

doi:10.5220/0004375604230432

FactRunner: Fact Extraction over Wikipedia

Rhio Sutoyo, Christoph Quix, Fisnik Kastrati

2013

Abstract

The increasing role of Wikipedia as a source of human-readable knowledge is evident as it contains an enormous amount of high quality information written in natural language by human authors. However, querying this information using traditional keyword based approaches requires often a time-consuming, iterative process to explore the document collection to find the information of interest. Therefore, a structured representation of information and queries would be helpful to be able to directly query for the relevant information. An important challenge in this context is the extraction of structured information from unstructured knowledge bases which is addressed by Information Extraction (IE) systems. However, these systems struggle with the complexity of natural language and produce frequently unsatisfying results. In addition to the plain natural language text, Wikipedia contains links between documents which directly link a term of one document to another document. In our approach for fact extraction from Wikipedia, we consider these links as an important indicator for the relevance of the linked information. Thus, our proposed system FactRunner focusses on extracting structured information from sentences containing such links. We show that a natural language parser combined with Wikipedia markup can be exploited for extracting facts in form of triple statements with a high accuracy.

References

Agichtein, E. and Gravano, L. (2000). Snowball: Extracting relations from large plain-text collections. In Proc. 5th ACM Intl. Conf. on Digital Libraries, pages 85-94.
Alias-i (2008). LingPipe 4.1.0.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction from the web. In Veloso, M. M., editor, Proc. 20th Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 2670-2676, Hyderabad, India.
Blohm, S. (2010). Large-scale pattern-based information extraction from the world wide web. PhD thesis, Karlsruhe Institute for Technology (KIT).
Brin, S. (1999). Extracting patterns and relations from the world wide web. In Atzeni, P., Mendelzon, A. O., and Mecca, G., editors, Proc. Intl. Workshop on The World Wide Web and Databases (WebDB), volume 1590 of Lecture Notes in Computer Science, pages 172-183. Springer.
Burton-Jones, A., Storey, V. C., Sugumaran, V., and Purao, S. (2003). A heuristic-based methodology for semantic augmentation of user queries on the web. In Proc. 22nd Intl. Conf. on Conceptual Modeling (ER), volume 2813 of LNCS, pages 476-489.
Defazio, A. (2009). Natural language question answering over triple knowledge bases. Master's thesis, Australian National University.
Dong, H., Hussain, F., and Chang, E. (2008). A survey in semantic search technologies. In Proc. 2nd Intl. Conf. on Digital Ecosystems and Technologies (DEST), pages 403-408. IEEE.
Etzioni, O., Cafarella, M. J., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. (2004). Web-scale information extraction in knowitall: (preliminary results). In Proc. WWW, pages 100-110.
Etzioni, O., Fader, A., Christensen, J., Soderland, S., and Mausam (2011). Open information extraction: The second generation. In Proc. IJCAI, pages 3-10, Barcelona, Spain.
Halevy, A. Y., Etzioni, O., Doan, A., Ives, Z. G., Madhavan, J., McDowell, L., and Tatarinov, I. (2003). Crossing the structure chasm. In Proc. 1st Biennal Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA.
Heath, T. and Bizer, C. (2011). Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers.
Hoffart, J., Suchanek, F. M., Berberich, K., Lewis-Kelham, E., de Melo, G., and Weikum, G. (2011). Yago2: Exploring and querying world knowledge in time, space, context, and many languages. In Proc. WWW (Companion Volume), pages 229-232.
Klein, D. and Manning, C. D. (2003). Accurate unlexicalized parsing. In Hinrichs, E. W. and Roth, D., editors, Proc. 41st Annual Meeting of the Association for Computational Linguistics (ACL), pages 423-430, Sapporo, Japan.
Mangold, C. (2007). A survey and classification of semantic search approaches. International Journal of Metadata, Semantics and Ontologies, 2(1):23-34.
Rusu, D., Dali, L., Fortuna, B., Grobelnik, M., and Mladenic, D. (2007). Triplet extraction from sentences. In Proc. 10th Intl. Multiconference on Information Society, volume A, pages 218-222.
Suchanek, F. M., Kasneci, G., and Weikum, G. (2007). Yago: A Core of Semantic Knowledge. In Proc. WWW.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. Intl. Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1, pages 173-180, Stroudsburg, PA, USA.
Weld, D. S., Hoffmann, R., and Wu, F. (2009). Using Wikipedia to bootstrap open information extraction. SIGMOD Record, 37(4):62-68.
Wu, F. and Weld, D. S. (2007). Autonomously semantifying wikipedia. In Silva, M. J., Laender, A. H. F., BaezaYates, R. A., McGuinness, D. L., Olstad, B., Olsen, Ø. H., and Falca˜o, A. O., editors, Proc. 16th Confo?n Information and Knowledge Management (CIKM), pages 41-50, Lisbon, Portugal. ACM.
Wu, F. and Weld, D. S. (2010). Open information extraction using wikipedia. In Hajic, J., Carberry, S., and Clark, S., editors, Proc. 48th Annual Meeting of the Association for Computational Linguistics (ACL), pages 118- 127, Uppsala, Sweden.

Download

Paper Citation

in Harvard Style

Sutoyo R., Quix C. and Kastrati F. (2013). FactRunner: Fact Extraction over Wikipedia . In Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-54-9, pages 423-432. DOI: 10.5220/0004375604230432

in Bibtex Style

@conference{webist13,
author={Rhio Sutoyo and Christoph Quix and Fisnik Kastrati},
title={FactRunner: Fact Extraction over Wikipedia},
booktitle={Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2013},
pages={423-432},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004375604230432},
isbn={978-989-8565-54-9},
}

in EndNote Style

TY - CONF
JO - Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - FactRunner: Fact Extraction over Wikipedia
SN - 978-989-8565-54-9
AU - Sutoyo R.
AU - Quix C.
AU - Kastrati F.
PY - 2013
SP - 423
EP - 432
DO - 10.5220/0004375604230432