DESIGNING A SYSTEM FOR SEMI-AUTOMATIC POPULATION OF KNOWLEDGE BASES FROM UNSTRUCTURED TEXT

Jade Goldstein-Stewart, Ransom K. Winder

Abstract

Important information from unstructured text is typically entered manually into knowledge bases, resulting in limited quantities of data. Automated information extraction from the text could assist with this process, but the technology is still at unacceptable accuracies. This task therefore requires a suitable user interface to allow for correction of the frequent extraction errors and validation of proposed assertions that a user wants to enter into a knowledge base. In this paper, we discuss our system for semi-automatic database population and how it handles the issues arising in content extraction and populating a knowledge base. The main contributions of this work are identifying the challenges in building such a semi-automated tool, the categorization of extraction errors, addressing the gaps in current extraction technology required for databasing, and the design and development of a usable interface and system, FEEDE, to support correcting content extraction output and speeding up the data entry time into knowledge bases. To our knowledge, this is the first effort to populate knowledge bases using content extraction from unstructured text

References

  1. ACE (automatic content extraction) English annotation guidelines for entities version 5.6.1. (2005). Retrieved May 7, 2008 from: http://projects.ldc.upenn.edu/ace/ docs/English-Entities-Guidelines_v5.6.1.pdf
  2. ACE (automatic content extraction) English annotation guidelines for events version 5.4.3. (2005). Retrieved May 7, 2008 from: http://projects.ldc.upenn.edu/ace/ docs/English-Events-Guidelines_v5.4.3.pdf
  3. ACE (automatic content extraction) English annotation guidelines for relations version 5.8.3. (2005). Retrieved May 7, 2008 from: http://projects.ldc.upenn.edu/ace/ docs/EnglishRelations-Guidelines_v5.8.3.pdf
  4. Automatic content extraction 2008 evaluation plan. (2008). Retrieved 2009 from: http://www.nist.gov/ speech/tests/ace/2008/doc/ace08-evalplan.v1.2d.pdf
  5. Barclay, C., Boisen, S., Hyde, C., & Weischedel, R. (1996). The Hookah information extraction system, Proc. Workshop on TIPSTER II (pp. 79-82). Vienna, VA: ACL.
  6. Evaluation scoring script, v14a. (2005). Retrieved September, 25, 2008, from: ftp://jaguar.ncsl.nist.gov/ ace/resources/ace05-eval-v14a.pl
  7. Ferro, L., Gerber, L., Mani, I., Sundheim, B., & Wilson, G. (2005). TIDES-2005 standard for the annotation of temporal expressions, Technical Report, MITRE. Retrieved June 3,2008 from:http://timex2.mitre.org/ an notation_guidelines/2005_timex2_standard_v1.1.pdf
  8. Frokjaer, E., Hertzum, M., & Hornbaek, K. (2000). Measuring usability: are effectiveness, efficiency, and satisfaction really correlated? Proc. ACM CHI 2000 Conference on Human Factors in Computing Systems (pp. 345-352). The Hague: ACM Press.
  9. Grishman, R., & Sundheim, B. (1996). Message understanding conference - 6: A brief history. Proc. 16th International Conference on Computational Linguistics (COLING) (pp. 466-471). Copenhagen: Ministry of Research, Denmark.
  10. Harabagiu, S., Bunescu, R., & Maiorano, S. (2001). Text and knowledge mining for coreference resolution. Proc. 2nd Meeting of the North America Chapter of the Association for Computational Linguistics (NAACL2001) (pp. 55-62). Pittsburgh: ACL.
  11. Marsh, E., & Perzanowsi, D. (1998). MUC-7 evaluation of IE technology: overview of results. Retrieved 2009 from: http://www.itl.nist.gov/iaui/894.02/ related_projects/muc/proceedings/muc_7_toc.html
  12. NIST 2005 automatic content extraction evaluation official results. (2006). Retrieved May 7, 2008 from: http://www.nist.gov/speech/tests/ace/2005/doc/ ace05eval_official_results_20060110.html
  13. Vilain, M., Su, J., & Lubar, S. (2007). Entity extraction is a boring solved problem-Or is it? HLT-NAACL - Short Papers (pp. 181-184). Rochester: ACL.
  14. Working guidelines ACE++ events. (2007). Unpublished Internal Report.
Download


Paper Citation


in Harvard Style

Goldstein-Stewart J. and Winder R. (2009). DESIGNING A SYSTEM FOR SEMI-AUTOMATIC POPULATION OF KNOWLEDGE BASES FROM UNSTRUCTURED TEXT . In Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2009) ISBN 978-989-674-012-2, pages 88-99. DOI: 10.5220/0002307500880099


in Bibtex Style

@conference{keod09,
author={Jade Goldstein-Stewart and Ransom K. Winder},
title={DESIGNING A SYSTEM FOR SEMI-AUTOMATIC POPULATION OF KNOWLEDGE BASES FROM UNSTRUCTURED TEXT},
booktitle={Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2009)},
year={2009},
pages={88-99},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002307500880099},
isbn={978-989-674-012-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Engineering and Ontology Development - Volume 1: KEOD, (IC3K 2009)
TI - DESIGNING A SYSTEM FOR SEMI-AUTOMATIC POPULATION OF KNOWLEDGE BASES FROM UNSTRUCTURED TEXT
SN - 978-989-674-012-2
AU - Goldstein-Stewart J.
AU - Winder R.
PY - 2009
SP - 88
EP - 99
DO - 10.5220/0002307500880099