COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM

Dean Williams, Alexandra Poulovassilis

Abstract

We describe an approach which builds on techniques from Data Integration and Information Extraction in order to make better use of the unstructured data found in application domains such as the Semantic Web which require the integration of information from structured data sources, ontologies and text. We describe the design and implementation of the ESTEST system which integrates available structured and semi-structured data sources into a virtual global schema which is used to partially configure an information extraction process. The information extracted from the text is merged with this virtual global database and is available for query processing over the entire integrated resource. As a result of this semantic integration, new queries can now be answered which would not be possible from the structured and semi-structured data alone. We give some experimental results from the ESTEST system in use.

References

  1. A.H.Tan (1999). Text mining: The state of the art and the challanges. Proc. of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, pages 65-70.
  2. A.Poulovassilis (2004). A tutorial on the IQL query language. Technical report, AutoMed Project.
  3. Appelt, D. (1999). An introduction to Information Extraction. Artificial Intelligence Communications, 12(3):161-172.
  4. Bairoch, A., Boeckmann, B., Ferro, S., and Gasteiger, E. (2000). Swiss-Prot: Juggling between evolution and stability. Brief. Bioinform., 5:39-55.
  5. Bontcheva, K., Tablan, V., Maynard, D., and Cunningham, H. (2004). Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering, 10:349-373.
  6. Brickley, D. and Guha, R. (2004). RDF vocabulary description language 1.0: RDF schema. W3C Recommendation. http://www.w3.org/TR/rdf-schema/.
  7. Cunningham, H., Bontcheva, K., and Li, Y. (2005). Knowledge Management and Human Language: Crossing the Chasm. Journal of Knowledge Management, 9(5):108-131.
  8. Cunningham, H., Maynard, D., Bontcheva, K., and Tablan, V. (2002). GATE: A framework and graphical development environment for robust NLP tools and applications. In Proc. of the 40th Anniversary Meeting of the Association for Computational Linguistics.
  9. Cunningham, H., Maynard, D., and Tablan, V. (2000). JAPE: a Java Annotation Patterns Engine (Second Edition). Research memorandum, University of Sheffield.
  10. Fellbaum, C. (1998). database.
  11. Halevy, A. (2003). Data Integration: A Status Report. In Weikum, G., Schöning, H., and Rahm, E., editors, BTW, volume 26 of LNI, pages 24-29. GI.
  12. Kiryakov, A., Popov, B., Ognyanoff, D., Manov, D., Kirilov, A., and Goranov, M. (2003). Semantic Annotation, Indexing, and Retrieval. In 2nd International Semantic Web Conference (ISWC2003), pages 484-499.
  13. Lassila, O. and Swick, R. (1999). Resource description framework (RDF) model and syntax specification. W3C Recommendation. http://www.w3.org/TR/RECrdf-syntax/.
  14. Lenzerini, M. (2002). Data Integration: A Theorectical Perspective. In Proc. PODS02, pages 247-258.
  15. McBride, B. (2002). Jena: A semantic web toolkit. IEEE Internet Computing, 6(6):55-59.
  16. McBrien, P. and A.Poulovassilis (2003). Defining peerto-peer data integration using both as view rules. In Proc. Workshop on Databases, Information Systems and Peer-to-Peer Computing (at VLDB'03), Berlin.
  17. McBrien, P. and Poulovassilis, A. (2003). Data integration by bi-directional schema transformation rules. In Proc. ICDE'03, pages 227-238.
  18. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., and Kirilov, A. (2004). KIM - a semantic platform for information extraction and retrieval. Nat. Lang. Eng., 10(3-4):375-392.
  19. UK Department for Transport (1999). Stats20: Instructions for the completion of road accident report form. http://www.dft.gov.uk.
  20. U.Y. Nahm, R. M. (2000). Using Information Extraction to aid the discovery of prediction rules from text. Proc. of the KDD-2000 Workshop on text Mining, pages 51- 58.
  21. Williams, D. (2005). Combining data integration and information extraction techniques. In Proc. Workshop on Data Mining and Knowledge Discovery, at BNCOD'05, pages 96-101.
  22. Williams, D. and Poulovassilis, A. (2004). An example of the ESTEST approach to combining unstructured text and structured data. In Proc. of the Database and Expert Systems Applications (DEXA'04), pages 191- 195. IEEE Computer Society.
  23. Wu, J. and Heydecker, B. (1998). Natural language understanding in road accident data analysis. Advances in Engineering Software, 29:599-610.
Download


Paper Citation


in Harvard Style

Williams D. and Poulovassilis A. (2006). COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM . In Proceedings of the First International Conference on Software and Data Technologies - Volume 2: ICSOFT, ISBN 978-972-8865-69-6, pages 13-21. DOI: 10.5220/0001315500130021


in Bibtex Style

@conference{icsoft06,
author={Dean Williams and Alexandra Poulovassilis},
title={COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM},
booktitle={Proceedings of the First International Conference on Software and Data Technologies - Volume 2: ICSOFT,},
year={2006},
pages={13-21},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001315500130021},
isbn={978-972-8865-69-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the First International Conference on Software and Data Technologies - Volume 2: ICSOFT,
TI - COMBINING INFORMATION EXTRACTION AND DATA INTEGRATION IN THE ESTEST SYSTEM
SN - 978-972-8865-69-6
AU - Williams D.
AU - Poulovassilis A.
PY - 2006
SP - 13
EP - 21
DO - 10.5220/0001315500130021