ONTOLOGY BASED EXTRACTION AND INTEGRATION OF INFORMATION FROM UNSTRUCTURED DOCUMENTS

Naychi Lai Lai Thein, Khin Haymar Saw Hla, Ni Lar Thein

2005

Abstract

The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. One of the basic problems in the development of Semantic Web is information integration. Indeed, the web is composed of a variety of information sources, and in order to integrate information from such sources, their semantic integration and reconciliation is required. Also, web pages are formatted with HTML which is only a human readable format and the agents cannot understand their meaning. In this paper, we present an approach to extract information from unstructured documents (e.g. HTML) and are converted to standard format (XML) by using source ontology. Then, we translate XML output to local ontology. This paper also describes a key technology for mapping between ontologies to compute similarity measures to express complex relationships among concepts. In order to address this problem, we apply machine learning approach for semantic interoperability in the real, commercial and governmental world.

References

  1. Bayrak C., Kolukisaoglu H., 2003. Data Extraction from Repositories on the Web: A Semi-automatic Approach, Computer Science Department, University of Arkansas at Little Rock, Little Rock, AR, U.S.A. , SEPTEMBER, Vol. 7, No. 4, pp. 13-23.
  2. Cui Z., Jones D., Brien P. O, 2001. Issues in Ontologybased Information Integration, Intelligent Business Systems Research Group Intelligent Systems Lab.
  3. Doan A., Madhavan J., Domingos P., Halevy A., 2002. Learning to Map between Ontologies on the Semantic Web. In Proceedings of the World-Wide Web Conference (WWW-2002), pages 662-673, ACM Press.
  4. Embley D. W., Campbell D. M., Smith R. D., and Liddle S. W., 1998. Ontology-based extraction and structuring of information from data-rich unstructured documents. In International Conference on Information and Knowledge Management (CIKM).
  5. Gruber T.R., 2003. A Translation Approach to Portable Ontology Specification, Knowledge Acquisition, 199- 220.
  6. Hendler J., Lee T.B., Miller E., 2002. Integrating Application on the Semantic Web, Journal of the Institute of Electrical Engineers of Japan, Vol. 122 (10).
  7. Maedche A., 2002. Tying Up Information Integration and Web Site Management by Ontologies, IEEE Data Engineering Bulletin.
  8. Stuckenschmidt H., 2002. Information Sharing on the Semantic Web, AI Department, Vrije University, Amsterdam, De Boelelaan 1081a, 1081HV Amsterdam, The Netherlands.
  9. Soderland S., 1998. Learning information extraction rules for semi-structured and free text, www.cs.washington.edu/homes/soderlan/ WHISK.
  10. Staab S., Maedche A., 2001. Comparing Ontologies Similarity Measures and a Comparison Study, Internal Report No. 408.
  11. Staab S., 2002. The Semantic Web-New Ways to Present and Integrate Information, Institute of Applied Informatics and Formal Description Methods (AIFB), University of Darlsruhe.
  12. Wache H., Vögele T., Visser U., Stuckenschmidt H., Schuster G., Neumann H., Hübner S., 2001. Ontology-Based Integration of Information: A Survey of Existing Approaches.
Download


Paper Citation


in Harvard Style

Lai Lai Thein N., Haymar Saw Hla K. and Lar Thein N. (2005). ONTOLOGY BASED EXTRACTION AND INTEGRATION OF INFORMATION FROM UNSTRUCTURED DOCUMENTS . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 972-8865-19-8, pages 457-460. DOI: 10.5220/0002555504570460


in Bibtex Style

@conference{iceis05,
author={Naychi Lai Lai Thein and Khin Haymar Saw Hla and Ni Lar Thein},
title={ONTOLOGY BASED EXTRACTION AND INTEGRATION OF INFORMATION FROM UNSTRUCTURED DOCUMENTS},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2005},
pages={457-460},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002555504570460},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - ONTOLOGY BASED EXTRACTION AND INTEGRATION OF INFORMATION FROM UNSTRUCTURED DOCUMENTS
SN - 972-8865-19-8
AU - Lai Lai Thein N.
AU - Haymar Saw Hla K.
AU - Lar Thein N.
PY - 2005
SP - 457
EP - 460
DO - 10.5220/0002555504570460