Discovering the Deep Web through XML Schema Extraction

Yasser Saissi, Ahmed Zellou, Ali Idri

2016

Abstract

The web accessible by the search engines contains a vast amount of information. However, there is another part of the web called the deep web accessible only through its associated HTML forms, and containing much more information. The integration of the deep web content presents many challenges that are not fully addressed by the actual deep web access approaches. The integration of the deep web data requires knowing the schema describing each deep web source. This paper presents our approach to extract the XML schema describing a selected deep web source. The XML schema extracted will be used to integrate the associated deep web source into a mediation system. The principle of our approach is to apply a static and a dynamic analysis to the HTML forms giving access to the selected deep web source. We describe the algorithms of our approach and compare it to the other existing approaches.

References

  1. Bing, L. 2007. Web Data Mining. Springer.
  2. Bergman, M.K. 2001. The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing, Vol 7.
  3. Chang. K. C.-C., He, B., Li, C., Patel, M., Zhang, Z. 2004. Structured Databases on the Web: Observations and Implications. ACM SIGMOD Record, Vol. 33, n. 3, 61- 70.
  4. Chang, K.C-C., He, B., Zhang, Z. 2005. Toward large scale integration: Building a MetaQuerier over databases on the web. In proceedings of the Second Conference on Innovative Data Systems Research (CIDR), 44-55.
  5. Doan, A., Halevy, A., Ives, Z. 2012. Principles of Data Integration. Elsevier.
  6. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C. 2011. Real understanding of real estate forms. In Proceedings of the international conference on Web Intelligence, Mining and Semantics, Article No. 13.
  7. Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G. Schallhart, C. 2013. The Ontological Key: Automatically Understanding and Integrating Forms to Access the Deep Web. VLDB Journal Volume 22, Issue 5, 615-640. DOI = http//doi.acm.org/ 10.1007/s00778- 013-0323-0.
  8. He, H., Meng, W., Yu, C., Wu, Z. 2005. WISE-Integrator : A System for extracting and integrating complex web search interfaces of deep web. In proceedings of the 31st VLDB conference, p.1314.
  9. He, H., Meng, W., Yu, C., Wu, Z. 2005. Constructing interface schemas for search interfaces of web databases. Web Information Systems Engineering chapter, Lecture Notes in Computer Science Volume 3806, 29-42. DOI = http//doi.acm.org/10.1007/ 11581062_3.
  10. He, B., Patel, M., Zhang, Z., Chang, K. C.-C. 2007. Accessing the Deep Web: A survey. Communications of the ACM, Vol. 50, 94-101. DOI = http//doi.acm.org/10.1145/1230819.1241670.
  11. Khelghati, M., Hiemstra, D., V.Keulen, M. 2013. Deep Web Entity Monitoring. In Proceedings of the 22nd International World Wide Web, page 377. DOI = http//doi.acm.org/ 10.1145/2487788.2487946.
  12. Liddle, S.W., Embley, D.W., Scott, D.T., Yau, S.H. 2003. Extracting Data Behind Web Forms. In A. Olivé, M. Yoshikawa, E. S.K. Yu (Ed.), Lecture Notes in Computer Science, Advanced Conceptual Modeling Techniques chapter, Vol. 2784, 402-413. DOI = http//doi.acm.org/10.1007/978-3-540-45275-1_35.
  13. Lu, Y., He, H., Zhao, H., Meng, W. 2007. Annotating structured data on the deep web. In IEEE 23rd International Conference on Data Engineering, ICDE, 376-385. DOI = http//doi.acm.org/ 10.1109/ ICDE.2007.367883.
  14. Lu, Y., He, H., Zhao, H., Meng, Yu, C. 2013. Annotating Search Results from Web Databases. In IEEE Transactions on Knowledge and Data Engineering, Vol. 25, Issue 3, 514-527. DOI = http//doi.acm.org/ 10.1109/TKDE.2011.175.
  15. Lyman, P., Varian, H.R. 2003. How Much Information. 2003? University of California.
  16. Madhavan, J., D. Ko, L. Kot, V. Ganapathy, A. Rasmussen, A. Halevy. 2008. Google's Deep Web Crawl, Proceedings of the VLDB Endowment, page 1241, ISSN: 2150-8097.
  17. Malki, M., Flory, A., Rahmouni, M.K. 2002. Extraction Of Object-Oriented Schemas from existing relational databases: a Form-driven Approach. INFORMATICA, Vol. 13, No. 1, 47-72.
  18. Saissi, Y., Zellou, A., Idri, A. 2014. Form driven web source integration. In proceedings of the 9th IEEE International Conference in Intelligent Systems: Theories and Applications.
  19. Saissi, Y., Zellou, A., Idri, A. 2014. Extraction of relationl schema from deep web sources: a form driven approach. In Proceedings of the 2nd IEEE World Conference in Complex System. DOI = http//doi.acm.org/ 10.1109/ICoCS.2014.7060888.
  20. Saissi, Y., Zellou, A., Idri, A. 2015. Deep web integration: Architecture for relational schema extraction. In Proceedings of the 26th International conference on Software & Systems Engineering and their Applications.
  21. Saissi, Y., Zellou, A., Idri, A. 2015. Deep Web integration : the tip of the iceberg. In International Review on Computers and Software, Vol 10, n.10. DOI = http://dx.doi.org/10.15866/irecos.v10i10.7755.
  22. Wang, J., Lochovsky, F. H. 2003. Data Extraction and Label Assignment for Web Databases. WWW 2003, ACM 1-58113-680-3/03/0005. DOI = http//doi.acm.org/ 10.1145/775152.775179.
  23. Zellou, A. 2008. Contribution to the LAV rewriting in the context of WASSIT, toward a resources integration. Doctoral Thesis, Dept. Computer Engineering, University Mohammed V, EMI, Rabat, Morocco.
  24. Zhang, Z., He, B., Chang, K. C-C. 2004. Understanding web query interfaces :Best-effort parsing with hidden syntax. In proceedings of the ACM SIGMOD international Conference on Management of data, p. 107. DOI = http//doi.acm.org/ 10.1145/ 1007568.1007583.
Download


Paper Citation


in Harvard Style

Saissi Y., Zellou A. and Idri A. (2016). Discovering the Deep Web through XML Schema Extraction . In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016) ISBN 978-989-758-203-5, pages 141-149. DOI: 10.5220/0006013901410149


in Bibtex Style

@conference{kdir16,
author={Yasser Saissi and Ahmed Zellou and Ali Idri},
title={Discovering the Deep Web through XML Schema Extraction},
booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},
year={2016},
pages={141-149},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006013901410149},
isbn={978-989-758-203-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)
TI - Discovering the Deep Web through XML Schema Extraction
SN - 978-989-758-203-5
AU - Saissi Y.
AU - Zellou A.
AU - Idri A.
PY - 2016
SP - 141
EP - 149
DO - 10.5220/0006013901410149