CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS

Jun Liu, Zhaohui Wu, Lu Jiang, Qinghua Zheng, Xiao Liu

Abstract

This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method. The query method extends query interface from single textbox to MEP set, and generates local-optimal query by choosing a MEP and a keyword vector of the MEP. Our method overcomes the problem of “data islands” to a certain extent which results from deficiency of current methods. The experimental results on six real-world Deep Web sites show that our method outperforms existing methods in terms of query capability and applicability.

References

  1. Alvarez M., Raposo J., Pan, A., Cacheda, F., Bellas, F., Carneiro, V, (2007). DeepBot: A Focused Crawler for Accessing Hidden Web Content, In Proceedings of DEECS2007. San Diego CA, pages.18-25.
  2. Barbosa L. and Freire J. (2005). Searching for Hidden Web Databases. In Proceedings of WEBDB2005, Baltimore MD, pages.1-6.
  3. Barbosa L. and Freire J. (2004). Siphoning Hidden-Web Data through Keyword-Based Interfaces. In Proceedings of SBBD2004, Brasilia Brazil, pages. 309-321.
  4. He B., Chang K. C. C (2006). Automatic Complex Schema Matching across Web Query Interfaces: A Correlation Mining Approach. ACM Transactions on Database Systems, vol. 13, pages.1-45.
  5. Ipeirotis P., Gravano L. (2002). Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In Proceedings of VLDB2002, Hong Kong China, August, pages. 1-12.
  6. Kenneth W. Church and William. (1995). A. Gale. Poisson Mixtures. Natural Language Engineering, vol. 1, pages 163-190.
  7. Mandelbrot B. B. (1988). The Fractal Geometry of Nature. New York: W. H. Freeman and Company.
  8. Michael K. Bergman. (2001). The Deep Web: Surfacing Hidden Value. The Journal of Electronic Publishing from the University of Michigan, vol. 7, pages 3-21.
  9. Ntoulas A., Zerfos P., Cho J. Downloading Textual Hidden Web Content through Keyword Queries. In Proceedings of JCDL2005, Denver CO, June 2005, pages 100-109.
  10. Raghavan S. and Garcia-Molina H. (2001). Crawling the Hidden Web. In Proceedings of VLDB2001, Rome Italy, pages 129-138.
  11. Wu P., Wen J. R., Liu H., Ma W. Y. (2006). Query Selection Techniques for Efficient Crawling of Structured Web Source. In Proceedings of ICDE2006, Atlanta GA, pages 47-56.
  12. Zhang Z., He B., Chang K. C. C. (2006). Understanding Web Query Interfaces: Best Effort Parsing with Hidden Syntax. In Proceedings of the ACM SIGMOD2004, Paris France, pages 107-118.
Download


Paper Citation


in Harvard Style

Liu J., Wu Z., Jiang L., Zheng Q. and Liu X. (2009). CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 629-637. DOI: 10.5220/0001830806290637


in Bibtex Style

@conference{webist09,
author={Jun Liu and Zhaohui Wu and Lu Jiang and Qinghua Zheng and Xiao Liu},
title={CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={629-637},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001830806290637},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - CRAWLING DEEP WEB CONTENT THROUGH QUERY FORMS
SN - 978-989-8111-81-4
AU - Liu J.
AU - Wu Z.
AU - Jiang L.
AU - Zheng Q.
AU - Liu X.
PY - 2009
SP - 629
EP - 637
DO - 10.5220/0001830806290637