EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES

Jianqiang Li, Yu Zhao

2009

Abstract

This paper proposes a method to identify the object relevant information which is distributed across multiple web pages in a website. Many researches have been reported on page-level web data extraction. They assume that the input web pages contain the data records of interested objects. However, in many cases for data mining from a website, the group of web pages describing an object are sparsely distributed in the website. It makes the page-level solutions no longer applicable. This paper exploits the hierarchy model employed by the website builder for web page organization to solve the problem of website-level data extraction. A new resource, the Hierarchical Navigation Path (HNP), which can be discovered from the website structure, is introduced for object relevant web page filtering. The found web pages are clustered based on the URL and semantic hyperlink analysis, and then the entry page and the detailed profile pages of each object are identified. The empirical experiments show the effectiveness of the proposed approach.

References

  1. Laender, A., da Silva, A., B. Ribeiro-Neto, and Teixeira., J., 2002. A Brief Survey of Web Data Extraction Tools. SIGMOD Record.
  2. Arocena, G. O., and Mendelzon, A. O., 1998. WebOQL: Restructuring documents, databases, and webs. Proc. of ICDE.
  3. Arasu, A. and Garcia-Molina, H., 2003. Extracting Structured Data from Web Pages. SIGMOD-03.
  4. Liu, B., Grossman, R., and Zhai, Y., 2003. Mining data records in Web pages. In Proc. of the ACM SIGKDD.
  5. Chang, C., Lui, S., 2001. IEPAD: Information extraction based on pattern discovery. Proc. of WWW
  6. Cohen, W., Hurst, M., and Jensen, L. 2002. A flexible learning system for wrapping tables and lists in HTML documents. Proc. of WWW
  7. Cai, D., Yu, S., Wen, Ji-Rong, and Ma, W.-Y., 2003. VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (MSR-TR-2003-79).
  8. Hammer, J., Mchvoh, J., and Garcia-Molina, H., 1997. Semistructured data: The TSIMMIS experience. Proc. of the First East-European Symposium on Advances in Databases and Information Systems.
  9. Davulcu, H., Vadrevu, S., Nagarajan, S., Gelgi, F., 2005 METEOR: metadata and instance extraction from object referral lists on the web. Proc. Of WWW.
  10. Zhu, H., Raghavan,S., Vaithyanathan, S., 2007. Alexander Löser: Navigating the intranet with high precision. Proc. WWW.
  11. Kao, H.-Y., Lin, S.-H., 2004. Mining web informative structures and content based on entropy analysis. IEEE Trans. on Knowledge and Data Engineering.
  12. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y., 2006. Simultaneous record detection and attribute labeling in web data extraction. Proc. Of KDD
  13. Park, J. Barbosa, D., 2007. Adaptive record extraction from web pages, Proc. WWW
  14. Tajima, K. Mizuuchi, Y. Kitagawa, M., K. Tanaka., 1998. Cut as a querying unit for WWW, Netnews, and Email. In Proc. Of ACM Hypertext.
  15. Kevin S. McCurley, A. T., 2004. Mining and Knowledge Discovery from the Web. ISPAN
  16. Kushmerick, N., 2000. Wrapper induction: efficiency and expressiveness. Artificial Intelligence.
  17. Muslea, I., Minton, S., Knoblock, C., 2001. Hirarchical wrapper induction for semi-structured information sources Autonomous Agents and Multi-Agent Sys.
  18. Baeza-Yates, R., B. Ribeiro-Neto, 1999. Modern Information Retrieval. Addison-Wesley.
  19. Wong, T.-L., Lam, W., 2007. Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn.
  20. Crescenzi, V., Mecca, G. and P. Merialdo, 2001. Roadrunner: Towards Automatic Data Extraction from Large Web Sites, Proc. VLDB
  21. Li, W. S., Ayan, N. F., H. Takano, H. Shimamura, 2001. Constructing multi-granular and topic-focused web site maps. Proc. Of WWW
  22. Li, W., Candan, Vu, K. Q., Agrawal, D., 2001. Retrieving and Organizing Web Pages by Information Unit, Proc. Of WWW
  23. Nie, Z., Ma, Y. J., Ma,W.-Y., 2001. Web Object Retrieval. Proc. of WWW.
  24. Zhai, Y. H., Liu B. 2006. Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering.
Download


Paper Citation


in Harvard Style

Li J. and Zhao Y. (2009). EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 597-604. DOI: 10.5220/0001823705970604


in Bibtex Style

@conference{webist09,
author={Jianqiang Li and Yu Zhao},
title={EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={597-604},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001823705970604},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EXTRACTING OBJECT-RELEVANT DATA FROM WEBSITES
SN - 978-989-8111-81-4
AU - Li J.
AU - Zhao Y.
PY - 2009
SP - 597
EP - 604
DO - 10.5220/0001823705970604