Wrapper Induction by XPath Alignment

Joachim Nielandt, Robin de Mol, Antoon Bronselaer, Guy de Tré

Abstract

Dealing with a huge quantity of semi-structured documents and the extraction of information therefrom is an important topic that is getting a lot of attention. Methods that allow to accurately define where the data can be found are then pivotal in constructing a robust solution, allowing for imperfections and structural changes in the source material. In this paper we investigate a wrapper induction method that revolves around aligning XPath elements (steps), allowing a user to generalise upon training examples he gives to the data extraction system. The alignment is based on a modification of the well known Levenshtein edit distance. When the training example XPaths have been aligned with each other they are subsequently merged into the path that generalises, as precise as possible, the examples, so it can be used to accurately fetch the required data from the given source material.

References

  1. (1999). Xpath 1.0. http://www.w3.org/TR/xpath/. W3C Recommendation: 16 November 1999.
  2. (2010). Xpath 2.0. http://www.w3.org/TR/xpath20/. W3C Recommendation: 14 December 2010.
  3. Anton, T. (2005). Xpath-wrapper induction by generalizing tree traversal patterns. Number 3, pages 126-133. GIFachgruppen ABIS, AKKD, FGML.
  4. Arasu, A. and Garcia-Molina, H. (2003). Extracting structured data from web pages. pages 337-348, New York, New York, USA. ACM Press.
  5. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. (2004). Blockbased web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7804, pages 456-463, New York, New York, USA. ACM.
  6. Chang, C.-H. and Lui, S.-C. (2001). Iepad: Information extraction based on pattern discovery. In Proceedings of the 10th International Conference on World Wide Web, WWW 7801, pages 681-688, New York, NY, USA. ACM.
  7. Cohen, W. W., Hurst, M., and Jensen, L. S. (2002). A flexible learning system for wrapping tables and lists in html documents. In WWW, WWW 7802, pages 232- 241, New York, NY, USA. ACM.
  8. Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 7801, pages 109-118, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  9. Gusfield, D. (1997). Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York, NY, USA.
  10. Han, W., Buttler, D., and Pu, C. (2001). Wrapping web data into xml. SIGMOD Rec., 30(3):33-38.
  11. Hao, Q., Cai, R., Pang, Y., and Zhang, L. (2011). From one tree to a forest: A unified solution for structured web data extraction. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7811, pages 775-784, New York, NY, USA. ACM.
  12. Hsu, C.-N. and Dung, M.-T. (1998). Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521-538.
  13. Kushmerick, N. (2000). Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15- 68.
  14. Kushmerick, N., Weld, D. S., and Doorenbos, R. B. (1997). Wrapper induction for information extraction. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI 7897), pages 729 - 737.
  15. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707-710.
  16. Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7803, pages 601-606, New York, NY, USA. ACM.
  17. Muslea, I., Minton, S., and Knoblock, C. (1999). A hierarchical approach to wrapper induction. In Proceedings of the Third Annual Conference on Autonomous Agents, AGENTS 7899, pages 190-197, New York, NY, USA. ACM.
  18. Myllymaki, J. and Jackson, J. (2002). Robust web data extraction with xml path expressions. Technical Report May.
  19. Pinto, D., McCallum, A., Wei, X., and Croft, W. B. (2003). Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7803, pages 235-242, New York, New York, USA. ACM Press.
  20. Sahuguet, A. and Azavant, F. (1999). Building light-weight wrappers for legacy web data-sources using w4f. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB 7899, pages 738-741, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  21. Sugibuchi, T. and Tanaka, Y. (2005). Interactive webwrapper construction for extracting relational information from web documents. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW 7805, pages 968-969, New York, New York, USA. ACM Press.
  22. Varun, S. (2011). Siloseer : A visual content extraction system.
  23. Wang, Y. and Hu, J. (2002). A machine learning based approach for table detection on the web. Proceedings of the eleventh international conference on World Wide Web, 9.
  24. Wong, T.-L. and Lam, W. (2010). Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach. IEEE Transactions on Knowledge and Data Engineering, 22(4):523-536.
  25. Zhai, Y. and Liu, B. (2005). Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web, WWW 7805, pages 76-85, New York, NY, USA. ACM.
  26. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. (2005). 2d conditional random fields for web information extraction. ICML 7805, pages 1044-1051, New York, New York, USA. ACM Press.
  27. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. (2006). Simultaneous record detection and attribute labeling in web data extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 7806, pages 494-503, New York, New York, USA. ACM Press.
Download


Paper Citation


in Harvard Style

Nielandt J., de Mol R., Bronselaer A. and de Tré G. (2014). Wrapper Induction by XPath Alignment . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 492-500. DOI: 10.5220/0005124504920500


in Bibtex Style

@conference{kdir14,
author={Joachim Nielandt and Robin de Mol and Antoon Bronselaer and Guy de Tré},
title={Wrapper Induction by XPath Alignment},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={492-500},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005124504920500},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Wrapper Induction by XPath Alignment
SN - 978-989-758-048-2
AU - Nielandt J.
AU - de Mol R.
AU - Bronselaer A.
AU - de Tré G.
PY - 2014
SP - 492
EP - 500
DO - 10.5220/0005124504920500