Automatically Extracting Complex Data Structures from the Web

Laura Fontán, Rafael López-García, Manuel Álvarez, Alberto Pan

Abstract

This paper presents a new technique for detecting and extracting lists of structured records from Web pages. With respect to most of the state-of-the-art systems, our approach is capable of detecting nested data structures (sublists) and it also incorporates some heuristics to delete unwanted content such as banners and navigation menus from the data region. This article also describes the experiments we have performed to validate the system. The precision and recall we have obtained in our tests surpass 90%.

References

  1. Álvarez, M. Pan, A., Raposo, J., Bellas, F., Cacheda, F., 2008. Extracting lists of data records from semistructured web pages. In Data and Knowledge Engineering, vol. 64, num 2.
  2. Arasu, A., Garcia-Molina, H., 2003. Extracting structured data from web pages. In Proceedings of the ACM SIGMOD International Conference on Management Data.
  3. Crescenzi, V., Mecca, G., Merialdo, P., 2001. ROADRUNNER: towards automatic data extraction from large web sites. In Proceedings of the 2001 International VLDB Conference, pp. 109-118.
  4. Gonnet, G. H., Baeza-Yates, R., Snider, T. 1992. New Indices for Text Pat Trees and Pat Arrays. In Information Retrieval: Data Structures and Algorithms. Prentice Hall.
  5. Jindal, N., Liu, B., 2010. A Generalizad Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In Proceedings of SIAM International Conference on Data Mining 2010, pp.930-941.
  6. Levenstein, V. I., 1966. Binary codes capable of correcting deletions, insertions and reversals. In Soviet Physics Doklady, 10, pp. 707-710.
  7. Miao, G., Tatemura, J., Hsiung, W. P., Sawires, A., Moser, L., 2009. Extracting Data Records from the Web Using Tag Path Clustering. In Proceedings of the 18th International Conference on World Wide Web (WWW'09), pp. 981-990.
  8. Raposo, J., Pan, A., Alvarez, M., Hidalgo, J., 2007. Automatically maintaining wrappers for web sources. In Data and Knowledge Engineering Journal, vol. 61, num. 2, pp. 331-358.
  9. Zhai, Y., Liu, B., 2005. Extracting web data using instance-based learning. In Proceedings of Web Information Systems Engineering Conference (WISE), pp. 318-331.
  10. Zhai, Y., Liu, B., 2006. Structured data extraction from the web based on partial tree alignment. In IEEE Transactions on Knowledge and Data Engineering, vol. 18, num. 12, pp. 1614-1628.
Download


Paper Citation


in Harvard Style

Fontán L., López-García R., Álvarez M. and Pan A. (2012). Automatically Extracting Complex Data Structures from the Web . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 246-251. DOI: 10.5220/0004140802460251


in Bibtex Style

@conference{kdir12,
author={Laura Fontán and Rafael López-García and Manuel Álvarez and Alberto Pan},
title={Automatically Extracting Complex Data Structures from the Web},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={246-251},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004140802460251},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Automatically Extracting Complex Data Structures from the Web
SN - 978-989-8565-29-7
AU - Fontán L.
AU - López-García R.
AU - Álvarez M.
AU - Pan A.
PY - 2012
SP - 246
EP - 251
DO - 10.5220/0004140802460251