SELF-SUPERVISED PRODUCT FEATURE EXTRACTION USING A KNOWLEDGE BASE AND VISUAL CLUES

Rémi Ferrez, Clément de Groc, Javier Couto

2012

Abstract

This paper presents a novel approach to extract product features from large e-commerce web sites. Starting from a small set of rendered product web pages (typically 5 to 10) and a sample of their corresponding features, the proposed method automatically produces labeled examples. Those examples are then used to induce extraction rules which are finally applied to extract new product features from unseen web pages. We have carried out an evaluation on 10 major French e-commerce web sites (roughly 1 000 web pages) and have reported promising results. Moreover, experiments have shown that our method can handle web site template changes without human intervention.

References

  1. Arasu, A., Garcia-Molina, H., and University, S. (2003). Extracting structured data from Web pages. Proceedings of SIGMOD 7803, page 337.
  2. Chang, C.-h. and Kuo, S.-c. (2007). Annotation Free Information Extraction from Semi-structured Documents. Engineering, pages 1-26.
  3. Chang, C.-H. and Lui, S.-C. (2001). IEPAD: information extraction based on pattern discovery. Proceedings of WWW' 01.
  4. Crescenzi, V., Mecca, G., and Merialdo, P. (2001). RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Very Large Data Bases.
  5. Gibson, D., Punera, K., and Tomkins, A. (2005). The volume and evolution of web page templates. In Special interest tracks and posters of the WWW' 05.
  6. Kosala, R., Blockeel, H., Bruynooghe, M., and Vandenbussche, J. (2006). Information extraction from structured documents using k-testable tree automaton inference. Data & Knowledge Engineering, 58(2):129-158.
  7. Kushmerick, N. (1997). Wrapper induction for information extraction. PhD thesis, University of Washington.
  8. Liu, B. and Grossman, R. (2003). Mining data records in Web pages. Proceedings of SIGKDD' 03, page 601.
  9. Muslea, I., Minton, S., and Knoblock, C. A. (2001). Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and MultiAgent Systems, 4(1):93-114.
  10. Rosenfeld, B. and Feldman, R. (2007). Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web. In Proceedings of ACL' 07, pages 600-607.
  11. Senellart, P., Mittal, A., Muschick, D., Gilleron, R., and Tommasi, M. (2008). Automatic wrapper induction from hidden-web sources with domain knowledge. Proceeding of WIDM 7808, page 9.
  12. Wang, J. and Lochovsky, F. (2002). Wrapper induction based on nested pattern discovery. World Wide Web Internet And Web Information Systems, pages 1-29.
  13. Wong, T.-L. and Lam, W. (2007). Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Transactions on Internet Technology, 7(1):6-es.
  14. Wong, T.-L., Lam, W., and Wong, T.-S. (2008). An unsupervised framework for extracting and normalizing product attributes from multiple web sites. Proceedings of SIGIR' 08, page 35.
  15. Wong, Y. W., Widdows, D., Lokovic, T., and Nigam, K. (2009). Scalable Attribute-Value Extraction from Semi-structured Text. 2009 IEEE International Conference on Data Mining Workshops, pages 302-307.
  16. Wu, B., Cheng, X., Wang, Y., Guo, Y., and Song, L. (2009). Simultaneous Product Attribute Name and Value Extraction from Web Pages. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pages 295-298.
  17. Zhao, H., Meng, W., Wu, Z., Raghavan, V., and Yu, C. (2005). Fully automatic wrapper generation for search engines. In Proceedings of WWW' 05.
  18. Zhao, S. and Betz, J. (2007). Corroborate and learn facts from the web. Proceedings of SIGKDD' 07, page 995.
Download


Paper Citation


in Harvard Style

Ferrez R., de Groc C. and Couto J. (2012). SELF-SUPERVISED PRODUCT FEATURE EXTRACTION USING A KNOWLEDGE BASE AND VISUAL CLUES . In Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-08-2, pages 643-652. DOI: 10.5220/0003936706430652


in Bibtex Style

@conference{webist12,
author={Rémi Ferrez and Clément de Groc and Javier Couto},
title={SELF-SUPERVISED PRODUCT FEATURE EXTRACTION USING A KNOWLEDGE BASE AND VISUAL CLUES},
booktitle={Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2012},
pages={643-652},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003936706430652},
isbn={978-989-8565-08-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 8th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - SELF-SUPERVISED PRODUCT FEATURE EXTRACTION USING A KNOWLEDGE BASE AND VISUAL CLUES
SN - 978-989-8565-08-2
AU - Ferrez R.
AU - de Groc C.
AU - Couto J.
PY - 2012
SP - 643
EP - 652
DO - 10.5220/0003936706430652