LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES

Maximilian Walther, Ludwig Hähne, Daniel Schuster, Alexander Schill

Abstract

Gathering product specifications from the Web is labor-intensive and still requires much manual work to retrieve and integrate the information in enterprise information systems or online shops. This work aims at significantly easing this task by introducing algorithms for automatically retrieving and extracting product information from producers’ websites while only being supplied with the product’s and the producer’s name. Compared to previous work in the field, it is the first approach to automate the whole process of locating the product page and extracting the specifications while supporting different page templates per producer. An evaluation within a federated consumer information system proves the suitability of the developed algorithms. They may easily be applied to comparable product information systems as well to minimize the effort of finding up-to-date product specifications.

References

  1. Arasu, A. and Garcia-Molina, H. (2003). Extracting Structured Data from Web Pages. In SIGMOD International Conference on Management of Data. San Diego, CA, USA 10-12 June 2003. ACM Press: New York.
  2. Banko, M., Cafarella, M. J. Soderland, S., Broadhead, M. and Etzioni, O. (2007). Open Information Extraction from the Web. In IJCAI 20th International Joint Conference on Artificial Intelligence. Hyderabad, India 9-12 January 2007. Morgan Kaufmann Publishers Inc.: San Francisco.
  3. Califf, M. E. and Mooney, R. J. (1997). Relational Learning of Pattern-Match Rules for Information Extraction. In ACL SIGNLL Meeting of the ACL Special Interest Group in Natural Language Learning. Madrid, Spain July 1997. T. M. Ellison: Madrid.
  4. Chang, C.-H. and Lui, S.-C. (2001). IEPAD: Information Extraction based on Pattern Discovery. In IW3C2 10th International Conference on the World Wide Web. Hong Kong, China 1-5 May 2001. ACM Press: New York.
  5. Crescenzi, V., Mecca, G. and Merialdo, P. (2001). Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In VLDB Endowment 27th International Conference on Very Large Data Bases. Rome, Italy 11-14 September 2001. Morgan Kaufmann Publishers Inc.: San Francisco.
  6. Freitag, D. (1998). Information Extraction from HTML: Application of a General Machine Learning Approach. In AAAI 15th National Conference on Artificial Intelligence. Madison, WI, USA 26-30 July 1998. AAAI Press: Menlo Park.
  7. Hsu, C.-N. and Dung, M.-T. (1998). Generating FiniteState Transducers for Semi-Structured Data Extraction from the Web. Journal of Information Systems, 23(8), pp.521-538.
  8. Kushmerick, N., Weld, D. S. and Doorenbos, R. (1997). Wrapper Induction for Information Extraction. In IJCAI 15th International Joint Conference on Artificial Intelligence. Nagoya, Japan 23-29 August 1997. Morgan Kaufmann Publishers Inc.: San Francisco.
  9. Laender, A. H. F., Ribeiro-Neto, B. and da Silva, A. S. (2002). DEByE - Data Extraction by Example. Data and Knowledge Engineering, 40(2), pp.121-154.
  10. Liu, B. (2007). Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer: Heidelberg.
  11. Liu, B., Grossman, R. and Zhai, Y. (2003). Mining Data Records in Web Pages. In SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA 24-27 August 2003. ACM Press: New York.
  12. Liu, B. and Zhai, Y. (2005). NET - A System for Extracting Web Data from Flat and Nested Data Records. In WISE Society 6th International Conference on Web Information Systems Engineering. New York, NY, USA 20-22 November 2005. Springer: Heidelberg.
  13. Muslea, I., Minton, S. and Knoblock, C. (1999). A Hierarchical Approach to Wrapper Induction. In IFAAMAS 3rd International Conference on Autonomous Agents. Seattle, WA, USA 1-5 May 1999. ACM Press: New York.
  14. Scaffidi, C., Bierhoff, K., Chang, E., Felker, M., Ng, H. and Jin, C. (2007). Red Opal: Product-Feature Scoring from Reviews. In SIGECOM 8th ACM Conference on Electronic Commerce. San Diego, CA, USA 11-15 June 2007. ACM Press: New York.
  15. Walther, M., Schuster, D. and Schill, A. (2009a). Federated Product Search with Information Enrichment Using Heterogeneous Sources. In Poznan University of Economics 12th International Conference on Business Information Systems. Poznan, Poland 27-29 April 2009. Springer: Heidelberg.
  16. Walther, M., Schuster, D., Juchheim, T. and Schill, A. (2009b). Category-Based Ranking of Federated Product Offers. In IADIS 8th International Conference on WWW and Internet. Rome, Italy 19-22 November 2009. IADIS Press: Lisbon.
  17. Wong, T.-L. and Lam, W. (2009). An Unsupervised Method for Joint Information Extraction and Feature Mining Across Different Web Sites. Data and Knowledge Engineering, 68(1), pp.107-125.
  18. Zhai, Y. and Liu, B. (2005). Web Data Extraction Based on Partial Tree Alignment. In IW3C2 14th International Conference on the World Wide Web. Chiba, Japan 10-14 May 2005. ACM Press: New York.
Download


Paper Citation


in Harvard Style

Walther M., Hähne L., Schuster D. and Schill A. (2010). LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES . In Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 4: ICEIS, ISBN 978-989-8425-07-2, pages 13-22. DOI: 10.5220/0002874300130022


in Bibtex Style

@conference{iceis10,
author={Maximilian Walther and Ludwig Hähne and Daniel Schuster and Alexander Schill},
title={LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES},
booktitle={Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 4: ICEIS,},
year={2010},
pages={13-22},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002874300130022},
isbn={978-989-8425-07-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 12th International Conference on Enterprise Information Systems - Volume 4: ICEIS,
TI - LOCATING AND EXTRACTING PRODUCT SPECIFICATIONS FROM PRODUCER WEBSITES
SN - 978-989-8425-07-2
AU - Walther M.
AU - Hähne L.
AU - Schuster D.
AU - Schill A.
PY - 2010
SP - 13
EP - 22
DO - 10.5220/0002874300130022