MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION - Object-oriented Web Data Model

Estella Annoni, C. I. Ezeife

2009

Abstract

Traditionally, mining web page contents involves modeling their contents to discover the underlying knowledge. Data extraction proposals represent web data in a formal structure such as database structures specific to application domains. Those models fail to catch the full diversity of web data structures which can be composed of different types of contents, and can be also unstructured. In fact, with these proposals, it is not possible to focus on a given type of contents, to work on data of different structures and to mine on data of different application domains as required to mine efficiently a given content type or web documents from different domains. On top of that, since web pages are designed to be understood by users, this paper considers modeling of web document presentations expressed through HTML tag attributes as useful for an efficient web content mining. Hence, this paper provides a general framework composed of an object-oriented web data model based on HTML tags and algorithms for web content and web presentation object extraction from any given web document. From the HTML code of a web document, web objects are extracted for mining, regardless of the domain.

References

  1. Abiteboul, S. (1997). Querying semi-structured data. In Afrati, F. N. and Kolaitis, P. G., editors, Database Theory - ICDT 7897, Greece, January 8-10, 1997,, volume 1186 of LNCS, pages 1-18. Springer.
  2. Arasu, A., Garcia-Molina, H., and University, S. (2003). Extracting structured data from web pages. In SIGMOD 7803 international conference on Management of data, pages 337-348, New York, NY, USA. ACM.
  3. Chapters.ca (2007). Chapters canada website (november 2007).
  4. Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Roadrunner: Towards automatic data extraction from large web sites. In 27th International Conference on Very Large DataBases, pages 109-118.
  5. Gottlob, G. and Koch, C. (2004). Logic-based web information extraction. SIGMOD Rec., 33(2):87-94.
  6. Kosala, R. and Blockeel, H. (2000). Web mining research: a survey. SIGKDD Explor. Newsl., 2(1):1-15.
  7. Levering, R. and Cutler, M. (2006). The portrait of a common html web page. In DocEng 7806 ACM symposium on Document engineering, pages 198-204, New York, NY, USA. ACM.
  8. Li, J. and Ezeife, C. I. (2006). Cleaning web pages for effective web content mining. In DEXA, pages 560-571.
  9. Liu, B. and Chen-Chuan-Chang, K. (2004). Editorial: special issue on web content mining. SIGKDD Explor. Newsl., 6(2):1-4.
  10. Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data records in web pages. In KDD 7803, pages 601-606, New York, NY, USA. ACM.
  11. Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004). Learning block importance models for web pages. In WWW'04, pages 203-211, New York, NY, USA. ACM.
  12. UN (2007). United nations english index web page (november 2007).
  13. W3C, W. W. W. C. (2007). Document object model standard. http://www.w3.org/DOM/.
  14. Yu, S., Cai, D., Wen, J.-R., and Ma, W.-Y. (2003). Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In WWW'03, pages 11-18, New York, NY, USA. ACM.
  15. Zhao, H., Meng, W., Wu, Z., Raghavan, V., and ement Yu, C. (2005). Fully automatic wrapper generation for search engines. In WWW 7805, pages 66-75, New York, NY, USA. ACM.
Download


Paper Citation


in Harvard Style

Annoni E. and I. Ezeife C. (2009). MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION - Object-oriented Web Data Model . In Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8111-84-5, pages 91-100. DOI: 10.5220/0001967400910100


in Bibtex Style

@conference{iceis09,
author={Estella Annoni and C. I. Ezeife},
title={MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION - Object-oriented Web Data Model},
booktitle={Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2009},
pages={91-100},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001967400910100},
isbn={978-989-8111-84-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION - Object-oriented Web Data Model
SN - 978-989-8111-84-5
AU - Annoni E.
AU - I. Ezeife C.
PY - 2009
SP - 91
EP - 100
DO - 10.5220/0001967400910100