Authors:
Estella Annoni
1
and
C. I. Ezeife
2
Affiliations:
1
University of Toulouse, France
;
2
University of Windsor, Canada
Keyword(s):
Web data model, Object-oriented mining, Automatic web data extraction.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Biomedical Engineering
;
Business Analytics
;
Data Engineering
;
Data Mining
;
Databases and Information Systems Integration
;
Datamining
;
Enterprise Information Systems
;
Health Information Systems
;
Information Systems Analysis and Specification
;
Modeling Formalisms, Languages and Notations
;
Object-Oriented Database Systems
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Web Databases
Abstract:
Traditionally, mining web page contents involves modeling their contents to discover the underlying knowledge. Data extraction proposals represent web data in a formal structure such as database structures specific to application domains. Those models fail to catch the full diversity of web data structures which can be composed of different types of contents, and can be also unstructured. In fact, with these proposals, it is not possible to focus on a given type of contents, to work on data of different structures and to mine on data of different application domains as required to mine efficiently a given content type or web documents from different domains. On top of that, since web pages are designed to be understood by users, this paper considers modeling of web document presentations expressed through HTML tag attributes as useful for an efficient web content mining. Hence, this paper provides a general framework composed of an object-oriented web data model based on HTML tags an
d algorithms for web content and web presentation object extraction from any given web document. From the HTML code of a web document, web objects are extracted for mining, regardless of the domain.
(More)