MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC

WEB CONTENT EXTRACTION

Object-oriented Web Data Model

Estella Annoni

IRIT, University of Toulouse, 118 Route de Narbonne, F-31062 Toulouse CEDEX 9, France

C. I. Ezeife

∗

School of Computer Science, University of Windsor, Windsor, Ontario N9B 3P4, Canada

Keywords:

Web data model, Object-oriented mining, Automatic web data extraction.

Abstract:

Traditionally, mining web page contents involves modeling their contents to discover the underlying knowl-

edge. Data extraction proposals represent web data in a formal structure such as database structures speciﬁc

to application domains. Those models fail to catch the full diversity of web data structures which can be

composed of different types of contents, and can be also unstructured. In fact, with these proposals, it is not

possible to focus on a given type of contents, to work on data of different structures and to mine on data of

different application domains as required to mine efﬁciently a given content type or web documents from dif-

ferent domains. On top of that, since web pages are designed to be understood by users, this paper considers

modeling of web document presentations expressed through HTML tag attributes as useful for an efﬁcient

web content mining. Hence, this paper provides a general framework composed of an object-oriented web

data model based on HTML tags and algorithms for web content and web presentation object extraction from

any given web document. From the HTML code of a web document, web objects are extracted for mining,

regardless of the domain.

1 INTRODUCTION

Since, web document volume and diversity are

tremendously increasing (Kosala and Blockeel,

2000), and more businesses use full web applications,

web data analysis and web content mining have be-

come important research areas. In web content min-

ing, presenting web data in a formal structure and

extracting similar records in web documents are two

important tasks. Web documents generated by back-

end database systems have underlying data structures

where data presented on a web document has the same

set of attributes, such as the Chapters web page in ﬁg-

ure 1 which has three nested tables. Web documents

manually generated are loosely-structured, even un-

structured because there is no similar data structure

underlying in them, such as United Nations index web

∗

This research was supported by the Natural Science

and Engineering Research Council (NSERC) of Canada un-

der an operating grant (OGP-0194134) and a University of

Windsor grant.

page (UN, 2007). The upper part of Chapters web

document (ﬁgure 1) composed of unstructured (Chap-

ters logo) and loosely-structured complex data (Chap-

ters menu) is not handled by previous approaches be-

cause either they view it as noise or they do not man-

age it. Existing work on web data extraction mostly

handle only complex and structured web data. How-

ever, many of web documents are unstructured as ar-

gued by (Kosala and Blockeel, 2000). We aim at

mining any web document in a uniﬁed way, so our

data extraction process needs to handle simple, com-

plex, loosely structured and unstructured web data.

For example, answering the following query, “What

is Chapters fake website proﬁle?” entails mining

structured data related to items in sales and also un-

structured data about chapters company. So, prevent-

ing Chapters’fake business websites requires ﬂagging

those presenting records of items and composed of

Chapters logo and a menu with user account in order

to drag common points and deﬁne their proﬁle.

Annoni E. and I. Ezeife C. (2009).

MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION - Object-oriented Web Data Model.

In Proceedings of the 11th International Conference on Enterprise Information Systems - Databases and Information Systems Integration, pages 91-100

DOI: 10.5220/0001967400910100

 SciTePress

Figure 1: Web document with records in body part and unstructured data in its upper part.

1.1 Related Work

The two main structures used to model structured con-

tent are Object Exchange Model (OEM) (Abiteboul,

1997) and nested table (Liu et al., 2003). OEM repre-

sents data as a graph where objects such as books,

book author. The ids of these objects are the ver-

tices and the object string labels are the edges. The

objects consist of a set of reference pairs (label, ob-

ject id). Although web data are related as argued

by (Liu and Chen-Chuan-Chang, 2004), none of the

two main structures covers all relationships. In fact,

both nested tables and OEM represent aggregation re-

lationships, but only OEM represents reference re-

lationships. Both of them do not represent hierar-

chical relationships between objects, which are also

useful in order to reuse and extend mining tech-

niques through objects. Nested tables do not support

loose structures and both do not handle unstructured

data because they look only for similarities through

data. Hence, they are not adequate for catching

the full diversity of structures within web documents

(Crescenzi et al., 2001).

With both of these approaches mining a web docu-

ment corresponds to either going through all the OEM

graph or mining all nested tables within it. Mining

web documents of different domains is not possible

because the nested tables or OEM associated are com-

parable to each others, which is not efﬁcient.

To mine data on structured web documents, their

HTML ﬁles are mainly represented as pre-parsed doc-

uments with DOM tree by (Liu et al., 2003; Gott-

lob and Koch, 2004; Zhao et al., 2005) contrary to

(Abiteboul, 1997; Crescenzi et al., 2001; Arasu et al.,

2003; W3C, 2007) assuming them as sequences of

character strings. In a DOM tree, tag and tag at-

tributes are represented with nodes, and the nodes

respect hierarchical relationships between tags. Al-

though, users mainly refer to web data presentation

to select information on web documents as argued in

web segmentation work (Levering and Cutler, 2006;

Li and Ezeife, 2006), none of pre-parsed based re-

search assumes presentation cues conveyed through

HTML tags, such as attribute tags “display” as our ap-

proach does. Hence, considering only content types

search spaces are partially narrowed. The authors

of (Zhao et al., 2005) use visual analysis of con-

tent shapes on web documents and compare with web

document DOM tree, but their work does not take

advantage of presentation aspects conveyed directly

through the DOM tree. They consider only strict

ICEIS 2009 - International Conference on Enterprise Information Systems

structures and assume stable data presentations since

web documents handled are generated by search en-

gines.

1.2 Contributions and Outline

Basically, mining data approaches from web docu-

ments face three main problems:

(1) Inability to focus web search on either web docu-

ment presentation or content, or both.

(2) Lack of a uniﬁed framework to mine each web

object regardless of their structure type (unstructured,

loosely or strictly structured).

(3) Time consuming and domain-dependent because

for requirements based on several domain applica-

tions, the mining process should submit as many pro-

cess as number of domain applications.

This paper provides an object-oriented web data

model for representing web data as web content and

web presentation objects to address these problems

and mine complex and structured data as well as sim-

ple and unstructured data in a uniﬁed way. Hence, our

object-oriented web data model distinguishes content

from presentation aspects of data (title, label, image,

...). We also deﬁne algorithms to extract these objects

from any given web document of any web application

in different domains.

This paper is organized in four sections. Section 2

presents the proposed object-oriented web data model

which represents the ﬁrst general and complete web

content and presentation type classiﬁcation. In sec-

tion 3, algorithms for extracting web content and pre-

sentation objects are presented. An example applica-

tion of these algorithms on Chapters web document

(ﬁgure 1), on its structured and unstructured data is

also presented in this section. Conclusions and future

work are given in section 4.

2 OBJECT-ORIENTED WEB

DATA MODEL

Web document segmentation work (Yu et al., 2003;

Song et al., 2004) uses DOM tree, data location fea-

tures, and data presentation features to distinguish in-

formation blocks from noisy ones. We need to go

further through the concept of web document block

to address data of simple and complex type and de-

ﬁne web document objects. Like web segmentation

algorithms of (Yu et al., 2003; Song et al., 2004), we

do not evaluate all HTML tags because that is time

consuming and all HTML tags are not always mean-

ingful. For example, <h1> or <a> tags are more

meaningful than <pre> used for pre-formatted text.

Thus, we propose an object-oriented web data meta

model, which captures both the content and presen-

tation views of web documents, to mine a web docu-

ment as a set of objects. Our web object classes are

deﬁned based on the four concepts:

(1) Main HTML tags: non-empty tags which impact

web document contents and presentations such as ti-

tle, table, link, form or list tags.

(2) Location features: web users directly distinguish

the meaning and the interest of data in web documents

with respect to where information is located. There-

fore, we pay attention to location features such as web

document zone, width, height, and center.

(3) Presentation features: information which are in-

stinctively used by human beings to distinguish con-

tents such as style, fonts, and spaces. An important

visual cue in the DOM tree is depicted by the attribute

tags called “style” or “type”. Moreover, the value of

the feature named “display” must not be “none” or

“hidden” to be considered as a rendering content.

(4) Relationships between objects: web document ob-

jects are related to each other and they share at least

the same space of presentation.

This metamodel represents a web document with an

object of the same name as a composition of Web-

Zone objects because a WebZone object can appear

only on one web document at a time. A WebZone

object is a coherent zone in a web document. In a

web zone, one can notice contents and presentation

of these contents. A WebElement object is a content

such as text, picture, forms, plug-in, separator, and

structure related to the content view. An example of

WebElement object is a weather plug-ins on personal

home pages which come from the same website. A

WebRender object is the rendering of contents on

a web document. An example of WebRender is the

menu shared by different web pages of the same web-

site or the legal information announced in the lowest

part of every web document of a website. From the

content view, a WebZone object is a composition of

WebElement objects because when a WebZone is de-

stroyed, WebElement objects associated to it will be

destroyed. For the same reason, from the presentation

view, a WebZone object is a composition of WebRen-

der objects.

In the literature, three main zones at most are vi-

sually considered on a web document regardless of

the number of main subtrees of its DOM tree. These

zones are called header part, body part, and foot part

(Song et al., 2004). We deﬁne these parts as instances

of specialized WebZone classes: HeaderZone, Body-

Zone, and FootZone. HeaderZone and BodyZone of

our snapshot are sketched on the left part of ﬁgure 1.

Now, we have presented the zones, we are going to

MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION -

Object-oriented Web Data Model

present the objects which may exist in these zones in

terms of content and presentation views.

2.1 Web Content Objects

Examples of web content types cited in existing work

include text, image, metadata, video, but a complete

classiﬁcation of these contents does not exist. Our

classiﬁcation is a tree-based structure deﬁned from

sets of tags at the same level in the DOM tree, at-

tributes of these tags, and their enclosed tags. (Lever-

ing and Cutler, 2006) classify web content according

to four web content types (Image, Text, Form, Plug-

in). We consider them as the basis of our classiﬁca-

tion.

Web documents shows blanks, more generally

separators between the four aforementioned types.

So, we add a ﬁfth content type called SeparatorEle-

ment, which is an element separating content types

from each other, so that any web application such

as segmentation could be supported. Moreover, web

documents generated by back-end database systems

respect an underlying data structure, which is an ag-

gregation of simple data types. A sixth content type

which is an aggregation of the ﬁve previous ones is

also deﬁned to model structured data. Thus, we deﬁne

sub-content types until we ﬁnd the most specialized,

simple content type existing on the web documents.

The feature of anchorage of text and image contents is

also used to take advantage of the property of naviga-

tion through web pages. Our complete classiﬁcation

of web content types deﬁnes six main generic types,

among them four have sub-content types:

1. Text element: textual data containing or not con-

taining page setup and it can be of two kinds:

(a) Raw text: textual data without predeﬁned page

setup and enclosed in HTML tag <span> and is of

three kinds:

i. Title: textual data which is header in page setup and

enclosed within HTML tags <h1> to <h6>.

ii. Label: textual data with page setup enclosed in

HTML tag <label> .

iii. Paragraph: textual data with page setup deﬁned

by the enclosing HTML tag <p>.

(b) List text: textual data organized into lists and en-

closed within HTML tag <li>. It has two kinds:

i. Ordered list: enclosed within HTML tag <ol>.

ii. Deﬁnition list: enclosed within an HTML tag

<dl>.

2. Image element: a picture which can be of two

kinds:

(a) Image: a simple picture enclosed within an HTML

tag <img> which does not contain a <map> at-

tribute.

(b) Map: a picture associated to a mapping deﬁned

by an HTML tag <img> with a <map> attribute set

up (in case of a client-side mapping) or a <ismap>

attribute (in case of a server-side mapping).

3. Form element: box of textual data that web users

send to web servers. It could be enclosed within

HTML tags <form> or <ﬁeldset>. The <ﬁeldset>

tag allows authors to semantically group form entries.

It has three kinds:

(a) Form select: box of textual data submitted by web

users to web servers from a unique or multiple choice.

It could be enclosed within HTML tags <form> and

<select>.

(b) Form input: box of textual data that web users

send to web servers order. It could be enclosed within

HTML tags <form> and <input>.

web users. It could be enclosed within HTML tags

<form> and <textarea>.

4. Plug-in element: data launched by speciﬁc appli-

cations in order to provide web users with a richer

interface. It has two kinds:

(a) Plug-in server: data launched onto web

servers. It could be enclosed within HTML tag

< −−#command exec=“scriptName”−− > for CGI

code, < % program % > for VB code, and <?php

program ? > for Php code, ...

(b) Plug-in client: data launched onto the local client.

It could be enclosed within HTML tags <script> and

<object>.

5. Separator element: spaces between contents which

emphasize them and make them instinctively mean-

ingful for human beings such as line, blank, and

empty space. They could be enclosed within HTML

tags <hr> , <br>.

6. Structure element: aggregation of the previous el-

ements of simple type in order to represent web data

of complex type adapted to represent web data struc-

ture such as structures composed of different content

types. It is deﬁned as a sub-tree of a node in the

DOM which has three or more children sub-tree root

of height 3. The string value of these tags has to be

the same.

Text and images which are hyperlinks, references to

web document parts, are represented as implement-

ing an interface called “Link”. Structured data type is

modeled by an aggregation of WebElement objects as

a self-aggregation relationship.

2.2 Web Presentation Objects

The authors of (Yu et al., 2003) mention that web doc-

uments are composed of four noisy materials, called:

1. navigation (e.g., hyperlink), 2. decoration (e.g.,

ICEIS 2009 - International Conference on Enterprise Information Systems

pictures), 3. interaction (e.g., forms), 4. special para-

graphs (e.g., copyright) and one information material,

called 5. topic (e.g., main data). These materials are

deﬁned mainly according to properties which graphi-

cally grab users’ attention. These presentation prop-

erties have only one strict meaning and are speciﬁ-

cally deﬁned (for example images, hyperlinks, copy-

rights) contrary to textual contents. In this section, we

make explicit this classiﬁcation in order to automati-

cally recognize these objects and we complete it to

represent structured data. So, we indicate HTML tags

according to the presentation object data type and we

add a data type called “Record” to model structured

data. The UML class diagram of presentation objects

is composed of six specialized classes which are:

1. Banner: dynamic WebRender object that contains

information plug-in, images linked to external web-

sites, usually situated around edges, in the upper part.

These objects could be presentation view of text, im-

age, and plug-in contents. Thus, the HTML tags

associated are <span>, <h1> to <h6>,<label>,

<p>, <li>, <img>, <form>, <ﬁelset>, <

−−#command exec=“scriptName”−− >, < % pro-

gram % >, and <?php program ? >, <script>,

<object>.

2. Menu: WebRender object that organizes naviga-

tion in web documents of a website or through a web

document. These objects could be the presentation of

text and image content. These objects do not contain

keywords, such as “copyright”, “about our company”.

Thus, the HTML tags associated are <span>, <h1>

to <h6>,<label>, <p>, <li>, <img>.

3. Interaction: WebRender object that collects user

information for service. These objects could be the

presentation view of forms. Thus, the HTML tags as-

sociated are <form>, <ﬁelset>.

4. LegalInformation: non-dynamic WebRender ob-

ject that contains text or image related to authors,

website or references. It is deﬁned through text and

a series of <a> tags referencing web documents of

the same website. It might be deﬁned by simple text

containing keywords “copyright”, “private”, “policy”,

“about our company”, or it can be deﬁned through

a series of <img> tags and <a> enclosing <img>

tags.

5. Record: WebRender object that represents struc-

tured data related to web document interests. These

objects could be structured element rendering. Thus,

the HTML tags associated are a subset of the afore-

mentioned tags such as a sub-tree of a DOM tree

node which has three or more children sub-tree root

of height 3. The string value of these tags have to be

the same.

6. Bulk : WebRender object that contains unstruc-

tured and loosely-structured data which are not a ban-

ner, a menu, an interaction, a legalInformation or a

record.

Regarding the content view and the presentation view,

two hierarchies of web objects have been deﬁned

based on HTML tags.

3 AUTOMATIC WEB OBJECT

EXTRACTION

Web documents are from two main types : unstruc-

tured and structured documents. They are evaluated

by users in terms of contents and rendering of these

contents. Using web document DOM tree and the hi-

erarchies aforementioned, web objects can be auto-

matically extracted. By this way, whatever the appli-

cation domain, any web document can be modeled as

a set of objects and tasks of pattern discovery through

intra and inter web documents can be processed to de-

tect similarities and trends. The main algorithm called

OWebMiner (Object-oriented Web Miner of Content

and Presentation Objects) is deﬁned as follows :

Algorithm OWebMiner()

Input: a set of HTML ﬁles (WDHTMLFile)of web

documents.

Output: a set of patterns of objects.

begin

for each WDHTMLFile

(A) Extract web presentation objects and web con-

tent objects are sequentially extracted with respect to

their hierarchical dependencies.

(B) Store the web object hierarchies into a

database table.

endFor

Mine patterns lying within objects

end.

In this paper, we develop the sub-algorithm (A).

Given a web document HTML ﬁle, this sub-algorithm

works in three steps. In the ﬁrst step, a DOM tree

of the HTML ﬁle is generated from the HTML code

source requires a DOM parser (W3C, 2007). The

DOM tree obtained from the HTML ﬁle of Chapters

web document (Chapters.ca, 2007) is shown in ﬁg-

ure 2. In the second step, from the DOM tree, web

zones on the web document are identiﬁed. Then, dur-

ing the third step, for each zone, web presentation

objects and web content objects are sequentially ex-

tracted with respect to their hierarchical dependen-

cies. We use blocks’levels to extract objects from the

DOM tree and we consider both block-level tag nodes

and non block-level tag nodes. A block-level tag is a

tag that can only be child of the sub-tree “body” or an-

other block-level tag and should be a parent of other

MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION -

Object-oriented Web Data Model

Region node

LI1

B1A

Menu2

Menu3

HeaderZone

B2A

2 Series1

Menu4

Region node

Banner5

Menu6

BodyZone

Menu7

FootZone

Series2

Menu58

Region node

Figure 2: DOM tree of web document on ﬁgure 1.

tags, like table, division, heading, list, form, block

quotation, paragraph, and address. A non block-level

tag is an inline tag or text-level tag which is a child

of block-level tag and is mainly a DOM tree leaf.

For example, non block-level tags are anchor, cita-

tion, image, object, span, script. Since block-level

tags mainly structure web documents and non-block

level enclose web document content or presentation,

two search approaches are used to explore the DOM

tree. In fact, the extraction process follows a depth-

ﬁrst search through block-level tags until ﬁnding a

non block-level tag. But, it follows a breadth-ﬁrst

search when a non block-level tag is parsed, its sib-

ling set are evaluated to associate web content ob-

jects and web presentation objects to this set of tags.

Through the extraction process, only tags with “dis-

play” or “type” attributes which are not equal to “hid-

den” or “none” are processed. This condition entails

using only user meaningful tags.

3.1 Web Zone Object Extraction

Algorithm

Procedure. From a web document DOM tree, ex-

tracting web zone objects requires extraction of navi-

gation menus which divide web document into useful

and meaningful zones. A web document is composed

of at least one web zone object, e.g., a BodyZone, and

up to 3 zones objects, e.g., a HeaderZone, a Body-

Zone, and a FootZone, because data interest rates are

different and depend on where data are located in.

Thus, we deﬁne an algorithm which receives as inputs

a web document DOM tree and returns an array con-

taining web zone objects. It uses string comparison

for parsing tag sets. From web document DOM trees

analyzed during our preliminary work, we notice as

(Liu et al., 2003) that some nodes have a speciﬁc role.

In (Liu et al., 2003), some of these specialized nodes

are used to only identify data records, e.g., general-

ized nodes, but we also use those which help us with

identifying web zones. We call them region nodes. A

region node is a root node of:

• a tree which has at least more than half of its chil-

dren having similar tag strings,

• a tree which has at least a height of 3,

• a tree which has at least three ﬁrst children.

Contrary to (Liu et al., 2003) requiring that general-

ized nodes of a data region has the same length, re-

gion nodes do not require the same length because

ICEIS 2009 - International Conference on Enterprise Information Systems

the goal is not identifying data records. Parts of web

documents are separated by navigation blocks, e.g.,

menus. We notice that main menus existing on web

documents are usually composed of at least ﬁve links

(<a> HTML tag) with or without special keywords in

their contents. We call these sets of tags Series. A Se-

ries is a set of ﬁve or more <a> or <area> (hypertext

or zone tags) sibling nodes. A web document is com-

posed of three zones at most, so the number of series

used to separate these zones are two at most. These

two speciﬁc series are called“Series1” and “Series2”.

They are identiﬁed according to the region node asso-

ciated with the ﬁrst and respectively the last series in

the DOM Tree.

Series2 can also be identiﬁed as the series en-

closing one of these keywords “copyright”, “private”,

“policy”, “about our company” because these menus

usually contain information about the owner com-

pany. In our algorithm for web zone extraction, re-

gion nodes are ﬁrstly identiﬁed and labeled, thus Se-

ries1 and Series2 are searched to set up web zone

objects. Only the subtree of “body” is meaningful

for web object extraction because it contains infor-

mation rendered to users contrary to the subtree as-

sociated to “head”. An application of the process is

deﬁned below. From our work on web zone extrac-

tion, we notice that content and presentation objects

before the ﬁrst menu and included in less than half of

the DOM tree belong to the web document Header-

Zone. However, content and presentation data after

the last menu and included in the second half of the

DOM tree belongs to FootZone. So, between these

two zones, content and presentation data belong to

BodyZone. Hence, our hypothesis for Series search-

ing are: 1) If the search process of Series1 goes over

half of the DOM tree size, that means the web docu-

ment does not have HeaderZone, Series1 is empty and

BodyZone’s ﬁrst tag is the ﬁrst region node child of

the closest region node in the subtree of “body” root,

2) If the search process of Series2, from half size of

the DOM tree to its end, Series2 does not exist that

means the web document does not have FootZone and

BodyZone’s last tag is the last region node child of the

closest region node in the subtree of “body” root. Se-

ries1 and Series2 are used to set up web zone objects

according to rules called “ZR”, which stands for Zone

Rule (cf. ZR rules in Application section):

Application. Web content and web presentation ob-

ject extraction algorithm applied on the web docu-

ment DOM tree presented in ﬁgure 2 uses an array of

web zone objects called WebZoneObjectArray com-

posed of three cells. We use node labels of ﬁgure 2

to cite them, for example “div2” stands for <div>

labeled 2 and “ul1-B1’ stands for <ul> of the sub-

tree 1 and labeled B1. During the ﬁrst task, some

nodes are labeled as region nodes like “form”, “div1’,

“div4-A’, and “div9” sketched on the snapshot in ﬁg-

ure 2. Then, search of Series1 and Series2 through the

DOM tree begins effectively from the node “body’.

Subtrees “ul1-B1” and “ul1-B2” are parsed and they

include respectively four and three <a> tags; those

are not special series because they do not have at least

5 <a> or <area> tags. Parsing subtree 2-A2, a se-

ries of 13 <area> tags is detected. The closest region

node of Body node is the “form” and its ﬁrst region

node child is <div2>. So, Series1 is found and equals

to <div2>, e.g. “html/body/form/div2”. Through the

DOM tree, subtrees 3, 4, 5, 6, 7, 8, and 9 are parsed

and several series of <a> tags are found, the last one

is in subtree labeled 9-E (ﬁve <a> tags). The last re-

gion node child of “form” is <div9>, and Series2 is

set and equals to <div9>.

Hence, with respect to ZR rules, web zone objects

can be initialized.

• ZR1: HeaderZone.FirstTag is Series1 ﬁrst sib-

ling. It is “div1” because it is the ﬁrst sibling of

<div2>,

• ZR2: HeaderZone.LastTag is Series1 previous

sibling. It is “div1” because it is also the previ-

ous siblings of <div2>,

• ZR3: BodyZone.FirstTag is Series 1. It is“div2”,

• ZR4: BodyZone. is Series2 previous sibling.

It is“div8” because it is the previous sibling of

<div9>,

• ZR5: FootZone.FirstTag is Series 2. It is“div9”,

• ZR6: FootZone.LastTag is Series 2 last sibling. It

is“div9” because <div9> is the last sibling at this

level in the DOM tree.

Then, HeaderZone.NbTag is set up and equals to 1

because HeaderZone is deﬁned through only one sub-

tree. BodyZone.NbTag is set up and equals to 7 be-

cause there are 7 nodes of the same level between

“div2” and “div8”. FootZone.NbTag is set up and

equals to 1 because FootZone is deﬁned through only

one subtree.

3.2 Web Content and Presentation

Object Extraction Algorithm

Procedure. In a web zone, our web object extrac-

tion process begins with web presentation objects and

ﬁnishes with web content objects. We provide two

algorithms called PresWebObjectScan and ContWe-

bObjectScan. PresWebObjectScan algorithm extracts

presentation objects with respect to our web presen-

tation class hierarchy and extracts objects such as

MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION -

Object-oriented Web Data Model

“LegalInformation”, “Menu”, ... Whereas, ContWe-

bObjectScan algorithm extracts web content objects

with respect to our web content class hierarchy and

extracts objects such as “TextElement”, “Deﬁnition-

List”, “PlugInServer”, ... Web presentation objects

and web content objects are extracted for each web

zone object from the DOM tree. We assume that it is

unlikely that an object is spread out through several

zone objects because web zone objects are deﬁned

as sub-tree of the same level in the DOM tree and

enclosed tags are supposed to be coherent and well

closed. It is worth mentioning that several web pre-

sentation objects can rendered a web content object,

and vice versa.

These algorithms are similar in many steps but dif-

fer in the way of detecting siblings through breadth-

ﬁrst search. More precisely, PresWebObjectScan

scans siblings until a block-level node or no more sib-

ling is left whereas ContWebObjectScan scans sib-

lings until a dissimilar node of string tag, a block-

level or no more sibling is left. ContWebObjectScan

is slightly different from PresWebObjectScan but eas-

ier and due to space we present only PresWebOb-

jectScan. The latter receives as input the DOM tree

and a web zone object called WDZoneObject. Ob-

jects are extracted and ﬂagged with a counter called

“indTag” per web zone. Another counter called

“nbTag” is used to keep track of the number of tags

parsed at the same level as the web zone object’s ﬁrst

tag. At the beginning of this algorithm, an array for

web presentation object storage is created. Then, it

goes through the DOM tree from its root to WDZo-

neObject ﬁrst tag and parses without going over WD-

ZoneObject number of tags “nbTag”. Then, it scans

recursively sub-trees of each block-level node call-

ing ProcessPresentationSibling algorithm presented

in ﬁgure 3. Through this search, it looks for non

block-level nodes and as soon as one is found, its

siblings are explored by a breadth-ﬁrst search until a

block-level or no more sibling left. So, a web presen-

tation object is associated to these sibling tags using

our web presentation objects deﬁned in section 2.2.

The counter of web presentation object is incremented

after each web presentation object extraction. In fact,

web presentation objects extraction process uses a

depth ﬁrst sear when a block-level tag is found and

a breadth ﬁrst search when a non block-level tag is

found, then the right web presentation object is asso-

ciated with the set of similar or dissimilar siblings.

Application. PresWebObjectScan algorithm is

applied on the HeaderZone of our web document ex-

ample, so the input are with the DOM tree presented

in ﬁgure 2 and the HeaderZone object. The process

begins by creating PresentationObjectArray and from

the DOM tree root, “html”, reaches <div1> which

is HeaderZone.FirstTag. Then, it reaches “1-A” and

calls ProcessPresentationSibling algorithm (call 1 of

the algorithm) with “html/body/form/div1/a1-A”, the

DOM tree, PresentationObjectArray and indTag=1 as

inputs.

ProcessPresentationSibling stores

“html/body/form/div1/a1-A” into tagArray. Its

unique sibling “html/body/form/div1/div” is a

block-level (<div>) and the web presentation object

associated to tagArray[0] which is an <a> tag with

an inner <img> tag is a LegalInformation object.

It is also the ﬁrst element of PresentationObjectAr-

ray. The counter “indTag” is incremented to 2 and

ProcessPresentationSibling (call 2 of the algorithm)

is called with “html/body/form/div1/div1-B”, the

DOM tree, PresentationObjectArray and indTag=2

as inputs. “html/body/form/div1/div1-B/ul1-B1”,

is reached and since it is a block-level node, Pro-

cessPresentationSibling (call 3 of the algorithm) is

called with “html/body/form/div1/div1-B/ul1-B1/li1-

B1A”, the DOM tree PresentationObjectArray and

indTag=2 as inputs. <li1-B1A> is a non block-level,

so itself and all siblings which are non block-level

are stored in tagArray. Then, the web presentation

object extracted is a Menu object because tagArray is

composed of four <li> tags and each of them has an

inner <a> tag. This Menu is the second element of

PresentationObjectArray.

The counter “indTag” is incremented to 3 and

as there is no more sibling of <li1-B1A> left

ProcessPresentationSibling (algorithm call 3) is over.

In ProcessPresentationSibling (algorithm call 2),

<ul1-B1> has a sibling which is <ul1-B2> which

is a block-level node. By a depth-ﬁrst search the

process reaches <li1-B2A> and ProcessPresenta-

tionSibling (call 4 of the algorithm) is called with

“html/body/form/div1/div1-B/ul1-B1/li1-B2A”, the

DOM tree PresentationObjectArray and indTag=3

as inputs. <li1-B2A> is a non block-level, so

itself and all siblings which are non block-level

are stored in tagArray. Then, the web presentation

object extracted is a Menu object because tagArray

is composed of four <li> tags and each of them

has an inner <a> tag. This Menu is the third

element of PresentationObjectArray. The counter

“indTag” is incremented to 4 and as there is no more

<li1-B2A> sibling left ProcessPresentationSibling

(algorithm 4) is over. In ProcessPresentationSibling

(algorithm 2), there is no more <ul1-B2> sibling,

so ProcessPresentationSibling (algorithm 2) is over.

ProcessPresentationSibling (algorithm 1) is also

over because there is no more <a1-A> sibling left.

Back in PresWebObjectScan algorithm, the number

ICEIS 2009 - International Conference on Enterprise Information Systems

Algorithm ProcessPresentationSibling (TTag, DOMTree, PresentationObjectArray, indTag)

Input: TTag is the HTML tag value which its siblings will be processed

Other variables: tagArray is an array of similar or dissimilar tag siblings of TTag with distinct names.

Comment: PresentationObjectArray is the global array storing web presentation objects. indTag is a global

index for labeling presentation objects per zone.

begin

if TTag is not a block-level tag then

repeat

-Store TTag in tagArray

-Store TTag siblings found in tagArray

until there is no more sibling left

for each TTagSibling in tagArray

begin

If TTagSibling is a block-level tag then

-Associate an object to tagArray[TTagSibling index-1] with respect to our web presentation object

model

-Store this object in the PresentationObjectArray cell indexed indTag

-indTag:= indTag+1

-TTag is set up with TTagSibling

-Call recursively ProcessPresentationSibling (TTag, DOMTree, PresentationObjectArray, indTag);

endif;

end;

-Associate an object to tagArray with respect to our presentation web class hierarchy

-Store this object in the PresentationObjectArray cell indexed indTag

-indTag:= indTag+1

else

-TTag is set up with the next node of the DOMtree by a depth-ﬁrst search

-Call recursively ProcessPresentationSibling (TTag, DOMTree, PresentationObjectArray, indTag)

-TTag is initialized to the next node of the DOMtree by a breadth-ﬁrst search

endif; /*All TTag siblings have been parsed*/

end;

Figure 3: Algorithm for sibling search in web presentation object extraction process.

of tags parsed is incremented to one that means

the end of the loop because HeaderZone.NbTag=1

and PresWebObjectScan end. Hence, the result

of PresWebObjectScan application on Chapters

web page HeaderZone is an array of three web

presentation objects, LegalInformation1, Menu1 and

Menu3.

These web presentation objects of Chapters web

document HeaderZone are sketched in ﬁgure 2 and

the web presentation objects hierarchy obtained is

sketched in ﬁgure 4. The web content objects of

this zone obtained by applying ContWebObjectScan

are sketched with a dashed arrow in ﬁgure 1. At

this point, an advantage of our work is to detect fake

Chapters web pages, only based on the presentation

view. For that, we have to look for web page which

Figure 4: Hierarchy of web objects of ﬁgure 1 according to

presentation view.

match with the pattern LegalInformation1 (Chapters

logo) and Menu2 (menu including user account) in

MODELING WEB DOCUMENTS AS OBJECTS FOR AUTOMATIC WEB CONTENT EXTRACTION -

Object-oriented Web Data Model

their HeaderZone as it is representative and relevant

for every Chapters web page. By the same way, ac-

cording to the content view, ImageLink object and

ListTextlink objects can be mined to ﬁnd the common

features of Chapters fake websites. The interest of the

two views rely on crossing patterns found from web

content object extraction and those from web presen-

tation object extraction to reﬁne and adjust the pro-

ﬁle. The existing approaches allow us to extract ob-

jects like images or data record. On top of extracting

these objects, with our approach, the space search can

be limited to either zones of web or view (content or

presentation) of web documents.

4 CONCLUSIONS

The proposed object-oriented meta model for repre-

senting web documents as an UML diagram of ob-

jects consists of two UML class hierarchies. This

model is composed of six web content classes and six

web presentation classes because we assume that pre-

sentation of web contents impacts user understanding.

It represents ﬁne levels of contents (such as title), high

level of contents (such as text), unstructured presenta-

tion data (such as banner), loosely-structured presen-

tation data (such as menu) and strictly-structured data

(such as record). The data extraction algorithm pro-

cesses any given web document. It pays attention to

content and presentation aspects of data on web doc-

uments mentioned in related work and it generalizes

previous work because it goes further in data block,

record on web documents. This new object-oriented

web data model accesses web documents either by

objects with ﬁne granularity or by record level. Our

model is suitable for handling web documents of any

application domain and either web documents gener-

ated by database system which contains mainly struc-

tured data or web document generated by human be-

ings which contain unstructured data. Our model also

allows us to narrow the search space in terms of ei-

ther content or presentation, or both, and also in terms

of more precise zones of web documents. Represent-

ing a web document as a list of objects is a frame-

work for other web applications because even sepa-

rators are modeled between web content objects (that

is useful for web segmentation work) and any data

types of web documents. Manual applications of the

proposed technique on a number of web pages gen-

erated detailed web object hierarchies and their ac-

companying databases for mining. We are working

on a complete automation of the proposed algorithms

to instantly generate objects on any given set of web

pages, mining various object level association rules

patterns and sequential patterns with further experi-

mentations. These rules aim at identifying similarities

and trends between set of objects intra and inter web

documents could be discover easily.

REFERENCES

Abiteboul, S. (1997). Querying semi-structured data. In

Afrati, F. N. and Kolaitis, P. G., editors, Database The-

ory - ICDT ’97, Greece, January 8-10, 1997,, volume

1186 of LNCS, pages 1–18. Springer.

Arasu, A., Garcia-Molina, H., and University, S. (2003).

Extracting structured data from web pages. In SIG-

MOD ’03 international conference on Management of

data, pages 337–348, New York, NY, USA. ACM.

Chapters.ca (2007). Chapters canada website (november

2007).

Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Road-

runner: Towards automatic data extraction from large

web sites. In 27th International Conference on Very

Large DataBases, pages 109–118.

Gottlob, G. and Koch, C. (2004). Logic-based web infor-

mation extraction. SIGMOD Rec., 33(2):87–94.

Kosala, R. and Blockeel, H. (2000). Web mining research:

a survey. SIGKDD Explor. Newsl., 2(1):1–15.

Levering, R. and Cutler, M. (2006). The portrait of a com-

mon html web page. In DocEng ’06 ACM symposium

on Document engineering, pages 198–204, New York,

NY, USA. ACM.

Li, J. and Ezeife, C. I. (2006). Cleaning web pages for ef-

fective web content mining. In DEXA, pages 560–571.

Liu, B. and Chen-Chuan-Chang, K. (2004). Editorial: spe-

cial issue on web content mining. SIGKDD Explor.

Newsl., 6(2):1–4.

Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data

records in web pages. In KDD ’03, pages 601–606,

New York, NY, USA. ACM.

Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004).

Learning block importance models for web pages.

In WWW’04, pages 203–211, New York, NY, USA.

ACM.

UN (2007). United nations english index web page (novem-

ber 2007).

W3C, W. W. W. C. (2007). Document object model stan-

dard. http://www.w3.org/DOM/.

Yu, S., Cai, D., Wen, J.-R., and Ma, W.-Y. (2003). Improv-

ing pseudo-relevance feedback in web information re-

trieval using web page segmentation. In WWW’03,

pages 11–18, New York, NY, USA. ACM.

Zhao, H., Meng, W., Wu, Z., Raghavan, V., and ement Yu,

C. (2005). Fully automatic wrapper generation for

search engines. In WWW ’05, pages 66–75, New York,

NY, USA. ACM.

ICEIS 2009 - International Conference on Enterprise Information Systems

100