their HeaderZone as it is representative and relevant
for every Chapters web page. By the same way, ac-
cording to the content view, ImageLink object and
ListTextlink objects can be mined to find the common
features of Chapters fake websites. The interest of the
two views rely on crossing patterns found from web
content object extraction and those from web presen-
tation object extraction to refine and adjust the pro-
file. The existing approaches allow us to extract ob-
jects like images or data record. On top of extracting
these objects, with our approach, the space search can
be limited to either zones of web or view (content or
presentation) of web documents.
4 CONCLUSIONS
The proposed object-oriented meta model for repre-
senting web documents as an UML diagram of ob-
jects consists of two UML class hierarchies. This
model is composed of six web content classes and six
web presentation classes because we assume that pre-
sentation of web contents impacts user understanding.
It represents fine levels of contents (such as title), high
level of contents (such as text), unstructured presenta-
tion data (such as banner), loosely-structured presen-
tation data (such as menu) and strictly-structured data
(such as record). The data extraction algorithm pro-
cesses any given web document. It pays attention to
content and presentation aspects of data on web doc-
uments mentioned in related work and it generalizes
previous work because it goes further in data block,
record on web documents. This new object-oriented
web data model accesses web documents either by
objects with fine granularity or by record level. Our
model is suitable for handling web documents of any
application domain and either web documents gener-
ated by database system which contains mainly struc-
tured data or web document generated by human be-
ings which contain unstructured data. Our model also
allows us to narrow the search space in terms of ei-
ther content or presentation, or both, and also in terms
of more precise zones of web documents. Represent-
ing a web document as a list of objects is a frame-
work for other web applications because even sepa-
rators are modeled between web content objects (that
is useful for web segmentation work) and any data
types of web documents. Manual applications of the
proposed technique on a number of web pages gen-
erated detailed web object hierarchies and their ac-
companying databases for mining. We are working
on a complete automation of the proposed algorithms
to instantly generate objects on any given set of web
pages, mining various object level association rules
patterns and sequential patterns with further experi-
mentations. These rules aim at identifying similarities
and trends between set of objects intra and inter web
documents could be discover easily.
REFERENCES
Abiteboul, S. (1997). Querying semi-structured data. In
Afrati, F. N. and Kolaitis, P. G., editors, Database The-
ory - ICDT ’97, Greece, January 8-10, 1997,, volume
1186 of LNCS, pages 1–18. Springer.
Arasu, A., Garcia-Molina, H., and University, S. (2003).
Extracting structured data from web pages. In SIG-
MOD ’03 international conference on Management of
data, pages 337–348, New York, NY, USA. ACM.
Chapters.ca (2007). Chapters canada website (november
2007).
Crescenzi, V., Mecca, G., and Merialdo, P. (2001). Road-
runner: Towards automatic data extraction from large
web sites. In 27th International Conference on Very
Large DataBases, pages 109–118.
Gottlob, G. and Koch, C. (2004). Logic-based web infor-
mation extraction. SIGMOD Rec., 33(2):87–94.
Kosala, R. and Blockeel, H. (2000). Web mining research:
a survey. SIGKDD Explor. Newsl., 2(1):1–15.
Levering, R. and Cutler, M. (2006). The portrait of a com-
mon html web page. In DocEng ’06 ACM symposium
on Document engineering, pages 198–204, New York,
NY, USA. ACM.
Li, J. and Ezeife, C. I. (2006). Cleaning web pages for ef-
fective web content mining. In DEXA, pages 560–571.
Liu, B. and Chen-Chuan-Chang, K. (2004). Editorial: spe-
cial issue on web content mining. SIGKDD Explor.
Newsl., 6(2):1–4.
Liu, B., Grossman, R., and Zhai, Y. (2003). Mining data
records in web pages. In KDD ’03, pages 601–606,
New York, NY, USA. ACM.
Song, R., Liu, H., Wen, J.-R., and Ma, W.-Y. (2004).
Learning block importance models for web pages.
In WWW’04, pages 203–211, New York, NY, USA.
ACM.
UN (2007). United nations english index web page (novem-
ber 2007).
W3C, W. W. W. C. (2007). Document object model stan-
dard. http://www.w3.org/DOM/.
Yu, S., Cai, D., Wen, J.-R., and Ma, W.-Y. (2003). Improv-
ing pseudo-relevance feedback in web information re-
trieval using web page segmentation. In WWW’03,
pages 11–18, New York, NY, USA. ACM.
Zhao, H., Meng, W., Wu, Z., Raghavan, V., and ement Yu,
C. (2005). Fully automatic wrapper generation for
search engines. In WWW ’05, pages 66–75, New York,
NY, USA. ACM.
ICEIS 2009 - International Conference on Enterprise Information Systems
100