co-citation principle, first described in (Small, 1973),
which claims that the relationship between two doc-
ument can be rated by the frequency at which they
appear together in citation lists. In the context of
the Web this principle has been exploited by several
authors (e.g. (Spertus, 1997; Dean and Henzinger,
1999)): the overall idea is that two pages are likely
to be related if they are linked by the same page. In
the context of large web sites, we claim that, as sites
are built automatically, this principle can be restated
by saying that pages pointed by the same link collec-
tion are likely to be structurally homogeneous.
6 CONCLUSIONS AND FUTURE
WORK
In this paper, we have developed an automatic tech-
nique for retrieving web pages similar in structure to
one only sample page. The output set of pages, as
structurally homogeneous, can be done as input to a
wrapper generator system. Our approach relies on
the observation that in Web sites made by automati-
cally generated pages, a degree of regularity occurs
also on the overall topological structure of the site
graph. Though large and complex, the site graph is
organized according to tricky yet regular networks of
links. Our techniques aim at exploiting such regular-
ities. One could think that a different approach for
collecting pages is to download all the pages of the
target web site, and then to cluster them according to
their structure (the cluster containing the sample page
would be the expected output). We observe that such
an approach would not be efficient, due to the number
of pages of the Web site, which can be several mag-
nitudes higher than the number of the searched pages.
We have experimented our method over several real-
life web sites obtaining interesting results. We are
currently integrating the presented techniques with
ROADRUNNER, our web wrapper generation system.
We believe that this step can lead to significant im-
provements in the scalability of data extraction from
the web.
Our approach is suitable for crawling dynamic
pages. On the other hand it is not designed to crawl
pages behind forms. In order to extract data also
from these important sources, our system could co-
operate with specific techniques, such as, for exam-
ple, those proposed in (Raghavan and Garcia-Molina,
2001; Palmieri et al., 2002).
REFERENCES
Arasu, A. and Garcia-Molina, H. (2003). Extracting struc-
tured data from web pages. In Proc. of ACM SIGMOD
2003.
Chakrabarti, S., DOM, B., Kumar, S., Raghavan, P., Ra-
jagopalan, S., Tomkins, A., Gibson, D., and Klein-
berg, J. (1999a). Mining the web’s link structure.
Computer, 32(8):60–67.
Chakrabarti, S., van den Berg, M., and Dom, B. (1999b).
Focused crawling: a new approach to topic-specific
Web resource discovery. Computer Networks (Ams-
terdam, Netherlands), 31(11–16):1623–1640.
Crescenzi, V. and Mecca, G. (2004). Automatic information
extraction from large web sites. Journal of the ACM,
51(5).
Crescenzi, V., Mecca, G., and Merialdo, P. (2001). ROAD-
RUNNER: Towards automatic data extraction from
large Web sites. In VLDB 2001.
Crescenzi, V., Merialdo, P., and Missier, P. (2003). Fine-
grain web site structure discovery. In Proc. of ACM
CIKM WIDM 2003.
Dean, J. and Henzinger, M. R. (1999). Finding related pages
in the world wide web. Computer Networks, 31:1467–
1479.
Kao, H., Lin, S., Ho, J., and M.-S., C. (2004). Mining web
informative structures and contents based on entropy
analysis. IEEE Trans. on Knowledge and Data Engi-
neering, 16(1):41–44.
Laender, A., Ribeiro-Neto, B., Da Silva, A., and J., T.
(2002). A brief survey of web data extraction tools.
ACM SIGMOD Record, 31(2).
Lerman, K., Getoor, L., Minton, S., and Knoblock, C.
(2004). Using the structure of web sites for automatic
segmentation of tables. SIGMOD 2004.
Liu, Z., Ng, W. K., and Lim, E.-P. (2004). An automated
algorithm for extracting website skeleton. In Proc. of
DASFAA 2004.
Palmieri, J., da Silva, A., Golgher, P., and Laender, A.
(2002). Collecting hidden web pages for data extrac-
tion. In Proc. of ACM CIKM WIDM 2002.
Raghavan, S. and Garcia-Molina, H. (2001). Crawling the
hidden web. In Proc. of VLDB 2001.
Small, H. (1973). Co-citation in the scientific literature: a
new measure on the relationship between two docu-
ments. Journal of the American Society for Informa-
tion Science, 24(4):28–31.
Spertus, E. (1997). Mining structural information on the
web. In Proc. of WWW Conf. 1997.
Van Rijsbergen, C. (1979). Information Retrieval, 2nd edi-
tion. University of Glasgow.
Wang, J. and Lochovsky, F. (2002). Data-rich section ex-
traction from html pages. In WISE 2002.
Ziv Bar-Yossef, Z. and Rajagopalan, S. (2002). Template
detection via data mining and its applications. In Proc.
of WWW Conf. 2002.
WEBIST 2005 - WEB INTERFACES AND APPLICATIONS
254