However, randomness of the web page change pro-
cess causes, that basing web page revisit on an es-
timate of web page change frequency might not be
sufficient. Other factors could be taken into account,
such as a day of a week or a forum structure. Com-
bining information about the frequency of changes to-
gether with information about the structure of a web
site as a graph could indicate places in the graph,
where new information is actually published. At the
same time it would allow to optimize the revisit policy
to be able to visit most relevant URLs on a less-than-
hour basis.
ACKNOWLEDGEMENTS
The work published in this article was supported by
the project titled: “Ego – Virtual Identity”, financed
by the Polish National Centre of Research and Devel-
opment (NCBiR), contract no. NR11-0037-10.
REFERENCES
Adar, E., Teevan, J., and Dumais, S. T. (2009a). Resonance
on the web: web dynamics and revisitation patterns.
In Proceedings of the 27th international conference on
Human factors in computing systems, CHI ’09, pages
1381–1390, New York, NY, USA. ACM.
Adar, E., Teevan, J., Dumais, S. T., and Elsas, J. L. (2009b).
The web changes everything: understanding the dy-
namics of web content. In Proceedings of the Second
ACM International Conference on Web Search and
Data Mining, WSDM ’09, pages 282–291, New York,
NY, USA. ACM.
Baeza-Yates, R. and Castillo, C. (2007). Crawling the infi-
nite web. J. Web Eng., 6(1):49–72.
Baeza-Yates, R., Castillo, C., Marin, M., and Rodriguez,
A. (2005). Crawling a country: better strategies than
breadth-first for web page ordering. In Special inter-
est tracks and posters of the 14th WWW conference,
WWW ’05, pages 864–872, New York. ACM.
Ben Saad, M. and Gançarski, S. (2011). Archiving the web
using page changes patterns: a case study. In Pro-
ceedings of the 11th annual international ACM/IEEE
joint conference on Digital libraries, JCDL ’11, pages
113–122, New York, NY, USA. ACM.
Buttler, D., Rocco, D., and Liu, L. (2004). Efficient web
change monitoring with page digest. In Proceedings
of the 13th international World Wide Web conference
on Alternate track papers & posters, WWW Alt. ’04,
pages 476–477, New York, NY, USA. ACM.
Cai, R., Yang, J.-M., Lai, W., Wang, Y., and Zhang, L.
(2008). irobot: an intelligent crawler for web forums.
In Proceedings of the 17th WWW conference, WWW
’08, pages 447–456, New York, NY, USA. ACM.
Cho, J. and Garcia-Molina, H. (2000). The evolution of the
web and implications for an incremental crawler. In
Proceedings of the 26th International Conference on
Very Large Data Bases, VLDB ’00, pages 200–209,
San Francisco. Morgan Kaufmann Publishers Inc.
Cho, J. and Garcia-Molina, H. (2003). Estimating fre-
quency of change. ACM Trans. Internet Technol.,
3(3):256–290.
Douglis, F. and Ball, T. (1996). Tracking and viewing
changes on the web. In USENIX Technical Confer-
ence. AT&T Bell Laboratories.
Douglis, F., Ball, T., Chen, Y.-F., and Koutsofios, E. (1998).
The AT&T Internet Difference Engine: Tracking and
viewing changes on the web. World Wide Web, 1:27–
44.
Hirschberg, D. S. (1977). Algorithms for the longest com-
mon subsequence problem. J. ACM, 24(4):664–675.
Jiang, J., Yu, N., and Lin, C.-Y. (2012). Focus: learn-
ing to crawl web forums. In Proceedings of the 21st
WWW conference, WWW ’12 Companion, pages 33–
42, New York. ACM.
Kwon, S., Lee, S., and Kim, S. (2006). Effective crite-
ria for web page changes. In Zhou, X., Li, J., Shen,
H., Kitsuregawa, M., and Zhang, Y., editors, Frontiers
of WWW Research and Development - APWeb 2006,
volume 3841 of Lecture Notes in Computer Science,
pages 837–842. Springer Berlin / Heidelberg.
Law, M. T., Thome, N., Gançarski, S., and Cord, M.
(2012). Structural and visual comparisons for web
page archiving. In Proceedings of the 2012 ACM sym-
posium on Document engineering, DocEng ’12, pages
117–120, New York, NY, USA. ACM.
Liu, M., Cai, R., Zhang, M., and Zhang, L. (2011). User
browsing behavior-driven web crawling. In Proceed-
ings of the 20th ACM international conference on In-
formation and knowledge management, CIKM ’11,
pages 87–92, New York, NY, USA. ACM.
Rocco, D., Buttler, D., and Liu, L. (2003). Page digest for
large-scale web services. In IEEE International Con-
ference on E-Commerce, pages 381 – 390.
Saad, M. B. and Gançarski, S. (2010). Using visual pages
analysis for optimizing web archiving. In Proceedings
of the 2010 EDBT/ICDT Workshops, EDBT ’10, pages
43:1–43:7, New York, NY, USA. ACM.
Toyoda, M. and Kitsuregawa, M. (2006). What’s really new
on the web?: identifying new pages from a series of
unstable web snapshots. In Proceedings of the 15th
WWW conference, WWW ’06, pages 233–241, New
York, NY, USA. ACM.
Yang, J.-M., Cai, R., Wang, C., Huang, H., Zhang, L., and
Ma, W.-Y. (2009). Incorporating site-level knowledge
for incremental crawling of web forums: a list-wise
strategy. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and
data mining, KDD ’09, pages 1375–1384, New York,
NY, USA. ACM.
WEBIST2013-9thInternationalConferenceonWebInformationSystemsandTechnologies
110