Web Forums Change Analysis
Tomasz Kaczmarek, Dawid Grzegorz Węckowski
2013
Abstract
In this paper we present results from an experiment conducted on over 27 900 web pages gathered every 2 hours over 22 days from 16 forums (4256 independent crawls), to investigate how these web pages evolve over time. The results of the experiment became a basis for design choices for a focused incremental crawler, that will be specialized for efficient gathering of documents from web forums, maintaining high freshness of the local collection of obtained pages. The data analysis shows, that forums differ from generic web portals and identifying places in the source navigational structure, where new documents occur more often, would allow to improve the crawler’s performance and the collection freshness.
References
- Adar, E., Teevan, J., and Dumais, S. T. (2009a). Resonance on the web: web dynamics and revisitation patterns. In Proceedings of the 27th international conference on Human factors in computing systems, CHI 7809, pages 1381-1390, New York, NY, USA. ACM.
- Adar, E., Teevan, J., Dumais, S. T., and Elsas, J. L. (2009b). The web changes everything: understanding the dynamics of web content. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM 7809, pages 282-291, New York, NY, USA. ACM.
- Baeza-Yates, R. and Castillo, C. (2007). Crawling the infinite web. J. Web Eng., 6(1):49-72.
- Baeza-Yates, R., Castillo, C., Marin, M., and Rodriguez, A. (2005). Crawling a country: better strategies than breadth-first for web page ordering. In Special interest tracks and posters of the 14th WWW conference, WWW 7805, pages 864-872, New York. ACM.
- Ben Saad, M. and Gançarski, S. (2011). Archiving the web using page changes patterns: a case study. In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL 7811, pages 113-122, New York, NY, USA. ACM.
- Buttler, D., Rocco, D., and Liu, L. (2004). Efficient web change monitoring with page digest. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, WWW Alt. 7804, pages 476-477, New York, NY, USA. ACM.
- Cai, R., Yang, J.-M., Lai, W., Wang, Y., and Zhang, L. (2008). irobot: an intelligent crawler for web forums. In Proceedings of the 17th WWW conference, WWW 7808, pages 447-456, New York, NY, USA. ACM.
- Cho, J. and Garcia-Molina, H. (2000). The evolution of the web and implications for an incremental crawler. In Proceedings of the 26th International Conference on Very Large Data Bases, VLDB 7800, pages 200-209, San Francisco. Morgan Kaufmann Publishers Inc.
- Cho, J. and Garcia-Molina, H. (2003). Estimating frequency of change. ACM Trans. Internet Technol., 3(3):256-290.
- Douglis, F. and Ball, T. (1996). Tracking and viewing changes on the web. In USENIX Technical Conference. AT&T Bell Laboratories.
- Douglis, F., Ball, T., Chen, Y.-F., and Koutsofios, E. (1998). The AT&T Internet Difference Engine: Tracking and viewing changes on the web. World Wide Web, 1:27- 44.
- Hirschberg, D. S. (1977). Algorithms for the longest common subsequence problem. J. ACM, 24(4):664-675.
- Jiang, J., Yu, N., and Lin, C.-Y. (2012). Focus: learning to crawl web forums. In Proceedings of the 21st WWW conference, WWW 7812 Companion, pages 33- 42, New York. ACM.
- Kwon, S., Lee, S., and Kim, S. (2006). Effective criteria for web page changes. In Zhou, X., Li, J., Shen, H., Kitsuregawa, M., and Zhang, Y., editors, Frontiers of WWW Research and Development - APWeb 2006, volume 3841 of Lecture Notes in Computer Science, pages 837-842. Springer Berlin / Heidelberg.
- Law, M. T., Thome, N., Gançarski, S., and Cord, M. (2012). Structural and visual comparisons for web page archiving. In Proceedings of the 2012 ACM symposium on Document engineering, DocEng 7812, pages 117-120, New York, NY, USA. ACM.
- Liu, M., Cai, R., Zhang, M., and Zhang, L. (2011). User browsing behavior-driven web crawling. In Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM 7811, pages 87-92, New York, NY, USA. ACM.
- Rocco, D., Buttler, D., and Liu, L. (2003). Page digest for large-scale web services. In IEEE International Conference on E-Commerce, pages 381 - 390.
- Saad, M. B. and Gançarski, S. (2010). Using visual pages analysis for optimizing web archiving. In Proceedings of the 2010 EDBT/ICDT Workshops, EDBT 7810, pages 43:1-43:7, New York, NY, USA. ACM.
- Toyoda, M. and Kitsuregawa, M. (2006). What's really new on the web?: identifying new pages from a series of unstable web snapshots. In Proceedings of the 15th WWW conference, WWW 7806, pages 233-241, New York, NY, USA. ACM.
- Yang, J.-M., Cai, R., Wang, C., Huang, H., Zhang, L., and Ma, W.-Y. (2009). Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD 7809, pages 1375-1384, New York, NY, USA. ACM.
Paper Citation
in Harvard Style
Kaczmarek T. and Węckowski D. (2013). Web Forums Change Analysis . In Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8565-54-9, pages 105-110. DOI: 10.5220/0004373201050110
in Bibtex Style
@conference{webist13,
author={Tomasz Kaczmarek and Dawid Grzegorz Węckowski},
title={Web Forums Change Analysis},
booktitle={Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2013},
pages={105-110},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004373201050110},
isbn={978-989-8565-54-9},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 9th International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - Web Forums Change Analysis
SN - 978-989-8565-54-9
AU - Kaczmarek T.
AU - Węckowski D.
PY - 2013
SP - 105
EP - 110
DO - 10.5220/0004373201050110