A General Evaluation Framework for Adaptive Focused Crawlers

Fabio Gasparetti, Alessandro Micarelli, Giuseppe Sansonetti

Abstract

Focused crawling is increasingly seen as a solution to increase the freshness and coverage of local repository of documents related to specific topics by selectively traversing paths on the web. The adaptation is a peculiar feature that makes it possible to modify the search strategies according to the particular environment, its alterations and its relationships with the given input parameters during the search. This paper introduces a general evaluation framework for adaptive focused crawlers.

References

  1. Abiteboul, S., Preda, M., and Cobena, G. (2003). Adaptive on-line page importance computation. In Proceedings of the 12th International Conference on World Wide Web, WWW 7803, pages 280-290, New York, NY, USA. ACM.
  2. Aggarwal, C. C., Al-Garawi, F., and Yu, P. S. (2001). Intelligent crawling on the world wide web with arbitrary predicates. In Proceedings of the 10th International Conference on World Wide Web, WWW 7801, pages 96-105, New York, NY, USA. ACM.
  3. Bailey, P., Craswell, N., and Hawking, D. (2003). Engineering a multi-purpose test collection for web retrieval experiments. Inf. Process. Manage., 39(6):853-871.
  4. Barbosa, L. and Bangalore, S. In Macdonald, C., Ounis, I., and Ruthven, I., editors, CIKM, pages 755-764. ACM.
  5. Bergholz, A. and Chidlovskii, B. (2003). Crawling for domain-specific hidden web resources. In Proceedings of the Fourth International Conference on Web Information Systems Engineering, WISE 7803, pages 125-, Washington, DC, USA. IEEE Computer Society.
  6. Biancalana, C., Gasparetti, F., Micarelli, A., and Sansonetti, G. (2013). Social semantic query expansion. ACM Trans. Intell. Syst. Technol., 4(4):60:1-60:43.
  7. Brewington, B. E. and Cybenko, G. (2000). How dynamic is the web? In Proceedings of the 9th International World Wide Web Conference on Computer Networks : The International Journal of Computer and Telecommunications Netowrking, pages 257-276, Amsterdam, The Netherlands, The Netherlands. North-Holland Publishing Co.
  8. Chakrabarti, S., Punera, K., and Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. In WWW 7802: Proceedings of the 11th international conference on World Wide Web, pages 148-159, New York, NY, USA. ACM Press.
  9. Chakrabarti, S., van den Berg, M., and Dom, B. (1999). Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th World Wide Web Conference (WWW8), pages 1623- 1640, Toronto, Canada.
  10. Chau, M. and Chen, H. (2003). Comparison of three vertical search spiders. Computer, 36(5):56-62.
  11. Chen, Z., Ma, J., Han, X., and Zhang, D. (2008). An effective relevance prediction algorithm based on hierarchical taxonomy for focused crawling. In Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., and Zhou, G., editors, Information Retrieval Technology, volume 4993 of Lecture Notes in Computer Science, pages 613-619. Springer Berlin Heidelberg.
  12. Cho, J., Garcia-Molina, H., and Page, L. (1998). Efficient crawling through url ordering. Computer Networks and ISDN Systems, 30(1-7):161-172.
  13. Choi, Y., Kim, K., and Kang, M. (2005). A focused crawling for the web resource discovery using a modified proximal support vector machines. In Gervasi, O., Gavrilova, M., Kumar, V., Lagan, A., Lee, H., Mun, Y., Taniar, D., and Tan, C., editors, Computational Science and Its Applications ICCSA 2005, volume 3480 of Lecture Notes in Computer Science, pages 186-194. Springer Berlin Heidelberg.
  14. Daneshpajouh, S., Nasiri, M. M., and Ghodsi, M. (2008). A fast community based algorithm for generating web crawler seeds set. In Cordeiro, J., Filipe, J., and Hammoudi, S., editors, WEBIST (2), pages 98-105. INSTICC Press.
  15. Davison, B. D. (2000). Topical locality in the web. In SIGIR 7800: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 272-279, New York, NY, USA. ACM Press.
  16. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs. In VLDB 7800: Proceedings of the 26th International Conference on Very Large Data Bases, pages 527-534, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  17. Dumais, S. and Chen, H. (2000). Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7800, pages 256-263, New York, NY, USA. ACM.
  18. Ehrig, M. and Maedche, A. (2003). Ontology-focused crawling of web documents. In SAC 7803: Proceedings of the 2003 ACM symposium on Applied computing, pages 1174-1178, New York, NY, USA. ACM Press.
  19. Gasparetti, F. and Micarelli, A. (2003). Adaptive web search based on a colony of cooperative distributed agents. In Klusch, M., Ossowski, S., Omicini, A., and Laamanen, H., editors, Cooperative Information Agents, volume 2782, pages 168-183. SpringerVerlag.
  20. Gasparetti, F., Micarelli, A., and Sansonetti, G. (2014). Exploiting web browsing activities for user needs identification. In International Conference on Computational Science and Computational Intelligence (CSCI 2014). IEEE Computer Society Conference Publishing Services.
  21. Gentili, G., Micarelli, A., and Sciarrone, F. (2003). Infoweb: An adaptive information filtering system for the cultural heritage domain. Applied Artificial Intelligence, 17(8-9):715-744.
  22. Hao, H.-W., Mu, C.-X., Yin, X.-C., Li, S., and Wang, Z.- B. (2011). An improved topic relevance algorithm for focused crawling. In Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on, pages 850-855.
  23. Hersovicia, M., Jacovia, M., Maareka, Y. S., Pellegb, D., Shtalhaima, M., and Ura, S. (1998). The shark-search algorithm an application: tailored web site mapping. In Proceedings of the 7th World Wide Web Conference(WWW7), Brisbane, Australia.
  24. Jansen, B. J., Booth, D. L., and Spink, A. (2008). Determining the informational, navigational, and transactional intent of web queries. Inf. Process. Manage., 44(3):1251-1266.
  25. Kleinberg, J. (1998). Authoritative sources in a hyperlinked environment. In Proceedings of the 9th annual ACMSIAM symposium on Discrete algorithms, pages 668- 677, San Francisco, CA, USA.
  26. Liakos, P. and Ntoulas, A. (2012). Topic-sensitive hiddenweb crawling. In Proceedings of the 13th International Conference on Web Information Systems Engineering, WISE'12, pages 538-551, Berlin, Heidelberg. Springer-Verlag.
  27. Limongelli, C., Sciarrone, F., and Vaste, G. (2011). Personalized e-learning in moodle: The moodle-ls system. Journal of E-Learning and Knowledge Society, 7(1):49-58.
  28. Luong, H. P., Gauch, S., and Wang, Q. (2009). Ontologybased focused crawling. In Information, Process, and Knowledge Management, 2009. eKNOW 7809. International Conference on, pages 123-128.
  29. Menczer, F. and Monge, A. E. (1999). Scalable web search by adaptive online agents: An infospiders case study. In Klusch, M., editor, Intelligent Information Agents, pages 323-340. Springer-Verlag, Berlin, Germany.
  30. Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Technol., 4(4):378-419.
  31. Menczer, F., Pant, G., Srinivasan, P., and Ruiz, M. E. (2001). Evaluating topic-driven web crawlers. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 7801, pages 241-249, New York, NY, USA. ACM.
  32. Micarelli, A. and Gasparetti, F. (2007). Adaptive focused crawling. In Brusilovsky, P., Kobsa, A., and Nejdl, W., editors, The Adaptive Web, volume 4321 of Lecture Notes in Computer Science, pages 231-262. Springer Berlin Heidelberg.
  33. Pant, G. and Srinivasan, P. (2005). Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst., 23(4):430-462.
  34. Radinsky, K. and Bennett, P. N. (2013). Predicting content change on the web. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM 7813, pages 415-424, New York, NY, USA. ACM.
  35. Raghavan, S. and Garcia-Molina, H. (2001). Crawling the hidden web. In VLDB 7801: Proceedings of the 27th International Conference on Very Large Data Bases, pages 129-138, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  36. Rungsawang, A. and Angkawattanawit, N. (2005). Learnable topic-specific web crawler. J. Netw. Comput. Appl., 28(2):97-114.
  37. Sakai, T. (2012). Evaluation with informational and navigational intents. In Proceedings of the 21st International Conference on World Wide Web, WWW 7812, pages 499-508, New York, NY, USA. ACM.
  38. Srinivasan, P., Menczer, F., and Pant, G. (2005). A general evaluation framework for topical crawlers. Information Retrieval, 8(3):417-447.
  39. Zheng, Q., Wu, Z., Cheng, X., Jiang, L., and Liu, J. (2013). Learning to crawl deep web. Inf. Syst., 38(6):801-819.
Download


Paper Citation


in Harvard Style

Gasparetti F., Micarelli A. and Sansonetti G. (2014). A General Evaluation Framework for Adaptive Focused Crawlers . In Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-758-024-6, pages 350-358. DOI: 10.5220/0004965903500358


in Bibtex Style

@conference{webist14,
author={Fabio Gasparetti and Alessandro Micarelli and Giuseppe Sansonetti},
title={A General Evaluation Framework for Adaptive Focused Crawlers},
booktitle={Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2014},
pages={350-358},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004965903500358},
isbn={978-989-758-024-6},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 10th International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A General Evaluation Framework for Adaptive Focused Crawlers
SN - 978-989-758-024-6
AU - Gasparetti F.
AU - Micarelli A.
AU - Sansonetti G.
PY - 2014
SP - 350
EP - 358
DO - 10.5220/0004965903500358