EFFECTS OF CRAWLING STRATEGIES ON THE PERFORMANCE OF FOCUSED WEB CRAWLING

Ari Pirkola, Tuomas Talvensaari

Abstract

Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a specific domain or topic. We investigate different focused crawling strategies including the use of data fusion in focused crawling. Documents in the domains of genomics and genetics were fetched by Nalanda iVia Focused Crawler using three crawling strategies. In the first one, a text classifier was trained to identify relevant documents. In the latter two strategies, the identification of relevant documents was based on query-document matching. In experiments, the crawling results of the single strategies were combined to yield fused crawling results. The experiments showed, first, that different single strategies overlap only to a small extent, identifying mainly different relevant documents. Second, a query-based strategy where the words of the link context were weighted gave the best coverage (i.e., number of relevant documents) after 10 000 and 40 000 documents had been downloaded. The combination of the two query-based strategies was the best fused strategy but it did not perform better than the best single strategy.

References

  1. Baeza-Yates, R., Castillo, C., Marin, M. and Rodriguez, A., 2005. Crawling a country: better strategies than breadth-first for web page ordering. Proc. of the 14th International conference on World Wide Web / Industrial and Practical Experience Track, Chiba, Japan, pp.864-872.
  2. Beitzel, S., Jensen, E., Chowdhury, A, Grossman, D., Frieder, O. and Goharian. N., 2004. Fusion of effective retrieval strategies in the same information retrieval system. Journal of the American Society for Information Science and Technology, 55(10): 859-868.
  3. Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Focused crawls, tunneling, and digital libraries. Proc. of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Rome, Italy, September 16-18, pp. 91 - 106.
  4. Braschler, M., 2004. Combination approaches for multilingual text retrieval. Information Retrieval, 7 (1- 2): 183-204.
  5. Brin, S. and Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1-7): 107-117.
  6. Castillo, C., 2004. Effective Web crawling. Ph.D. Thesis. University of Chile, Department of Computer Science, 180 pages. http://www.chato.cl/534/article-63160.html Chakrabarti, S., van den Berg, M. and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Proc. of the Eighth International World Wide Web Conference, Toronto, May 11 - 14.
  7. Chakrabarti, S., Punera, K. and Subramanyam, M., 2002. Accelerated focused crawling through online relevance feedback. Proc. of the 11th International Conference on World Wide Web, Honolulu, Hawaii, May 7 - 11, pp. 148-159.
  8. Diligenti, M., Coetzee, F. .M., Lawrence, S., Giles, C.L. and Gori, M., 2000. Focused crawling using context graphs. Proc. of the 26th International Conference on Very Large Databases (VLDB), pp. 527-534.
  9. Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M. and Kraemer, D. F., 2005. TREC 2004 genomics track overview. Proceedings of the Thirteenth TExt REtrieval conference (TREC-13) (Gaithersburg, MD). http://trec.nist.gov/pubs/ trec13/t13_proceedings.html
  10. Manmatha, R., Feng, F. and Rath, T., 2001. Using models of score distributions in information retrieval. Proc. of the 27th ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana.
  11. Montague, M. and Aslam, J. 2002: Condorcet fusion for improved retrieval. Proc. of the Eleventh International Conference on Information and Knowledge Management, McLean, VA, November 4-9, pp. 538- 548.
  12. Novak, B., 2004. A Survey of focused Web crawling algorithms. Proc. of SIKDD 2004 at Muticonference IS, Ljubljana, Slovenia, October 12-15, pp. 55-58.
  13. Pirkola, A. and Talvensaari, T. 2009. Developing a system for multilingual focused crawling. Submitted to WWW'2009 - 18th International World Wide Web Conference, Madrid, Spain, April 29-24, 2009. Poster manuscript.
  14. Rennie, J. and McCallum, A., 1999.Using reinforcement learning to spider the web efficiently. Proc. of the Sixteenth International Conference on Machine Learning (ICML).
  15. Srinivasan, P., Menczer, F., Pant, G. 2005. A general evaluation framework for topical crawlers. Information Retrieval, 8(3): 417-447.
  16. Tang, T., Hawking, D., Craswell, N. and Griffiths, K., 2005. Focused crawling for both topical relevance and quality of medical information. Proc. of the 14th ACM International Conference on Information and Knowledge Management CIKM 7805.
  17. Zhuang, Z., Wagle, R. and Giles, C.L., 2005. What's there and what's not?: focused crawling for missing documents in digital libraries. Proc. of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, CO, pp. 301 - 310.
Download


Paper Citation


in Harvard Style

Pirkola A. and Talvensaari T. (2009). EFFECTS OF CRAWLING STRATEGIES ON THE PERFORMANCE OF FOCUSED WEB CRAWLING . In Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST, ISBN 978-989-8111-81-4, pages 376-381. DOI: 10.5220/0002037603760381


in Bibtex Style

@conference{webist09,
author={Ari Pirkola and Tuomas Talvensaari},
title={EFFECTS OF CRAWLING STRATEGIES ON THE PERFORMANCE OF FOCUSED WEB CRAWLING},
booktitle={Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,},
year={2009},
pages={376-381},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002037603760381},
isbn={978-989-8111-81-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fifth International Conference on Web Information Systems and Technologies - Volume 1: WEBIST,
TI - EFFECTS OF CRAWLING STRATEGIES ON THE PERFORMANCE OF FOCUSED WEB CRAWLING
SN - 978-989-8111-81-4
AU - Pirkola A.
AU - Talvensaari T.
PY - 2009
SP - 376
EP - 381
DO - 10.5220/0002037603760381