EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND GENETICS DOMAINS

Ari Pirkola, Tuomas Talvensaari

Abstract

A focused crawler is a program that fetches Web pages that are relevant to a pre-defined domain. In this paper we consider focused crawling in the domains of genomics and genetics. Crawling is often started with seed URLs that point to central North-American and European universities, research institutions, and other organizations in North-America and Europe. We investigate how strongly this central region of the Web is connected to other large geographical regions of the Web: Australia (top level domain .au), China (.cn), and five South-American countries (.ar, .br, .cl, .mx, and .uy). We consider what implications the observed global link structure has for the selection of seed URLs for focused crawling. The results showed that the proportion of out-links from the North-American and European region to the other regions is low whereas pages in the other regions often point to the central region. We also found that two focused crawling processes, one started from the central region and the other from another large region, overlap only to a small extent. Overall, the results suggest that the effectiveness of focused crawling can be improved considerably if crawling is started with a geographically heterogeneous seed URL set.

References

  1. Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Focused crawls, tunneling, and digital libraries. Proceedings of the 6th European Conference on Research and Advanced Technology for Digital
  2. Libraries, Rome, Italy, September 16-18, pp. 91 - 106.
  3. Castillo, C., 2004. Effective Web crawling. Ph.D. Thesis. University of Chile, Department of Computer Science, 180 pages. http://www.chato.cl/534/article-63160.html
  4. Chakrabarti, S., van den Berg, M. and Dom, B., 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Proceedings of the Eighth International World Wide Web Conference, Toronto, May 11 - 14.
  5. Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P., Cohen, A. M. and Kraemer, D. F., 2005. TREC 2004 genomics track overview. Proceedings of the Thirteenth TExt REtrieval conference (TREC-13) (Gaithersburg, MD). http://trec.nist.gov/pubs/ trec13/t13_proceedings.html
  6. Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and Laurikkala, J., 2008. Focused Web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427-445.
  7. Tang, T., Hawking, D., Craswell, N. and Griffiths, K., 2005. Focused crawling for both topical relevance and quality of medical information. Proceedings of the 14th ACM International Conference on Information and Knowledge Management CIKM 7805.
  8. Toyoda, M. and Kitsuregawa, M., 2001. Creating a Web community chart for navigating related communities. Proceedings of the 12th ACM Conference on Hypertext and Hypermedia, Århus, Denmark, August 14 - 18.
  9. Zhang, J., Jin, R., Yang, Y. and Hauptmann, A., 2003. Modified logistic regression: An approximation to svm and its applications in large-scale text categorization. Proceedings of the 20th International Conference on Machine Learning (ICML), Washington, DC.
  10. Zhuang, Z., Wagle, R. and Giles, C.L., 2005. What's there and what's not?: focused crawling for missing documents in digital libraries. Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, Denver, CO, pp. 301 - 310.
Download


Paper Citation


in Harvard Style

Pirkola A. and Talvensaari T. (2009). EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND GENETICS DOMAINS . In Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2009) ISBN 978-989-8111-63-0, pages 499-502. DOI: 10.5220/0001777004990502


in Bibtex Style

@conference{healthinf09,
author={Ari Pirkola and Tuomas Talvensaari},
title={EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND GENETICS DOMAINS},
booktitle={Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2009)},
year={2009},
pages={499-502},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001777004990502},
isbn={978-989-8111-63-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Health Informatics - Volume 1: HEALTHINF, (BIOSTEC 2009)
TI - EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND GENETICS DOMAINS
SN - 978-989-8111-63-0
AU - Pirkola A.
AU - Talvensaari T.
PY - 2009
SP - 499
EP - 502
DO - 10.5220/0001777004990502