SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER

Joo Yong Lee, Sang Ho Lee, Yanggon Kim

Abstract

As the size of the Web grows, it becomes increasingly important to parallelize a crawling process in order to complete downloading pages in a reasonable amount of time. This paper presents the design and implementation of an effective parallel web crawler. We first present various design choices and strategies for a parallel web crawler, and describe our crawler’s architecture and implementation techniques. In particular, we investigate the URL distributor for URL balancing and the scalability of our crawler.

References

  1. Boldi, P., Codenotti, B., Santini, M., Vigna, S., 2004. UbiCrawler: a scalable fully distributed Web crawler. Software-Practice and Experience, Vol. 34, No. 8, 711-726.
  2. Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems, Vol. 30, No.1-7, 107- 117.
  3. Burner, M., 1997. Crawling towards Eternity: Building An Archive of The World Wide Web. In Web Techniques Magazine, Vol. 2, No. 5, 37-40.
  4. Cho, J., Garcia-Molina, H., 2002. Parallel Crawlers. In WWW'02, 11th International World Wide Web Conference, 124-135.
  5. Cho, J., Garcia-Molina, H., 2000. The Evolution of the Web and Implications for an Incremental Crawler. In VLDB'00, 26th International Conference on Very Large Data Bases, 200-209.
  6. Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G., 2006. Stanford WebBase Components and Applications. In ACM Transactions on Internet Technology. Vol. 6, No. 2, 153-186.
  7. Gray, M., 1996. Internet Statistics: Growth and Usage of the Web and the Internet, http://www.mit. edu/people/mkgray/net/.
  8. Heydon, A., Najork, M., 1999. Mercator: A scalable, extensible Web crawler. In World Wide Web, Vol. 2, No. 4, 219-229.
  9. Kim, S.J., Lee, S.H., 2003. Implementation of a Web Robot and Statistics on the Korean Web. In HSI'03, 2nd International Conference of Human.Society@ Internet, 341-350.
  10. Najork, M., Heydon, A., 2001. High-performance web crawling. In SRC Research Report 173. Compaq Systems Research Center.
  11. Najork, M., Wiener, J.L., 2001. Breadth-First Search Crawling Yields High-Quality Pages. In WWW'01, 10th International World Wide Web Conference, 114- 118.
  12. Shkapenyuk, V., Suel, T., 2002. Design and Implementation of a High-Performance Distributed Web Crawler. In ICDE'02, 18th International Conference on Data Engineering, 357-368.
Download


Paper Citation


in Harvard Style

Yong Lee J., Ho Lee S. and Kim Y. (2007). SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER . In Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007) ISBN 978-989-8111-11-1, pages 151-156. DOI: 10.5220/0002108701510156


in Bibtex Style

@conference{ice-b07,
author={Joo Yong Lee and Sang Ho Lee and Yanggon Kim},
title={SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER},
booktitle={Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007)},
year={2007},
pages={151-156},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002108701510156},
isbn={978-989-8111-11-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007)
TI - SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER
SN - 978-989-8111-11-1
AU - Yong Lee J.
AU - Ho Lee S.
AU - Kim Y.
PY - 2007
SP - 151
EP - 156
DO - 10.5220/0002108701510156