SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER
Joo Yong Lee, Sang Ho Lee, Yanggon Kim
2007
Abstract
As the size of the Web grows, it becomes increasingly important to parallelize a crawling process in order to complete downloading pages in a reasonable amount of time. This paper presents the design and implementation of an effective parallel web crawler. We first present various design choices and strategies for a parallel web crawler, and describe our crawler’s architecture and implementation techniques. In particular, we investigate the URL distributor for URL balancing and the scalability of our crawler.
References
- Boldi, P., Codenotti, B., Santini, M., Vigna, S., 2004. UbiCrawler: a scalable fully distributed Web crawler. Software-Practice and Experience, Vol. 34, No. 8, 711-726.
- Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. In Computer Networks and ISDN Systems, Vol. 30, No.1-7, 107- 117.
- Burner, M., 1997. Crawling towards Eternity: Building An Archive of The World Wide Web. In Web Techniques Magazine, Vol. 2, No. 5, 37-40.
- Cho, J., Garcia-Molina, H., 2002. Parallel Crawlers. In WWW'02, 11th International World Wide Web Conference, 124-135.
- Cho, J., Garcia-Molina, H., 2000. The Evolution of the Web and Implications for an Incremental Crawler. In VLDB'00, 26th International Conference on Very Large Data Bases, 200-209.
- Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G., 2006. Stanford WebBase Components and Applications. In ACM Transactions on Internet Technology. Vol. 6, No. 2, 153-186.
- Gray, M., 1996. Internet Statistics: Growth and Usage of the Web and the Internet, http://www.mit. edu/people/mkgray/net/.
- Heydon, A., Najork, M., 1999. Mercator: A scalable, extensible Web crawler. In World Wide Web, Vol. 2, No. 4, 219-229.
- Kim, S.J., Lee, S.H., 2003. Implementation of a Web Robot and Statistics on the Korean Web. In HSI'03, 2nd International Conference of Human.Society@ Internet, 341-350.
- Najork, M., Heydon, A., 2001. High-performance web crawling. In SRC Research Report 173. Compaq Systems Research Center.
- Najork, M., Wiener, J.L., 2001. Breadth-First Search Crawling Yields High-Quality Pages. In WWW'01, 10th International World Wide Web Conference, 114- 118.
- Shkapenyuk, V., Suel, T., 2002. Design and Implementation of a High-Performance Distributed Web Crawler. In ICDE'02, 18th International Conference on Data Engineering, 357-368.
Paper Citation
in Harvard Style
Yong Lee J., Ho Lee S. and Kim Y. (2007). SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER . In Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007) ISBN 978-989-8111-11-1, pages 151-156. DOI: 10.5220/0002108701510156
in Bibtex Style
@conference{ice-b07,
author={Joo Yong Lee and Sang Ho Lee and Yanggon Kim},
title={SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER},
booktitle={Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007)},
year={2007},
pages={151-156},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002108701510156},
isbn={978-989-8111-11-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the Second International Conference on e-Business - Volume 1: ICE-B, (ICETE 2007)
TI - SCRAWLER: A SEED-BY-SEED PARALLEL WEB CRAWLER
SN - 978-989-8111-11-1
AU - Yong Lee J.
AU - Ho Lee S.
AU - Kim Y.
PY - 2007
SP - 151
EP - 156
DO - 10.5220/0002108701510156