A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri, Mohammad Ghodsi

Abstract

In this paper, we present a new and fast algorithm for generating the seeds set for web crawlers. A typical crawler normally starts from a fixed set like DMOZ links, and then continues crawling from URLs found in these web pages. Crawlers are supposed to download more good pages in less iterations. Crawled pages are good if they have high PageRanks and are from different communities. In this paper, we present a new algorithm with O(n) running time for generating crawler's seeds set based on HITS algorithm. A crawler can download qualified web pages, from different communities, starting from generated seeds set using our algorithm in less iteration.

References

  1. Henzinger, M. R., 2003. Algorithmic challenges in Web Search Engines. Internet Mathematics, vol. 1, no. 1, pp. 115-123.
  2. Cho, J. Garcia-Molina, H. and Page, L., 1998. Efficient Crawling through URL ordering. In Proceedings of the 7th International World Wide Web Conference, April, pp.161-172.
  3. Najork, Wiener, J. L., 2001. Breadth-First Search Crawling Yields High-Quality Pages, Proceedings of the 10th international conference on World Wide Web,pp. 114-118.
  4. Brin, S. and Page, L., 1998. The anatomy of a large-scale hypertextual Web search engine. Proceedings of the seventh international conference on World Wide Web 7, pp. 107 - 117.
  5. Jon M. Kleinberg, J., 1999. Authoritative Sources in a Hyperlinked Environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, pp.604-632.
  6. Laboratory for Web Algorithmics, [Online], Available: http://law.dsi.unimi.it/ [19 Jan. 2007]
  7. Boldi, P., and Vigna, S., 2004. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference, pp. 595-601.
  8. Boldi, P., Codenotti, B., Santini, M., Vigna, S., 2004, UbiCrawler: A Scalable Fully Distributed Web Crawler, Journal of Software: Practice & Experience, vol.34, no 8, pp. 711-726.
  9. Albert, R. Jeong, H. Barabasi, A. L., 2000, 'A random Graph Model for massive graphs', ACM symposium on the Theory and computing.
  10. Andrei Z. Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet L. Wiener, 2000. Graph structure in the Web. Computer Networks, pp. 309-320.
Download


Paper Citation


in Harvard Style

Daneshpajouh S., Mohammadi Nasiri M. and Ghodsi M. (2008). A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET . In Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST, ISBN 978-989-8111-27-2, pages 98-105. DOI: 10.5220/0001527400980105


in Bibtex Style

@conference{webist08,
author={Shervin Daneshpajouh and Mojtaba Mohammadi Nasiri and Mohammad Ghodsi},
title={A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET},
booktitle={Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,},
year={2008},
pages={98-105},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001527400980105},
isbn={978-989-8111-27-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Fourth International Conference on Web Information Systems and Technologies - Volume 2: WEBIST,
TI - A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET
SN - 978-989-8111-27-2
AU - Daneshpajouh S.
AU - Mohammadi Nasiri M.
AU - Ghodsi M.
PY - 2008
SP - 98
EP - 105
DO - 10.5220/0001527400980105