We experimentally measured how the average
number of downloaded pages per second per thread
changes as the number of crawling machines
changes. For this, we used 250,000 Korean sites
randomly selected as seeds. We increased crawling
machines from one to five by increment of one. We
set the number of crawling threads for each machine
to 10, 15 and 20, and each thread ran with 10 seeds
simultaneously.
Figure 9 shows how the average number of pages
downloaded per second per thread changes as the
number of crawling machines increases. The solid
line, long dashes, and short dashes represent web
crawling using 20 threads, 15 threads, and 10
threads, respectively. The more systems are scalable,
the more lines are horizontal. From the results, we
believe that SCrawler is scalable almost linearly
with the number of crawling machines. One might
notice that the lines are not completely horizontal.
This could be attributed to the limitations of our
network resources. We ran this experiment in a
campus network where the network status is likely
to vary over time.
Figure 9: Average number of pages crawled per second
per thread.
4 CLOSING REMARKS
The development of SCrawler is ongoing.
Dynamically generated contents are constantly
created on the Web. The Web is growing
tremendously. Our next expansion of SCrawler
would be to selectively crawl web pages that are
relevant to a pre-defined set of topics.
ACKNOWLEDGEMENTS
This work was supported by Seoul R&BD Program
(10581cooperateOrg93112).
REFERENCES
Boldi, P., Codenotti, B., Santini, M., Vigna, S., 2004.
UbiCrawler: a scalable fully distributed Web crawler.
Software-Practice and Experience, Vol. 34, No. 8,
711-726.
Brin, S., Page, L., 1998. The anatomy of a large-scale
hypertextual Web search engine. In Computer
Networks and ISDN Systems, Vol. 30, No.1-7, 107-
117.
Burner, M., 1997. Crawling towards Eternity: Building An
Archive of The World Wide Web. In Web Techniques
Magazine, Vol. 2, No. 5, 37-40.
Cho, J., Garcia-Molina, H., 2002. Parallel Crawlers. In
WWW’02, 11th International World Wide Web
Conference, 124-135.
Cho, J., Garcia-Molina, H., 2000. The Evolution of the
Web and Implications for an Incremental Crawler. In
VLDB’00, 26th International Conference on Very
Large Data Bases, 200-209.
Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W.,
Paepcke, A., Raghavan, S., Wesley, G., 2006. Stanford
WebBase Components and Applications. In ACM
Transactions on Internet Technology. Vol. 6, No. 2,
153-186.
Gray, M., 1996. Internet Statistics: Growth and Usage of
the Web and the Internet, http://www.mit.
edu/people/mkgray/net/.
Heydon, A., Najork, M., 1999. Mercator: A scalable,
extensible Web crawler. In World Wide Web, Vol. 2,
No. 4, 219-229.
Kim, S.J., Lee, S.H., 2003. Implementation of a Web
Robot and Statistics on the Korean Web. In HSI’03,
2nd International Conference of Human.Society@
Internet, 341-350.
Najork, M., Heydon, A., 2001. High-performance web
crawling. In SRC Research Report 173. Compaq
Systems Research Center.
Najork, M., Wiener, J.L., 2001. Breadth-First Search
Crawling Yields High-Quality Pages. In WWW’01,
10th International World Wide Web Conference, 114-
118.
Shkapenyuk, V., Suel, T., 2002. Design and
Implementation of a High-Performance Distributed
Web Crawler. In ICDE’02, 18th International
Conference on Data Engineering, 357-368.
ICE-B 2007 - International Conference on e-Business
156