Authors:
Joo Yong Lee
1
;
Sang Ho Lee
1
and
Yanggon Kim
2
Affiliations:
1
School of Computing, Soongsil University, Korea, Republic of
;
2
Computer and Information Sciences, Towson University, United States
Keyword(s):
Web crawler, Parallel crawler, Scalability, Web database.
Related
Ontology
Subjects/Areas/Topics:
Cloud Computing
;
Collaboration and e-Services
;
Data Engineering
;
e-Business
;
Enterprise Information Systems
;
Mobile Software and Services
;
Ontologies and the Semantic Web
;
Services Science
;
Software Agents and Internet Computing
;
Software Engineering
;
Software Engineering Methods and Techniques
;
Telecommunications
;
Web Services
;
Wireless Information Networks and Systems
Abstract:
As the size of the Web grows, it becomes increasingly important to parallelize a crawling process in order to complete downloading pages in a reasonable amount of time. This paper presents the design and implementation of an effective parallel web crawler. We first present various design choices and strategies for a parallel web crawler, and describe our crawler’s architecture and implementation techniques. In particular, we investigate the URL distributor for URL balancing and the scalability of our crawler.