web pages online grew exponentially, and thus the
need for a search engine to index these pages
became indispensable. Web spiders were used by
these search engines to scale up the web and hence
they became popular. Since then, the web spiders
have been crawling the web on a regular basis.
A popular web crawler is the UbiCrawler (Boldi
et al., 2004). It is made up of several web agents that
scan their own share of the web by autonomously
coordinating their behaviour. Each agent performs
its task by running several threads, each dedicated to
scan a single host using a Breadth-First visit thus
ensuring that politeness is maintained as different
threads visit different hosts at the same time.
However, breadth-first navigation technique being a
top-down approach results in web crawler missing
out on possible detection of new URLs that can be
obtained from already discovered URLs. For
example, given a URL say
www.example.com/pics/page1.html, there might be
possible existence of URLs like
www.example.com/pics/ and www.example.com/
pics/page2.html, etc. Discovering such URLs from
the existing URLs increases the scalability of the
web crawler.
Heritrix is the web crawler developed by Internet
Archive's, an open-source corporation. Heritrix
provides a number of storing and scheduling
strategies to crawl the seed list. Each of its crawler
process can be assigned up to 64 sites to crawl, and
it is ensured that no site is assigned to more than one
crawler. The crawler process reads a list of seed
URLs for its assigned sites from disk into the
queues, and then uses asynchronous I/O to fetch
pages from these queues in parallel. After the page is
downloaded, the crawler performs link extraction on
it and if a link refers to the site of the page it was
contained in, it is added to the appropriate site
queue; otherwise it is logged to disk. Periodically, a
batch process performs merging of these logged
“cross-site” URLs into the site-specific seed sets and
thus filtering out duplicates in the process and the
process is repeated till exhaustion of URL in the list.
KSpider (Koht-arsa and Sanguanpong, 2002), is
a scalable, cluster based web spider. It uses a URL
compression scheme that stores the URLs in a
balanced AVL tree. The compressed URLs are
stored in memory rather than on hard disk because
storing the URLs in memory improves the
performance of the crawler. Common prefix among
URLs is used to reduce the size of the URLs by
storing the common prefixes once and reusing them
for many URLs. However, the structure of AVL
restricts the number of children to two and increases
the height of the tree even though it is balanced.
Tarantula capitalizes over KSpider by making
optimal use of the common prefixes among URLs
by using a slight modified version of compressed
tries. These data structure are broad and therefore
can have more than two children thereby decreasing
the height of the tree. Also, unlike KSpider,
Tarantula stores all the URLs belonging to the same
host in the same compressed trie. This is useful in
restricting the depth of crawling a hostname and also
provides easy mechanism for ensuring politeness of
the crawler system. Applying compression
algorithms to URLs and then expanding them again
leads to a lot of CPU usage and time expenditure. It
is therefore advantageous to compress the URLs
based on common prefixes. Though the compression
is not as high, the speed of the crawler is greatly
enhanced.
3 MOTIVATION
One of the initial motivations for this work was to
develop a crawling system which is able to scale a
greater degree of the web. The crawling system
should be able to process a large number of URLs
from far and wide thus trying to cover the entire
breadth of the Internet. This prompted us to come up
with unique crawling strategies, which when
combined with the page ranking and refreshing
crawling schemes, gives excellent results.
We also wanted that the design used data
structures that reduce the amount of I/O needed and
CPU processing performed to pursue the newly
extracted URLs for downloading the web pages.
By looking at some URLs, it is possible to
detect the existence of newer URLs that might be
valid but have not yet been discovered by the
crawler possibly because of broken web links.
Therefore, mining on the URLs discovered was yet
another motivation behind development of Tarantula
web crawler.
KMIS 2009 - International Conference on Knowledge Management and Information Sharing
168