Hub and Authority pages. Since we are interested in
having Authority pages in our crawl, we would need
to start crawling from Hub pages. Hubs are durable
pages, so we can rely upon them for crawling.
The main idea in our method is to use HITS-
Ranking on the whole graph for extracting the most
important bipartite cores. We offer two bipartite core
extraction algorithms.
We have compared the results of the crawls
starting from our extracted seeds set with crawls
starting random nodes. Our experiments show that
the crawl starting from our seeds finds the most
suitable pages of web very faster.
To the best of our knowledge, this is the first
such seeds extraction algorithm. The running time of
proposed algorithm is O(n). Low running time with
community base properties makes this algorithm
unique in comparison with previous algorithms.
2 DISCOVERING SEEDS SET IN
LARGE WEB GRAPH
A crawler usually retrieves a limited number of
pages. Crawlers are expected to collect the "most
suitable" pages of web rapidly. We define the "most
suitable" pages of web as those pages with high
Page Rank. In terms of HITS algorithm, these are
called Authority pages. The difference is that HITS
finds the authority pages relating to keywords but
PageRank shows the importance of a page in the
whole web. We know that good hubs link to good
authorities. If we are able to extract good hubs from
a web graph and different communities, we will be
able to download good authorities that have high
PageRank of different communities.
2.1 Iterative HITS-Ranking & Pruning
We assume that we have the crawled web graph.
The goal is to extract seeds set from this graph so
that a crawler can collect the most important pages
of the web in less iteration. To do this we run HITS-
Ranking algorithm on this graph. This is the second
step of HITS algorithm. In the first step, it searches
the keywords in an index-based search engine. For
our purpose, we ignore this step and only run the
ranking step on the whole graph. In this way,
bipartite cores with high Hub and Authority rank
will become visible in the graph. Then, we select the
most highly ranked bipartite core using two
algorithms. We suggest, extracting seeds with fixed
size, and extracting seeds with fixed density; we
remove this sub-graph from the graph, repeat
ranking, seed extraction, and sub-graph removal
steps up to a point that we have enough seeds set.
Why do we run HITS-Ranking repeatedly? The
answer is: removing bipartite core in each step
modifies the web-graph structure. In fact, re-ranking
changes the hub and authority ranks of bipartite
cores. Removing high-ranked bipartite core and re-
ranking web-graph drive, bipartite cores appeared to
be from different communities. Thus, a crawler will
be able to download pages from different
communities starting from these seeds. Our
experiments prove that extracted bipartite cores have
a reasonable distance from each other.
We expect to crawl the most suitable pages,
because, in iterations of algorithm, we select and
extract high-ranked bipartite cores which have high
hub or authority ranks. It is expected that such pages
link to pages with high PageRank. Our experiments
prove the correctness of this hypothesis.
2.2 Extracting Seeds with Fixed Size
The procedure in Figure 1, extracts one bipartite
sub-graph with highest hub and authority ranks with
predetermined size given as an input. The procedure
is given a directed graph G, BipartiteCoreSize,
NewMemberCount and h, and a vectors.
BipartiteCoreSize specifies the desired size of
bipartite core we like to be extracted.
NewMemberCount indicates in each iteration of
algorithm how many hub or authority nodes should
be added to the hub or authority sets; h and a vectors
are hub and authority ranks of nodes in the input
graph G.
In the initial steps, the Algorithm sets HubSet to
empty and adds the node with highest authority rank
to AuthoritySet. While the sum of AuthoritySet size
and HubSet size is less than BipartiteCoreSize, it
continues to find new hubs and authorities regarding
the NewMemberCount and adds them to the related
set. We use this procedure when we like to extract
bipartite sub-graph with fixed size. Figure 1 shows
the details of this procedure. An interesting result we
have found in our experiments is that at the very first
steps, all the hubs have links to all authorities which
is a complete bipartite sub-graph. This leaded us to
suggest a density base extraction algorithm.
2.3 Extracting Seeds with Fixed
Cover-Density
The procedure in Figure 2 extracts one bipartite sub-
graph with highest hub and authority ranks in a way
A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET
99