(South-America), and O (other). The category of
other includes regions not used as seed URLs in this
study as well as indeterminate TLDs. A combination
of the symbols is marked as, for example,
A,C,O,SA(A). This refers to the case where crawling
is started with seeds in the region of A, and pages in
the Minor regions A, C, O, and SA are downloaded.
We investigate how strongly the Major region of
the Web is connected to the other geographical
regions of the Web. We are interested in both
directions: from the region of M to the regions of A,
C, O, SA, and from A, C, SA to M. If, as expected,
the former direction is weak and the latter one
strong, FC starting with seed URLs only from the
Major region may lose a substantial amount of
relevant information. First, it loses a considerable
number of pages inside the Minor regions. Second,
if FC is started from a Minor region, it is likely that
this would find a significant number of Major region
pages that are not within the crawling scope of FC
starting from the Major region. This is because,
rather than being a true web the Web is a community
of communities that are isolated or only loosely
connected to each other (Toyoda and Kitsuregawa,
2001). It is therefore likely that FC starting from two
remote areas finds pages from different
communities.
To explore how strongly the Major region is
connected to the Minor regions we calculated, first,
the proportion of pages downloaded from a target
region T
j
(S
i
) among all downloaded pages T
all
(S
i
):
T
j
(S
i
) / T
all
(S
i
). Naturally, in most cases the seed
URLs point to the target regions indirectly through
URLs extracted from the pages downloaded during
crawling. This measure is called seed-to-target (ST)
rate. It was calculated for the following test cases:
Major Æ Major; includes the case M(M)
• Major Æ Minor; includes the case
A,C,O,SA(M)
• Minor Æ Major; includes the cases M(A),
M(C), M(SA)
• Minor Æ Minor; includes the cases
A,C,O,SA(A), A,C,O,SA(C), A,C,O,SA(SA)
Second, we calculated for all three Minor seed
regions overlap rate, i.e., the percentage of identical
URLs downloaded for the seed regions Major and
Minor. Generally, high overlap indicates that two
focused crawling processes with different starting
points (Major and Minor) operate mainly in the
same Web communities while low overlap shows
that they operate mainly in different communities.
Both ST and overlap rates were measured at two
relevance probability values assigned by Terrier to
the downloaded pages. The thresholds were TR1 >
0.0 and TR2 > 5.0. The higher TR is, the more
relevant pages there are in a page set.
4 FINDINGS
The seed-to-target (ST) rates are presented in Table
1. As expected, the highest ST rates were obtained
for MajorÆ Major region. The figures are very high:
94.7% or more. This means that the majority of
North-American and European pages dealing with
genomics and genetics are connected to other North-
American and European pages. It was also expected
that the direction of MajorÆ Minor is very weak:
ST ranges from 2.9% to 5.3%. The opposite
direction, MinorÆ Major is much stronger, with ST
ranging from 40.1% to 85.1%. For MinorÆ Minor,
ST is in the range (14.9%, 59.9%). We did not
consider the MinorÆ Minor cases where S is the
same as T. However, it is obvious that in most cases
Minor region target pages are of the same type as
Minor region seeds. For example, Australian seeds
find Australian pages rather than Chinese and South-
American pages. Table 1 also shows that in the case
of Minor region seeds, the seeds for general topics
point more often to the Minor region than the seeds
for specific topics.
The results of the overlap calculations are
reported in Table 2. In each case the first column
shows the absolute numbers of downloaded pages
for a region pair (identical URLs in single result lists
were first removed), and the second column shows
the overlap percentages. For example, FC / Major
and FC / Australia with TR=0.0 gave 21575 pages in
total. Of these pages 1.2% (N=267) shared the same
URL. In all cases the overlap rates are very low,
1.4% or less. Overall, the overlap results show that
the crawling results of two FC processes, one started
from the Major region and the other from a Minor
region, overlap only to a small extent.
5 CONCLUSIONS
The results revealed a biased link structure in that
North-American and European seeds point primarily
to other North-American and European pages
whereas Minor region seeds point both to the Major
and Minor regions. The results also showed that
pages downloaded using Major and Minor region
seeds overlap only to a small extent. Overall, the
results suggest that the effectiveness of FC can be
improved considerably if crawling is started from
different geographical regions. The domains of
genomics and genetics are typical scientific
EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND
GENETICS DOMAINS
501