EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR
FOCUSED CRAWLING IN THE GENOMICS AND GENETICS
DOMAINS
Ari Pirkola and Tuomas Talvensaari
Department of Information Studies, University of Tampere, Finland
Keywords: Focused web crawling, Genomics, Genetics.
Abstract: A focused crawler is a program that fetches Web pages that are relevant to a pre-defined domain. In this
paper we consider focused crawling in the domains of genomics and genetics. Crawling is often started with
seed URLs that point to central North-American and European universities, research institutions, and other
organizations in North-America and Europe. We investigate how strongly this central region of the Web is
connected to other large geographical regions of the Web: Australia (top level domain .au), China (.cn), and
five South-American countries (.ar, .br, .cl, .mx, and .uy). We consider what implications the observed
global link structure has for the selection of seed URLs for focused crawling. The results showed that the
proportion of out-links from the North-American and European region to the other regions is low whereas
pages in the other regions often point to the central region. We also found that two focused crawling
processes, one started from the central region and the other from another large region, overlap only to a
small extent. Overall, the results suggest that the effectiveness of focused crawling can be improved
considerably if crawling is started with a geographically heterogeneous seed URL set.
1 INTRODUCTION
Web crawling refers to the process of gathering data
from the World Wide Web. Focused crawlers are
programs that selectively download Web pages,
restricting the scope of crawling to a pre-defined
domain or topic (Bergmark et al., 2002; Castillo,
2004; Chakrabarti et al., 1999; Talvensaari et al.,
2008; Tang et al., 2005; Zhuang et al., 2005).
Depending on the purpose of focused crawling (FC),
different methods are applied to process the
downloaded pages, e.g. they can be indexed for a
domain specific search engine or a digital library.
Focused crawlers can even be used as personal
search agents. The benefits of focused crawling are
that it is able to find a large proportion of relevant
pages on that particular domain and it is well able to
keep up with the change of the Web.
Crawling starts with a set of seed URLs. The
crawler connects to servers and downloads pages
from the servers. Crawling starting from a given
URL continues until it comes to a dead end or until
some restriction defined in the crawling policy is
met. URLs are extracted from the pages and are
added to the URL queue which determines the order
in which new pages are downloaded. A focused
crawler differs from a general crawler in two main
points. First, it judges whether the visited pages and
the pages pointed to by the URLs are relevant for the
pre-defined domain. Domain identification is based,
for example, on domain vocabularies or topical
hierarchies. Second, focused crawlers reorder the
URL queue based on the probabilities that the
downloaded pages deal with the defined domain or
topic. The pages assessed to be very relevant are
downloaded first.
In a current project we are examining the
connections between different geographical regions
of the Web and the effects of geographically
categorized seed URL sets on the effectiveness of
FC. So far, we have completed one set of
experiments, the results of which are reported in this
position paper.
In FC, a common practice is to retrieve the seed
URLs by a Web search engine. The returned URLs
typically point to pages of central information
providers in the field, such as universities, journals,
and research institutions. These are mainly North-
American and European Web pages. We investigate
in this paper how strongly this central region is
499
Pirkola A. and Talvensaari T. (2009).
EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND GENETICS DOMAINS.
In Proceedings of the International Conference on Health Informatics, pages 499-502
DOI: 10.5220/0001777004990502
Copyright
c
SciTePress
connected to other large geographical regions of the
Web, and what implications the observed global link
structure has for the selection of seed URLs for FC.
2 METHODS AND DATA
2.1 Web Regions
We call Web pages with generic and sponsored top
level domains (gTLDs and sTLDs, see
http://en.wikipedia.org/wiki/Generic_top-level_
domain), e.g. .com, .edu, .gov, and .org, as well as
North-American and European country code top
level domains (ccTLDs), e.g. .ca, .de, .es, .fr, .it, .pt,
and .uk collectively the Major region of the Web.
The Major region is contrasted to Australian
(ccTLD: .au), Chinese (ccTLD: .cn), and South-
American (five ccTLDs: .ar, .br, .cl, .mx, and .uy)
regions. These are here called Minor regions of the
Web. It should be noted that a small portion of TLDs
defined here as Major region TLDs are registered
outside North-America and Europe. Therefore,
Major region does not exactly correspond to North-
America and Europe. It should also be noted that the
terms major and minor refer to the relative size
dimensions of the Web. For simplicity, Mexico
(.mx) is here called a South-American country.
2.2 Test Topics, Seed URLs, and
Crawling
We experimented with two kinds of test topics in the
domains of genomics and genetics: specific topics
(an example: regulatory targets of nkx genes), and
general topics (an example: hereditary diseases). As
specific topics we used five TREC
(http://trec.nist.gov) Genomics Track 2004 topics
(the topic numbers 1, 10, 20, 30, and 40). For the
Genomics Track, see Hersh et al. (2005). There were
also five general topics. They were created by one of
the authors who has expertise in health informatics.
For seed URL retrieval, queries containing
synonyms and morphological variants of the topic
words were constructed based on the topics.
Statistical information, such as term and document
frequencies and the total number of hits for a query
in the Medline database (http://www.ncbi.nlm.
nih.gov/pubmed/) are some measures to determine
topic specificity. We used the latter measure.
Four crawls were performed for each topic with
the seed URLs from the Major region, Australia,
China, and five South-American countries
(Argentina, Brazil, Chile, Mexico, and Uruguay).
Each seed URL set contained 50 URLs. The seeds of
the Major region were retrieved by means of the
basic Google (http://www.google.com) whereas the
seeds of the other regions were retrieved by
Google’s local versions (e.g. http://www.google.cl).
The South-American URL sets were formed by
taking top ten URLs from each five local Google.
Most of the Chinese seed URL pages were bilingual
Chinese-English pages.
The majority of the Major region seed URLs
were of the type .com, .edu, .gov, .org, .de, and .uk.
The original Major region seed
URL sets contained
(only) a few Australian, Chinese, and South-
American URLs which were removed from the final
sets to allow us to investigate the defined research
questions (presented in Section 3).
The Nalanda iVia Focused Crawler
(http://ivia.ucr.edu) was used in the experiments. It
is based on the work of Chakrabarti et al. (1999). At
the start of a crawl, the Nalanda crawler initializes a
priority queue of URLs with a set of seed URLs.
One by one, each page u pointed to by the URLs in
the queue are fetched. The probability Pr(t|u), i.e.,
the probability of u being about the wanted topic t is
calculated with a text classification algorithm. The
probabilities Pr(t|u) were estimated with a logistic
regression classifier (Zhang et al., 2003) that, for
every topic, was trained with positive and negative
instances of the topic in question.
There were 40 seed URL sets in total: 10 topics
and for each topic 4 URL sets representing different
regions. Accordingly, we performed 40 crawls. In
each case crawling was stopped after 20 000 pages
had been downloaded. Thus, each result list
contained 20 000 pages.
For evaluation, the fetched pages were indexed
with the Terrier search engine
(http://ir.dcs.gla.ac.uk/terrier/) that ranked the pages
based on their probability of relevance to the entered
query. The same queries that were used in searching
for seed URLs were used to represent the topics,
however they were modified to fit Terrier’s query
language. Of course, the probabilities calculated by
the classifier could have been used to rank the pages,
but Terrier was used to provide stronger evidence.
3 RESEARCH QUESTIONS AND
EVALUATION
We denote by T(S) the situation where pages in the
region of T(arget) are downloaded (or are
considered in calculations) in a crawling process that
starts with seed URLs in the region of S(eed). The
regions considered in this study are denoted as
follows: M (Major), A (Australia), C (China), SA
HEALTHINF 2009 - International Conference on Health Informatics
500
(South-America), and O (other). The category of
other includes regions not used as seed URLs in this
study as well as indeterminate TLDs. A combination
of the symbols is marked as, for example,
A,C,O,SA(A). This refers to the case where crawling
is started with seeds in the region of A, and pages in
the Minor regions A, C, O, and SA are downloaded.
We investigate how strongly the Major region of
the Web is connected to the other geographical
regions of the Web. We are interested in both
directions: from the region of M to the regions of A,
C, O, SA, and from A, C, SA to M. If, as expected,
the former direction is weak and the latter one
strong, FC starting with seed URLs only from the
Major region may lose a substantial amount of
relevant information. First, it loses a considerable
number of pages inside the Minor regions. Second,
if FC is started from a Minor region, it is likely that
this would find a significant number of Major region
pages that are not within the crawling scope of FC
starting from the Major region. This is because,
rather than being a true web the Web is a community
of communities that are isolated or only loosely
connected to each other (Toyoda and Kitsuregawa,
2001). It is therefore likely that FC starting from two
remote areas finds pages from different
communities.
To explore how strongly the Major region is
connected to the Minor regions we calculated, first,
the proportion of pages downloaded from a target
region T
j
(S
i
) among all downloaded pages T
all
(S
i
):
T
j
(S
i
) / T
all
(S
i
). Naturally, in most cases the seed
URLs point to the target regions indirectly through
URLs extracted from the pages downloaded during
crawling. This measure is called seed-to-target (ST)
rate. It was calculated for the following test cases:
Major Æ Major; includes the case M(M)
Major Æ Minor; includes the case
A,C,O,SA(M)
Minor Æ Major; includes the cases M(A),
M(C), M(SA)
Minor Æ Minor; includes the cases
A,C,O,SA(A), A,C,O,SA(C), A,C,O,SA(SA)
Second, we calculated for all three Minor seed
regions overlap rate, i.e., the percentage of identical
URLs downloaded for the seed regions Major and
Minor. Generally, high overlap indicates that two
focused crawling processes with different starting
points (Major and Minor) operate mainly in the
same Web communities while low overlap shows
that they operate mainly in different communities.
Both ST and overlap rates were measured at two
relevance probability values assigned by Terrier to
the downloaded pages. The thresholds were TR1 >
0.0 and TR2 > 5.0. The higher TR is, the more
relevant pages there are in a page set.
4 FINDINGS
The seed-to-target (ST) rates are presented in Table
1. As expected, the highest ST rates were obtained
for MajorÆ Major region. The figures are very high:
94.7% or more. This means that the majority of
North-American and European pages dealing with
genomics and genetics are connected to other North-
American and European pages. It was also expected
that the direction of MajorÆ Minor is very weak:
ST ranges from 2.9% to 5.3%. The opposite
direction, MinorÆ Major is much stronger, with ST
ranging from 40.1% to 85.1%. For MinorÆ Minor,
ST is in the range (14.9%, 59.9%). We did not
consider the MinorÆ Minor cases where S is the
same as T. However, it is obvious that in most cases
Minor region target pages are of the same type as
Minor region seeds. For example, Australian seeds
find Australian pages rather than Chinese and South-
American pages. Table 1 also shows that in the case
of Minor region seeds, the seeds for general topics
point more often to the Minor region than the seeds
for specific topics.
The results of the overlap calculations are
reported in Table 2. In each case the first column
shows the absolute numbers of downloaded pages
for a region pair (identical URLs in single result lists
were first removed), and the second column shows
the overlap percentages. For example, FC / Major
and FC / Australia with TR=0.0 gave 21575 pages in
total. Of these pages 1.2% (N=267) shared the same
URL. In all cases the overlap rates are very low,
1.4% or less. Overall, the overlap results show that
the crawling results of two FC processes, one started
from the Major region and the other from a Minor
region, overlap only to a small extent.
5 CONCLUSIONS
The results revealed a biased link structure in that
North-American and European seeds point primarily
to other North-American and European pages
whereas Minor region seeds point both to the Major
and Minor regions. The results also showed that
pages downloaded using Major and Minor region
seeds overlap only to a small extent. Overall, the
results suggest that the effectiveness of FC can be
improved considerably if crawling is started from
different geographical regions. The domains of
genomics and genetics are typical scientific
EVALUATING GLOBAL LINK STRUCTURE OF THE WEB FOR FOCUSED CRAWLING IN THE GENOMICS AND
GENETICS DOMAINS
501
Table 1: Seed-to-Target rates (%).
Topic type and
relevance threshold
Major
Æ
Major
Major
Æ
Minor
Minor
Æ
Major
Minor
Æ
Minor
General, TR=0.0 94.7 5.3 76.8 23.2
General, TR=5.0 96.0 4.0 40.1 59.9
Specific, TR=0.0 97.1 2.9 85.1 14.9
Specific, TR=5.0 96.6 3.4 65.7 34.3
Table 2: Overlap rates (%) for the Major and Minor regions.
Topic type and
relevance thr.
Major+
Australia
N
Major+
Australia
Overlap%
Major+
China
N
Major+
China
Overlap%
Ma
j
or+
South-
America
N
Ma
j
or+
South-
America
Overlap%
General TR=0.0 21575 1.2 23342 1.4 19326 0.3
General TR=5.0 3430 0.7 3349 0.9 2857 0.2
Specific TR=0.0 18845 0.2 21227 0.6 18926 0.0
Specific TR=5.0 2349 0.0 2837 0.4 1914 0.0
domains. We therefore assume that the obtained
results are generalizable to other scientific
domains.
REFERENCES
Bergmark, D., Lagoze, C. and Sbityakov, A., 2002.
Focused crawls, tunneling, and digital libraries.
Proceedings of the 6th European Conference on
Research and Advanced Technology for Digital
Libraries, Rome, Italy, September 16-18, pp. 91 – 106.
Castillo, C., 2004. Effective Web crawling. Ph.D.
Thesis. University of Chile, Department of Computer
Science, 180 pages.
http://www.chato.cl/534/article-63160.html
Chakrabarti, S., van den Berg, M. and Dom, B., 1999.
Focused crawling: a new approach to topic-specific
Web resource discovery. Proceedings of the Eighth
International World Wide Web Conference, Toronto,
May 11 - 14.
Hersh, W. R., Bhuptiraju, R. T., Ross, L., Johnson, P.,
Cohen, A. M. and Kraemer, D. F., 2005. TREC 2004
genomics track overview. Proceedings of the
Thirteenth TExt REtrieval conference (TREC-13)
(Gaithersburg, MD). http://trec.nist.gov/pubs/
trec13/t13_proceedings.html
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and
Laurikkala, J., 2008. Focused Web crawling in the
acquisition of comparable corpora. Information
Retrieval, 11(5), 427-445.
Tang, T., Hawking, D., Craswell, N. and Griffiths, K.,
2005. Focused crawling for both topical relevance
and quality of medical information. Proceedings of
the 14th ACM International Conference on
Information and Knowledge Management CIKM '05.
Toyoda, M. and Kitsuregawa, M., 2001. Creating a Web
community chart for navigating related communities.
Proceedings of the 12th ACM Conference on
Hypertext and Hypermedia, Århus, Denmark,
August 14 - 18.
Zhang, J., Jin, R., Yang, Y. and Hauptmann, A., 2003.
Modified logistic regression: An approximation to
svm and its applications in large-scale text
categorization. Proceedings of the 20th International
Conference on Machine Learning (ICML),
Washington, DC.
Zhuang, Z., Wagle, R. and Giles, C.L., 2005. What's
there and what's not?: focused crawling for missing
documents in digital libraries. Proceedings of the 5th
ACM/IEEE-CS Joint Conference on Digital
Libraries, Denver, CO, pp. 301 – 310.
HEALTHINF 2009 - International Conference on Health Informatics
502