Finally, we compared our crawler’s accuracy in
identifying geographically relevant web pages to the
accuracy of a classification-aware crawler. For this
experiment, we built a Bayesian classifier and we
used it to score every URL in the crawler’s seed list
with respect to their corresponding geographic cate-
gories in the Dmoz directory. For scoring geo-
graphically-relevant URLs, we relied on the seman-
tic relations between the Dmoz category names and
the keywords extracted form the anchor text of the
respective URLs, using WordNet. We then used the
above set of scored URLs as the classifier’s training
data. Having trained the classifier we integrated it in
a crawling module, which we run against the seed
URLs of our previous experiment for one week. At
the end of crawling, we computed the accuracy of
the classification-aware crawler and we compared it
to the accuracy of our geo-focused crawler. In Table
3, we report the comparison results.
Table 3: Comparison results.
Geo-focused crawling accuracy 89.28%
Classification-aware crawling accuracy 71.25%
Results indicate that our geo-focused crawler has
improved performance compared to the performance
of the classifier-based crawler. This, coupled with
the fact that our geo-focused crawler does not need
to undergo a training phase imply the potential of
our geo-focused crawler towards retrieving geo-
graphically-specific web data. Currently, we are
running a large-scale focused crawling experiment
in order to evaluate the effectiveness of our ranking
algorithm in ordering URLs in the crawler’s frontier.
Lefteris Kozanidis is funded by the PENED 03ED_413
research project, co-financed 25% from the Greek Minis-
try of Development-General Secretariat of Research and
Technology and 75% from E.U.-European Social Fund.
