Finally, we compared our crawler’s accuracy in
identifying geographically relevant web pages to the
accuracy of a classification-aware crawler. For this
experiment, we built a Bayesian classifier and we
used it to score every URL in the crawler’s seed list
with respect to their corresponding geographic cate-
gories in the Dmoz directory. For scoring geo-
graphically-relevant URLs, we relied on the seman-
tic relations between the Dmoz category names and
the keywords extracted form the anchor text of the
respective URLs, using WordNet. We then used the
above set of scored URLs as the classifier’s training
data. Having trained the classifier we integrated it in
a crawling module, which we run against the seed
URLs of our previous experiment for one week. At
the end of crawling, we computed the accuracy of
the classification-aware crawler and we compared it
to the accuracy of our geo-focused crawler. In Table
3, we report the comparison results.
Table 3: Comparison results.
Geo-focused crawling accuracy 89.28%
Classification-aware crawling accuracy 71.25%
Results indicate that our geo-focused crawler has
improved performance compared to the performance
of the classifier-based crawler. This, coupled with
the fact that our geo-focused crawler does not need
to undergo a training phase imply the potential of
our geo-focused crawler towards retrieving geo-
graphically-specific web data. Currently, we are
running a large-scale focused crawling experiment
in order to evaluate the effectiveness of our ranking
algorithm in ordering URLs in the crawler’s frontier.
ACKNOWLEDGEMENTS
Lefteris Kozanidis is funded by the PENED 03ED_413
research project, co-financed 25% from the Greek Minis-
try of Development-General Secretariat of Research and
Technology and 75% from E.U.-European Social Fund.
REFERENCES
Amitay, E., Har’El, N., Silvan, R., Soffer, A. 2004. Web-
a-where: geo-tagging web content. In
Proceedings of
the 27
th
Annual Intl. SIGIR Conference.
Borges, K., Laender, A., Mederios, C., Davis, C. 2007.
Discovering geographic locations in web pages using
urban addresses. In
Proceedings of the 4
th
Intl Work-
shop on GIR
Buscaldi, D., Roso, P. 2008. Geo-WordNet: automatic
georeferencing of WordNet. In
Proceedings of the 6
th
Intl. LREC Conference.
Chakrabarti, S., van den Berg, M., Dom, B. 2000. Focused
crawling: a new approach to topic-specific web re-
sources discovery.
Computer Networks, 31(11-16):
1623-1640.
Chung, C., Clarke, C.L.A., 2002. Topic-oriented collabo-
rative crawling, In
CIKM Conference, pp. 34-42.
Ding, J., Gravano, L., Shivakumar, N. 2000. Computing
geographical scopes of web resources. In
Proceedings
of the VLDB Conference
.
Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.
2005. Geographical partition for distributed web
crawling. In
Proceedings of the 2
nd
GIR Workshop.
Fellbaum, Ch. (ed.) 1998.
WordNet: An Electronic Lexical
Database
, MIT Press.
Fu, G., Jones, C.R., Abdelmoty, A. 2005. Building a geo-
graphical ontology for intelligent spatial search on the
Web. In
Proceedings of the IASTED Intl. Conference
on Databases and Applications
. pp. 167-172.
Gao, W., Lee, H.C., Miao, Y. 2006. Geographically fo-
cused collaborative crawling. In
Proceedings of the
WWW Conference
.
GeoWordNet. Available at: http://www.dsic.upv.es/ gru-
pos/nle/downloads-new.html
Hill, L. 2000. Core elements of digital gazetteers: place-
ments, categories and footprints. In
Research and Ad-
vanced Technology of Digital Libraries
.
Himmelstein, M. 2005. Local search: the internet is yellow
pages. In Computer, v.38, n.2, pp. 26-34.
Map 24. Available at: http://developer.navteq.com/site/
global/zones/ms/downloads.jsp.
Markowetz, A., Brinkhoff, T., Seeger, B., 2004. Geo-
graphic information retrieval. In
Web Dynamics.
Martins, B., Silva, M.J., Andrade, L. 2005. Indexing and
ranking on Geo-IR systems. In
Proceedings of the 2
nd
Intl. Workshop on GIR
.
Salton, G., Wong, A., Yang, S.C. 1975. A vector space
model for automatic indexing. In
Communications of
the ACM,
Vol.18, No.11, pp. 631-620.
Silva, M.J., Martins, B., Chaves, M., Cardoso, N., Afonso,
A.P. 2006. Adding geographic scopes to web re-
sources. In
Computers, Environment and Urban Sys-
tems
, vol. 30, pp. 378-399.
Smith, D., Mann, G. 2003. Bootstrapping toponyms classi-
fiers. In
Proceedings of the HLT-NAACL Workshop on
Analysis of Geographic References
, pp. 45-49.
Wang, L., Wang, C., Xie, X., Forman, J., Lu, Y.S., Ma,
W,Y., Li, Y. 2005a. Detecting dominant locations
from search queries. In
Proceedings of the SIGIR Con-
ference
.
Wang, C., Xie, X., Wang, L., Lu, Y.S., Ma, W,Y. 2005b.
Detecting geographic locations from web sources. In
Proceedings of the 2
nd
Intl. Workshop on GIR.
Welch, M., Cho, J. 2008. Automatically Identifying Local-
izable Queries. In Proceedings of the SIGIR Confer-
ence.
Yu, B., Cai, G. 2007. A query-aware document ranking
method for geographic information retrieval. In
Pro-
ceedings of the 4
th
Intl Workshop on GIR.
FOCUSING WEB CRAWLS ON LOCATION-SPECIFIC CONTENT
249