Authors:
Ari Pirkola
and
Tuomas Talvensaari
Affiliation:
University of Tampere, Finland
Keyword(s):
Focused crawling, Web crawling.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Data Engineering
;
Digital Libraries
;
Knowledge Management and Information Sharing
;
Knowledge-Based Systems
;
Ontologies and the Semantic Web
;
Searching and Browsing
;
Symbolic Systems
;
Web Information Systems and Technologies
;
Web Interfaces and Applications
Abstract:
Focused crawlers are programs that selectively download Web documents (pages), restricting the scope of crawling to a specific domain or topic. We investigate different focused crawling strategies including the use of data fusion in focused crawling. Documents in the domains of genomics and genetics were fetched by Nalanda iVia Focused Crawler using three crawling strategies. In the first one, a text classifier was trained to identify relevant documents. In the latter two strategies, the identification of relevant documents was based on query-document matching. In experiments, the crawling results of the single strategies were combined to yield fused crawling results. The experiments showed, first, that different single strategies overlap only to a small extent, identifying mainly different relevant documents. Second, a query-based strategy where the words of the link context were weighted gave the best coverage (i.e., number of relevant documents) after 10 000 and 40 000 documents h
ad been downloaded. The combination of the two query-based strategies was the best fused strategy but it did not perform better than the best single strategy.
(More)