the Internet. With this evolution, online content
quantity and quality has increased exponentially.
Along with it, so has the difficulty in gathering the
most suitable information related to a certain
subject. Web crawling engines provide the basis for
indexing information that can be accessed with a
search engine. However, this approach is generic –
the goal is to find content in the entire web. This
means that finding valid information with certified
quality for a certain topic can be a complex task.
To overcome this problem, there is a new wave
of web crawling engines that are directed to a
specific scientific domain and can reach content
hidden in the deep web. Arabella is a directed web
crawler that focuses on scalability, extensibility and
flexibility. Considering that the crawling engine is
based on an XML-defined navigational map, the
engine is scalable as the map can contain any
number of stops and rules. Rule and stop
dependencies can be added, as well as multiple rules
for the same address. Arabella is a multithreaded
environment that is prepared to deal with these
situations. Extensibility is also a key feature of this
application. Its modular approach and small code
base allow the quick addition of new features or the
reconfiguration of existing ones.
However, the most important feature is
flexibility. Interleaving web requests with text
processing allows the execution of a set of important
functionalities. Arabella is a web crawler that allows
dynamic generation of new URL addresses in real
time and storage of gathered information for further
use in the execution cycle. With proper
configuration, Arabella can also crawl through any
type of online accessible document whether is
HTML, CSV, XML, REST web service, tab
delimited or simple text.
ACKNOWLEDGEMENTS
The research leading to these results has received
funding from the European Community's Seventh
Framework Programme (FP7/2007-2013) under
grant agreement nº 200754 - the GEN2PHEN
project.
REFERENCES
Berners-Lee, T., Hendler, J. & Lassila, O. (2001) The
Semantic Web. Sci Am, 284, 34-43.
Brin, S. & Page, L. (1998) The anatomy of a large-scale
hypertextual Web search engine. Computer Networks
and ISDN Systems, 30, 107-117.
Chakrabarti, S. (2003) Mining the Web: Discovering
Knowledge from Hypertext Data, Morgan Kauffman.
Ding, L., Finin, T., et al. (2004) Swoogle: A Search and
Metadata Engine for the Semantic Web. Language,
652-659.
Eichmann, D. (1994) The RBSE Spider -Balancing
Effective Search Against Web Load. Proceedings of
the First International World Wide Web Conference.
Geneva, Switzerland.
Fokkema, I. F., Den Dunnen, J. T. & Taschner, P. E.
(2005) LOVD: easy creation of a locus-specific
sequence variation database using an "LSDB-in-a-
box" approach. Human Mutation, 26, 63-68.
Lin, S., Li, Y.-M. & Li, Q.-C. (2008) Information Mining
System Design and Implementation Based on Web
Crawler. Science, 1-5.
Menczer, F., Pant, G., et al. (2001) Evaluating Topic-
Driven Web Crawlers. Proceedings of the 24th Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval. New York,
NY, USA, ACM.
Miller, R. C. & Bharat, K. (1998) SPHINX: a framework
for creating personal, site-specific Web crawlers.
Computer Networks and ISDN Systems, 30, 119-130.
Mukhopadhyay, D., Biswas, A. & Sinha, S. (2007) A new
approach to design domain specific ontology based
web crawler. 10th International Conference on
Information Technology (ICIT 2007), 289-291.
Ntoulas, A., Zerfos, P. & Cho, J. (2005) Downloading
textual hidden web content through keyword queries.
JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint
conference on Digital libraries. New York, NY, USA.
Oliveira, J. L., Dias, G. M. S., et al. (2004) DiseaseCard:
A Web-based Tool for the Collaborative Integration of
Genetic and Medical Information. Proceedings of the
5th International Symposium on Biological and
Medical Data Analysis, ISBMDA 2004. Barcelona,
Spain, Springer.
Peisu, X., Ke, T. & Qinzhen, H. (2008) A Framework of
Deep Web Crawler. 27th Chinese Control Conference.
China, IEEE.
Pinkerton, B. (1994) Finding what people want:
Experiences with the WebCrawler. Proceedings of the
Second International World Wide Web Conference.
Srinivasamurthy, K. (2004) Ontology-based Web Crawler.
Computing, 4-8.
Suel, T. & Shkapenyuk, V. (2002) Design and
Implementation of a High-Performance Distributed
Web Crawler. World Wide Web Internet And Web
Information Systems.
Tripathy, A. & Patra, P. K. (2008) A web mining
architectural model of distributed crawler for internet
searches using PageRank algorithm. 2008 IEEE Asia-
Pacific Services Computing Conference, 513-518.
Tsay, J.-J., Shih, C.-Y. & Wu, B.-L. (2005) AuToCrawler:
An Integrated System for Automatic Topical Crawler.
Machine Learning.
ARABELLA - A Directed Web Crawler
273