ARABELLA

A Directed Web Crawler

Pedro Lopes, Davide Pinto, David Campos and José Luís Oliveira

IEETA, Universidade de Aveiro, Campus Universitário de Santiago, 3810 – 193 Aveiro, Portugal

Keywords: Information retrieval, Web crawling, Crawler, Text processing, Multithread, Directed crawling.

Abstract: The Internet is becoming the primary source of knowledge. However, its disorganized evolution brought

about an exponential increase in the amount of distributed, heterogeneous information. Web crawling

engines were the first answer to ease the task of finding the desired information. Nevertheless, when one is

searching for quality information related to a certain scientific domain, typical search engines like Google

are not enough. This is the problem that directed crawlers try to solve. Arabella is a directed web crawler

that navigates through a predefined set of domains searching for specific information. It includes text-

processing capabilities that increase the system’s flexibility and the number of documents that can be

crawled: any structured document or REST web service can be processed. These complex processes do not

harm overall system performance due to the multithreaded engine that was implemented, resulting in an

efficient and scalable web crawler.

1 INTRODUCTION

Searching for knowledge and the desire to learn are

ideals as old as mankind. With the advent of the

World Wide Web, access to knowledge has become

easier. The Internet is now one of the primary media

regarding knowledge sharing. Online information is

available everywhere, at any time and can be

accessed by the entire world.

However, the ever growing amount of online

information arises several issues. Some of the major

problems are finding the desired information and

sorting online content according to our needs. To

overcome these problems we need some kind of

information crawler that searches through web links

and indexes the gathered information, allowing

faster access to content. Web crawling history starts

in 1994 with RBSE (Eichmann, 1994) and

Pinkerton’s work (Pinkerton, 1994) in the first two

conferences on the World Wide Web. However, the

most successful case of web crawling is, without a

doubt, Google (Brin and Page, 1998): innovative

page rankings algorithms and high performance

crawling architecture turned Google into one of the

most important companies on the World Wide Web.

However, 25 percent of web hyperlinks change

weekly (Ntoulas et al., 2005) and web applications

hide their content in dynamically generated

addresses making crawling task more difficult and

leveraging the need for novel web crawling

techniques. Research areas include semantic

developments (Berners-Lee et al., 2001), web

ontology rules inclusion (Mukhopadhyay et al.,

2007, Srinivasamurthy, 2004), text mining features

(Lin et al., 2008) or directing crawling to specific

content or domains (Menczer et al., 2001).

Our purpose with Arabella was to create a

directed crawler that only navigates on a specific

subset of hyperlinks, searching for information

based on an initial subset of validation rules. The set

of rules defines which links contain relevant and

correct information and which therefore should be

added to the final results. The rules also define a

mechanism for creating dynamic URLs in real-time,

according to information gathered in previously

crawled pages, extending traditional web crawling

operability. Arabella combines typical HTTP

crawling features with a fast, easily extensible text

processing engine that is capable of extracting nearly

any structured information. This means that our tool

is able to extract information from HTML pages,

REST web services, CSV sheets or Excel books

among others.

This paper is organized in 5 distinct sections.

The following section provides a general overview

of the web crawling subject and other relevant work

done in the directed web crawling area. The third

section presents the system architecture and we

270

Lopes P., Pinto D., Campos D. and Luís Oliveira J. (2009).

ARABELLA - A Directed Web Crawler.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 270-273

DOI: 10.5220/0002291602700273

 SciTePress

Figure 1: Comparison of traditional crawlers and Arabella.

include a usage scenario in section 4 where we

improve DiseaseCard (Oliveira et al., 2004). Finally,

there are conclusions and discussion about future

developments.

2 WEB CRAWLING

Despite the fact that Google is, and will be, the king

of search engines, web crawling has still evolved

considerably in the last few years. The basic

operation of a web crawler consists of a repeated

cycle of solving website address using a DNS,

connecting to the destination server and requesting

the pages, receiving and processing the server

response, advancing to the next link. DNS and web

server requests are essential components and do not

vary too much. Improvements are usually done on

the page processing modules where several features

can be added to extend the architecture.

Semantic web developments, brought by W3C,

are one of the hottest research trends that web

crawler developers have invested in. Structured and

described content may ease the crawler tasks and

will require changes in the basic crawler

architectures. Swoogle (Ding et al., 2004) is a good

example of research advances in this area.

Data mining is the subject of several information

retrieval projects. Web crawling and the scope of

data it can cover is the perfect ally to text mining

tools. Whether research is focused on knowledge

discovery (Chakrabarti, 2003) or on redesigning web

crawling architectures to improve text mining results

(Tripathy and Patra, 2008), the results show that this

combination is successful.

The last mentionable topics are directed or

oriented crawlers. These web crawlers (Miller and

Bharat, 1998, Mukhopadhyay et al., 2007, Tsay et

al., 2005) are related to a specific domain or research

trend and are focused on result quality instead of

result quantity. From Menczer’s work (Menczer et

al., 2001) we can conclude that this area can be

enhanced and that there is space for further

improvements. To solve the directed web crawler

problem we need a tool to collect arbitrary text data

from different websites, organized in a coherent

structure. This tool should be easily configurable,

fairly fast (Suel and Shkapenyuk, 2002) and should

be presented in the form of a library, so that it can be

easily integrated in any existing system. The

problem with collecting arbitrary text data is that the

same text can vary widely from site to site in terms

of structure. Thus there was also a need for the tool

to have basic text processing capabilities such as

searching for regular expressions, cutting and

splitting text, which could provide the necessary

flexibility.

3 ARABELLA

Considering the features and flexibility we needed,

we stated that none of the existing tools would

satisfy our requirements. Despite their capabilities,

these tools have the major drawback of being

focused on a single task. Arabella closely integrates

web crawling and text processing in a single Java

library that allows full control over downloaded

pages, saved content and the generation of new

dynamic content. This feature enhances information

retrieval tasks, improving the final result set and the

overall quality of the application – Figure 1.

3.1 Motivation

With Arabella, we want to access content hidden in

the deep web (Peisu et al., 2008), not limiting the

crawlable space to web page referenced links. The

majority of quality data is available online but is not

directly referenced. To access this valuable

information it is necessary to search specific

databases, usually resorting to strict conditional

ARABELLA - A Directed Web Crawler

271

inputs. The URL to this information is dynamically

generated in real-time, obstructing web crawlers’

work – traditional web crawlers will not reach them

- and increasing the complexity of the web crawling

algorithms.

In the majority of situations, these dynamic

URLs can be parameterized with sets of keys whose

domains are either well known or easily generated.

To achieve this important goal, we take advantage of

Arabella’s text processing capabilities. Using the

Arabella engine we are able to define sets of keys

that will be searched inside the crawled pages,

enabling the conversion of these keys to other

similar keys, and using them to generate new

addresses, in real time, that will be added to the

crawlable URL list – Figure 1-B. This is the

opposite of traditional web crawler architectures -

Figure 1-A – where the crawler reads a

configuration file containing the URLs that will be

processed and where new URLs will be found and

added to the processing list.

3.2 Execution

Initially, the spider parses the configuration file and

builds an internal data structure representing the web

map. Task dependency is calculated - if task B

depends on content produced by task A, B should

not be executed until A finishes – and a thread

manager is created that will ensure execution takes

place as planned.

When the Spider starts, the thread manager

initially launches all tasks that do not depend on any

of the other tasks. When each task finishes, it signals

termination to the thread manager, and the thread

manager is then responsible for launching any ready-

to-run tasks (those whose dependencies are already

met). Each task is divided in two steps, retrieving the

raw text, and processing the raw text. Text retrieval

can be done in several distinct ways. Textual URLs

are simple HTTP URL that can contain a

placeholder for previously obtained keys – stored in

a named container. Collected links are entire strings

stored previously in a named container and will be

added to the crawling cycle. Collected content is

simple text that is gathered according to the rules

defined in the configuration file and will also be

stored in a named container.

When retrieving text from the web, requests to

web servers are made asynchronously, in order to

achieve a higher speed. However, crawling at high

speeds can be considered impolite, so we have added

customized control over interval between requests.

Each stop has only one location but may have

more than one parallel action. This is extremely

useful when categorizing information, since there

can be more than one processing flow acting over

the same piece of text, thus, collecting different

pieces of information.

4 LOVD INTEGRATION

In order to provide an overview of the actual system

functionality, we describe a real-world usage

scenario involving the integration of several locus

specific databases. This can be a cumbersome task in

the field of bioinformatics due to the heterogeneity

of adopted formats and the existence of a large

number of distinct databases. The Leiden Open

(source) Variation Database – LOVD (Fokkema et

al., 2005) – is a successful attempt at creating a

standard platform that can be easily deployed and

customized by any research lab.

DiseaseCard’s main objective is to offer

centralized access to a vast amount of information

sources focusing on rare diseases. With the main

goal of improving DiseaseCard’s quality, we have

decided to increase its range by including access to

several LOVD installations distributed through

several distinct web locations.

Arabella comes into play as the tool that

discovers which LOVD installation contains

information regarding the genes that are associated

with a given rare disease. We use it to select the

index entries that contain the wanted gene, discard

irrelevant information, and expand the URLs with

the gene name in order to form the URLs for the

variation pages.

We will use the PEX10 (human peroxisome

biogenesis factor 10) gene for this example. We

instantiate the Spider with the appropriate

configuration file, put the gene name in the

gene

container and start the crawl. After the crawl we

print the contents of the containers on the screen.

Container 'databaseRowsContainingGene':

"2009-04-14" "2.0-16" "13" "870" "868"

"268" "dbPEX, PEX Gene Database"

"http://www.medgen.mcgill.ca/dbpex/""P

EX1,PEX10,PEX12,PEX13,PEX14,PEX16,PEX1

9,PEX2,PEX26,PEX3,PEX5,PEX6,PEX7"

"2009-04-01" "1.1.0-11" "13" "804"

"226" "dbPEX, PEX Gene Database"

"http://www.dbpex.org/"

"PEX1,PEX2,PEX3,PEX5,PEX6,PEX7,PEX10,PEX

12,PEX13,PEX14,PEX16,PEX19,PEX26"

5 CONCLUSIONS

Knowledge discovery and information retrieval are

scientific areas greatly promoted by the evolution of

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

272

the Internet. With this evolution, online content

quantity and quality has increased exponentially.

Along with it, so has the difficulty in gathering the

most suitable information related to a certain

subject. Web crawling engines provide the basis for

indexing information that can be accessed with a

search engine. However, this approach is generic –

the goal is to find content in the entire web. This

means that finding valid information with certified

quality for a certain topic can be a complex task.

To overcome this problem, there is a new wave

of web crawling engines that are directed to a

specific scientific domain and can reach content

hidden in the deep web. Arabella is a directed web

crawler that focuses on scalability, extensibility and

flexibility. Considering that the crawling engine is

based on an XML-defined navigational map, the

engine is scalable as the map can contain any

number of stops and rules. Rule and stop

dependencies can be added, as well as multiple rules

for the same address. Arabella is a multithreaded

environment that is prepared to deal with these

situations. Extensibility is also a key feature of this

application. Its modular approach and small code

base allow the quick addition of new features or the

reconfiguration of existing ones.

However, the most important feature is

flexibility. Interleaving web requests with text

processing allows the execution of a set of important

functionalities. Arabella is a web crawler that allows

dynamic generation of new URL addresses in real

time and storage of gathered information for further

use in the execution cycle. With proper

configuration, Arabella can also crawl through any

type of online accessible document whether is

HTML, CSV, XML, REST web service, tab

delimited or simple text.

ACKNOWLEDGEMENTS

The research leading to these results has received

funding from the European Community's Seventh

Framework Programme (FP7/2007-2013) under

grant agreement nº 200754 - the GEN2PHEN

project.

REFERENCES

Berners-Lee, T., Hendler, J. & Lassila, O. (2001) The

Semantic Web. Sci Am, 284, 34-43.

Brin, S. & Page, L. (1998) The anatomy of a large-scale

hypertextual Web search engine. Computer Networks

and ISDN Systems, 30, 107-117.

Chakrabarti, S. (2003) Mining the Web: Discovering

Knowledge from Hypertext Data, Morgan Kauffman.

Ding, L., Finin, T., et al. (2004) Swoogle: A Search and

Metadata Engine for the Semantic Web. Language,

652-659.

Eichmann, D. (1994) The RBSE Spider -Balancing

Effective Search Against Web Load. Proceedings of

the First International World Wide Web Conference.

Geneva, Switzerland.

Fokkema, I. F., Den Dunnen, J. T. & Taschner, P. E.

(2005) LOVD: easy creation of a locus-specific

sequence variation database using an "LSDB-in-a-

box" approach. Human Mutation, 26, 63-68.

Lin, S., Li, Y.-M. & Li, Q.-C. (2008) Information Mining

System Design and Implementation Based on Web

Crawler. Science, 1-5.

Menczer, F., Pant, G., et al. (2001) Evaluating Topic-

Driven Web Crawlers. Proceedings of the 24th Annual

International ACM SIGIR Conference on Research

and Development in Information Retrieval. New York,

NY, USA, ACM.

Miller, R. C. & Bharat, K. (1998) SPHINX: a framework

for creating personal, site-specific Web crawlers.

Computer Networks and ISDN Systems, 30, 119-130.

Mukhopadhyay, D., Biswas, A. & Sinha, S. (2007) A new

approach to design domain specific ontology based

web crawler. 10th International Conference on

Information Technology (ICIT 2007), 289-291.

Ntoulas, A., Zerfos, P. & Cho, J. (2005) Downloading

textual hidden web content through keyword queries.

JCDL '05: Proceedings of the 5th ACM/IEEE-CS joint

conference on Digital libraries. New York, NY, USA.

Oliveira, J. L., Dias, G. M. S., et al. (2004) DiseaseCard:

A Web-based Tool for the Collaborative Integration of

Genetic and Medical Information. Proceedings of the

5th International Symposium on Biological and

Medical Data Analysis, ISBMDA 2004. Barcelona,

Spain, Springer.

Peisu, X., Ke, T. & Qinzhen, H. (2008) A Framework of

Deep Web Crawler. 27th Chinese Control Conference.

China, IEEE.

Pinkerton, B. (1994) Finding what people want:

Experiences with the WebCrawler. Proceedings of the

Second International World Wide Web Conference.

Srinivasamurthy, K. (2004) Ontology-based Web Crawler.

Computing, 4-8.

Suel, T. & Shkapenyuk, V. (2002) Design and

Implementation of a High-Performance Distributed

Web Crawler. World Wide Web Internet And Web

Information Systems.

Tripathy, A. & Patra, P. K. (2008) A web mining

architectural model of distributed crawler for internet

searches using PageRank algorithm. 2008 IEEE Asia-

Pacific Services Computing Conference, 513-518.

Tsay, J.-J., Shih, C.-Y. & Wu, B.-L. (2005) AuToCrawler:

An Integrated System for Automatic Topical Crawler.

Machine Learning.

ARABELLA - A Directed Web Crawler

273