A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES

Intelligent Focused Crawling and Multilingual Semantic Techniques

Sybille Peters, Claus-Peter R

uckemann and Wolfgang Sander-Beuermann

Regionales Rechenzentrum f

ur Niedersachsen (RRZN), Leibniz Universit

at Hannover (LUH), Hannover, Germany

Keywords:

Focused crawling, Search engine, Vertical search engine, Metadata, Educational research, Link analysis.

Abstract:

Search engines typically consist of a crawler which traverses the web retrieving documents and a search front-

end which provides the user interface to the acquired information. Focused crawlers reﬁne the crawler by

intelligently directing it to predeﬁned topic areas. The evolution of search engines today is expedited by

supplying more search capabilities such as a search for metadata as well as search within the content text.

Semantic web standards have supplied methods for augmenting webpages with metadata. Machine learning

techniques are used where necessary to gather more metadata from unstructured webpages. This paper ana-

lyzes the effectiveness of techniques for vertical search engines with respect to focused crawling and metadata

integration exemplarily in the ﬁeld of “educational research”. A search engine for these purposes implemented

within the EERQI project is described and tested. The enhancement of focused crawling with the use of link

analysis and anchor text classiﬁcation is implemented and veriﬁed. A new heuristic score calculation formula

has been developed for focusing the crawler. Full-texts and metadata from various multilingual sources are

collected and combined into a common format.

1 INTRODUCTION

This investigation is part of a an ambitious scheme

funded by the European Commission under the 7th

Framework Programme: The European Educational

Research Quality Indicators (EERQI) project was

launched 2008 for a duration of three years with the

purpose of ﬁnding new indicators and methodologies

for determining research quality of scientiﬁc publica-

tions in the ﬁeld of “educational research” (EERQI-

Annex1, 2008). A key task within this project is the

development of an exemplary vertical search engine

for “educational research” documents. For this pur-

pose, mechanisms must be found for locating “educa-

tional research” publications in the WWW as well as

for distinguishing scientiﬁc research documents from

non-scientiﬁc documents.

The goal is to provide extensive search capabili-

ties for the user of the search engine. It should be pos-

sible to search within the full-text and metadata (lan-

guage, publisher, publication date, peer-review sta-

tus etc.) of the document. The methods described

exemplarily in this case study might then be applied

to any vertical search engine.

1.1 Existing Search Engines

A number of search engines focusing on scien-

tiﬁc research were analyzed. These included OAIs-

ter (OAIster, 2009), Scirus (Scirus, 2009), Google

scholar (Google Scholar, 2009), and the Education

Resources Information Center (ERIC, 2009). None of

these search engines provided all required features in-

cluding a granular topical selection, searching within

content and metadata such as title, subject, author

and/or language and inclusion of documents in other

languages than English in the corpus.

1.2 Focused Crawling

The goal of a focused crawler is to limit crawling

to a speciﬁc ﬁeld of interest. Frequency measures

for the keywords within URLs (Zheng et al., 2008),

link anchor texts (Zhuang et al., 2005), title, and full-

text (Bergmark et al., 2002) as well as occurrences

of links to and from other pages (Chakrabarti et al.,

1998) are some of the parameters that have been eval-

uated. Machine learning methods have been applied

to steer the crawler to pages with a higher probability

of compliance with the requested topic and determine

whether the documents meet the criteria. The crawl

181

Peters S., RÃijckemann C. and Sander-Beuermann W.

A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual Semantic Techniques.

DOI: 10.5220/0002777901810186

In Proceedings of the 6th International Conference on Web Information Systems and Technology (WEBIST 2010), page

ISBN: 978-989-674-025-2

must also be reﬁned by selecting a list of high qual-

ity start URLs (“seeds”). Starting the crawler on open

access archives, academic author or institutional web-

sites are some of the strategies that have been used

(Zhuang et al., 2005). It must also be established how

far the crawler may deviate from topically relevant

pages (“tunneling”) to ﬁnd more clusters of relevant

pages (Bergmark et al., 2002).

Classiﬁcation methods are then applied to deter-

mine whether the retrieved page belongs to a target

topic. For this purpose a set of training documents

are used. Common methods are a vector space model

with tf-idf term weights and a Naive Bayes, k–nearest

neighbor, or support vector machine classiﬁer (Pant

et al., 2004; Manning et al., 2008). In addition to lim-

iting the crawl to a certain topic, mechanisms must

be applied to assure that the retrieved documents are

actually scientiﬁc research documents. It needs to be

determined how well existing classiﬁers are capable

of doing this. In order to retrieve only academic doc-

uments, the documents themselves may be analyzed

for structure or content as well as the sites on which

they are located.

1.3 Metadata

Metadata in this paper means any information further

describing a digital “educational research” document

(referred to as ERD in this paper). This may be any in-

formation such as the title, author, publisher, abstract

or ISSN which are assigned by the time of publish-

ing and additional keywords or quality criteria which

may be automatically or manually attributed to the

document. It may also be the language, ﬁle format

or number of pages.

Metadata is useful for searching and browsing

within a set of documents (Witten et al., 2004). Com-

bining full-text search with metadata search greatly

enhances the capabilities of the user to reﬁne queries.

Displaying the metadata in the search results provides

additional valuable information. Sources for gather-

ing metadata may be combined. Metadata may also

be extracted from the full-text documents themselves.

Extensive research has been done on various meth-

ods to achieve this, for example using a support vec-

tor machine based classiﬁcation method (Han et al.,

2003).

The Web pages themselves also contain metadata

that is not marked as metadata but may be identiﬁed

with machine learning methods (e.g. result pages of

search results, table of content pages). Some work

has also been done on supervised and unsupervised

learning approaches in this area (Liu, 2008).

2 IMPLEMENTATION

The EERQI crawler is based on Nutch (Nutch, 2009),

which is an open source web crawler, that is highly

conﬁgurable and extensible via plugins. It is scal-

able across CPU clusters by incorporating the Apache

Hadoop (Hadoop, 2009) framework.

The following sections discuss the implementa-

tion of the search engine regarding the signiﬁcant

goals mentioned in the introduction.

2.1 Combining Techniques for Best-ﬁrst

Focused Crawling

2.1.1 Crawl Cycle

The Nutch crawler used within this investigation is

substantially optimized. The Nutch software itself

is not implemented for focused crawling but is ex-

tendable in this respect. The crawl is initialized with

a seed list: a set of start URLs. Most of these

start URLs have been selected from lists of electronic

journals in “educational research”. These URLs are

injected into the Nutch crawl database (“crawldb”),

which includes some information about each URL,

such as the current status (e.g. fetched or unfetched)

and time of last fetch.

Each crawl cycle generates a list of top scoring

unfetched URLs or URLs which need to be refetched.

These URLs are then retrieved from the WWW and

the resulting ﬁles are parsed. The URLs and corre-

sponding anchor texts are also extracted and inserted

into the link database (“linkdb”). This contains a list

of inlink URLs and anchor texts for each URL. The

parsed text is indexed if the document meets the Ed-

ucational Research Document Detection (ERDD) cri-

teria. A partial index is created for each crawl cy-

cle. Duplicate documents are deleted from the in-

dexes (“dedup”). At last, the indexes from each crawl

cycle are merged into the ﬁnal index. The modi-

ﬁed status information for each URL is rewritten to

the “crawldb”. The score for each URL is adapted

for EERQI focused crawling (“rescore”). Nutch uses

the OPIC (On-line Page Importance Computation)

(Abiteboul et al., 2003) algorithm to assign scores to

each URL.

2.1.2 Focused Crawling based on Link Analysis

A basic premise in OPIC and PageRank is (Abite-

boul et al., 2003): a page is important, if important

pages are pointing to it and important pages should

be fetched ﬁrst and more often. Within the EERQI

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

182

crawler, we know which pages are important, aka rel-

evant, as soon as we have fetched and analyzed them.

These are the pages that have been indexed after being

detected as Educational Research Documents (ERD).

We must learn to predict, which pages will be im-

portant before they are fetched, and follow the most

promising paths.

Some samples from the WWW have shown that

the ERDs, most often do not link to other important

ERD, if they link to anything at all. However, the

pages linking to ERDs can be regarded as important

pages, because they often consist of tables of content

pages for an entire journal volume or year. They will

not be indexed but are important in ﬁnding links to

other relevant pages. It makes sense to use backpropa-

gation for boosting the relevance score of pages which

link to ERDs. These pages are comparable to the

hubs in Kleinberg’s HITS (Hyperlink-Induced Topic

Search) algorithm (Kleinberg, 1999). The HITS al-

gorithm assumes that a good hub is a document, that

links to many good authorities (authorities are impor-

tant pages, comparable to ERD). Simply using the

above mentioned link importance algorithms (such as

OPIC, HITS or PageRank) is not feasible because we

will not crawl a signiﬁcant portion of the WWW and

these algorithms do not take into account whether a

document is an ERD.

The web may be displayed as a directed graph.

Intuitively, an ideal crawl path would retrieve a very

high number of ERD and a small number of non-ERD

pages. The ratio of ERD pages to the total num-

ber of fetched pages should be as high as possible.

When considering speciﬁc URLs, pages are impor-

tant, if they link to a high number of pages classiﬁed

as ERD. Indirect outlinks (outlinks of outlinks) will

be considered up to a certain distance. Effectively,

the high score of an ERD will be backpropagated to

pages linking to it. The resulting score must then be

passed on to the outlinks of these pages, until they

reach a signiﬁcant amount of unfetched pages.

We calculate the score based on the ratio of ERD

classiﬁed outlinks to all outlinks. Ultimately, the to-

tal number of ERD classiﬁed outlinks was included

into the equation but experimental results showed that

this did not improve results signiﬁcantly. Because

the total score of each link level should be smaller

with growing distance from start level, it must be di-

vided by a constant, here named g, which should ful-

ﬁll g > 1 (experimentally a value of g = 2 has proven

to yield best results), exponentiated with the link level

k. So the score will become weaker, the farther it

is propagated. Experiments were used to reﬁne the

equation based on the results.

Equation 1 is an heuristic approach newly devel-

oped within this project to sum up the score calcu-

lations based on backpropagation. It is applicable to

a vertical search engine in other ﬁelds of interest as

well. It has proven to yield optimum results as will be

shown later within Figure 1.

∑

k=0

+ 1

(1)

is the score for page i, l is the number of link

levels, c

is the number of links of i in link level k that

have been classiﬁed as ERD, d

is the total number of

links of i in link level k.

2.1.3 Anchor Text Analysis

Up to now, 60,000 anchor texts were analyzed. It

may be assumed that words such as “pdf”, “full”, “ar-

ticle”, and “paper” are good indicators of research

documents but they do not contain any information

about whether the referenced document is about “ed-

ucational research”. The word “abstract” is a good

hint, that the referenced document contains only an

abstract, which is currently not considered as ERD by

the search engine.

SVMLight (SVMLight, 2009) was used to train

the anchor texts. SVMLight is a Support Vector

Machine based classiﬁer. Single-word anchor texts

that are a good indicator of a direct link to research

texts (“pdf”) obtained almost the same result as sin-

gle words that would most likely not point to re-

search documents (“sitemap” and “abstract”). It is as-

sumed that this is due to the large number of non-ERD

documents (for example research from other ﬁelds)

that were also linked with potentially promising an-

chor text words. However, the classiﬁer works well

on anchor texts containing typical “educational re-

search” terms, for example “Teacher” received a score

of 4.28, “Learning” a score of 4.84.

When training the classiﬁer, not only the anchor

texts with direct links to ERD were used, but also an-

chor texts of indirect links up to a level of three.

An SVMLight score above 0 may be interpreted

as a positive hit. The higher the score, the higher the

probability of being in the trained class. The max-

imum score obtained in a list of 30000 samples was

4.89 while the minimum was −4.99. While using this

score may optimize the focused crawler, it may also

bias the search engine towards documents with “typi-

cal” mainstream titles.

A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual

Semantic Techniques

183

2.2 Educational Research Document

Detection

Before analyzing how an ERD may be detected, we

must ﬁrst deﬁne the term ERD more precisely: An

ERD is a digital scientiﬁc research document which

may be classiﬁed within the topic “educational re-

search”. It may be for example a journal article, a

conference paper, a thesis or a book. An ERD may

consist of one or more ERDs as in conference pro-

ceedings or entire journals. Abstracts are a part of an

ERD but are not considered as a fully qualiﬁed ERD.

Educational Research Document Detection may

be regarded as a combination of identifying scien-

tiﬁc research documents and topical (“educational re-

search”) classiﬁcation.

A large number of publications have analyzed

the use of Vector Space Model based algorithms

for document classiﬁcation. Sebastiani (Sebastiani,

2002) provided an overview. These methods may be

used for matching new documents with existing cate-

gories, such as speciﬁc topics (e.g. physics, biology),

spam / no-spam etc. The document is represented as

a vector. Each dimension of the vector represents a

term, the value is a representation of the frequency

that the term exists in the document (e.g. tf-idf may

be used). When classifying a document, the term vec-

tor of the document is matched with the term vectors

of the classes. ERDD may be regarded as a binary

classiﬁcation problem, because there is only one class

(ERD), or a ranking problem where the documents are

sorted by their ERD ranking score.

For supervised learning text classiﬁcation, a col-

lection of documents is required, which may be used

as a training base. This collection should cover all ar-

eas of “educational research”. A negative collection

should be provided as well, which covers documents

that should not be considered as ERD, such as re-

search documents from other ﬁelds and non-research

documents.

The detection mechanism is implemented using

the following criteria:

1. A rule based content analysis is used in order to

ensure a high probability that the document is a re-

search document. The document must have a min-

imum text length, it must contain deﬁned terms

(such as references, abstract) and it must contain

references which may be existing in various for-

mats.

2. A number of signiﬁcant “educational research”

keywords must exist in the document. Further

work needs to be done to replace or augment this

with a vector space model based classiﬁer.

2.3 Metadata

A common Dublin Core based XML format was de-

ﬁned for metadata. The local content base consists

of a number of full-text documents and metadata, that

have been supplied by various publishers. The full-

text content and available metadata was indexed using

Lucene (Lucene, 2009).

The Nutch (Nutch, 2009) crawler also uses a

Lucene (Lucene, 2009) index. The EERQI index-

ing plugin was modiﬁed to write the full-text content,

ﬁle format (Internet media type), number of pages (in

case of PDF), and the language (as detected by the

Nutch LanguageIdentiﬁer) to the index. A list of in-

formation about journals was generated by combining

the lists of educational research journals from various

journals lists with information supplied by EERQI

partners. Metadata information in the index was ex-

panded with journal information such as peer-review

status, publisher, and ISSN. This may later be en-

hanced with information about the title, authors, and

date of publication.

2.4 Multilingualism

Multilingualism is of special importance for Euro-

pean search engines due to Europe’s diversity of lan-

guages. The documents and metadata indexed by

our search engine are in fact supplied in several lan-

guages. Furthermore the documents themselves may

contain more than one language: often the abstract

is in English and the rest of the document is in the

native language of the authors. In order to provide

full multilingualism, it is necessary to use language

independent algorithms wherever possible and supply

translations for the supported languages.

Our focused crawling scheme based on link analy-

sis is language independent. The ERDD must supply

mechanisms for all supported languages.

When submitting a query from the web user in-

terface, the search engine may return results match-

ing the query and related terms in all supported lan-

guages. This will be implemented in the very near

future of the project using a multilingual thesaurus.

To the best of our knowledge this will soon be the

ﬁrst and only search engine implementing this kind

of extended multilingualism.

3 RESULTS

At this stage, a prototype search engine has been de-

signed and tested. The primary targets as described in

this case study have been implemented.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

184

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

crawl cycle

number of ERD found

without optimization anchor text optimization link analysis optimization (equation 1)

Figure 1: Crawl results with and without optimizations.

In order to test the search engine, a number of 100

URLs were randomly selected from the seed list to

test the crawler. A mutually exclusive list of 100 fur-

ther URLs was used to train the anchor text classiﬁer.

When crawling, 1000 URLs were generated for each

crawl cycle. The crawling alternates between select-

ing the 1000 best-scoring URLs and selecting the 10

top-scoring URLs for each of the sites from the seed

list. This was done to prevent an excessive downgrad-

ing of individual sites.

The total number of ERDs found for each crawl

cycle is shown in Figure 1. Crawling was done with a

total depth of 100 crawl cycles. The three lines show

the execution of the runs:

1. without optimization,

2. with anchor text optimization: A preliminary train

crawl of depth 50 was done with the aforemen-

tioned train URLs. When optimizing during the

test crawl, anchor texts were rated by SVMLight

based on the previous training set,

3. with link analysis optimization by equation 1:

Pages linking to ERD were boosted in score up

to a level of three outlink levels and one inlink

level. Equation 1 so has proven to be of signiﬁcant

value for intelligent focused crawling and might

therefore be used for any vertical search engine

crawler.

Using various sources, such as input from EERQI

partners, and the ERIC database (ERIC, 2009), a list

of URLs was compiled for measuring ERDD. This

list included “educational research” documents, re-

search documents from other areas of research, non-

research, and other documents (such as bibliogra-

phies, book reviews etc.). Applying the ERDD mech-

anism to these documents produced the following re-

sults (Table 1):

Table 1: Precision, recall, and accuracy ratios in ERDD re-

sults.

Precision Recall Accuracy

0.73 0.89 0.86

For the four project languages (English, French,

German, and Swedish) a sufﬁcient number of docu-

ments has been gathered from the WWW which is

further to be used for testing quality indicators detec-

tion within the EERQI project. Results of this detec-

tion may be used to enrich document metadata within

the search engine.

4 SUMMARY AND

CONCLUSIONS

It has been shown that an advanced vertical search

engine collecting full-text and metadata from inho-

mogeneously structured information sources can be

successfully implemented by integrating an intelli-

gent focused crawler. Content classiﬁcation (ERDD)

A NEW APPROACH TOWARDS VERTICAL SEARCH ENGINES - Intelligent Focused Crawling and Multilingual

Semantic Techniques

185

and metadata extraction have been shown as valuable

methods for enhancing search results. Link analysis

optimization achieved considerably better results than

anchor text optimization or no optimization. Using

link analysis, the number of necessary crawl cycles

are reduced by at least 50 %, leading to faster results

and less use of resources. The EERQI search engine,

accessible on the EERQI project website (EERQI,

2009), provides extensive search capabilities within

metadata and full-texts. It is the goal of the search

engine to gather information about a large number of

relevant “educational research” documents and pro-

vide access to information about these documents.

The ﬁrst steps have been taken to achieve this goal.

A new formula (equation 1) has been developed for

focused crawling.

5 FUTURE WORK

Based on the current implementation, the next stage

of the EERQI search engine development will con-

centrate on optimized content classiﬁcation (ERDD)

and metadata extraction. Further effort needs to be

put into metadata extraction from anchor texts and

full text. Preliminary tests revealed that a signiﬁcant

number of anchor texts include title, author, and / or

journal names. This may be combined with metadata

extraction from full-texts. The search engine user in-

terface will be enhanced to facilitate ergonomic us-

ability for a number of features, such as clustering and

sorting of results as well as complex search queries.

ACKNOWLEDGEMENTS

We kindly thank the partners within the EERQI

project and the colleagues at RRZN for their valuable

input and support. EERQI is funded by the European

Commission under the 7th Framework Programme,

grant 217549.

REFERENCES

Abiteboul, S., Preda, M., and Cobena, G. (2003). Adaptive

On-Line Page Importance Computation. In Proceed-

ings of the 12th international conference on World

Wide Web, pages 280–290. ACM.

Bergmark, D., Lazoze, C., and Sbityakov, A. (2002). Fo-

cused Crawls, Tunneling, and Digital Libraries. In

Proceedings of the 6th European Conference on Digi-

tal Libraries.

Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S.,

Gibson, D., and Kleinberg, J. (1998). Automatic Re-

source Compilation by Analyzing Hyperlink Structure

and Associated Text. In Proceedings of the Seventh

International World Wide Web Conference.

EERQI (2009). EERQI project website.

http://www.eerqi.eu.

EERQI-Annex1 (2008). EERQI Annex I - Description of

Work. http://www.eerqi.eu/sites/default/ﬁles/11-06-

2008 EERQI Annex I-1.PDF (PDF).

ERIC (2009). Education Resources Information Center

(ERIC). http://www.eric.ed.gov.

Google Scholar (2009). Google Scholar.

http://scholar.google.com.

Hadoop (2009). Apache Hadoop.

http://hadoop.apache.org/.

Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z.,

and Fox, E. (2003). Automatic Document Metadata

Extraction using Support Vector Machines. In Pro-

ceedings of the 2003 Joint Conference on Digital Li-

braries (JCDL 2003).

Kleinberg, J. (1999). Authoritative Sources in a Hyper-

linked Environment. Journal of the ACM, pages 604–

632.

Liu, B. (2008). Web Data Mining. Springer.

Lucene (2009). Apache Lucene. http://lucene.apache.org/.

Manning, C. D., Raghavan, P., and Sch

utze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Nutch (2009). Apache Nutch. http://lucene.apache.org/

nutch/.

OAIster (2009). OAIster. http://oaister.org.

Pant, G., Tsioutsiouliklis, J. J., and Giles, C. L. (2004).

Panorama: Extending Digital Libraries with Topi-

cal Crawlers. In Proceedings of the 2004 Joint

ACM/IEEE Conference on Digital Libraries.

Scirus (2009). Scirus. http://www.scirus.com.

Sebastiani, F. (2002). Machine Learning in Automated Text

Categorization. ACM Computing Surveys, 34:1–47.

SVMLight (2009). SVMlight.

http://svmlight.joachims.org/.

Witten, I., Don, K. J., Dewsnip, M., and Tablan, V. (2004).

Text mining in a digital library. International Journal

on Digital Libraries.

Zheng, X., Zhou, T., Yu, Z., and Chen, D. (2008). URL Rule

Based Focused Crawler. In Proceedings of 2008 IEEE

International Conference on e-Business Engineering.

Zhuang, Z., Wagle, R., and Giles, C. L. (2005). What’s there

and what’s not? Focused crawling for missing docu-

ments in digital libraries. In Proceedings of the 5th

ACM/IEEE-CS Joint Conference on Digital Libraries.

WEBIST 2010 - 6th International Conference on Web Information Systems and Technologies

186