TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT

DICTIONARY

Ari Pirkola

University of Tampere, School of Information Sciences, Tampere, Finland

Keywords: Dictionaries, Focused Crawling, Query Performance Prediction, Searching, Vertical Search Engines, Web

Search Engines.

Abstract: The contributions of this paper are twofold. First, we present a new type of dictionary that is intended as a

search assistance in topic-specific Web searching. The method to construct the dictionary is a general

method that can be applied to any reasonable topic. The first implementation deals with climate change. The

dictionary has the following new features compared to standard dictionaries and thesauri: (A) It contains

real-text phrases (e.g. rising sea levels) in addition to the standard dictionary forms (sea-level rise). The

phrases were extracted automatically from the pages dealing with climate change, and are thus known to

appear in the pages discussing climate change issues when used as search terms. (B) Synonyms, i.e., differ-

ent spelling, syntactic, and short form variants of the phrase are grouped together into the same entry (syno-

nym set) using approximate string matching. (C) Each phrase is assigned an importance score (IS) which is

calculated based on the frequencies of the phrase in relevant pages (i.e., pages on climate change) and non-

relevant pages. Second, we investigate how effective the IS is for indicating the best phrase among synony-

mous phrases and for indicating effective phrases in general from the viewpoint of search results. The ex-

perimental results showed that the best phrases have higher ISs than the other phrases of a synonym set, and

that the higher the IS is the better the search results are. This paper also describes the crawler used to fetch

the source data for the climate change dictionary and discusses the benefits of using the dictionary in Web

searching.

1 INTRODUCTION

In a current research project, we developed a vertical

search engine (a topic-specific search engine)

(Pirkola, 2011b) and a topic-specific dictionary of

real-text phrases, that is, a dictionary of phrases that

are extracted from the Web pages discussing the

topic in question. Both are dedicated to climate

change and their focus is on scientific pages, but the

methods to implement the search engine and the

dictionary are general methods that can be applied to

other topics. The search engine and the dictionary

are at: www.searchclimatechange.com.

The proposed real-text dictionary, in short RT-

dictionary, has new features compared to the stan-

dard dictionaries and thesauri. We now discuss the

existing climate change RT-dictionary, but a similar

dictionary can be constructed for any reasonable

topic using the method presented in this paper. The

dictionary contains some 5 500 unique real-text

phrases related to climate change (keyphrases). They

were extracted from scientific Web pages discussing

climate change. Each phrase is assigned a fre-

quency-based importance score, which reflects the

significance of the phrase in the context of climate

change research. Different variant forms of the same

phrase, such as sea-level rise, sea level rising, and

rising sea level, are grouped together into the same

entry (synonym set) using approximate string match-

ing. The dictionary was developed for use as a

search assistance in our search system to support

query formulation, but it can be used as well to-

gether with the general Web search engines. The

phrases represent different aspects of the climate

change research. When used in searching the user

can browse through the dictionary to clarify his or

her information need, or select more directly appro-

priate search phrases.

The purpose of the importance score (IS) is to

provide the user with information on important and

less important phrases. Within a synonym set all

phrases are not equal from the viewpoint of the

289

Pirkola A..

TOPIC-SPECIFIC WEB SEARCHING BASED ON A REAL-TEXT DICTIONARY.

DOI: 10.5220/0003895602890298

In Proceedings of the 8th International Conference on Web Information Systems and Technologies (WEBIST-2012), pages 289-298

ISBN: 978-989-8565-08-2

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

search results, but some yield better results than

others. It may be that one is significantly better than

the other phrases. In the experimental part of this

paper we investigate, first, how effective the IS to

indicate the best phrase within a synonym set. We

expect the phrase with the highest IS to yield the

best search results. Our second research question is:

Does high IS mean high retrieval performance in

general (i.e., when synonymy is not considered). The

expectation is that the higher the IS, the better the

search results. The results of this study guide our

further work: depending on the results we may

elaborate the calculation of the IS. The results may

also suggest a sensible lower limit of the IS to be

applied in the dictionary. The results also guide

more generally our work in the further development

of the RT-dictionary.

In summary, this paper has two contributions.

First, we describe how the RT-dictionary of climate

change was constructed (Section 4). The presented

methods and tools can be used to a construct a simi-

lar RT-dictionary for any reasonable topic. Second,

in the experimental part of this paper we investigate

how effective the proposed importance score is to

indicate the best phrase among synonymous phrases

and to indicate effective phrases in general from the

viewpoint of the search results (Sections 5 and 6).

We also review related research (Section 2) and

describe the crawler used to fetch the source data for

the climate change RT-dictionary (Section 3). Sec-

tion 7 presents the discussion, and Section 8 con-

cludes this paper.

2 RELATED WORK

This study is related to research on query perform-

ance prediction, which is an important research

problem in the field of information retrieval re-

search. A retrieval system that can predict the diffi-

culty of a query can adapt its parameters to the

query, or it can give feedback to the user to reformu-

late the query. Query performance prediction has

been the focus of many studies. He and Ounis

(2006) studied six methods to predict query per-

formance, e.g. average inverse collection term fre-

quency and the standard deviation of inverse docu-

ment frequency. Cronen-Townsend et al. (2002)

proposed computing divergence of the query model

from the collection model to predict the difficulty of

queries. Perez-Iglesias and Araujo (2010) proposed

the use of the maximum standard deviation in the

ranked-list as a retrieval predictor. Different from

these studies the IS ranks phrases with respect to

each other rather than tries to predict retrieval per-

formance as such.

This work is also related to research on key-

phrase extraction. Conventionally, keyphrase extrac-

tion refers to a process where phrases that describe

the contents of a document are extracted and are

assigned to the same document to facilitate e.g.

information retrieval. Our approach differs from the

conventional approach in that we do not handle

individual documents but a set of documents dis-

cussing a particular topic. Most conventional ap-

proaches are based on machine learning techniques.

KEA (Witten et al., 1999), GenEx (Turney, 2003),

and KP-Miner (El-Beltagy and Rafea, 2009) are

three keyphrase extraction systems presented in the

literature. In these systems, keyphrases are identified

and scored based on their length and their positions

in documents, and using the TF-IDF weight.

Muresan and Harper (2004) also developed a

terminological support for searchers’ query con-

struction in Web searching. However, unlike our

study they did not focus on keyphrases but proposed

an interaction model based on system-based media-

tion through structured specialized collections. The

system assists the user in investigating the terminol-

ogy and the structure of the topic of interest by al-

lowing the user to explore a specialized source col-

lection representing the problem domain. The user

may indicate relevant documents and clusters on the

basis of which the system automatically constructs a

query representing the user’s information need. The

starting point of the approach is the ASK (Anoma-

lous State of Knowledge) model where the user has

a problem to solve but does not know what informa-

tion is needed (Belkin et al., 1982). Lee (2008)

showed that the mediated system proposed by Mure-

san and Harper (2004) was better than a direct re-

trieval system not including a source collection in

terms of effectiveness, efficiency and usability. The

more search tasks the users conducted, the better

were the results of the mediated system.

The source data for the climate change RT-

dictionary were crawled from the Web sites of uni-

versities and other research organizations investigat-

ing climate change by the focused crawler described

in Section 3. Focused crawlers are programs that

fetch Web documents (pages) that are relevant to a

pre-defined domain or topic. Only documents as-

sessed to be relevant by the system are downloaded

and made accessible to the users e.g. through a digi-

tal library or a vertical search engine. During crawl-

ing link URLs are extracted from the pages and are

added into a URL queue. The queue is ordered based

on the probability of URLs (i.e., pages pointed to by

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

290

the URLs) being relevant to the topic in question.

Pages are assigned probability scores e.g. using a

topic-specific terminology, and high-score pages are

downloaded first.

Focused crawling research has focused on im-

proving crawling techniques and crawling effective-

ness (Bergmark et al., 2002; Chakrabarti et al., 1999;

Diligenti et al., 2000; Tang et al., 2005). Below we

consider these papers. We are not aware of any

study investigating the use of focused crawling for

keyphrase extraction. Perhaps the closest work to

our research is that of Talvensaari et al. (2008) who

also constructed word lists using focused crawling.

However, they used focused crawling as a means to

acquire German-English and Spanish-English com-

parable corpora in biology for statistical translation

in cross-language information retrieval.

Bergmark et al. (2002) use the term tunneling to

refer to the fact that sometimes relevant pages can be

fetched only by traversing through irrelevant pages.

A focused crawling process using the tunneling

technique does not end immediately after an irrele-

vant page is encountered, but the process continues

until a relevant page is encountered or the number of

visited irrelevant pages reaches a pre-set limit. It was

shown that a focused crawler using tunneling is able

to fetch more relevant pages than a focused crawler

that only counts relevant pages. However, efficiency

is lowered due to the downloaded irrelevant pages.

Chakrabarti et al. (1999) utilized the Document

Object Model (DOM) of Web pages in focused

crawling. The DOM is a convention for representing

and interacting with objects in HTML, XHTML and

XML documents (http://en.wikipedia.org/wiki/ Doc-

ument_Object_Model). A DOM tree represents the

hierarchical structure of a page: the root of the tree

is usually the HTML element which typically has

two children, the HEAD and BODY elements,

which further are divided into sub-elements. The

leaf nodes of the DOM tree are typically text para-

graphs, link anchor texts, or images. In addition to

the usual bag-of-words representation of Web pages,

the approach proposed by Chakrabarti et al. (1999)

represented a hyper-link as a set of features <t, d>

where t is a word appearing near the link, and d its

distance from the link. The distance is measured as

the number of DOM tree nodes that separates t from

the link node. These features were used to train a

classifier to recognize links leading to relevant

pages. The links with low distance to relevant text

are considered to be more important than links that

are far from the relevant text.

In the Context Graph approach (Diligenti et al.,

2000) a graph of several layers depth is constructed

for each page and the distance of the page to the

target pages is computed. In the beginning of crawl-

ing a set of seed URLs is entered in the focused

crawler. Pages that point to the seed URLs, i.e.,

parent pages, and their parent pages (etc.) form a

context graph. The context graphs are used to train a

classifier with features of the paths that lead to rele-

vant pages.

Tang et al. (2005) built a focused crawler for

mental health and depression. The aim was to iden-

tify high-quality information on these topics. They

found that link anchor text was a good indicator of

the page’s relevance but not of quality. They were

able to predict the quality of links using a relevance

feedback technique.

3 CRAWLER

We developed a focused crawler that is used to fetch

pages both for the climate change search system and

for use as source data for the RT-dictionary of cli-

mate change. The crawler can be easily tuned to

fetch pages on other topics. In this section we de-

scribe the crawler.

The crawler determines the relevance of the

pages during crawling by matching a topic-defining

query against the retrieved pages using a search

engine. It uses the Lemur search engine

(http://www.lemurproject.org/) for this purpose. The

pages on climate change were fetched using the

following search terms in the topic-defining query:

climate change, global warming, climatic change,

research. We used the core journals in the field to

find relevant start URLs. When pages on some other

topic are fetched only the search terms and the start

URL set need to be changed. So, applying the crawl-

er to a new topic is easy.

To ensure that the crawler fetches mainly schol-

arly documents its crawling scope is limited, so that

it is only allowed to visit the pages on the start URL

sites and their subdomains (for example, re-

search.university.edu is a subdomain of

www.university.edu), as well as sites that are one

link apart from the start domain. If needed, this re-

striction can be relaxed so that the crawling scope is

not limited to these sites.

A focused crawler does not follow all links on a

page but it will assess which links to follow to find

relevant pages. Our crawler assigns the probability

of relevance to an unseen page v using the following

formula that was developed in our previous study.

We call it N-formula:

TOPIC-SPECIFICWEBSEARCHINGBASEDONAREAL-TEXTDICTIONARY

291

Pr(T|v)= (α * rel(u) * (1/log(Nu))) + ((1 – α) *

rel(<u,v>)), α= 0.3

where Pr(T|v) is the probability of relevance of

the unseen page v to the topic T, α is a weighting

parameter (0 < α < 1), rel(u) is the relevance of the

seen page u, calculated by Lemur, Nu the number of

links on page u, and rel(<u,v>) the relevance of the

link between u and the unseen page v. The relevance

of the link is calculated by matching the context of

the link against the topic query. The context is the

anchor text, and the text immediately surrounding

the anchor. The context is defined with the help of

the Document Object Model: all text that is within

five DOM tree nodes of the link node is considered

belonging to the context.

As can be seen, Pr(T|v) is a sum that consists of

two terms: one that depends on the relevance of the

page, and one that depends on the relevance of the

link. The relative importance of the two terms is

determined by the parameter α. Also, the number of

links on page u inversely influences the probability.

If rel(u) is high, we can think that the page “recom-

mends” page v. However, if the page also recom-

mends lots of other pages (i.e., Nu is high), we can

rely less on the recommendation.

The N-formula gave the best results in an ex-

periment where we compared it to two reasonable

baseline methods. In the first baseline, link <u,v>

was assigned the same relevance score as page u. In

the second one, a link was assigned a relevance

score based on its context. Crawling results were

measured as the number of obtained relevant pages

when the same number of documents was

downloaded.

We also conducted an experiment where the

weighting parameter α was set to (A) α=0.3 and (B)

α=0.7. In (A) the crawler fetched about two times

more relevant documents than in (B). Based on this

crawling experiment we apply for the α parameter

the value of α = 0.3.

4 REAL-TEXT DICTIONARY OF

CLIMATE CHANGE

Constructing the real-text dictionary of climate

change involved two main stages: (1) extraction of

keyphrases and the calculation of importance scores

and (2) identification of synonyms. Section 4.1.

describes the first stage and Section 4.2 the second

one. Section 4.2 also describes the synonym types

contained in the dictionary.

4.1 Keyphrase Extraction

Here keyphrase of the topic means a phrase that is

often used in texts dealing with the topic and which

refers to one of its aspects. Keyphrases typically

represent different research areas of the topic (i.e.,

climate change in this study), some are more techni-

cal in nature. The keyphrase extraction method and a

climate change keyphrase list, a predecessor of the

climate change dictionary. are described in (Pirkola,

2011a). Here we describe the main features of the

extraction method, and present the importance score.

The source data for the dictionary, i.e., pages re-

lated to climate change, were crawled from the Web

sites of universities and other research organizations

investigating climate change by the crawler de-

scribed in Section 3. When the crawled data were

processed the first challenge was to determine which

sequences of words are phrases. Here we applied the

phrase identification method by Jaene and Seelbach

(1975). The main point of the technique is that a

sequence of two or more consecutive words consti-

tutes a phrase if it is surrounded by small words

(such as the, on, if) but do not include a small word

(except for of). The phrases were extracted from the

pages assigned a high relevance score by the

crawler.

When constructing RT-dictionary the importance

score is needed to prune out out-of-topic phrases

from the dictionary. The most obvious out-of-topic

phrases receive a low score and are not accepted in

the dictionary. The remaining phrases are regarded

as keyphrases and are included in the dictionary.

This is the first application of the IS. The second

one, which is investigated in this paper, is to use the

IS as an indicator of the effectiveness of different

keyphrases in searching.

The IS is calculated on the basis of the frequen-

cies of the phrases in the corpora of various densities

of relevant text, and in a non-relevant corpus. We

determined the IS using four different corpora. The

relevant corpora are built on the basis of the occur-

rences of the topic title phrase (i.e., climate change)

and a few known keyphrases related to climate

change in the original corpus crawled from the Web.

Assumedly, a phrase which has a high frequency in

the relevant corpora and a low frequency in the non-

relevant corpus deserves a high score. Therefore, the

importance score is calculated as follows:

IS(P

) = ln(F

DC(1)

( P

) * F

DC(2)

) * F

DC(3)

)

/ F

DC(4)

)); (F

> 0)

DC(1)

)... F

DC(4)

) = the frequencies of the phrase

in the four corpora.

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

292

DC(1) = Highly dense corpus

DC(2) = Very dense corpus

DC(3) = Dense corpus

DC(4) = non-relevant corpus

This method allows us to indicate the importance

of the phrase in the Web texts discussing climate

change (or any topic for which RT-dictionary is

constructed), and separate between keyphrases and

out-of-topic phrases based on the fact that the rela-

tive frequencies of the keyphrases decrease as the

density decreases.

In the dictionary, the importance scores range

from 4.7-35.8. The choice of the lower limit of 4.7

has no experimental basis. As mentioned, one pur-

pose of this study is to consider this issue.

Table 1 shows the 20 highest ranked phrases in

the dictionary and their importance scores. Many of

the phrases are well-known in the context of climate

change research. The dictionary also contains short

form phrases, i.e., phrases where one component is

omitted. Authors often use such forms. For example,

after introducing the full phrase climate change

impacts the author may refer to it by the phrase

change impacts. In searching such short forms may

strengthen the query and affect document ranking

positively when used together with their full forms.

However, the short forms are often ambiguous, and

often it does not make sense to use a short form

without at the same time using its full phrase.

4.2 Synonyms

Synonyms were identified using the digram ap-

proximate matching technique. Phrases were first

decomposed into digrams, i.e., substrings of two

adjacent characters in the phrase (for n-gram match-

ing, see Robertson and Willett, 1998). The digrams

of the phrase were matched against the digrams of

the other phrases in the phrase list generated in the

keyphrase extraction phase (Section 4.1). Similarity

between phrases was computed using the Dice for-

mula, and the phrase pairs that had the similarity

value higher than the threshold (SIM=0.75) were

regarded as synonyms. The output of the digram

matching process contained some 10% of wrong

matches which were removed manually. In the ma-

jority of cases the identification of the wrong

matches was a trivial task and it does not require any

specific expertise. Currently, we are working to

develop a more effective synonym identification

method where only a minimal manual intervention is

needed. We are testing several approximate string

matching methods in combination with stemming

and as such, as well as the use of the textual contexts

of the phrases.

Table 1: Highest ranked keyphrases in the climate change

RT-dictionary.

Keyphrases IS

climate change 35.8

climate change research 28.4

global warming 26.7

impacts of climate change 26.3

sea level 26.2

global climate 25.7

greenhouse gas 25.3

sea level rise 25.2

climate change impacts 25.0

sustainable communities 25.0

environmental policy 24.9

environmental change 24.7

global climate change 24.5

sustainable energy 24.5

climate impacts 24.3

integrated assessment 24.2

water resources 24.1

gas emissions 23.9

greenhouse gases 23.9

climate changes 23.8

The dictionary contains the following types of

synonyms:

A. Spelling Variants. Phrases that contain the

same component pairs but are written slightly differ-

ently, e.g. climate change prediction - climate

change predictions. These kinds of variants are

mostly morphological variants.

B. Syntactic Variants. Phrases that contain the

same component pairs, but the order of the compo-

nents is different, e.g. predicted climate change -

climate change predictions. These kinds of variants

may also include type A variation, as in the example

above.

C. Short Form Variants. A phrase that is a

subphrase of a longer phrase, e.g. change impacts -

climate change impacts.

The dictionary contains 5 507 unique phrases. Of

these, 4 769 phrases have at least one synonym.

Table 2 presents three example entries. The diction-

ary is organized alphabetically, and each phrase acts

as a head phrase in its turn.

TOPIC-SPECIFICWEBSEARCHINGBASEDONAREAL-TEXTDICTIONARY

293

Table 2: Three example entries in the climate change RT-

dictionary.

Head phrase IS Synonyms IS

hydrologic cycle 11.3 hydrological cycle 15.1

ice melt 11.5 ice melting 5.7

melting ice 11.8

melting of ice 10.3

sea level rising 5.5 rising sea level 13.0

rising sea levels 18.1

sea level 26.2

sea level rise 25.2

sea level rises 10.8

5 EXPERIMENTS

In the experiments, we selected test phrases, formu-

lated test queries based on the selected phrases, and

run the queries in two search systems. This section

describes these tasks. Section 6 presents the evalua-

tion measures and reports the findings.

We investigated spelling and syntactic variants.

As discussed above, short forms variants are often

ambiguous expressions, and often it does not make

sense to use them alone in queries. For this reason,

they were not considered in these experiments. We

may investigate them in the further research. They

may be useful in document ranking, which is one

possible application area of the RT-dictionary.

When the test phrases were selected from the

dictionary it was ordered in the decreasing order of

the ISs, so that high IS phrases were in the beginning

and low IS phrases at the end of the dictionary. For

both variant types, 25 entries were selected from the

beginning and 25 from the middle of the dictionary.

An entry contains a head phrase and its synonym(s)

(Table 2). In the selection, a candidate entry was

discarded if it did not have synonyms, then the next

one was tried until there were 50 entries both for

spelling and syntactic variants. The synonyms of the

selected head phrases varied from high IS to low IS

phrases. In the spelling variant test set, 38 head

phrases had one synonym, 10 had two, one had

three, and one had four synonyms. In the syntactic

variant set, these numbers were: 36 head phrases had

one synonym, 10 had two synonyms, and four had

three synonyms.

Basically, we formulated 50 test topics both for

spelling and syntactic variants. The test topics were

of the type climate change AND C, where C is the

concept represented by the entry selected for the

experiment. Unlike in traditional information re-

trieval experiments (see http://trec.nist.gov), the

relevance of the documents was not assessed by

human assessors. Instead, we performed (1) high

precision queries, that is, title queries, and high

recall queries, that is, full text queries. In (1) the

query phrase is required to appear in the title of the

document and in (2) the query phrase may appear

anywhere in the document. These two types repre-

sent a broad spectrum of search results regarding the

precision and recall of the results. It should be noted

that we are not interested in the precision and recall

as such, but the effectiveness of the IS to indicate

good query phrases in different situations.

If the phrase occurs in the title of the document,

we can be highly confident that the document deals

with the issue represented by the phrase. The title

queries do not retrieve all documents that are rele-

vant to a particular test topic. However, they retrieve

highly relevant documents which usually are the

most important for the user. The title queries are

important also in that they return a focused set of

relevant documents. Common experience suggests

that Web searchers usually only look at the top

search results. This has also been verified experi-

mentally (Jansen et al., 2000). Thus, a relatively

small set of relevant documents is more important

than high recall. The full text queries return highly

relevant, marginally relevant and irrelevant docu-

ments. They thus yield higher recall but lower preci-

sion than title queries.

We run the queries in our climate change search

system and in the Google search engine. Google is

the biggest search engine on the Web, covering

billions of pages, and it is characterized by the het-

erogeneous information. Google‘s advanced search

mode supports title queries. The climate change

search system has indexed 95 819 (public version 73

194) Web pages related to climate change. The

pages crawled for the system were indexed using the

Apache Lucene programming library

(http://lucene.apache.org/). The system supports

several types of queries based on Lucene’s query

language, e.g. queries targeted at the title of docu-

ments, and it reports the number of retrieved pages.

In the experiments, each test phrase was run as a

query in the climate change search system (in short

CCSS in tables) and in Google. All documents in-

dexed by our search system deal with climate

change and it was not explicitly expressed in que-

ries, whereas Google queries included the phrase

climate change in addition to the test phrases.

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

294

Google reports the number of retrieved pages which,

however, may vary in that the same query returns

different number of pages even during a short period

of time. However, generally the retrieval results are

consistent and meet the expectations.

Below we present some example title queries:

Climate change system - the highest IS phrase

title: "biodiversity loss" [IS=12.6]

Climate change system - other phrase

title: "loss of biodiversity" [IS=11.2]

Google - the highest IS phrase

allintitle: "biodiversity loss" "climate change"

Google - other phrase

allintitle: "loss of biodiversity " "climate change"

6 FINDINGS

In the first experiment, we compared the number of

documents retrieved by the highest IS phrases to that

retrieved by the other phrases. If the IS is a good

indicator of query performance we expect the high-

est IS phrases to retrieve much larger sets of docu-

ments than the other phrases. Tables 3 and 4 report

the total number of documents retrieved by the high-

est IS phrases and the other phrases across the syno-

nym sets for title queries (Table 3) and for full text

queries (Table 4). As can be seen, in both cases the

highest IS phrases perform far better. The statistical

significance was analyzed by the paired t-test. By

conventional criteria, the results are statistically

significant except for Google / syntactic / title. In

Tables 3-4 the significance levels are indicated as

follows: *** p < 0.001; ** p < 0.005; * p < 0.05.

The second experiment considered effective phrases

in general. The test phrases were put in six catego-

ries of equal size, i.e., the same number of phrases,

based on their ISs. For spelling variants, each cate-

gory contained 19 phrases, and for syntactic variants

20 phrases. Tables 5 (title queries) and 6 (full

text

queries) report the average number of re trieved

documents in each category, and the average IS for

the categories. As can be seen, the trends are not

strictly linear but exhibit downward curvatures in a

few cases. However, the trends are still clear: The

higher the IS, the more documents are retrieved.

Linear regression analysis yielded the correlation

coefficients from 0.87 (Google / spelling / title, p <

0.05) to 0.92 (CCSS / spelling / title, p < 0.01).

These indicate a strong relationship between the IS

and the search results.

Table 3: Number of retrieved documents by the highest IS

phrases and other phrases. Title queries.

Variant type

Search system

Highest IS

# docs

Other

# docs

Spelling, CCSS 3545 *** 793

Syntactic. CCSS 1163 *** 295

Spelling, Google 686 854 * 331 556

Syntactic. Google 1 374 087 748 841

Table 4: Number of retrieved documents by the highest IS

phrases and other phrases. Full text queries.

Variant type

Search system

Highest IS

# docs

Other

# docs

Spelling, CCSS 53 839 *** 16 403

Syntactic. CCSS 49 146 *** 14 205

Spelling, Google 408 524 000 *** 133 057 300

Syntactic. Google 223 705 000 ** 94 023 730

Table 5: Average number of retrieved documents in the

six IS categories. Title queries.

Variant type.

Average IS for the

category

CCSS

# documents

Google

# documents

Spelling

IS=24.3 1317,8 12 071 526.3

IS=

20.1 1612,1 11 427 842.1

IS=

15.1 452,5 1 460 737.8

IS=13.4 121,5 1 482 000.0

IS=11.9 128,6 1 457 136.8

IS=7.8 64,5 605 036.8

Syntactic

IS=21.8 1983,4 7 620 500.0

IS=17.6 472,4 3 057 610.5

IS=14.4 347,4 1 434 650.0

IS=12.8 291,3 1 372 726.3

IS=9.9 73,4 1 265 878.9

IS=6.9 38,3 362 090.6

• The higher the IS is, the more documents

are retrieved.

• The IS is effective in many search situa-

tions.

7 DISCUSSION

Web search engines are query-based search systems.

Querying is an effective method when the informa-

tion need can be expressed using a relatively simple

query and the search term is clear, e.g. when the

searcher looks for information on the disease whose

name he or she knows. However, it is common that

the searchers do not know or remember the appro-

priate search terms. Complex information needs are

also difficult to formulate as a query. In situations

such as complex work tasks the information need

itself may be vague and ill defined (Ingwersen and

Järvelin, 2005). Obviously, a terminology tool con-

taining the most important phrases related to a par-

ticular topic would be a helpful tool for searchers

searching for information related to that topic, help-

ing to clarify the information need and find good

query terms. We developed such a tool: a new type

of dictionary that is intended as a search assistance

in topic-specific Web searching. The dictionary has

the following new features compared to standard

dictionaries and thesauri: It contains real-text

phrases and synonym groups, and each phrase is

assigned an importance score.

Real-text Phrases

. The phrases included in the real-

text dictionary of climate change were extracted

from the pages dealing with climate change, and are

thus known to appear in the pages discussing climate

change issues when used as search terms. Hence, the

proposed approach implicitly involves the idea of

reciprocity: the phrases are extracted from relevant

Web pages (i.e., pages related to some aspect of

climate change), and they in turn can be used in

queries to find relevant pages.

Synonymy

. It seems that Web searchers are not

aware of the effects of synonymy on the search re-

sults, because not many search systems provide the

searchers with terminology tools that support query

formulation. The proposed RT-dictionary is such a

tool that groups synonymous phrases together. The

main limitation of our approach is that it does not

find synonyms that are written completely differ-

ently. We intend to address this issue in the ongoing

project. Wei et al. (2009) applied co-click analysis

for synonym discovery for Web queries. The idea is

to identify synonymous query terms from queries

that share similar document click distribution. We

cannot use the co-click analysis because we do not

have sufficient amount of click data, but we can try a

similar kind of idea: analyzing documents that share

similar in- and out-link distribution.

Importance Score

. The importance score was de-

vised to be a measure that indicates how important

the phrase is from the viewpoint of the search re-

sults. The results showed that it works as it was

intended to work. It indicates effectively the best

phrase among synonymous phrases and effective

phrases in general. The IS is effective even in the

Google search engine that has indexed billions of

pages. Thus, the RT-dictionary provides the user

with information on which of the alternative phrases

is likely to yield the best search results. In cases

where two (or more) query terms have high IS, the

results of this study suggest that the user should use

both (all) of them (provided that the search system

supports disjunction (OR-operator) type queries).

RT-dictionary was developed for use as a search

assistance to support query formulation in Web

searching, but it could be used in document indexing

as well to improve the ranking of documents. In

Web search engines, pages are scored with respect to

the input query and ranked on the basis of the as-

signed scores. Ranking schemes typically include a

term frequency (tf) component, where term fre-

quency refers to the number of occurrences of the

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

296

query term in a document. The rationale behind the

use of tf is that the more occurrences the query term

has in a given document the more likely it is that the

document is relevant to the input query. If synonymy

is taken into account by summing up the term fre-

quencies of synonyms in a document, more accurate

relevance scores are achieved in comparison to a

conventional approach where synonymy is not taken

into account. This can be illustrated using a simple

example. Consider a page containing the following

phrases with each having one occurrence on the

page: sea level rise, rising sea level, and rising sea

levels. In the conventional situation where the user

only uses base form query term (i.e., sea level rise)

and the term frequencies of synonyms are not

summed up tf=1, whereas when the term frequencies

are summed up tf=3, which is more realistic be-

cause the concept denoted by the three phrases in-

deed appears three times on the page. Authors (of

Web pages) tend to use alternative phrases and do

not only use base form terms but also different syn-

tactic and morphological forms. This means that

many important documents are ranked lower than

they actually deserve if synonymy is not taken into

account.

8 CONCLUSIONS

We described a method to construct a topic-specific

dictionary of real-text phrases to support query for-

mulation in Web searching, and presented the exist-

ing climate change RT-dictionary. The proposed

method is a general method and can be applied to

any reasonable topic. We argued that there is need

for such search assistances due to the difficulty to

formulate queries in particular for complex informa-

tion needs. In the experimental part of this paper we

showed that the proposed importance score is a good

indicator of search success.

The further development of RT-dictionary was

discussed in the previous sections. Our plan is also

to construct RT-dictionaries for new topics and to

add multilingual features to the RT-dictionary.

ACKNOWLEDGEMENTS

This study was funded by the Academy of Finland

(research projects 130760, 218289).

REFERENCES

Belkin, N. J., Oddy, R. N., Brooks, H. M., 1982. ASK for

information retrieval: Part I. Background and history.

Journal of Documentation, 38 (2), 61-71.

Bergmark, D., Lagoze, C. and Sbityakov, A., 2002. Fo-

cused crawls, tunneling, and digital libraries. Proc. of

the 6th European Conference on Research and Ad-

vanced Technology for Digital Libraries, Rome, Italy,

September 16-18, pp. 91-106.

Chakrabarti, S., van den Berg, M. and Dom, B., 1999.

Focused crawling: a new approach to topic-specific

Web resource discovery. Proc. of the Eighth Interna-

tional World Wide Web Conference, Toronto, Canada,

May 11-14, pp. 1623-1640.

Cronen-Townsend, S., Zhou, Y. and Croft, B., 2002.

Predicting query performance. Proc. of the 28th ACM

SIGIR Conference on Research and Development in

Information Retrieval, Tampere, Finland, August 11-

15, pp. 299-306.

Diligenti, M., Coetzee, F. M., Lawrence, S., Giles, C.L.

and Gori, M., 2000. Focused crawling using context

graphs. Proc. of the 26th International Conference on

Very Large Databases (VLDB), Cairo, Egypt, Septem-

ber 10-14, pp. 527-534.

El-Beltagy, S. and Rafea, A., 2009. KP-Miner: A key-

phrase extraction system for English and Arabic

documents. Information Systems, 34(1), 132-144.

He, B. and Ounis, I., 2006. Query performance prediction.

Information Systems, 31(7), 585-594.

Ingwersen, P. and Järvelin, K., 2005. The Turn: Integra-

tion of Information Seeking and Retrieval in Context.

Heidelberg, Springer.

Jaene, H. and Seelbach, D., 1975. Maschinelle Extraktion

von zusammengesetzten Ausdrücken aus englischen

Fachtexten. Report ZMD-A-29. Beuth Verlag, Berlin.

Jansen, B. J., Spink, A. and Saracevic, T., 2000. Real life,

real users, and real needs: A study and analysis of user

queries on the Web. Information Processing & Man-

agement, 36(2), 207-227.

Lee, H. J., 2008. Mediated information retrieval in Web

searching. Proc. of the American Society for Informa-

tion Science and Technology, 45(1), pages 1-10.

Muresan, G. and Harper, D. J. 2004., Topic modeling for

mediated access to very large document collections.

Journal of the American Society for Information Sci-

ence and Technology, 55 (10), 892-910.

Perez-Iglesias, J. and Araujo. L., 2010. Standard deviation

as a query hardness estimator. The 17th International

Symposium on String Processing and Information Re-

trieval (SPIRE 2010), Los Cabos, Mexico, October

11-13, pp. 207-212.

Pirkola, A., 2011a. Constructing topic-specific search

keyphrase suggestion tools for Web information re-

trieval. Proc. of the 12th International Symposium on

Information Science (ISI 2011), Hildesheim, Germany,

March 9-11, pp. 172-183.

Pirkola, A., 2011b. A Web search system focused on

climate change. Digital Proceedings, Earth Observa-

TOPIC-SPECIFICWEBSEARCHINGBASEDONAREAL-TEXTDICTIONARY

297

tion of Global Changes (EOGC), Munich, Germany,

13-15 April.

Robertson and Willett., 1998. Applications of n-grams in

textual information systems. Journal of

Documentation, 54(1), 48-69.

Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M. and

Laurikkala, J., 2008. Focused Web crawling in the ac-

quisition of comparable corpora. Information Re-

trieval, 11(5), 427-445.

Tang, T., Hawking, D., Craswell, N. and Griffiths, K.,

2005. Focused crawling for both topical relevance and

quality of medical information. Proc. of the 14th ACM

International Conference on Information and Knowl-

edge Management (CIKM '05), Bremen, Germany,

October 31-November 5, pp. 147-154.

Turney, P.D., 2003. Coherent keyphrase extraction via

Web mining. Proc. of the Eighteenth International

Joint Conference on Artificial Intelligence (IJCAI-03),

Acapulco, Mexico, pp. 434-439.

Wei, X., Peng, F., Tseng. H., Lu, Y. and Dumoulin, B.,

2009. Context sensitive synonym discovery for web

search queries. Proc. of the 18th ACM Conference on

Information and Knowledge Management (CIKM '09),

Hong Kong, pp. 1585-1588.

Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and

Nevill-Manning, C.G., 1999. KEA: Practical auto-

matic keyphrase extraction. Proc. of the 4th ACM con-

ference on Digital Libraries, Berkeley, California, pp.

254-255.

WEBIST2012-8thInternationalConferenceonWebInformationSystemsandTechnologies

298