A MULTILEVEL UNL CONCEPT BASED SEARCHING AND

RANKING

E. Umamaheswari, T. V. Geetha, Ranjani Parthasarathi and Madhan Karky

College of Engineering, Anna University, Chennai, India

eywords:

CoReS, UNL, Concept relation search and ranking, Query Expansion, Query Translation.

Abstract:

The recent advances in search engines have resulted in a huge explosion of available web documents. Under-

standing the content on the web and providing meaningful search results to the user have become essential

for any search engine. This paper proposes CoReS, a multilevel Concept based Searching and Ranking Algo-

rithm which retrieves and ranks the documents based on the concepts and relationships between the concepts.

The search and rank methodology is based on Universal Networking Language (UNL) representation of the

documents. The UNL Index based query expansion technique is used to provide more meaningful results to

the user. The algorithm has been evaluated on a corpus of tourism documents, and its performance compared

with keyword based search. The mean average precision of the concept based search is found to be 0.75 while

the keyword based search has a MAP score of 0.45.

1 INTRODUCTION

The content on the web is growing rapidly every frac-

tion of a second. Search engines such as Google, Ya-

hoo and MSN have become the most heavily-used on-

line services, with millions of searches performed ev-

ery day. All the above search engines basically use

keyword based search strategy. The ranking algo-

rithms such as PageRank algorithm(Brin and Page,

1998) and HITS Algorithm(Brin and Page, 1998)

score the documents according to the incoming and

outgoing links of the documents. However due to

the huge number of documents available on the web,

the number of results produced by keyword based

search engines is too many. The ultimate challenge

for search engines is to provide effective systems that

retrieve the most relevant information from the web

that exactly caters to the users information need.

Concept based search attempts to improve search

effectiveness by incorporating conceptual informa-

tion that convey meaning rather than using the pres-

ence or absence of keywords as the basis for the re-

trieval process. Concept based search can be classi-

ﬁed as those that use a background knowledge source

to provide conceptual information and those that use

semantically analyzed components of the document.

Concept based search can also be classiﬁed based on

how semantics is used to represent the documents.

Documents can be represented by considering con-

cepts associated with the frequently occurring key-

words or by converting important components of the

document into a semantic structure. In addition, con-

cept based search can also be classiﬁed based on

where the semantics is introduced in the components

of the search engine. Semantics can be introduced

in query expansion, building the index, searching and

also in ranking the search results.

This paper deals with a concept based search

which uses a semantic representation of documents,

and incorporates semantics in all the components

of the search engine. Universal Networking Lan-

guage(M and H, 1998) (UNL), an inter-lingual lan-

guage independent semantic representation is used for

document representation. UNL based concepts and

relations are used to build the index structure; and this

UNL based index is used for query expansion, search

and ranking.

The focus of this paper is on the algorithm,

CoReS, used for UNL based Concept relation search

and ranking. This algorithm aims at improving the

search and ranking by performing matching at three

levels, namely

1. Partial or Complete match between the index and

expanded query;

2. Concept Association level which distinguishes

between actual query terms,query concepts and

expanded concept; and

282

Umamaheswari E., V. Geetha T., Parthasarathi R. and Karky M..

A MULTILEVEL UNL CONCEPT BASED SEARCHING AND RANKING.

DOI: 10.5220/0003334402820289

In Proceedings of the 7th International Conference on Web Information Systems and Technologies (WEBIST-2011), pages 282-289

ISBN: 978-989-8425-51-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

3. Document based features such as frequencyof oc-

currence and position of terms and concepts in the

document.

This paper is organized as follows. Section 2 of

this paper gives an overview of related work in con-

cept and semantic based search. Section 3 describes

overall architecture of the concept based search sys-

tem and describes the UNL structure. It also outlines

the various features available in the index that are

used for matching and ranking; as well as the features

included in the UNL based query expansionand trans-

lation. Section 4 describes CoReS, a Multilevel UNL

Concept relation based Searching and ranking. Sec-

tion 5 discusses the results and the evaluation of the

UNL Searching and Ranking, and Section 6 presents

the conclusion and future work.

2 RELATED WORK

This section explores literature related to concept

based searching and ranking. Only a few meaning

based search engines have been developed.

2.1 Semantic Search Engines

While some meaning based search engines use sen-

tence level semantics, others use ontology as the

background knowledge source for providing seman-

tics. Hakia(HAKIA, 2009) is a semantic search en-

gine that uses knowledge of Ontology and Fuzzy

logic for semantic ranking. In order to retrieve con-

ceptual results, it uses QDEX (Query Detection and

Extraction) Indexing Architecture which enables se-

mantic analysis of web pages and provides meaning

based search results. In Hakia besides the keywords,

phrases are used for meaning based searches.

The limitation of Hakia is that it accepts queries

as questions in a speciﬁc format. Also, the QDEX al-

gorithm extracts all possible queries that can be asked

on the content of web pages of various lengths and

forms. This is an ofﬂine process before any user query

is entered. The major difﬁculty in QDEX system is

the reduction of the huge number of generated query

sequences into a few dozens that make sense. Hakia

allows only these predeﬁned query sequences gener-

ated from the content to be used as queries.

On the other hand SenseBot(Sensebot, 2009) is a

semantic search engine that runs over search engines

like google and yahoo to generate multi document

summary based on text mining and limited semantics.

Though all the above search engines provide

meaning based results, some search engines require

sophisticated query analysis techniques to provide

meaningful search results. Other search engines con-

sider concepts rather than relations between concepts

as the basis of match. However, in the search engine

described in this paper, the context of the query is re-

trieved by traversing the already created UNL based

indexer. The frequently occurring UNL relations ob-

tained from the UNL index, in effect provide informa-

tion about the possible connections between concepts

in the speciﬁc domain under consideration. These

connections provide the context of the query concept,

and the query expansion based on this context yields

meaningful search results.

2.2 Ontology based Semantic Search

Engines

Concept based search can also be based on the

use of knowledge structures. One such search en-

gine is Engineering or Environmental Knowledge

Ontology-based Semantic Search(EKOSS)(Kraines

et al., 2006). It is an ontology based semantic search

engine which uses a fully functional ontology for rep-

resenting the knowledge base. It provides a collabora-

tive knowledge sharing environmentand helps knowl-

edge experts to share their knowledge such as re-

search papers, database, computer simulated model

and even curriculum vitae. The EKOSS system is

used to construct computer-interpretable semantically

rich statements of the knowledge resource. When a

user request is posted, this system converts the user

request into a computer readable knowledge descrip-

tion based on description logic and associated rules.

Ontology-based information retrieval(Gao et al.,

2005) intended for e-Government has been developed

for securing the legal documents of the government.

The disadvantage of using ontology based search en-

gines is that they are susceptible to changes in the in-

formation resources. This will affect the conceptual-

ization of the domain representation. More over, the

effort required to build an ontology is huge. This task

is domain dependent and the use of common vocab-

ulary ontology for different domains remains a chal-

lenging task.

2.3 UNL based Search Engines

A meaning based multilingual search engine that uses

UNL (Universal Networking Language) is AgroEx-

plorer(Surve et al., 2004). This search engine is sim-

ilar to the search engine described in this work, since

AgroExplorer also uses Universal Networking Lan-

guage (UNL) expressions for representing sentences

as graphs that capture the meaning of the sentences.

The System has been developed for agriculture do-

A MULTILEVEL UNL CONCEPT BASED SEARCHING AND RANKING

283

main and also provides multilingual feature. It uses a

simple search and rank process based on the degree of

match of the query UNL and the frequency of occur-

rence of the Concepts with other concepts in the UNL

expression.

The algorithm for searching and ranking described

in this paper, is a part of UNL search system that

differs from the existing AgroExplorer(Surve et al.,

2004) in that it incorporates semantics in every com-

ponent of the search engine. Also, it uses a so-

phisticated three-level search and rank process and a

context-based query expansion to enhance the results

obtained for a search.

3 BACKGROUND

This section gives a brief introduction to UNL (Uni-

versal Networking Language) and describes the UNL

index structure and the UNL based Query expansion

for conceptual searching and ranking.

3.1 The Universal Networking

Language

The Universal Networking Language has been in-

troduced as a digital meta-language for describing,

summarizing, reﬁning, storing and disseminating in-

formation in a machine-independent and human-

language-neutral form (UNDL, 2009).

Words are expressed as concepts called as Uni-

versal Words or UWs. The UWs can be linked to-

gether with a relation. Relations specify the role of

words in the sentences. There are a standard set of

46 UNL relations. The subjective meaning intended

by the speaker can be expressed through attributes.

Normally, natural language sentences are converted

to UNL graphs or expressions using linguistic analy-

sis. In the UNL expression, nodes represent concepts,

and arcs represent relations between concepts.

The Knowledge Base (UNLKB) is provided to de-

ﬁne the semantics of UWs. The UNLKB deﬁnes hi-

erarchical relations and inference based relations be-

tween concepts.

3.2 UNL Index for Conceptual Search

In the UNL based search system discussed here, UNL

graphs that represent fragments of sentences in a doc-

ument are used to build the conceptual index. The

UNL enconverter of the system uses a rule based ap-

proach to convert the sentence constituents to UNL

graphs where concepts are represented as nodes and

relations as edges. The use of this approach allows

terms to be represented as concepts, extracts out a

standard set of semantic relations between concepts

in a sentence, and at the same time, associates a hi-

erarchy for the concepts linked through the UNL se-

mantic relations. This essentially means that seman-

tically analyzed information from the sentences of

the documents is used for building the index. In ad-

dition, the constraints associated with the concepts

available in the UNL KB also incorporate information

from a backgroundknowledge resource into the index

structure.For example the UW word Chennai of the

tamil sentence will be translated into Chennai(icl >

place).Here Chennai denotes the head word and the

icl > place denotes the contraints associated with the

concept. Figure 1 shows the UNL enconversion pro-

cess.

Figure 1: UNL Enconversion of a Tamil Sentence.

In Figure1 the concept build(icl > action) is con-

nected to Rajarajachozhan(icl > person) and also

with ThanjaiTemple(iof > temple) using agt and

obj respectively.

The set of UNL graphs obtained from the encon-

version component of the search system are repre-

sented as a multi-list structure. This multi-list struc-

ture is converted into three separate indices CRC

(Concept-Relation-Concept), CR (Concept-Relation)

and C (Concept) indices in order to aid searching

and ranking. In addition to building the UNL graph

represented as multi-list structure, the UNL encon-

verter also provides additional information to aid the

retrieval process.

The CRC Indices for the UNL tamil sentences are

1. build(icl > action)-agt-Rajarajachozhan(icl >

person)

2. build(icl > action) − obj −

ThanjaiTemple(iof > temple)

The CR Indices for the UNL tamil sentences are

1. build(icl > action)-agt

2. build(icl > action)-obj

The C Indices for the UNL tamil sentences are

1. build(icl > action)

2. Ra jarajachozhan(icl > person)

WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

284

3. ThanjaiTemple(iof > temple)

Sentence based information includes sentence

identiﬁer, Part Of Speech tags, Entity tags, Multiword

tags, the actual terms or words associated with the

UNL concepts and a bit pattern vector that indicates

sentence-wise position of the concepts in the docu-

ment. Document based information includes docu-

ment identiﬁer, term frequency, concept frequency,

and the position of the concepts in the document.

These features are used in weight determination dur-

ing searching and ranking of documents. Features

such as frequency of concepts present in the docu-

ment in addition to term frequency, allow ranking to

be both term and concept based which becomes im-

portant when term frequency is not signiﬁcant. The

bit pattern vector indicating distance between con-

cepts helps to identify relations that are not necessar-

ily proximity dependent.The UNL index with all the

above sentence level and document level informations

are stored in the Binary Search Tree(BST).

3.3 Context based Query Expansion

An important contribution of this paper is the use of

semantics in the query expansion component of the

search engine. In this work, context of a query con-

cept is deﬁned as the association of this concept with

other concepts in a CRC relation, across documents

in the domain of interest. By analyzing the index, the

concept associated with a query is matched with the

CRCs of the index and the most common CRCs as-

sociated with the query concept are extracted. The

expanded concepts obtained, are ranked based on fre-

quency of CRC and on its being an entity. Query

expansion is an on-line activity and the index anal-

ysis results in efﬁcient query expansion. The most

frequently occurring CRC in the index indicates the

frequent association of concepts in the domain across

documents and hence gives the domain context of the

query concept. This expansion of the query concepts

to CRC allowscontext dictated query sub graphs to be

constructed for the query. The expanded query graph

is now associated with actual query terms, query con-

cepts and expanded concepts associated with the con-

text of the query concept. This in turn means that

differentiation between these is required during both

searching and ranking.

The index based query expansion inﬂuences the

searching and ranking of documents in many ways.

The association of expanded concepts with the query,

helps to build CRC query graphs that can be matched

with the UNL index. Without this expansion, sin-

gle word queries would have resulted in isolated con-

cept (C) only match while with the expansion we are

matching with a context dictated CRC. As already ex-

plained, the association of expanded concepts allows

domain oriented, corpus based context of the query

word to play a role in semantic matching and in addi-

tion helps to bring in documents which have concepts

in the context of the query, which would have been

missed by other search mechanisms.

4 CONCEPTUAL SEARCHING

AND RANKING

The basic searching procedure is based on complete

CRC Match or partial CR or C matches between

query sub graphs and the corresponding index as in

AgroExplorer(Surve et al., 2004). However, in this

paper, the design of the ranking procedure depends

on whether the match of the index is with the ac-

tual query terms, actual query concepts or expanded

concepts. In addition, all the sentence and document

based features associated with the conceptual indices

also affect the ranking procedure.

The overall algorithm for searching and ranking

actually performs three level ranking. The ﬁrst level

ranking is obtained based on whether there is com-

plete match (CRC match), partial match of Concept

Relation (CR) or match of only concepts (C Only).

This level of ranking is provided by the Degree of

Match Categorization tag Ta. The set of documents

obtained in level 1 category is further prioritized using

Concept Association Categorization Tag Tb. Con-

cept Association categorization depends on whether

the index match is between query terms, query con-

cepts or expanded concepts. Once the documents

have been ranked by Ta and Tb, the documents at the

same Ta.Tb level are ranked based on weights cal-

culated based on the index based features associated

with the concept.

A Tag represented as Ta.Tb helps in determining

the two level list of prioritized documents. Tag Ta

computed in level 1 indicates degree of match while

Tb computed in level 2 indicates the type of concept

association. For determining the tags the following

terminology is deﬁned.

A given query with n terms may be represented

as a set Q,Let Q ={q

, ...., q

} ,where q

represents

a query term. Each element i of the power-set of Q

is expanded and enconverted to a set EQ

of UNL

graphs g

,where m represents the expanded concepts

from the UNL index and m > 0.Here the power set of

Q represents that each query term is associated with

not only a single expanded terms and it’s concepts,it

also represents more than one expaned terms and con-

cepts.

A MULTILEVEL UNL CONCEPT BASED SEARCHING AND RANKING

285

That is,EQ

= {g

, g

, ....g

},where each g

a tuple of {Cx

, R

,Cy

} representing a relation R

between the two associated concepts X

and Y

. The

presence of all three elements of the tuple corresponds

to aCRC graph, the presence of aC and R corresponds

to aCR graph, and the presence of a C alone indicates

a C graph.Now each g

is matched with the CRC,CR

and C indices represented in the index graphs in the

indices I

CRC

, I

and I

to obtain a set of documents

CRC

, D

, and D

. The matching set of doc-

uments D

for the expanded query graph g

is the

union of these three sets. i.e.

= D

CRC

U D

Now by using these sets, the degree of match is

determined by the tag Ta.

4.1 Tag Determination for Degree of

Match

The tag determination for the degree of match de-

pends on the extent of match between the CRC rep-

resenting the query sub graph and the conceptual in-

dex. It essentially differentiates between CRC, CR

and C matches. Ta helps in differentiating between

the different degrees of match. The UNL sub graph is

a directional graph and hence partial match also con-

siders whether the concept in CR (Concept Relation),

matches with the source concept, C

, or destination

concept, C

, of the UNL subgraph.

Ta =











1 if q

, R

} ∈ I

CRC

2 if q

, R

} ∈ I

and q

, R

} ∈ I

3 if q

, R

} ∈ I

and q

} ∈ I

4 if q

, R

} ∈ I

and q

} ∈ I

5 if q

} ∈ I

and q

} ∈ I

6 if q

} ∈ I

7 if q

} ∈ I











As shown above, the tag Ta has seven values

bringing out the degree of match. Let D

.Ta be the

set of matched documents associated with each Ta.

4.2 Tag Determination for Concept

Association

The next level of tag determination is based on

whether the Ci value in CRC,CR and C matches cor-

responds to the actual query term ,the concept of the

query term or the concept obtained after query expan-

sion. Accordingly the concept association is said to

be of three types.

1. Query Term TWi association - This means that the

concept Ci is query term itself

2. Concept Word CWi association - This means that

the conceptCi matches the corresponding concept

of the query,but the actual query term is different.

3. Expanded Word EWi association - This means

that the concept Ci is associated with a concept

that is not actually in the query but has been ob-

tained as a result of query expansion.

Based on the above 3 values the eight different

tags are obtained as given below

Tb =











1 if C

= C

= TW

2 if C

= TWandC

= CW

3 if C

= CWandC

= TW

4 if C

= C

= CWs

5 if C

= TWandC

= EW

6 if C

= EWandC

= TW

7 if C

= CWandC

= EW

8 if C

= EWandC

= CW











It can be seen that the Tag Tb,differentiating be-

tween the three types explained above, also differen-

tiates between whether the concept is the source node

Cx or destination node Cy of the directed UNL sub-

graph. The eight values of Tb bring out these differ-

ences.With in each DTa the documents are ordered as

per Tb.

Each of the set of D

.Ta documents are now

tagged with the Tb tag. In other words, the searched

documents are prioritized and ranked according to

Ta.Tb value. Let D

.TaTb represent the set of doc-

uments with a tag Ta.Tb corresponding to the encon-

verted query graph g

. The next section describes

how index based features are used to further rank each

set of D

.TaTb documents.

4.3 Use of Index based Features

Index based features are used to calculate a weight

factor to prioritize the documents within each set

.TaTb. The features used are position, frequency

count, Named Entity(NE) tag and Multi-word(MW)

tag of the term/concept. The feature weight is calcu-

lated as follows.

Index based feature Weight

Weight

Count

+NE

Weight

+MW

Weight

∑

j=1..n



Weight

Count

+NE

Weight

+MW

Weight



WEBIST 2011 - 7th International Conference on Web Information Systems and Technologies

286

Here i represents the single document weight and j

represents the weight across the documents.

Weight

represents the position weight of the concept.

Position weight is computed based on where in the

document the concept or term occurs

Count

represents the frequency of occurrence of con-

cepts in the document.

Weight

represents the Named Entity weight associ-

ated with the concepts in Q.

Weight

represents the Multi Word weight associ-

ated with the concepts in Q.

4.4 Computing Overall Ranking

The ﬁrst step in computing the overall ranking is to

merge all the D

.TaTb documents corresponding to

each gij of the Query Q. Those documents which oc-

cur in the maximum number of sets are ranked higher.

The merged set of documents are then ranked based

on TaTb value, and each set DTa.Tb is in turn ranked

using normalized index based weight factor. δ is the

normalized weight factor to differentiate between the

complete CRC match,partial CR or C match. Here

δ =











0.5 ifTa = 1

0.3 ifTa = 2, 3, 4

0.2 ifTa = 5, 6, 7











Thus, in this algorithm, the ﬁrst level of ranking

is obtained by Ta, in turn the set of documents corre-

sponding to each Ta at the second level the documents

are ranked according to Tb and then within the set, at

level three these documents are ranked according to a

normalized index based weight δ×W

Ta.Tb

Thus the conceptual searching and ranking algorithm

considers degree of match,context of query, concept

association and index based term,concept and posi-

tion factors corresponding to sentences as well as doc-

uments for effective ranking.

5 PERFORMANCE EVALUATION

This system has been implemented for the Tourism

domain and has been tested with a corpus of 33000

documents. For comparison purposes, a key word

based search built by the CLIA (Cross Lingual In-

formation Access) consortium

has been used with

A project funded by Ministry of Information Technol-

ogy, New Delhi

the same corpus. We have used MAP (Mean Aver-

age Precision) score for measuring the relevance of

the documents retrieved, and Discounted cumulative

gain for evaluating the ranking. MAP (Mean Aver-

age Precision) (Thom J et.al.,2007) is an arithmetic

mean of average precisions over a set of queries used

for evaluation. It calculates the average precision of

every single query and then takes the mean value of

all queries. The relevance judgment of each retrieved

page is done based on the human judgment of the doc-

uments.

• If the resulting search documents not only contain

the keyword but also other related information rel-

evant to the user query, then the score will be 1.

• If the resulting search documents contain the user

entered query terms but not describing the rele-

vant information of the user entered query, then

the score will be 0.

• A non responding URL is also given a score of 0.

A query set of 139 queries has been used for the

evaluation. The MAP Score for 139 queries is given

in the Table1 for the UNL systems. It can be seen

that the MAP score of the UNL system is found to

be around 0.72 even at the top 20 documents.The

Figure.1 shows the MAP(Mean Average Precision)

score at various top level results of UNL search sys-

tem. The comparison of UNL search system with the

keyword based search engine is given in Table.2 and

the comparison chart is shown in Figure.2.

Table 1: MAP Score at various top level results for 139

queries of CLIA Advanced UNL Search System.

Relevance Judgement for MAP Score