ARAGOG SEMANTIC SEARCH ENGINE

Beyond the Limits of Keyword Search

Madhur Garg, Jaspreet Singh, Jitesh Sachdeva, Harsh Mittal

Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India

Sanjay K. Dhurandher

Department of Information Technology, Netaji Subhas Institute of Technology, University of Delhi, Delhi, India

Keywords: Semantic Web, Search, Semantic Search, Semantic Page Ranking, Ontology Ranking.

Abstract: The world is heading towards a phase of pure automation and artificial intelligence. In this context the

science of exploring the possibility of computers interpreting the meanings of sentences is a topic of great

interest. The search engines are in no way left behind from its impact. The prospects of having a semantic

search engine that could explore the proper context of an input query and produce relevant results is being

constantly looked for. In this backdrop we present our prototype – Aragog, which is even a step ahead than

the conventional idea of a semantic search engine. This not only makes the user free from the hassle of

browsing through hundreds of irrelevant results, but also generates results in an order that would match its

intended context, with a high probability. The engine has been designed and tested in its nascent stage and

the results have been found to be exemplary. Additionally, we have incorporated many other features such

as synonym handling and explicit result display that make it all the more tempting to emerge as the next

generation’s search engine.

1 INTRODUCTION

Since its very inception the notion of a search engine

has been to provide the web users with an interface

that could look for appropriate content on the web.

The AOL, Google and Yahoo! search engines have

all restricted this idea to keyword searching which in

the present scenario seems outdated and incomplete.

The present generation search engines present a

huge amount of search results to the user in response

to the query, most of which are at times highly

irrelevant, and the user has an ordeal in sifting

through these result sets to arrive at some page of his

interest.

The limitation of contemporary search engines

has forced researchers to look for new alternatives.

Semantic engines- which propose to derive

meanings out of sentences seem to fit in place to

alleviate out of this hitch. This can be better

understood in the light of some examples. For

instance, consider a query “Winner of maiden T20

cricket world cup”. Intuitively, the user is interested

in knowing the direct answer of the query which in

this case turns out to be India. However, our dry run

over some of the conventional search engines

yielded us results which cater to the official T20

world cup site, site links for watching T20 world

cup, web pages which are flooded with information

not worth the user requirement. This clearly

highlights the indispensability to look for better

alternatives.

A semantic search engine uses ontologies which

are a set of concepts mapped together and can be

referenced as such to derive semantic associations

among different words and concepts. The resources

on the web, i.e., the web pages, are crawled and

looked for annotations done on them, if any. These

annotations are then used to set the words to that

ontology with which they have been tagged. These

tags are used later on for page searching. The

matching and searching algorithms of Aragog have

been explained in later sections.

The remaining paper is organized as follows. In

Section 2, we discuss about the previous work done

in this area. Next we present our motivation towards

this work that has been taken up in Section 3.

Garg M., Singh J., Sachdeva J., Mittal H. and K. Dhurandher S. (2009).

ARAGOG SEMANTIC SEARCH ENGINE - Beyond the Limits of Keyword Search.

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, pages 21-27

DOI: 10.5220/0002290600210027

 SciTePress

Section 4 talks about our proposed Semantic Search

Engine Aragog’s architecture. Section 5 explains the

algorithms and the heuristics used in the various

modules. Section 6 discusses the results of our

implementation of Aragog. Finally we conclude our

work and follow that up with the future work that

can be done on this engine to enhance its

capabilities.

2 PREVIOUS WORKS

The architecture for a semantic search engine was

proposed in (Qazi Mudassar Ilyas, et al., 2004).

However, this architecture proposes a two level

interaction with the user whereby the user needs to

separately mention both the search query and the

domain in which the searching shall be performed.

Another important work in this field is (Li Ding,

et al., 2004). It is a working implementation but it

does not incorporate searching according to

semantics. It restricts itself to keyword searching in

the semantic web documents.

(Lee & Tsai, 2001) define the Lee’s Model

which uses a matching algorithm to reflect the

semantic similarity of web page content but it is

unable to do the same completely. Thus, it is not

possible to satisfy user queries’ to an appreciable

extent.

This all reveals that semantic search engines are

still far away from reality and a better solution needs

to be found out.

3 MOTIVATION

Aragog has been conceived and developed with the

following motivations:

1) A Semantic Search Engine should minimize on

the number of user interaction levels for better

usability. For a particular query, the domain in

which the search is to be performed should be

deduced automatically. This will bring the

Semantic Search Engine at par with the existing

keyword based search engine.

2) Synonyms of the keywords should be considered

while deriving the semantics of query. Synonyms

should be handled in the sense that for a given

query, the results of all synonymous queries

should also be displayed.

3) Apart from web resources results, a semantic

search engine should also provide the user with

the exact answer of the query. This would make

it a search engine cum answering agent.

4) The query result display should enhance the user

experience. This should include ranking amongst

the domains and also, ranking within an

individual domain. Along with this, relevant text

from the web documents should also be

displayed to the user.

4 PROPOSED ARCHITECTURE

FOR ARAGOG

As discussed in the previous section, the motivation

for building Aragog has come from various

shortcomings and limitations with other existing

semantic search engines. The proposed architecture

of Aragog has been shown in Figure 1.

This architecture supports various features such

as ontology ranking, synonym handling, semantic

answer finder etc. The various modules are

described below:

Query Preprocessor: This is the module which

interacts directly with the user. The query

preprocessor is responsible for accepting a query

from the user. An acceptance list for this is

maintained in a relational database and contains

entries for tokens for which corresponding concepts

exist in the ontology collections. The user’s query

tokens are verified with the acceptance list and only

the tokens which are found in the acceptance list are

accepted. Remaining are rejected by the

preprocessor. This helps in rejecting helping verbs

such as ‘is’, ‘am’, ‘are’ etc. The acceptance list is

created/maintained/updated by the Ontology

Crawler Module discussed later.

This module also takes care of the scenario when a

user query contains similar meaning tokens. For

example if the query posed by a user is “maximum

highest score of Sachin Tendulkar in Test”. Here, the

concepts ‘maximum’ and ‘highest’ both correspond

to the same meaning. Hence, redundancy is there in

the tokens. The query preprocessor also removes the

similar meaning tokens to avoid complexity and

inefficiency that may occur at a later stage. This

module also provides features such as support for

double quotes in queries.

Ontology Ranker: This module is a major

improvement over the previously proposed versions.

An ontology ranker understands the user’s query’s

context and finds the ontologies which contain the

desired concepts. Apart from finding the relevant

ontologies, this module also ranks the ontologies for

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

Figure 1: Architecture of Semantic Search Engine.

a given query’s context.

As explained before, the previous versions

expect the user to specify the ontology in which the

concepts are to be searched. This module removes

this second level interaction with user and hence,

increases the responsiveness.

For example, if the user enters the query as

“Tiger Woods”. This query has two contexts. One

refers to the golf ontology where Tiger Woods is a

concept referring to a golf player. The second

context refers to forest ontology where Tiger and

Woods are two separate concepts.

The Ontology Ranker module uses various

ontology ranking algorithms and heuristics to rank

the ontologies. These have been discussed in later

sections.

Inference Engine/Reasoner: The Reasoner’s task

is to infer new information from the given

information to help understand the concepts in a

better way. This is done by backward chaining to

improve on the efficiency front.

For example, if in an ontology for the food

Domain, we have a class hierarchy as Food --> Non-

Vegetarian Food --> Sea Food. According to the

transitive property of reasoning, it can be inferred

that Sea Food is a type of Food.

Semantic Web Crawler: The Semantic Web

Crawler crawls the web and searches for annotated

pages on the web. Annotated pages are the web

pages having concepts annotated on them. These

annotations are done by the web developer while

developing the pages using various tools available.

The web crawler then retrieves such pages and

passes these pages to the Annotation Metadata

Extractor Module.

Ontology Crawler: This component is

responsible for crawling web to find out Ontologies

to build Ontology Collection. Apart from finding

new Ontologies, this module performs the task of

updating existing ontologies on finding new

concepts.

Annotation Metadata Extractor Module: This

module retrieves annotated web pages from the

ARAGOG SEMANTIC SEARCH ENGINE - Beyond the Limits of Keyword Search

semantic web crawler. It extracts the annotated

metadata and concepts from the web page and

updates the Metadata Store with the concepts

extracted.

Annotated Web Page Metadata Store: This store

is built by the Web Crawler Module and Annotation

Metadata Extractor Module. This store contains the

metadata in a relational database and consists of an

index for all the annotated concepts in the webpage

along with the URLs. The metadata also contains the

information to be used by the Page Ranker Module.

Web Page Searcher: This module is responsible

for finding the web resources suitable for the

preprocessed query in a particular ontology context.

The searcher refers to the ontology and finds out

the relevant concepts to be searched in the metadata

store using the various Searcher Algorithms that

have been discussed later. These relevant concepts

are then searched in the Annotated Web Page

Metadata Store and the corresponding URLs are

retrieved.

Page Ranker: This module receives the list of

URLs for a given query and ranks these URLs using

various page rank algorithms discussed later. The

point to be noted here is that the Page Ranker ranks

only those URLs which correspond to a single

ontology domain. Thus, the sorting done here is

Intra-Ontology Sorting whereas the Ontology

Ranker Module does Inter-Ontology Sorting.

Answer Finder: In certain cases, apart from the

query result URLs, it is also helpful to get an answer

of the query posed. For example if a user enters a

query ‘Director of Movie Black’, then along with

the related web pages, it is really appreciated if the

exact answer i.e. Sanjay L. Bhansali is given to the

user. Answer Finder Module aims to provide this

functionality.

Semantic Document Loader: This module is a

normal document loader but with extra capability of

loading relevant text of a URL depending on the

query posed by the user. A local cache copy of each

web resource is maintained in the Annotated Web

page metadata store that is used by the semantic

document loader. For example if a user places the

query “Movies of Shahrukh Khan”, the conventional

search engine’s document loader will display those

section of the retrieved pages where either the

keyword ‘Shahrukh Khan’ or ‘movies’ appears, but

the Semantic Document Loader will display those

sections in the result which contain the name of the

movies of Shahrukh Khan.

In the next section, the various algorithms and

heuristics adopted for Aragog shall be discussed.

5 PROPOSED ALGORITHMS

AND TECHNIQUES FOR

VARIOUS MODULES

5.1 Ontology Ranker

As stated in previous section, this module ranks the

ontologies according to the query based on:

1) The Presence of Keywords in an Ontology:

Each word present in any of the ontologies has a

set of numbers associated with it, where each

number corresponds to an ontology. When the user

enters a query we compute the intersection of the

sets corresponding to each keyword of the query so

as to determine the ontology containing all the

words of the pre-processed query, which would

result in narrowing down the list of ontologies which

are required to be searched.

2) The Position of Keywords in an Ontology:

i) If the query consists of a single word, we

calculate the depth of the keyword in various

ontologies and the one having the keyword at

the minimum depth is given the highest

priority.

ii) If the query consists of multiple words, we

calculate the Lowest Common Ancestor of

the keywords (nodes represented by the

keywords). The lowest common ancestor thus

found must also be one of the keywords

entered. Then the ontology having those

keywords separated by minimum distance is

ranked first.

5.2 Synonym Handler

The synonyms for a word (class name, instance

name and property name) are inserted in the same

node itself with the help of aliasing. This greatly

reduces the time that would have been required in

referring to a thesaurus for each word entered in the

query.

5.3 Semantic Answer Finder

Here the ontology graph is traversed from the least

common ancestor of the various nodes to the

required instance and the property value of the

required property is returned.

5.4 Page Ranker

1) The ontology (graph) is traversed from the

property value of the required property (of the

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

required instance) followed by the property

name to the least common ancestor of the

keywords. A separate set S

is constructed for

the i

node of the path, where i=0 corresponds

to the property value node.

2) For each set S

, a set of links L

is searched, such

that any of the words of S

, appear in the

annotation of those web pages.

3) If there were n nodes in the path, then two sets

of sets are computed. Set V contains n sets,

where the set V

is computed from the

intersection of the all the sets L

, where i ε [0,n]

and j ε [0,i], and set R contains n-1 sets, where

is computed from the intersection of sets L

where k ε [1,n] and m ε [1,k].

4) Since, the set V also incorporates the property

values, so the links in the sets of set V will be

preferred over the links in the sets of set R.

5) Each set V

contains the set of links which

contains the answer of the query i.e. the

property value along with words from the 1

node to the i

node. Thus, V

will give the set of

links which are best suited for the query as it

would contain the property value and all the

keywords of the query (or their synonyms)

which were present in the ontology.

6) Therefore, priority will be given to the links of

the set V

over the links of the set V

, where i>j.

7) Similarly, the links in the set R

will be preferred

over the links in the set R

, where i>j.

6 IMPLEMENTATION OF

ARAGOG

A working implementation of Aragog has been

developed. In this work we have covered three

domains: Bollywood, Cricket and Food. The

Ontologies were created for these domains. As

OWL-DL is rapidly emerging as a standard to build

Ontologies, we have used it to follow the worldwide

standards. Protégé tool was used to design

Ontologies.

For handling Ontologies and performing

operations on them, we used Jena Ontology API .

Jena provides us with the Reasoner based on the DL

transitive rules. This Reasoner was used in our

Inference Engine module.

We chose C#.NET as our development language

and ASP.NET was chosen to build the web interface

of the Aragog. As Jena API was available in Java,

IKVM was used to convert the java source code into

a .NET DLL. This development work has been

carried on a Intel Dual Core 1.83GHz system having

3GB of DDR2 RAM and 320GB of SATA Hard

Disk.

The Aragog search results were compared with

Google for a set of queries. The results as presented

below clearly highlight how Aragog outperforms the

conventional keyword search engine.

Case 1: Imprecise Queries

Query 1: “Maiden T20 World Cup winner”

Ideal results: All web resources about India

preferably in cricket domain.

Aragog Results:

Domain: Cricket

Answer: India