3. ThanjaiTemple(iof > temple)
Sentence based information includes sentence
identifier, Part Of Speech tags, Entity tags, Multiword
tags, the actual terms or words associated with the
UNL concepts and a bit pattern vector that indicates
sentence-wise position of the concepts in the docu-
ment. Document based information includes docu-
ment identifier, term frequency, concept frequency,
and the position of the concepts in the document.
These features are used in weight determination dur-
ing searching and ranking of documents. Features
such as frequency of concepts present in the docu-
ment in addition to term frequency, allow ranking to
be both term and concept based which becomes im-
portant when term frequency is not significant. The
bit pattern vector indicating distance between con-
cepts helps to identify relations that are not necessar-
ily proximity dependent.The UNL index with all the
above sentence level and document level informations
are stored in the Binary Search Tree(BST).
3.3 Context based Query Expansion
An important contribution of this paper is the use of
semantics in the query expansion component of the
search engine. In this work, context of a query con-
cept is defined as the association of this concept with
other concepts in a CRC relation, across documents
in the domain of interest. By analyzing the index, the
concept associated with a query is matched with the
CRCs of the index and the most common CRCs as-
sociated with the query concept are extracted. The
expanded concepts obtained, are ranked based on fre-
quency of CRC and on its being an entity. Query
expansion is an on-line activity and the index anal-
ysis results in efficient query expansion. The most
frequently occurring CRC in the index indicates the
frequent association of concepts in the domain across
documents and hence gives the domain context of the
query concept. This expansion of the query concepts
to CRC allowscontext dictated query sub graphs to be
constructed for the query. The expanded query graph
is now associated with actual query terms, query con-
cepts and expanded concepts associated with the con-
text of the query concept. This in turn means that
differentiation between these is required during both
searching and ranking.
The index based query expansion influences the
searching and ranking of documents in many ways.
The association of expanded concepts with the query,
helps to build CRC query graphs that can be matched
with the UNL index. Without this expansion, sin-
gle word queries would have resulted in isolated con-
cept (C) only match while with the expansion we are
matching with a context dictated CRC. As already ex-
plained, the association of expanded concepts allows
domain oriented, corpus based context of the query
word to play a role in semantic matching and in addi-
tion helps to bring in documents which have concepts
in the context of the query, which would have been
missed by other search mechanisms.
4 CONCEPTUAL SEARCHING
AND RANKING
The basic searching procedure is based on complete
CRC Match or partial CR or C matches between
query sub graphs and the corresponding index as in
AgroExplorer(Surve et al., 2004). However, in this
paper, the design of the ranking procedure depends
on whether the match of the index is with the ac-
tual query terms, actual query concepts or expanded
concepts. In addition, all the sentence and document
based features associated with the conceptual indices
also affect the ranking procedure.
The overall algorithm for searching and ranking
actually performs three level ranking. The first level
ranking is obtained based on whether there is com-
plete match (CRC match), partial match of Concept
Relation (CR) or match of only concepts (C Only).
This level of ranking is provided by the Degree of
Match Categorization tag Ta. The set of documents
obtained in level 1 category is further prioritized using
Concept Association Categorization Tag Tb. Con-
cept Association categorization depends on whether
the index match is between query terms, query con-
cepts or expanded concepts. Once the documents
have been ranked by Ta and Tb, the documents at the
same Ta.Tb level are ranked based on weights cal-
culated based on the index based features associated
with the concept.
A Tag represented as Ta.Tb helps in determining
the two level list of prioritized documents. Tag Ta
computed in level 1 indicates degree of match while
Tb computed in level 2 indicates the type of concept
association. For determining the tags the following
terminology is defined.
A given query with n terms may be represented
as a set Q,Let Q ={q
1
, ...., q
n
} ,where q
i
represents
a query term. Each element i of the power-set of Q
is expanded and enconverted to a set EQ
i
of UNL
graphs g
im
,where m represents the expanded concepts
from the UNL index and m > 0.Here the power set of
Q represents that each query term is associated with
not only a single expanded terms and it’s concepts,it
also represents more than one expaned terms and con-
cepts.
A MULTILEVEL UNL CONCEPT BASED SEARCHING AND RANKING
285