LSA-derived representations of and shared vocabu-
lary from the contexts surrounding the instances of
an ambiguous keyword in a corpus - the senses of the
word in question are then derived using unsupervised
learning techniques. (schutze 98) presents a corpus-
based approach to word-sense disambiguation. It is
based on the idea that two instances of an ambigu-
ous word have the same sense if they have second-
order similarity - that is, if there is substantial overlap
between the words that they co-occur with co-occur
with.
The related work described in this section is
mostly about providing methods guiding the user,
with more or less automation, to the information he
wants. This work is different in that it provides a
powerful but intuitive language for the user to express
what he wants.
4 CONCLUSIONS AND FUTURE
WORK
The method described in this paper is simple but
effective. This technique for non-lexical, semantic
search works because of the existence of a very-large,
multi-topical collection of corpora, in the form of
the Internet, and a fast, efficient method for search-
ing over it lexically (in this case, Google, though any
search engine would do.) The key observation is that
simple characterizations of the search-result pages for
a query provide a reasonable characterization of that
query’s meaning that can be used to compute inter-
document distances.
This paper used supervised learning techniques
over queries and documents but these distance met-
rics could also be used with unsupervised clustering
algorithms. There have been many papers about the
shape of the Internet, with topologies based on con-
nectivity (i.e., (Faloutsos et al 99)) - it would be in-
teresting to use the technique described herein to de-
rive the semantic topology of the Internet, though the
bandwidth and processing power required to do such
a project justice would be vast.
REFERENCES
James Allan and Hema Raghavan (2002). Using Part-of-
Speech Patterns to Reduce Query Ambiguity. SIGIR
’02, Tampere, Finland.
P.D. Bruza and S.Dennis. (1997) Query-reformulation on
the internet: empirical data and the hyperindex search
engine. In Proceedings of the RIAO Conference: Intel-
ligent Text and Image Handling, pages 488-499, Mon-
treal, Canada.
Andrew Burton-Jones, Veda C. Storey, Vijayan Sugu-
maran and Sandeep Purao. (2003) A Heuristic-Based
Methodology for Semantic Augmentation of User
Queries on the Web. International conference on con-
ceptual modeling, ER’03, pp. 476-489,
Michalis Faloutsos, Petros Faloutsos, Christos Faloutsos
(1999) On power-law relationships of the Internet
topology. Proceedings of the conference on Applica-
tions, technologies, architectures, and protocols for
computer communication.
Google, Inc. www.google.com
Cheng Niu, Wei Li, Rohini K. Srihari, Huifeng Li, Lau-
rie Crist. (2004). Context Clustering for Word Sense
Disambiguation Based on Modeling Pairwise Con-
text Similarities. SENSEVAL-3: Third International
Workshop on the Evaluation of Systems for the Se-
mantic Analysis of Text, Barcelona.
Geoffrey Leech, Paul Rayson, Andrew Wilson (2001).
Word Frequencies in Written and Spoken English:
based on the British National Corpus. Longman, Lon-
don.
Overture, Inc. www.overture.com
M. Sanderson and D. Lawrie. (2000) Building, testing, and
applying concept hierarchies. In W. Bruce Croft, ed-
itor, Advances in Information Retrieval: Recent Re-
search from the CIIR, W. Brude Croft, ed., Kluwer
Academic Publishers, chapter 9, pages 235-266.
Kluwer Academic Press, 2000.
Schutze, Hinrich. (1998) Automatic Word Sense Discrimi-
nation. Computational Linguistics. 24:1, 97-123.
Dou Shen, Rong Pan, Jian-Tao Sun, Jeffrey Junfeng
Pan, Kangheng Wu, Jie Yin, Qiang Yang. (To ap-
pear.) Query Enrichment for Web-query Classifica-
tion. ACM Transactions on Information Systems
Veda C. Storey, Andrew Burton-Jones, Vijayan Sugumaran,
Sandeep Purao. (Preprint, submitted to Information
Systems Review.) Making the Web More Semantic: A
Methodology for Context-Aware Query Processing.
Yahoo, Inc. www.yahoo.com
AUGMENTING SEARCH WITH CORPUS-DERIVED SEMANTIC RELEVANCE
371