Incorporating Ad Hoc Phrases in LSI Queries

Roger Bradford


Latent semantic indexing (LSI) is a well-established technique for information retrieval and data mining. The technique has been incorporated into a wide variety of practical applications. In these applications, LSI provides a number of valuable capabilities for information search, categorization, clustering, and discovery. However, there are some limitations that are encountered in using the technique. One such limitation is that the classical implementation of LSI does not provide a flexible mechanism for dealing with phrases. In both information retrieval and data mining applications, phrases can have significant value in specifying user information needs. In the classical implementation of LSI, the only way that a phrase can be used in a query is if that phrase has been identified a priori and treated as a unit during the process of creating the LSI index. This requirement has greatly hindered the use of phrases in LSI applications. This paper presents a method for dealing with phrases in LSI-based information systems on an ad hoc basis – at query time, without requiring any prior knowledge of the phrases of interest. The approach is fast enough to be used during real-time query execution. This new capability can enhance use of LSI in both information retrieval and knowledge discovery applications.


