LATENT SEMANTIC INDEXING
USING MULTIRESOLUTION ANALYSIS
Tareq Jaber
1
, Abbes Amira
2
and Peter Milligan
3
1
Faculty of Computing and Information Technology, King Abdulaziz University - North Jeddah Branch, K.S.A.
2
Nanotechnology and Integrated Bio-Engineering Centre (NIBEC), Faculty of Computing and Engineering
University of Ulster, Jordanstown campus, Antrim, BT37 0QB, U.K.
3
School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT7 1NN, U.K.
Keywords:
Latent semantic indexing, Information retrieval, Haar wavelet transform, Singular value decomposition.
Abstract:
Latent semantic indexing (LSI) is commonly used to match queries to documents in information retrieval
(IR) applications. It has been shown to improve the retrieval performance, as it can deal with synonymy and
polysemy problems. This paper proposes a hybrid approach which can improve result accuracy significantly.
Evaluation of the approach based on using the Haar wavelet transform (HWT) as a preprocessing step for the
singular value decomposition (SVD) in the LSI system is presented, using Donoho
′
s thresholding with the
transformation in HWT. Furthermore, the effect of different levels of decomposition in the HWT process is in-
vestigated. The experimental results presented in the paper confirm a significant improvement in performance
by applying the HWT as a preprocessing step using Donoho
′
s thresholding.
1 INTRODUCTION
As the amount of information stored electronically in-
creases, so does the difficultly in searching it. The
field of information retrieval (IR) examines the pro-
cess of extracting relevant information from a dataset
based on a user’s query (Berry et al., 1995). Latent
semantic indexing (LSI) is a technique used for intel-
ligent IR. It can be used as an alternative to the tra-
ditional keyword matching IR and is attractive in this
respect due to its ability to overcome problems with
synonymy and polysemy (Berry et al., 1995). Tra-
ditionally LSI is implemented in several stages (Bell
and Degani, 2002). The first stage is to preprocess
the database of documents, by removing all punctua-
tion and ”stop words” such as the, as, and etc, those
without distinctive semantic meaning, from a docu-
ment. A term document matrix (TDM) is then gen-
erated which represents the relationship between the
documents in the database and the words that appear
in them. Then the TDM is decomposed. The original
decomposition algorithm proposed by Berry (Berry
et al., 1995) et al, and by far the most widely used, is
the singular value decomposition (SVD) (Fox, 1992).
The decomposition is used to remove noise (sparse-
ness) from the matrix and reduce the dimensionality
of the TDM, in order to ascertain the semantic relati-
onship among terms and documents in an attempt to
overcome the problems of polysemy and synonymy.
Finally, the document set is compared with the query
and the documents which are closest to the user’s
query are returned. In Unitary Operators on the
Document Space (Hoenkamp, 2003), Hoenkamp as-
serts the fundamental property of the SVD is its uni-
tary nature. And the use of Haar wavelet transform
(HWT), as an alternative that shares this unitary prop-
erty at much reduced computational cost, has been
suggested, and this research presents some promis-
ing initial results. Further the idea of the TDM as
a gray scale image has also been postulated, and the
equivalence of using the HWT to removelexical noise
and using the HWT to remove noise from an image
has been discussed (Hoenkamp, 2003). The aim of
the research presented in this paper, and continuing
on the research work in (Jaber et al., 2008), is to de-
velop a new approach to the LSI process based on the
possibility of using image processing techniques in
text document retrieval. In particular, the effect of us-
ing the HWT as a pre-processing step to the SVD is
studied. Moreover, attention is paid to the effect of
different levels of decomposition and threshold tech-
niques used in the HWT. A range of parameters and
performance metrics, including accuracy or precision
(number of relevant documents returned), computa-
327
Jaber T., Amira A. and Milligan P..
LATENT SEMANTIC INDEXING USING MULTIRESOLUTION ANALYSIS.
DOI: 10.5220/0003313203270332
In Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems (PECCS-2011), pages
327-332
ISBN: 978-989-8425-48-5
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)