LATENT SEMANTIC INDEXING

USING MULTIRESOLUTION ANALYSIS

Tareq Jaber

, Abbes Amira

and Peter Milligan

Faculty of Computing and Information Technology, King Abdulaziz University - North Jeddah Branch, K.S.A.

Nanotechnology and Integrated Bio-Engineering Centre (NIBEC), Faculty of Computing and Engineering

University of Ulster, Jordanstown campus, Antrim, BT37 0QB, U.K.

School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast, BT7 1NN, U.K.

Keywords:

Latent semantic indexing, Information retrieval, Haar wavelet transform, Singular value decomposition.

Abstract:

Latent semantic indexing (LSI) is commonly used to match queries to documents in information retrieval

(IR) applications. It has been shown to improve the retrieval performance, as it can deal with synonymy and

polysemy problems. This paper proposes a hybrid approach which can improve result accuracy signiﬁcantly.

Evaluation of the approach based on using the Haar wavelet transform (HWT) as a preprocessing step for the

singular value decomposition (SVD) in the LSI system is presented, using Donoho

′

s thresholding with the

transformation in HWT. Furthermore, the effect of different levels of decomposition in the HWT process is in-

vestigated. The experimental results presented in the paper conﬁrm a signiﬁcant improvement in performance

by applying the HWT as a preprocessing step using Donoho

′

s thresholding.

1 INTRODUCTION

As the amount of information stored electronically in-

creases, so does the difﬁcultly in searching it. The

ﬁeld of information retrieval (IR) examines the pro-

cess of extracting relevant information from a dataset

based on a user’s query (Berry et al., 1995). Latent

semantic indexing (LSI) is a technique used for intel-

ligent IR. It can be used as an alternative to the tra-

ditional keyword matching IR and is attractive in this

respect due to its ability to overcome problems with

synonymy and polysemy (Berry et al., 1995). Tra-

ditionally LSI is implemented in several stages (Bell

and Degani, 2002). The ﬁrst stage is to preprocess

the database of documents, by removing all punctua-

tion and ”stop words” such as the, as, and etc, those

without distinctive semantic meaning, from a docu-

ment. A term document matrix (TDM) is then gen-

erated which represents the relationship between the

documents in the database and the words that appear

in them. Then the TDM is decomposed. The original

decomposition algorithm proposed by Berry (Berry

et al., 1995) et al, and by far the most widely used, is

the singular value decomposition (SVD) (Fox, 1992).

The decomposition is used to remove noise (sparse-

ness) from the matrix and reduce the dimensionality

of the TDM, in order to ascertain the semantic relati-

onship among terms and documents in an attempt to

overcome the problems of polysemy and synonymy.

Finally, the document set is compared with the query

and the documents which are closest to the user’s

query are returned. In Unitary Operators on the

Document Space (Hoenkamp, 2003), Hoenkamp as-

serts the fundamental property of the SVD is its uni-

tary nature. And the use of Haar wavelet transform

(HWT), as an alternative that shares this unitary prop-

erty at much reduced computational cost, has been

suggested, and this research presents some promis-

ing initial results. Further the idea of the TDM as

a gray scale image has also been postulated, and the

equivalence of using the HWT to removelexical noise

and using the HWT to remove noise from an image

has been discussed (Hoenkamp, 2003). The aim of

the research presented in this paper, and continuing

on the research work in (Jaber et al., 2008), is to de-

velop a new approach to the LSI process based on the

possibility of using image processing techniques in

text document retrieval. In particular, the effect of us-

ing the HWT as a pre-processing step to the SVD is

studied. Moreover, attention is paid to the effect of

different levels of decomposition and threshold tech-

niques used in the HWT. A range of parameters and

performance metrics, including accuracy or precision

(number of relevant documents returned), computa-

327

Jaber T., Amira A. and Milligan P..

LATENT SEMANTIC INDEXING USING MULTIRESOLUTION ANALYSIS.

DOI: 10.5220/0003313203270332

In Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems (PECCS-2011), pages

327-332

ISBN: 978-989-8425-48-5

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

tion time, and threshold value or dimensions retained,

are used to evaluate the proposed LSI system. The

paper is organized as follows. The proposed hybrid

approach is presented in section 2. This gives an

overview of our proposed LSI system and the pro-

cesses involved. Section 3 presents the results of the

new method as a series of hypothesis which are evalu-

ated by comparing to standard baseline systems. Con-

cluding remarks are given in section 4.

2 THE PROPOSED HYBRID

METHOD

In this section the proposed framework with its differ-

ent components are illustrated. To enable evaluation

of the modiﬁed approach, the method is applied to

four sample databases: Memos database (Berry et al.,

1995)(Bell and Degani, 2002), Cochrane database

(Cochrane, 2005), eBooks database (eBooks, 2005),

and Reuters database (Fox, 1992).

2.1 The HWT

The HWT is a series analogous to the Fourier expan-

sion that is often used in image processing (Amira and

Farrell, 2005). HWT decomposition works on an av-

eraging and differencing process (Amira and Farrell,

2005)(Berry et al., 1999). In image processing, the

transform can be used to remove noise from an image

(Delakis et al., 2007). An image is transformed using

HWT, and then a thresholding function, at a certain

threshold value, is applied to remove the noise from

the image; typically a cleaner image results when the

image is reconstructed after thresholding. Donoho

′

thresholding algorithm is used in this paper (Donoho,

1995). This algorithm generates the threshold value

by the given equations:

λ = γσ

2log(n)

(1)

where λ is a threshold value, n is a number of sam-

ple data, σ is a noise standard deviation and γ is a

constant. It can be seen that this threshold depends

simply on the value of sample data. Then the hard-

thresholding function, or the soft-thresholding func-

tion (Yoon and Vaidyanathan, 2004), is applied to

threshold the image. The paper will present the re-

sults of applying each thresholding model in the new

approach. For our present purposes, the TDM can

be considered as a gray scale image, usually a binary

image (sparse TDM with 0, 1 probabilities). By ap-

plying the HWT and a threshold to the TDM, we can

also remove ”noise” from our image, in this case we

argue that this represents the removal of lexical noise.

2.2 HWT/SVD based Hybrid Approach

A commonly used approach in image processing is

to combine different techniques in order to improve

noise reduction. The comparison of the TDM to a

gray scale image invites a similar technique. The sys-

tem allows SVD and HWT techniques to be com-

bined, as shown in Fig. 1, to investigate their com-

bined effect on the TDM and the quality of the results,

in terms of precision and recall.

HWT

Inverse

HWT

Reconstructed

TDM



Visualization

Engine

Thresholding

Hard Soft Donoho



SVD Decomposition

K Approximation

Term Document

Matrix (TDM)

Figure 1: The hybrid method.

2.3 Lexical Noise Analysis

In the LSI process, once preprocessing is complete,

the TDM is constructed. Most values in the matrix

are zero, as only a subset of keywords appears in any

given document. It is interesting to see the relation-

ship of terms across documents; therefore words that

appear only in one document add no information to

this relationship. Similarly, words that appear in all

documents are considered meaningless due to their

ubiquity. Fig. 2 shows images generated by visual-

izing the TDMs after applying the HWT/SVD hybrid

method. It is worth noting that visualizing the TDM

as an image enables large datasets to be examined and

analyzed more easily (Jaber et al., 2006). If the im-

ages are examined, the white dots represent the data

or non-zero values. When dots are close to each other,

forming a cluster, it is possible, by looking at appro-

priate columns and rows, to say that there is a relation-

ship between these documents because they reference

about the same concepts.

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

328

Figure 2: Left: SVD Decomposition of Cochrane database

with K= 40 Right: HWT/SVD of Cochrane Database with

threshold = 0.04 and K=40. The SVD image is slightly

brighter, and has more dark pixels which represent zeros.

For HWT image, the great deal of sparseness (dark pixels)

has been reduced and values are redistributed more evenly.

3 RESULTS AND ANALYSIS

This section compares the proposed hybrid tech-

niques with the standard LSI approach. The search

is applied to the different databases. There are sev-

eral differentmeasures for evaluatingthe performance

of information retrieval systems. The most common

properties that are widely accepted by the research

community are recall and precision (Singhal, 2001).

Precision is the fraction of the documents retrieved

that are relevant to the user’s query, and recall is

the fraction of the documents that are relevant to the

query and successfully retrieved (Jaber et al., 2008).

Both recall and precision are needed for measur-

ing issues in the IR. It is common to achieve recall of

100% by returning all relevant documents in response

to any query,therefore recall alone is not enough. One

needs to measure the number of irrelevant documents

to determine the precision or accuracy of the results

returned. On the other hand, precision of 100% can

also be achieved in many cases by returning only rel-

evant results, but again, one needs to count all the rel-

evant documents in the database to measure the recall.

3.1 HWT-SVD LSI

In this section a number of searches are performed on

the sample databases to compare the basic LSI-SVD

approach with the proposed hybrid technique. (The

standard SVD when mentioned in the work indicates

the standard LSI-SVD system).

• Cochrane Database. Searching for ”rheumatoid

arthritis” and ”smoking and heart disease”.

For the ﬁrst query, as shown in Fig. 3, the stan-

dard SVD returns one more result than the hybrid ap-

proach. At the second query, the standard method has

a low precision value, by returning three results, two

Query 1 Query 2

0.5

1.5

2.5

3.5

k = (20 − 50), Threshold = 0.02

No. of Documents Returned

Cochrane Database

SVD Total

SVD Relevant

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Donoho(soft)/HWT−SVD Total

Donoho(soft)/HWT−SVD Relevant

Figure 3: LSI search results for Cochrane database.

of which are irrelevant to the query. The HWT-SVD

method returns only related documents demonstrating

greater accuracy or precision in matching queries, and

hence outperforms the standard method.

• eBooks Database. Searching for ”plastics

engineering”,”xml transformations”,”health and

safety” and ”advanced java programming”.

Query 1 Query 2 Query 3 Query 4

k = (10 − 50), Threshold = 0.004

No. of Documents Returned

eBooks Database

SVD Total

SVD Relevant

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Donoho(soft)/HWT−SVD Total

Donoho(soft)/HWT−SVD Relevant

Figure 4: LSI search results for eBooks database.

In Fig. 4, for the ﬁrst query as shown before,

the standard SVD returns all the existing relevant re-

sults, and outperforms the HWT-SVD approach with

recall of 100% and 79% for the new novel approach

with both thresholding. However, and as shown in

the ﬁgure, the standard SVD produces seven irrel-

evant documents, while both thresholding schemes,

with the new approach, return 22 documents, all

of which are relevant with precision of 100% and

80% for the standard SVD. In the second query,

with precision of 100%, the Donoho(hard)/HWT-

SVD approach returns one less relevant result, and

the Donoho(soft)/HWT-SVD returns three less rele-

vant results than the standard SVD, while the stan-

dard SVD produces two extra unrelated documents.

For the third query, the standard method returns four

unrelated extra documents resulting in a precision of

71%, while the hybrid method with the hard function

obviously improves the accuracy level with precision

of 100%. A recall of 100% for the both methods is

achieved as they retrieve all the related results in the

database. The new method with the soft thresholding

LATENT SEMANTIC INDEXING USING MULTIRESOLUTION ANALYSIS

329

performs not very well and returns three less related

documents with recall of 70%. For the last query,

Donoho(hard)/HWT-SVD again performs very well

by eliminating six additional unrelated documents re-

turned by the standard method, resulting in a consid-

erable improvement for the precision volume with re-

call and precision of 100%. The Donoho(soft)/HWT-

SVD returns four less relevant documents than the

hard function.

• Reuters Database. Searching for

”japan”,”bank”,”money market” and ”sales

tax”.

Query 1 Query 2 Query 3 Query 4

k = (10 − 50), Threshold = 0.004

No. of Documents Returned

Reuters Database

SVD Total

SVD Relevant

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Donoho(soft)/HWT−SVD Total

Donoho(soft)/HWT−SVD Relevant

Figure 5: LSI search results for Reuters Database.

At the ﬁrst query for the search in Fig. 5, the stan-

dard method as shown before performs inefﬁciently

in the precision action, with a large volume of irrel-

evant results returned, causing a low value of preci-

sion (67%). Reasonable results are returned by the

new hybrid approach with the hard function. The

new method returns only two less relevant results and

keeps only the relevant documents with recall of 91%

and precision of 100%. A less efﬁcient performance

can be noticed in the new approach with the soft func-

tion. Although precision of 100% is achieved, this

method misses seven related documents, which de-

creases the recall to 68%. Excellent results are ob-

tained by the Donoho(hard)/HWT-SVD for the sec-

ond query, the method returns all the existing rele-

vant documents to the query without any irrelevant

results, with recall and precision of 100% obtained.

As shown in the previous section, a considerable vol-

ume of irrelevant documents are returned by the stan-

dard SVD with precision of 53%, and thus, shows

a poor accuracy level. As was the case in the pre-

vious query the Donoho(soft)/HWT-SVD again per-

forms less well than the hard one. All the documents

returned are relevant but it produces a lower volume

of relevant results, with a recall of 83% and a pre-

cision of 100% achieved. The standard SVD at the

third and fourth queries keeps showing a lower level

of accuracy for the results returned compared to the

hybrid novel approach, with both thresholding meth-

ods. A negligible lower number of relevant results is

returned by Donoho(hard)/HWT-SVD, resulting in a

slightly lower recall value. As noticed, in most of the

cases, the Soft/HWT-SVD misses more relevant re-

sults and as a consequence the recall value decreases.

Table 1: Decomposition Algorithms Computation Time

(seconds) and Accuracy.

Database SVD Donoho

′

s/HWT-SVD

Memos 0.11 0.245

Cochrane 0.562 0.977

eBooks 16.562 19.765

Reuters 27.125 33.314

Accuracy 66% 100%

Table 1 provides an overview of the computation

time in seconds and the accuracy action for the dif-

ferent decomposition algorithms. As the size of the

matrix grows, the amount of extra processing time re-

quired to implement HWT preprocessing into the LSI

system becomes negligible in comparison to the over-

all time required, especially with the improvement in

accuracy of the returned results.

3.2 Multilevel Decomposition Analysis

This section investigates the inﬂuence of the decom-

position level on the search results, by applying the

search at the best k (the dimension reduction in the

SVD algorithm for which the best results are ob-

tained) and threshold value at different levels of de-

composition. The results in the previous section show

that the Donoho(hard)/HWT-SVD in many cases per-

forms better than the Donoho(soft)/HWT-SVD. Con-

sequently the threshold function used in this investi-

gation is the hard function.

1 2 3 4 5 6 7 8

0.5

1.5

Decomposition Level At k−value=50 and Threshold=0.02

No. of Document returned

Cochrane Database. Query: "rheumatoid arthritis "

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Figure 6: Multiresolution analysis for Cochrane database.

Fig. 6 shows that at range of decomposition levels

(4-8) the method keeps returning one relevant result,

while no results are returned at the levels less than

four.

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

330

1 2 3 4 5 6 7 8 9 10

Decomposition Level At k−value=30 and Threshold=0.004

No. of Document returned

eBooks Database. Query: "advanced java programming"

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Figure 7: Multiresolution analysis for eBooks database.

In Fig. 7 the results show that at the decomposi-

tion level ten, which is the full decomposition level,

the maximum number of results is obtained. As the

level of decomposition decreases, the number of re-

sults returned decreases as well.

1 2 3 4 5 6 7 8 9 10

Decomposition Level At k−value=30 and Threshold=0.004

No. of Document returned

Reuters Database. Query: "japan"

Donoho(hard)/HWT−SVD Total

Donoho(hard)/HWT−SVD Relevant

Figure 8: Multiresolution analysis for Reuters database.

Fig. 8 presents similar results as shown before, no

documents are produced for level less than ﬁve, while

only one related document is returned at this level. A

far larger number of relevant results is obtained at the

higher levels (6-10).

3.3 Analysis

In this research we argue that low frequencies (0 val-

ues in a typical TDM) represent lexical noise (or un-

related documents) and, consequently, the deletion

of these values will not affect the structure of the

TDM. The elimination of this noise is effected by

the application of the wavelet transform which sep-

arates low and high frequencies and the subsequent

use of the threshold function which removes low val-

ues from the TDM. Finally, the use of the inverse

HWT generates a new TDM. For the hybrid HWT-

SVD, the results show that by adding the HWT as a

pre-processing step the precision is improved by ap-

proximately 34%. The HWT-SVD, in most cases,

does not produce any irrelevant documents, and re-

turns the same number of relevant documents that

are returned by the standard SVD approach. The

HWT-SVD uses Donoho

′

s thresolding to generate the

threshold value, and then thresholds the TDM using

two thresholding functions. The results show that

the Donoho(hard)/HWT-SVD performs clearly bet-

ter than the Donoho(soft)/HWT-SVD in terms of the

number of relevant results returned. In the investiga-

tion of different levels of decomposition, the results

show that at the level of full decomposition the best

results are obtained. It is beneﬁcial to note that the

precise action of the pre-processing step depends on

the value of k used for the SVD and the threshold

value used in HWT, (that are chosen by testing the

methods at range of k and threshold values), but for

most optimal cases the results show that adding HWT

pre-processing can improve the precision of the doc-

uments returned.

4 CONCLUSIONS

A new hybrid modiﬁed approach to LSI for effective

use in IR has been presented in this paper. Investi-

gation of different approaches for LSI has conﬁrmed

that the SVD is the most powerful decomposition al-

gorithm in the LSI process in terms of the number

of documents returned. The results of the investiga-

tion for the HWT as a preprocessing step, prior to the

SVD in the LSI process, shows good results, the pre-

processing step tends to remove irrelevant documents

from the documents returned, causing enhancement

of the accuracy of the results returned. The multires-

olution analysis shows that HWT performs better at

the full decomposition. It also offers the possibility of

other combined approaches.

REFERENCES

Amira, A. and Farrell, P. (2005). An automatic face recogni-

tion system based on wavelet transforms. In IEEE In-

ternational Conference on Circuits and Systems, page

62526255.

Bell, M. and Degani, N. (2002). Latent semantic index-

ing, parallel svd and its applications. In ALGORITMY

2002, page 113120.

Berry, M., Dumais, S., and O’Brien, G. (1995). Using lin-

ear algebra for intelligent information retrieval. SIAM

Review, 37:573 595.

Berry, M. W., Drmavc, Z., and Jessup, E. R. (1999). Ma-

trices, vector spaces, and information retrieval. SIAM

Review, 41:335362.

LATENT SEMANTIC INDEXING USING MULTIRESOLUTION ANALYSIS

331

Cochrane (2005). The cochrane collaboration.

http://www.cochrane.org.

Delakis, I., Hammad, O., and Kitney, R. I. (2007). Wavelet-

based denoising algorithm for images acquired with

parallel magnetic resonance imaging (mri). Physics in

Medicine and Biology, 52:37413751.

Donoho, D. L. (1995). De-noising by soft-thresholding.

IEEE Transaction on Information Theory, 41:613627.

eBooks (2005). Queen’s university hosted library cata-

logues. http://library.qub.ac.uk.

Fox, C. (1992). Lexical analysis and stoplists. in informa-

tion retrieval - data structures & algorithm. Prentice-

Hall.

Hoenkamp, E. (2003). Unitary operators on the document

space source. Journal of the American Society for In-

formation Science and Technology, 54:14320.

Jaber, T., Amira, A., and Milligan, P. (2006). A novel

approach for lexical noise analysis and measurement

in intelligent information retrieval. In IEEE Interna-

tional Conference on Pattern Recognition ICPR, vol-

ume 3, page 370373, Hong Kong.

Jaber, T., Amira, A., and Milligan, P. (2008). Performance

evaluation of dct and wavelet transform for lsi. In

IEEE International Symposium on Circuits and Sys-

tems (ISCAS).

Singhal, A. (2001). Modern information retrieval: A brief

overview. IEEE Data Engineering Bulletin, 24:3543.

Yoon, B. and Vaidyanathan, P. P. (2004). Wavelet-based

denoising by customized thresholding. In IEEE Inter-

national Conference on Acoustics, Speech, and Signal

Processing, volume 2, page 925928.

PECCS 2011 - International Conference on Pervasive and Embedded Computing and Communication Systems

332