AUTOMATIC ESTIMATION OF THE LSA DIMENSION

Jorge Fernandes, Andreia Art

ıﬁce and Manuel J. Fonseca

Department of Computer Science and Engineering, INESC-ID/ IST/ Technical University of Lisbon

R. Alves Redol, 9, 1000-029 Lisboa, Portugal

Keywords:

LSA, LSA dimension, Unsupervised text classiﬁcation, Bootstrapping.

Abstract:

Nowadays the size of collections of information achieved considerable sizes, making the ﬁnding and explo-

ration of a particular subject hard to achieve. One way to solve this problem is through text classiﬁcation,

where a theme or category is assigned to a text based on the analysis of its content. However, existing ap-

proaches to text classiﬁcation require some effort and a high level of knowledge on this subject by the users,

making them inaccessible to the common user. Another problem of current approaches is that they are op-

timized for a speciﬁc problem and can not easily be adapted to another context. In particular, unsupervised

methods based on the LSA algorithm require users to deﬁne the dimension to use in the algorithm. In this

paper we describe an approach to make the use of text classiﬁcation more accessible to common users, by

providing a formula to estimate the dimension of the LSA based on the number of texts used during the boot-

strapping process. Experimental results show that our formula for estimation of the LSA dimension allows us

to create unsupervised solutions able to achieve results similar to supervised approaches.

1 INTRODUCTION

Text classiﬁcation consists in assigning a category

(from a set of predeﬁned categories) to a text, based

on its content. Although this problem dates from the

60s, it still is relevant since the amount of (uncata-

logued) information increases everyday. Information

such as, RSS feeds, news, scientiﬁc papers, e-books,

etc., need to be organized and cataloged to make life

easier for users.

Nowadays when we want to classify a set of texts

we can either resort to supervised, unsupervised or

even hybrid approaches. While supervised solutions

require the manual classiﬁcation of a set of texts, un-

supervised approaches avoid that by using a boot-

strapping method to perform a ﬁrst classiﬁcation of

unclassiﬁed texts. The classiﬁed texts (in both ap-

proaches) are then used to train a classiﬁer, which will

be later used to classify other texts.

Various unsupervised approaches use the Latent

Semantic Analysis (LSA) (Landauer et al., 1998) al-

gorithm to perform feature extraction from the texts.

Since one of the main characteristics of the LSA algo-

rithm is its dimension, the selection of this value is of

crucial importance because it affects the ﬁnal results

of the classiﬁcation. However, from the literature we

did not ﬁnd any founded choice for its value, being its

selection made by skilled people, according to the ac-

tual context of the problem and after several iterations

to optimize the ﬁnal results.

In this paper we propose an empirical formula to

automatically estimate the best value for the LSA di-

mension, according to the current context of the prob-

lem, namely the number of documents used during

the bootstrapping step. We applied this formula to es-

timate the LSA dimension of an unsupervised system

and compare the classiﬁcation results with a super-

vised solution. Experimental evaluation show that the

results are similar, with the advantage of not requiring

any input from the users beside the set of texts to be

used by the bootstrapping algorithm.

The remainder of this paper is organized as fol-

lows. Section 2 provides a summary of the supervised

and unsupervised approaches for text classiﬁcation.

In Section 3 we present our solution for the automatic

estimation of the LSA dimension and in Section 4 we

present the results of the experimental evaluation. Fi-

nally, in Section 5 we conclude the paper.

2 RELATED WORK

Most of the existing text classiﬁcation techniques can

be grouped into two groups: supervised learning and

unsupervised learning.

309

Fernandes J., Artíﬁce A. and J. Fonseca M..

AUTOMATIC ESTIMATION OF THE LSA DIMENSION.

DOI: 10.5220/0003666103010305

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 301-305

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Supervised Learning requires a manual classiﬁca-

tion of a group of texts into a predeﬁned set of cate-

gories. This result is then used to train and build an

automatic classiﬁer able to categorize any text into the

predeﬁned set of categories.

According to (Huang, 2001), there are two key

factors for a successful supervised learning. One is

the feature extraction, which should accurately rep-

resent the contents of text in a compact and efﬁcient

manner, and the other is the classiﬁer design, which

should take the maximum advantage of the proper-

ties inherent to the texts, to achieve the best possi-

ble results. Huang studied several algorithms for both

factors, and concluded that the LSA algorithm is the

most appropriated for the feature extraction, and the

SVM for the classiﬁer.

(Debole and Sebastiani, 2003) and (Ishii et al.,

2006) both agree with Huang in using the LSA for

feature extraction, but they introduced some changes

to the feature extraction process. While the ﬁrst au-

thors included a number of “supervised variants” of

TFIDF weighting, the second authors complemented

the LSA by introducing the concept of data grouping.

Although supervised learning can obtain good re-

sults, they require a large number of texts (literature

values vary between 500 and 1400) and a manual clas-

siﬁcation to train the ﬁnal classiﬁer.

Unsupervised Learning tries to overcome the dis-

advantages of the supervised approaches by replacing

the manual classiﬁcation of a high number of texts

with an automatic classiﬁcation (often called boot-

strapping). By doing so we are able to greatly reduce

the costs and the need for human intervention.

Unfortunately the automatic classiﬁcation of texts

used by the unsupervised learning can cause vari-

ous misclassiﬁcation, introducing noise in the training

of the classiﬁer and affecting its ﬁnal performance,

which traditionally is worst than in the supervised

learning.

Since most unsupervised approaches requires a

list of representative keywords for each category,

some authors tried to improve the bootstrapping qual-

ity by developing algorithms to help in the selection

of the best keywords for each category. (Liu et al.,

2004) used a clustering algorithm to identify the most

important words for each cluster of texts. Then the

user could inspect the ranked list and select a small

set of representative keywords for each category.

(Barak et al., 2009) went a step forward by com-

pletely automating the process. Their approach at-

tempts to automatically extract possible keywords us-

ing only the category name as a starting point. The

authors introduced also a novel scheme that mod-

els both lexical references, based on certain relations

present in WordNet and Wikipedia, and contextual

references, using the model of the LSA. From the re-

sulting model they extract the necessary keywords.

(Gliozzo et al., 2005) tried to minimize the num-

ber of misclassiﬁcations of the bootstrapping by ﬁrst

preprocessing the text to remove all the words that are

not nouns, verbs, adjectives and adverbs. The result-

ing set of words is then represented using LSA. An

algorithm based on unsupervised estimation of Gaus-

sian Mixtures is then applied to differentiate between

relevant and non-relevant information using statistics

from unclassiﬁed texts. According to the authors a

SVM classiﬁer trained with the results from this boot-

strapping algorithm achieved results comparable to a

supervised solution.

In summary, although supervised learning ap-

proaches present the best results, they require some-

one (an expert person) to manually classify a large

number of texts, which is an arduous and monotonous

task, with an enormous cost associated. On the other

hand, the unsupervised learning avoids the manual

classiﬁcation by including a bootstrapping technique,

but requires speciﬁc knowledge about the algorithms

in use (e.g. LSA) and of the domain problem. In-

deed, when we use an approach that reduce the di-

mension of the features extracted from the text, like

for instance the LSA algorithm the selection of the

dimension is very important, since its value affects

the ﬁnal results. From the analysis of the several ex-

isting proposals based on the LSA algorithm, we did

not ﬁnd a clear explanation on how to choose the best

dimension for the LSA. In most cases its value is cho-

sen after several iterations and taking into account the

speciﬁc context of the current problem.

To overcome this, to minimize the human inter-

vention, and to offer good results, we propose in this

paper a solution to automatically estimate the “opti-

mum” dimension for the LSA algorithm, taking into

account only the number of texts.

3 LSA DIMENSION ESTIMATION

The solution that we developed for the bootstrapping

step starts by reducing the size of the vocabulary by

removing useless words that only introduce noise in

the categorization. We remove words contained into

a stopwords list and the least frequent words (words

that appeared less than three times)(Joachims, 1997).

By removing the least frequent words we are able to

eliminate typos and reduce the noise of the vocabu-

lary. Additionally, by reducing the size of the vocab-

ulary we will reduce the complexity of the problem

and the computational cost of all the following algo-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

310

rithms.

To represent the content of the texts we used the

LSA. We built a word-document matrix by grouping

the individual document representations and applied

a TFIDF (Salton and Buckley, 1988) to the result-

ing frequency matrix, followed by SVD (Berry, 1992)

to obtain the new matrices of reduced dimensionality.

The reduced dimension of the resulting space need to

be carefully selected (as we will see below) to ﬁt well

the problem to be solved. Then, the resulting latent

semantic space is used to classify the documents and

this classiﬁcation is used to train the classiﬁer.

To get satisfactory results the dimension of the

LSA algorithm must be selected appropriately. Typi-

cally this value is selected empirically, based on val-

ues used on other similar problems, or through vari-

ous tests to identify the interval where the optimum

value of the dimension is. This represents a problem,

because an ordinary person does not have the knowl-

edge to make this selection, and also because we can

not use a ﬁxed value. Here, we propose a solution that

will allow ordinary people to use the LSA algorithm

in various problem contexts, by deﬁning a formula for

the estimation of the most appropriated LSA dimen-

sion based on the context of the problem. To that end,

we analyzed the behavior of the LSA and performed

several experimental tests.

From the literature, we found that the LSA’s per-

formance increases as the number of dimensions also

increases, until reaching a maximum. After that

value, any further increase in the number of dimen-

sions will only decrease the performance. Moreover,

if we look carefully into the LSA algorithm we can

see that the number of dimensions is affected by the

size of the vocabulary and by the number of texts used

to extract the features. Since the size of the vocab-

ulary is somehow directly related to the number of

texts, we can assume that the number of dimensions

depends exclusively from the number of texts.

Based on this we performed a set of experimen-

tal studies to identify the range of dimension values

where the performance has a maximum, and tried to

ﬁgure out a formula to estimate a value for the dimen-

sion within that interval.

To perform the experimental tests we considered

various context problems, where we had ten cate-

gories (Science and Technology, Cinema and TV,

Sport, Economy and Management, Informatics and

Internet, Games, Music, Politics, Health, and Motor

Vehicles), with distinct characteristics among them,

and three number of texts per category (32, 64 and

128). Texts were collected from news sites, and their

sizes vary from a few paragraphs to one to two pages.

For each problem context we performed ﬁve in-

dividual tests using ﬁve different sets of texts (in the

same conditions) and measured the F1-measure. The

ﬁnal result for a speciﬁc context is the average of the

F1-measure from the ﬁve individual tests.

The values of the dimension used to compute the

F1-measure were selected to ﬁgure out if the opti-

mum value for the dimension varies linearly or non-

linearly. To that end we considered the following val-

ues for the dimension:

√

, 2

√

, 3

√

and

, where n

represents the number of texts.

We ﬁrst studied the performance for 32 texts per

category (320 texts in total), and we achieved the val-

ues depicted in Figure 1. As we can see we have a

maximum for 106 dimensions. It is important to high-

light that this does not mean that the optimum value

for the dimension is 106, but that the F1-measure in-

creases and then decreases, with the optimum value

inside the interval ]64, 160[. In this particular case,

we also computed the F1-measure for a dimension of

because the F1-measure was still increasing at

!"#

$%#

%$#

&'#

!(&#

!&(#

&()#

"()#

*()#

+()#

!(()#

!(# $(# %(# "(# +(# !!(# !$(# !%(# !"(#

!"#$%&'()%*

+($,%)*-.*/0$%1'0-1'*

Figure 1: LSA performance for a set of 320 texts.

For the next test we used 64 texts per category

(640 texts in total), and we achieved the values de-

picted in Figure 2. As we can see we achieved a max-

imum for 128, and the interval for the optimum value

is ]75, 213[.

!"#

"$#

%"#

&!'#

!&(#

%")#

'$)#

'")#

*$)#

*")#

&$$)#

!$# %$# &!$# &%$# !!$#

!"#$%&'()%*

+($,%)*-.*/0$%1'0-1'*

Figure 2: LSA performance for a set of 640 texts.

AUTOMATIC ESTIMATION OF THE LSA DIMENSION

311

Finally, we used 128 texts per category (1280 texts

in total), producing the results depicted in Figure 3. In

this case we have a maximum for 107 dimensions and

an interval for the optimum value between ]71, 256[.

!"#

$%#

%&$#

'"(#

)'(#

*&+#

*"+#

,&+#

,"+#

%&&+#

'&# $&# %'&# %$&# ''&# '$&# !'&# !$&# )'&#

!"#$%&'()%*

+($,%)*-.*/0$%1'0-1'*

Figure 3: LSA performance for a set of 1280 texts.

By analyzing the obtained data, the ﬁrst conclu-

sion that we can take is that the number of dimensions

does not grow linearly with the number of texts. In-

deed, while the number of texts increase the number

of dimensions corresponds to a smaller percentage of

the total number of texts. In the ﬁrst case (320 texts)

the optimum value is between 20% and 50% of the

number of texts, in the second case (640 texts) is be-

tween 12% and 33%, and in the last case (1280 texts)

is between 6% and 20%.

After looking at various mathematical functions

we identiﬁed the square root as the one with the most

similar behavior. Based on that we studied some pos-

sibilities and achieved the following formula for the

estimation of the LSA dimension:

k = n

log(n

)

(1)

where n

is the number of texts.

If we now apply this formula to the previous three

cases we obtain the following values:

Table 1: Intervals for the optimum values of dimension, and

the values automatically estimated using equation 1.

# Texts Expected interval Estimated Value

320 ]64, 160[ 100

640 ]75, 213[ 155

1280 ]71, 256[ 235

As we can see all the estimated values belong to

the interval where the optimum value is. Moreover,

the literature mentioned that for problems where we

have more than 20000 texts the typical value recom-

mended for the dimension is between 200 and 2000.

Though, we can conclude that our formula estimates

values inside that interval.

4 EXPERIMENTAL EVALUATION

To evaluate our formula for the estimation of the ap-

propriate dimension for the LSA, we compared the re-

sults achieved by a classiﬁer trained with the classiﬁ-

cation produced by our bootstrapping algorithm and a

classiﬁer trained using texts classiﬁed manually. Our

goal was to check if our solution, where the over-

all bootstrapping step was automatized with the in-

clusion of the estimation of the LSA dimension, can

compete with a supervised solution, where exist a lot

of human intervention.

To perform the tests, we used the same set of texts

in both solutions, supervised and unsupervised. Then,

we compared the two classiﬁers using two distinct sit-

uations. One where we used 500 texts to train the

classiﬁers and 1000 for evaluation, and another where

we used 1000 texts to train and 500 for evaluation. In

both situations the two sets were disjoints.

!"#$

%!#$

!"#

$!"#

%!"#

&!"#

'!"#

(!"#

)!"#

*!"#

+!"#

,!"#

$!!"#

&'()*+,-).$ /0-'()*+,-).$

1234)5-'*)$

Figure 4: Performance results for supervised and unsuper-

vised solutions, using 500 texts to train and 1000 texts for

recognition.

Figure 4 shows the results achieved by both solu-

tions for the ﬁrst situation, while Figure 5 presents

the results for the second case. As we can see in

both cases our approach presents results similar to

the supervised solution. In the ﬁrst case we have a

F1-measure of 67% against 70% an in the second we

have 84% against 85%. Although in both cases the

F1-measure is smaller for the unsupervised solution,

the standard deviations intersect, meaning that our ap-

proach can achieve results comparable to supervised

solutions without their costs.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

312

!"#$

!%#$

!"#

$!"#

%!"#

&!"#

'!"#

(!"#

)!"#

*!"#

+!"#

,!"#

$!!"#

&'()*+,-).$ /0-'()*+,-).$

1234)5-'*)$

Figure 5: Performance results for supervised and unsuper-

vised solutions, using 1000 texts to train and 500 texts for

recognition.

5 CONCLUSIONS

As we have seen unsupervised solutions have a boot-

strapping step where, in the majority of the cases, a

LSA algorithm is used. However, to properly take ad-

vantage of the LSA a good selection of its dimension

is crucial. In this paper we presented a solution, based

on a set of empirical studies, to automatically estimate

the most appropriated dimension for the LSA. By pro-

viding this estimation mechanism, we will allow peo-

ple without speciﬁc knowledge about the LSA algo-

rithm to use it parameterized with the correct values

to assure a good performance.

Indeed, from the experimental evaluation we can

conclude that our formula allows unsupervised solu-

tions based on the LSA to achieve results similar to

supervised methods.

ACKNOWLEDGEMENTS

This work was supported by FCT through the

PIDDAC Program funds (INESC-ID multiannual

funding) and the Crush project, PTDC/EIA-

EIA/108077/2008.

REFERENCES

Barak, L., Dagan, I., and Shnarch, E. (2009). Text catego-

rization from category name via lexical reference. In

NAACL ’09: Proceedings of Human Language Tech-

nologies: The 2009 Annual Conference of the North

American Chapter of the Association for Computa-

tional Linguistics, Companion Volume: Short Papers,

pages 33–36, Morristown, NJ, USA. Association for

Computational Linguistics.

Berry, M. (1992). Large scale sparse singular value compu-

tations. International Journal of Supercomputer Ap-

plications, 6:13–49.

Debole, F. and Sebastiani, F. (2003). Supervised term

weighting for automated text categorization. In Pro-

ceedings of SAC-03, 18th ACM Symposium on Applied

Computing, pages 784–788. ACM Press.

Gliozzo, A., Strapparava, C., and Dagan, I. (2005). Inves-

tigating unsupervised learning for text categorization

bootstrapping. In HLT ’05: Proceedings of the confer-

ence on Human Language Technology and Empirical

Methods in Natural Language Processing, pages 129–

136, Morristown, NJ, USA. Association for Computa-

tional Linguistics.

Huang, Y. (2001). Support vector machines for text catego-

rization based on latent semantic indexing. Technical

report, Electrical and Computer Engineering Depart-

ment, The Johns Hopkins University.

Ishii, N., Murai, T., Yamada, T., and Bao, Y. (2006).

Text classiﬁcation by combining grouping, lsa and

knn. In ICIS-COMSAR ’06: Proceedings of the 5th

IEEE/ACIS International Conference on Computer

and Information Science and 1st IEEE/ACIS Interna-

tional Workshop on Component-Based Software Engi-

neering,Software Architecture and Reuse, pages 148–

154, Washington, DC, USA. IEEE Computer Society.

Joachims, T. (1997). A probabilistic analysis of the rocchio

algorithm with tﬁdf for text categorization. In ICML

’97: Proceedings of the Fourteenth International Con-

ference on Machine Learning, pages 143–151, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Landauer, T., Foltz, P., and Laham, D. (1998). An intro-

duction to latent semantic analysis. In Discourse Pro-

cesses, pages 259–284.

Liu, B., Li, X., Lee, W., and Yu, P. (2004). Text classiﬁca-

tion by labeling words. In AAAI’04: Proceedings of

the 19th national conference on Artiﬁcal intelligence,

pages 425–430. AAAI Press.

Salton, G. and Buckley, C. (1988). Term-weighting ap-

proaches in automatic text retrieval. Information Pro-

cessing and Management, 24(5):513–523.

AUTOMATIC ESTIMATION OF THE LSA DIMENSION

313