A SEMANTIC CLUSTERING APPROACH FOR INDEXING

DOCUMENTS

Daniel Osuna-Ontiveros, Ivan Lopez-Arevalo and Victor Sosa-Sosa

Information Technology Laboratory, CINVESTAV - IPN, Av. Victoria-Soto La Marina Km. 5.5

Ciudad Victoria, Tamaulipas, Mexico

Keywords:

Indexing models, Information retrieval, Semantic clustering, Semantic search.

Abstract:

Information retrieval (IR) models process documents for preparing them for search by humans or computers.

In the early models, the general idea was making a lexico-syntactic processing of documents, where the

importance of the documents retrieved by a query is based on the frequency of its terms in the document.

Another approach is return predeﬁned documents based on the type of query the user make. Recently, some

researchers have combined text mining techniques to enhance the document retrieval. This paper proposes

a semantic clustering approach to improve traditional information retrieval models by representing topics

associated to documents. This proposal combines text mining algorithms and natural language processing.

The approach does not use a priori queries, instead clusters terms, where each cluster is a set of related words

according to the content of documents. As result, a document-topic matrix representation is obtained denoting

the importance of topics inside documents. For query processing, each query is represented as a set of clusters

considering its terms. Thus, a similarity measure (e.g. cosine similarity) can be applied over this array and the

matrix of documents to retrieve the most relevant documents.

1 INTRODUCTION

The actual increase of electronic information has be-

come difﬁcult the search of information by people.

Some representation models as boolean representa-

tion or vector space model (Salton et al., 1975) (VSM)

has been proposed to represent documents in order to

make information readable by computers. The dis-

advantage of these models is their search is based

only considering the terms of a query. Other models

as probabilistic model (Robertson and Jones, 1976)

and latent semantic indexing (Deerwester et al., 1990)

(LSI) use a mathematical approach in order to ﬁnd

hidden relations between the terms in documents.

Some modiﬁcations to these models have been sug-

gested in order to improve their performance.

With the aim to enhance the above approaches,

some text mining algorithms have been applied to get

knowledge from documents.

Other approaches as Latent Dirichlet Alloca-

tion (Blei et al., 2003) (LDA) and Clusteing by Com-

mittee (Pantel, 2003) (CBC) were proposed in order

to recover topics from a set of documents. These al-

gorithms process documents to cluster terms where

clusters can be seen as topics or a set of related terms.

Grifﬁths and Steyvers (Grifﬁths and Steyvers, 2004)

presented a statistical inference algorithm for LDA

using scientist papers to test their proposal. In Bioin-

formatics, Konietzny et al (Konietzny et al., 2011)

applied LDA to identify functional modules of pro-

tein families. The method explores the co-occurrence

patterns of protein families across a collection of

sequence samples to infer a probabilistic model of

arbitrarily-sized functional modules.

Other approaches have been used to improve

probabilistic information retrieval models. Lafferty

and Zhai (Lafferty and Zhai, 2001) propose a frame-

work for information retrieval that combines docu-

ment models and query models using a function based

on Bayesian decision theory. In their proposal, Laf-

ferty and Zhai estimate a language model for each

document, as well as a language model for each query,

and the retrieval problem is cast in terms of risk min-

imization. Ponte and Croft (Ponte and Croft, 1998)

presents a language modeling approach to improve

the weighting proposed by Salton et al (Salton et al.,

1975). They use the mean probability to terms as an

estimator to model the relevance of a term in a docu-

ment.

This paper proposes a semantic information re-

288

Osuna-Ontiveros D., Lopez-Arevalo I. and Sosa-Sosa V..

A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS.

DOI: 10.5220/0003663802800285

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 280-285

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

trieval model based on topics for document retrieval.

The remain of the paper is structured as follows. Sec-

tion 2 presents the background on information re-

trieval models. Section 3 describes the process to

build the model representation. Section 4 shows the

results of an experiment to demostrate the perfor-

mance of the model. Finally, some conclusions and

remarks are given in Section 5.

2 BACKGROUND

This section shows the theoretical basis of this work

and some proposals reported in the literature.

2.1 Information Retrieval

The recovery and representation of information is de-

ﬁned as the model, design, and implementation of

systems to provide quick and effective access to the

contents of documents (Manning et al., 2008). The

purpose of information retrieval systems is to rep-

resent documents in order to estimate its relevance

based on the user’s search.

Some of the most used approaches for document

representation are:

• Boolean Representation (BR) is a classical model

for information retrieval which documents and

terms are represented by an incidence matrix. The

order of terms is not relevant in this model. Only

important is to know whether a term is found or

not in a document. Using the boolean representa-

tion is easy to know whether the terms of a query

are in a document. The disadvantage of this model

is its inability to know which document is more

relevant with respect to the query.

• The Vector Space Model (VSM) (Salton et al.,

1975) is based on the boolean representation. The

vector space model uses a document-term matrix

to represent the importance of each term in each

document. This importance is based on the fre-

quency of terms in documents. The T F · IDF is

the weight often used to normalize this matrix.

T F (Term Frequency) and IDF (Inverse Docu-

ment Frecuency) are shown in Equations 1 and 2,

respectively:

i,j

i, j

∑

k, j

(1)

idf

= log

|D|

|{ j : t

∈ d

(2)

where n

i, j

is the number of occurrences of the con-

sidered term t

in document d

i, j

∑

k, j

is the sum

of number of occurrences of all terms in document

and |{ j : t

∈ d

}| represents the number of doc-

uments where the term t

appears.

• The Probabilistic Model (Robertson and Jones,

1976) uses statistical techniques to assign weights

to documents. This model obtains relevant and

irrelevant documents, and based on the obtained

documents sorts them taking into account their

importance.

• Latent Semantic Indexing (LSI) (Deerwester et al.,

1990) is an information retrieval method which

attempts to capture this hidden structure by us-

ing techniques from linear algebra. Vectors rep-

resenting the documents are projected in a new

low-dimensional space obtained by singular value

decomposition of a term-document matrix A. This

low-dimensional space is spanned by the eigen-

vectors of A

A that corresponds to the few largest

eigenvalues and to the hidden correlations be-

tween terms. Queries are also projected and pro-

cessed in this low-dimensional space.

2.2 Text Mining

Text mining is a branch that emerges from data min-

ing. This branch seeks for knowledge into text doc-

uments. Text mining is used in information re-

trieval, document summarization, categorization of

documents, clustering of terms/documents, etc. The

following text mining algorithms are some of the most

relevant in this work.

• Latent Dirichlet Allocation (LDA) (Blei et al.,

2003) is an algorithm to obtain the most repre-

sentative terms within a corpus. LDA is a prob-

abilistic approach based on a bayesian model for

associating words to topics. This method is based

on the idea that each document refers to a number

of topics and using probability models deﬁnes the

belongings of each term respect to topics. LDA

can be used for terms reduction and disambigua-

tion. Terms that are not associated to any topics

are considered irrelevant to the corpus. LDA get

topics as follows:

1. Choose θ

∼ Dir(α), where i ∈ {1, . . . , M}

2. For each of the words w

i j

, where i ∈ {1, . . . , N

}

(a) Choose a topic z

i j

∼ Multinomial(θ

)

(b) Choose a topic w

i j

∼ Multinomial(β

i j

)

where:

– α is the parameter of the uniform Dirichlet prior

on the per-document topic distributions.

– β is the parameter of the uniform Dirichlet prior

on the per-topic word distribution.

A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

289

– theta

is the topic distribution for document i,

– z

i j

is the topic for the jth word in document i,

and

– w

i j

is the speciﬁc word.

• Clustering by Committee (CBC) (Pantel, 2003) is

an algorithm used to cluster terms. CBC orga-

nizes documents by topics to discover concepts

and meaning of words. The difference with LDA

is that CBC analyzes relations of terms (e.g., verb-

noun), while LDA only process terms. It has been

used to get related terms based on a set of docu-

ments (Lin, 1998). CBC work as follows:

1. For each element are obtained the most similar

elements to create n clusters.

2. It is obtained a committee list

(a) For each element, cluster the top similar ele-

ments from similarity database using average-

link clustering.

(b) For each cluster: add centroid’s to committee

if centroid’s similarity to the centroid of each

committee previously added is below a thresh-

old θ

is below by threshold θ

(d) When list of residues is empty, return commit-

tee.

3. Groups are created, where the committees cre-

ated in the previous phase are the centroids of

these groups.

3 METHODOLOGY

This section presents the steps to build the proposed

document-topic matrix. The ﬁrst part of the method-

ology raises the representation of information. The

second part is to process the search of a user. The

steps are illustrated in Figure 1.

3.1 Document Representation

First, it is necessary to obtain the contents from doc-

uments to be processed. A good practice is to dis-

card stopwords which are words that do not con-

tribute to documents. The proposed representation

model only considers verb-noun relations for process-

ing. Nouns that do not have associated a verb provide

little semantic importance related to the document.

For this task, in the implementation, the Stanford tag-

ger (Klein and Manning, 2003) is used. Verb-noun re-

lations are obtained for each input ﬁle for integrating

all relationships in one ﬁle. The ﬁle with all relation-

ships is the input to a modiﬁed version of the CBC al-

gorithm. With CBC, the terms can be grouped based

on verb-nouns relationship. The proposed approach

assumes that groups generated by CBC are topics and

the similarity of a term respect to its centroid corre-

sponds to the relevance of a term to the topic. This

will create a topic-term matrix that represents the im-

portance of terms for each topic. The matrix is repre-

sented as shows Table 1.

Table 1: Topic-term matrix.

Topic

Term

Term-1 Term-2 Term-3 Term-4 Term-5

Topic-1 0.102 0.016 0.000 0.123 0.000

Topic-2 0.000 0.012 0.130 0.100 0.030

Topic-3 0.063 0.002 0.023 0.102 0.002

Topic-4 0.030 0.123 0.020 0.021 0.015

Subsequently, LDA is applied to cluster terms.

Terms that were not added to any cluster (outliers)

and stopwords are removed from documents. In this

approach, every document is seen as a bag of words.

With this set of bag of words, it is created a frequency

matrix. A frequency matrix denotes the frequency

that terms appears in documents as is shown in Ta-

ble 2.

Table 2: Term-document matrix.

Term

Document

Doc-1 Doc-2 Doc-3

Term-1 1 4 0

Term-2 2 2 4

Term-3 3 3 2

Term-4 0 0 2

Term-5 4 2 3

Then, a topic-document matrix (γ) is obtained as

the product of the topic-term matrix (α) and the term-

document matrix (β) as shows Equation 3

γ = α · β (3)

The topic-document matrix is transposed to get a

document-topic matrix as shows Table 3. The idea to

transpose the matrix is for facilitating the processing

of queries. This matrix is stored as an index.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

290

Document Representation

Obtain topics

related to

documents

Get verb-noun

relations

Index

Apply CBC to

cluster terms in

topics

Obtain documents

Similarity

measure

Obtain topics

related to

query

Query

Ranked

documents

Query Processing

Wordnet

Semantic Information Retrieval Approach

Figure 1: Proposed information retrieval approach.

Table 3: Document-topic matrix.

Document

Topic

Topic-1 Topic-2 Topic-3 Topic-4

Doc-1 0.343 0.232 0.238 0.264

Doc-2 0.440 0.474 0.329 0.456

Doc-3 0.340 0.598 0.264 0.619

3.2 Query Processing

In a similar way as the document-topic matrix was

created, the terms of a query are processed to obtain

its relevance to each topic. The goal is to create a

vector that represents the topics of the query. For ex-

ample, the “relation” of the query terms term-1 and

term-2 (both terms) to each document can be seen as

shows Table 4.

Table 4: Document-topic matrix with the query represented

as a topic vector.

Document

Topic

Topic-1 Topic-2 Topic-3 Topic-4

Doc-1 0.343 0.232 0.238 0.264

Doc-2 0.440 0.474 0.329 0.456

Doc-3 0.340 0.598 0.264 0.619

Query 0.160 0.142 0.025 0.143

In this way, the cosine similarity measure can be

applied over the vectors from the query-topic vector

(A) and the set of document-topic vectors (B) for re-

trieving the most representative documents.

4 PRELIMINARY RESULTS

We conducted an experiment to evaluate the approach

making 31 queries (according to the Central Limit

Theorem (Fischer, 2011)) over the reuters corpus.

This corpus has 12902 documents about 116 top-

ics. The results were compared against BR and VSM

models. It was used the precision, recall, and f-

measure. For the precision, were recovered the 25

most relevant documents (according to the study of

Sanchez-Ruenes (S

anchez, 2009)) for each query in

order to measure the ranking of precision in models.

For the recall, a treshold of 0.2 was used in VSM and

the proposal in order to measure the ranking of recall

in models.

It is important to remark that queries are based

on the idea of topic-based search. A resulting doc-

ument is considered relevant although it do not con-

tains a speciﬁc query-term. For example, a query like

“steel” could returns documents with information re-

lated to steel, as “iron” or “metal”, which are very

related terms to “steel”.

In Figures 2, 3 and 4, can be seen from the results,

that in general, this approach performs with higher

effectiveness than BR and VSM. Table 5 shows the

average of precision, recall and f-measure obtained

by models.

A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

291

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

aluminium

cocoa

coffee

cotton

heat

interesting

housing

money

platinum

potato

propane

silver

trade

zinc

meal

oil

rubber

fuel

rapeseed

ship

barley

wheat

sorghum

crude

copper

wool

soybean

corn

income

dollar

earn

Final Average

Precision

Queries

VSM

Proposed Approach

Figure 2: Precision of 31 queries using BR, VSM, and the proposal.

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

aluminium

cocoa

coffee

cotton

heat

interesting

housing

money

platinum

potato

propane

silver

trade

zinc

meal

oil

rubber

fuel

rapeseed

ship

barley

wheat

sorghum

crude

copper

wool

soybean

corn

income

dollar

earn

Final Average

Recall

Queries

VSM

Proposed Approach

Figure 3: Recall of 31 queries using BR, VSM, and the proposal.

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

0.800

0.900

1.000

aluminium

cocoa

coffee

cotton

heat

interesting

housing

money

platinum

potato

propane

silver

trade

zinc

meal

oil

rubber

fuel

rapeseed

ship

barley

wheat

sorghum

crude

copper

wool

soybean

corn

income

dollar

earn

Final Average

F-Measure

Queries

VSM

Proposed Approach

Figure 4: F-Measure of 31 queries using BR, VSM, and the proposal.

Table 5: Average of precision, recall, and, f-measure over

BR, VSM, and the proposal.

Metric BR VSM Proposed Approach

Precision 0.828 0.917 0.951

Recall 0.748 0.491 0.660

F-Measure 0.733 0.563 0.743

Moreover, for speciﬁc queries, the proposed

model had a regular behavior. From the precision,

was seen that while BR and VSM returned few docu-

ments (10 or less) for a query, the proposed approach

returned more results with relevance (more than 20).

This is because in the proposed IR model is not nec-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

292

essary that a document contains terms of query. From

the recall, the boolean representation has the better re-

sult. This was because BR returned a lot of documents

(more than 300) for each query.

5 CONCLUSIONS

An unsupervised approach for indexing documents is

proposed in this paper. The proposal combines text

minig and natural language processing to obtain a

document-topic matrix representation for a set of doc-

uments. First, verb-noun relationships are obtained

by using a POS tagger. Then the Clustering by Com-

mittee algorithm is used to group terms according to

verb-noun relations. After that, the Latent Direch-

let Allocation is applied to obtain the most relevant

terms. The parameters for LDA are obtained without

human intervention. According to the experiments,

in general, the proposal has a better performance in

topic-based semantic searches over traditional mod-

els (boolean and vector space model). Future work

should include semantic processing in web search or

analysis of tweets/posts.

ACKNOWLEDGEMENTS

This research was partialy funded by project number

165474 from “Fondo Mixto Conacyt-Gobierno del

Estado de Tamaulipas”.

REFERENCES

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

Dirichlet Allocation. Journal of Machine Learning

Research, 3:993–1022.

Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas,

G. W., and Harshman, R. A. (1990). Indexing by latent

semantic analysis. Journal of the American Society of

Information Science, 41:391–407.

Fischer, H. (2011). Conclusion: The central limit theorem

as a link between classical and modern probability

theory. In A History of the Central Limit Theorem,

Sources and Studies in the History of Mathematics

and Physical Sciences, pages 353–362. Springer New

York.

Grifﬁths, T. L. and Steyvers, M. (2004). Finding scientiﬁc

topics. Proceedings of the National Academy of Sci-

ence, 101:5228–5235.

Klein, D. and Manning, C. (2003). Accurate unlexicalized

parsing. In Proceedings of the 41st Meeting of the

Association for Computational Linguistics.

Konietzny, S. G. A., Dietz, L., and McHardy, A. C. (2011).

Inferring functional modules of protein families with

probabilistic topic models. BMC Bioinformatics,

12:141.

Lafferty, J. D. and Zhai, C. (2001). Document language

models, query models, and risk minimization for in-

formation retrieval. In Croft, W. B., Harper, D. J.,

Kraft, D. H., and Zobel, J., editors, SIGIR, pages 111–

119. ACM.

Lin, D. (1998). Automatic retrieval and clustering of similar

words. In Proceedings of the 17th international con-

ference on Computational linguistics, pages 768–774,

Morristown, NJ, USA. Association for Computational

Linguistics.

Manning, C. D., Raghavan, P., and Schtze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity Press.

Pantel, P. A. (2003). Clustering by committee. PhD the-

sis, University of Alberta Edmonton. Adviser-Dekang

Lin.

Ponte, J. and Croft, B. (1998). A language modeling ap-

proach to information retrieval. In Proceedings of the

21st International Conference on Research and De-

velopment in Information Retrieval.

Robertson, S. E. and Jones, K. S. (1976). Relevance weight-

ing of search terms. Journal of the American Society

for Information Science, 27:129–146.

Salton, G., Wong, A., and Yang, C.-S. (1975). A vector

space model for automatic indexing. Communications

of the ACM, 18(11):613–620. The paper where vector

space model for IR was introduced.

anchez, D. (2009). Domain ontology learning from

the web. The Knowledge Engineering Review,

24(04):413–413.

A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

293