A TEXT CLASSIFICATION

METHOD BASED ON LATENT TOPICS

Yanshan Wang and In-Chan Choi

Department of Industrial Management Engineering, Korea University, Anamdong, Seoul, Republic of Korea

Keywords: Text Classification, Latent Topic, Indexing by LDA.

Abstract: Latent Dirichlet Allocation (LDA) is a generative model, which exhibits superiority over other topic

modelling algorithms on latent topics of text data. Indexing by LDA is a new method in the context of LDA

to provide a new definition of document probability vectors that can be applied as feature vectors. In this

paper, we propose a joint process of text classification that combines DBSCAN, indexing with LDA and

Support Vector Machine (SVM). DBSCAN algorithm is applied as a pre-processing for LDA to determine

the number of topics, and then LDA document indexing features are employed for text classifier SVM.

1 INTRODUCTION

For decades, the information on the Internet has

been increasing explosively. Amongst many other

forms of information representation, text data holds

a crucial position. Text data help people obtain

valuable information with reduced network data

flow. The task of categorizing natural language texts

into topical categories has become one of the key

factors in organizing online information.

Many machine learning and statistical language

processing methods have been applied to text

classification. Relavent approaches include decision

trees, k-nearest neighbors (kNN) (Hart, 1967),

neural networks (Wiener, Pedersen, and Weigend,

1995), Naïve Bayes and support vector machines

(SVM) (Cortes and Vapnik, 1995). Joachims (1988)

was the first to apply SVM to text classification. He

used the inverse document frequency as a feature

vector and reported the computational performance

of SVM.

However, a big problem in the text classification

is that the captured text data often lies in a high-

dimensional feature space. Thus, efficient dimension

reduction remains to be a big challenge in classifi-

cation methods. One of the efficient dimension

reduction techniques is Latent Dirichlet Allocation

(LDA) (Blei et al., 2003), which is originally

designed for topic and document modelling. It is a

newly designed sequential generative model that

aims to represent text documents based on latent

topics.

2 OVERVIEW OF OUR METHOD

In this paper, indexing by LDA (Choi and Lee,

2010) is used for document modelling. Because the

document probability proposed by Choi and Lee

(2010) is easy to obtain, it is utilised to reduce any

document to a fixed set of real-valued features.

The number of latent topics k plays a crucial role

when LDA generates latent topics for a corpus. It is

necessary to find a method that is effective for

determining k. Here in this paper, the DBSCAN

clustering idea is considered.

Our proposed method for the text classification

consists of the following three steps:

1. Topic number determination step.

DBSCAN is used to acquire the number of clusters k

of the feature words.

2. Document modelling step. The indexing

with LDA model is used with the number of topics k

to obtain the feature of each document.

3. Classification step. Using the document

probability as the feature for each document, SVM

classifier is utilised to generate a model for the

corpus.

This paper is organized as follows: the

components in the proposed model, including

DBSCAN algorithms, indexing by LDA and the

SVM algorithm, are presented in the next section.

Section 3 reports preliminary computational results

on a benchmark data set. The last section presents

concluding remarks with some future work.

212

Wang Y. and Choi I..

A TEXT CLASSIFICATION METHOD BASED ON LATENT TOPICS.

DOI: 10.5220/0003740902120214

In Proceedings of the 1st International Conference on Operations Research and Enterprise Systems (ICORES-2012), pages 212-214

ISBN: 978-989-8425-97-3

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

3 MODEL COMPONENTS

3.1 DBSCAN

DBSCAN (Density-Based Spatial Clustering of

Applications with Noise) (Ester et al, 1996) is a

density-based clustering algorithm designed to

discover clusters. As in many document clustering

algorithms, in which documents and words are

represented by the vector space model (Van, 1979),

their method considers each document as a vector in

the term-space. The vector is represented by the term

frequency (TF) vector: d



={tf



,tf



,…,tf



}, where



is the frequency of term i in the document j. Just

as each document can be represented by a vector of

tf-idf weight, each word is denoted inversely as:



{



×id





×id



,…,t



×id



}

(1)

where idf



=log







and df



is the number of

documents that contain word i.

The distance of two words is defined by the

cosine similarity measure, i.e.

disw





=cosw





=



∙w



‖



‖

w





(2)

The cosine value is 1 when two words are identical,

and 0 if there is nothing in common between them.

3.2 Indexing by LDA

The Latent Dirichlet Allocation (LDA) method is

based on the assumption of “bag-of-words” that the

order of words in a document can be neglected and

the documents in a corpus share some latent topics

(Blei et al, 2003).

In LDA, a document of size N is defined as a

sequence of N words denoted by

w={w





,…,w



}, where w



∈{w





,…,w



}

denotes n



word in the sequence. A corpus is

defined as a collection of M documents denoted by

C={w

,



,…,w



}. LDA generates topic model

from given documents, which is obtained by

inferring the topic mixture θ in document-level,

where θ=(θ

,



,…,θ



), and θ



∈{θ



,θ



,…,θ



which is a K-dimensional vector. A set of N topics

in word-level is defined as z, where z=





,…,z



), z



∈{z





,…,z



The document generating mechanism assumed

by LDA consists of three steps:

1) Choose the number of words N ~ Poisson

2) Choose θ ~ Dirichlet(α)

3) For each of the N words 



a) Choose a topic 



~ Multinomial(θ).

b) Choose a word 



~ Multinomial(θ|



), a

multinomial distribution conditioned on the

topic 



Choi and Lee (2010) make a direct use of the

beta matrix in LDA to get the document probability:





≅D







=pz



w





p

(



d



)





∑















(3)

Where W





represents the probability of the word w

embodying the kth concept z

, n

denotes the

number of occurrence of word w

in document d

and N

denotes the number of words in the

document d

, i.e. N

∑

3.3 Support Vector Machines

Support Vector Machines (SVM) is a type of

learning system developed by Vapnik (1995) based

on the structural risk minimization principle from

the statistical learning theory. The document

probability matrix obtained from the indexing by

LDA model is used as an input.

Here, the RBF kernel function (Aizerman et al.,

1964) is utilized. The classification decision is

obtained by the following equation:

(

)

sgn



∗

exp

−

‖

x−x′

‖









∗

.

(4)

4 PRELIMINARY RESULT

In this section, the results on a preliminary

experiment on real data are provided. The proposed

method was tested on the first four categories of a

subset of 20news-bydate-matlab corpus (Available

at: http://people.csail.mit.edu/jrennie/20Newsgroups

/20news-bydate-matlab.tgz). Table 1 shows the

numbers of training and testing documents in four

categories. The corpus contains 3710 documents,

56222 words and 4 categories. The first 2220

documents were trained to build the classifier and

the categories of the remaining 1490 documents

were predicted to test the proposed model.

A TEXT CLASSIFICATION METHOD BASED ON LATENT TOPICS

213

Table 1: Number of Training and Testing Items.

Category Name Num Train Num Test

alt.atheism 480 318

comp.graphics 581 391

comp.os.ms-

windows.misc

572 391

comp.sys.ibm.pc.hardware 587 390

All 2220 1490

The DBSCAN clustering algorithm resulted in

15 clusters. In order to see the effectiveness of the

algorithm, the proposed method without DBSCAN

was tested by varying the number of latent topics

and the test result is shown in Figure 1.

Figure 1: Accuracy according to different numbers of

latent topics.

DBSCAN algorithm has obtained a satisfying

result, even though the highest accuracy is acquired

when the number of topics is 14. The precision and

recall with the proposed method is shown in Table 2.

Table 2: Precision and Recall.

Category Name Precision Recall

alt.atheism 93.7% 99.3%

comp.graphics 97.1% 94.3%

comp.os.ms-

windows.misc

92.5% 92.1%

comp.sys.ibm.pc.hardware 92.2% 88.2%

Accuracy 93.2%

Computational comparison with other

classification algorithms remains as a future study.

5 CONCLUSIONS

This paper presented a joint project based on a

generative model LDA and Support Vector machine

to categorize text. DBSCAN clustering algorithm is

used to obtain the number of latent topics. A

preliminary experiment was carried out to a small-

scale data corpus. Further research on applications to

other large-scale data is necessary. Moreover, other

classification algorithms should also be tested on

this corpus in comparison with the proposed method.

Furthermore, the potential use of the proposed

method to the area of information systems is subject

to a further study.

ACKNOWLEDGEMENTS

This work was supported by the National Research

Foundation of Korea (NRF) grant funded by the

Korea government (MEST) (2009-0083893).

REFERENCES

Aizerman, M. A., Braverman, E. M., and Rozono’er, L. I.,

1964. Theoretical foundations of the potential function

method in pattern recognition learning. Automat. Rem.

Control, 25, pp.824-837.

Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent

dirichlet allocation. Journal of Machine Learning

Research, 3, pp.993-1022.

Choi, I.C., and Lee, J. S., 2010. Document indexing by

latent dirichlet allocation. Proceedings of The 2010

International Conference on Data Mining, pp.409-

414.

Cortes, C., & Vapnik, V., 1995. Support vector networks.

Machine Learning, 20(3), pp.273-297.

Ester, M., Kriegel, H. P., Sander, J., and Xu, X., 1996. A

density based algorithm for discovering clusters in

large spatial databases. Proceedings of 2nd

International Conference on Knowledge Discovery

and Data Mining, pp.226-231.

Joachims, T., 1988. Text categorization with support

vector machines: Learning with many relevant

features. Proceedings of the 10th European conference

on machine learning, pp.137-142.

Hart, P., 1967. Nearest neighbor pattern classification.

IEEE Transaction on Information Theory, 13(1),

pp.21-27.

Van Rijsbergen, C. J., 1979. Information Retrieval,

Buttersworth. London, 2

edition.

Vapnik, V., 1995. The nature of statistical learning theory,

Springer. New York.

Wiener, E., Pedersen, J. O., Weigend, A. S., 1995. A

Neural Network Approach to Topic Spotting. SDAIR,

pp.317-332.

70%

75%

80%

85%

90%

95%

7 8 9 10 11 13 14 15 16 17

Accuracy

Number of topics

ICORES 2012 - 1st International Conference on Operations Research and Enterprise Systems

214