
 
 
Table 1: Number of Training and Testing Items. 
Category Name  Num Train  Num Test 
alt.atheism 480 318 
comp.graphics 581 391 
comp.os.ms-
windows.misc 
572 391 
comp.sys.ibm.pc.hardware 587  390 
All 2220 1490 
The DBSCAN clustering algorithm resulted in 
15 clusters. In order to see the effectiveness of the 
algorithm, the proposed method without DBSCAN 
was tested by varying the number of latent topics 
and the test result is shown in Figure 1.  
 
Figure 1: Accuracy according to different numbers of 
latent topics. 
DBSCAN algorithm has obtained a satisfying 
result, even though the highest accuracy is acquired 
when the number of topics is 14. The precision and 
recall with the proposed method is shown in Table 2.  
Table 2: Precision and Recall. 
Category Name  Precision  Recall 
alt.atheism 93.7% 99.3% 
comp.graphics 97.1% 94.3% 
comp.os.ms-
windows.misc 
92.5% 92.1% 
comp.sys.ibm.pc.hardware 92.2%  88.2% 
Accuracy 93.2% 
Computational comparison with other 
classification algorithms remains as a future study. 
5 CONCLUSIONS 
This paper presented a joint project based on a 
generative model LDA and Support Vector machine 
to categorize text. DBSCAN clustering algorithm is 
used to obtain the number of latent topics. A 
preliminary experiment was carried out to a small-
scale data corpus. Further research on applications to 
other large-scale data is necessary. Moreover, other 
classification algorithms should also be tested on 
this corpus in comparison with the proposed method.  
Furthermore, the potential use of the proposed 
method to the area of information systems is subject 
to a further study. 
ACKNOWLEDGEMENTS 
This work was supported by the National Research 
Foundation of Korea (NRF) grant funded by the 
Korea government (MEST) (2009-0083893). 
REFERENCES 
Aizerman, M. A., Braverman, E. M., and Rozono’er, L. I., 
1964. Theoretical foundations of the potential function 
method in pattern recognition learning. Automat. Rem. 
Control, 25, pp.824-837. 
Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent 
dirichlet allocation. Journal of Machine Learning 
Research, 3, pp.993-1022. 
Choi, I.C., and Lee, J. S., 2010. Document indexing by 
latent dirichlet allocation. Proceedings of The 2010 
International Conference on Data Mining, pp.409-
414. 
Cortes, C., & Vapnik, V., 1995. Support vector networks. 
Machine Learning, 20(3), pp.273-297. 
Ester, M., Kriegel, H. P., Sander, J., and Xu, X., 1996. A 
density based algorithm for discovering clusters in 
large spatial databases. Proceedings of 2nd 
International Conference on Knowledge Discovery 
and Data Mining, pp.226-231. 
Joachims, T., 1988. Text categorization with support 
vector machines: Learning with many relevant 
features. Proceedings of the 10th European conference 
on machine learning, pp.137-142. 
Hart, P., 1967. Nearest neighbor pattern classification. 
IEEE Transaction on Information Theory, 13(1), 
pp.21-27. 
Van Rijsbergen, C. J., 1979. Information Retrieval, 
Buttersworth.  London, 2
th
 edition. 
Vapnik, V., 1995. The nature of statistical learning theory, 
Springer. New York. 
Wiener, E., Pedersen, J. O., Weigend, A. S., 1995. A 
Neural Network Approach to Topic Spotting. SDAIR, 
pp.317-332. 
70%
75%
80%
85%
90%
95%
7 8 9 10 11 13 14 15 16 17
Accuracy
Number of topics
ICORES 2012 - 1st International Conference on Operations Research and Enterprise Systems
214