Table 1: Number of Training and Testing Items.
Category Name Num Train Num Test
alt.atheism 480 318
comp.graphics 581 391
comp.os.ms-
windows.misc
572 391
comp.sys.ibm.pc.hardware 587 390
All 2220 1490
The DBSCAN clustering algorithm resulted in
15 clusters. In order to see the effectiveness of the
algorithm, the proposed method without DBSCAN
was tested by varying the number of latent topics
and the test result is shown in Figure 1.
Figure 1: Accuracy according to different numbers of
latent topics.
DBSCAN algorithm has obtained a satisfying
result, even though the highest accuracy is acquired
when the number of topics is 14. The precision and
recall with the proposed method is shown in Table 2.
Table 2: Precision and Recall.
Category Name Precision Recall
alt.atheism 93.7% 99.3%
comp.graphics 97.1% 94.3%
comp.os.ms-
windows.misc
92.5% 92.1%
comp.sys.ibm.pc.hardware 92.2% 88.2%
Accuracy 93.2%
Computational comparison with other
classification algorithms remains as a future study.
5 CONCLUSIONS
This paper presented a joint project based on a
generative model LDA and Support Vector machine
to categorize text. DBSCAN clustering algorithm is
used to obtain the number of latent topics. A
preliminary experiment was carried out to a small-
scale data corpus. Further research on applications to
other large-scale data is necessary. Moreover, other
classification algorithms should also be tested on
this corpus in comparison with the proposed method.
Furthermore, the potential use of the proposed
method to the area of information systems is subject
to a further study.
ACKNOWLEDGEMENTS
This work was supported by the National Research
Foundation of Korea (NRF) grant funded by the
Korea government (MEST) (2009-0083893).
REFERENCES
Aizerman, M. A., Braverman, E. M., and Rozono’er, L. I.,
1964. Theoretical foundations of the potential function
method in pattern recognition learning. Automat. Rem.
Control, 25, pp.824-837.
Blei, D. M., Ng, A. Y., and Jordan, M. I., 2003. Latent
dirichlet allocation. Journal of Machine Learning
Research, 3, pp.993-1022.
Choi, I.C., and Lee, J. S., 2010. Document indexing by
latent dirichlet allocation. Proceedings of The 2010
International Conference on Data Mining, pp.409-
414.
Cortes, C., & Vapnik, V., 1995. Support vector networks.
Machine Learning, 20(3), pp.273-297.
Ester, M., Kriegel, H. P., Sander, J., and Xu, X., 1996. A
density based algorithm for discovering clusters in
large spatial databases. Proceedings of 2nd
International Conference on Knowledge Discovery
and Data Mining, pp.226-231.
Joachims, T., 1988. Text categorization with support
vector machines: Learning with many relevant
features. Proceedings of the 10th European conference
on machine learning, pp.137-142.
Hart, P., 1967. Nearest neighbor pattern classification.
IEEE Transaction on Information Theory, 13(1),
pp.21-27.
Van Rijsbergen, C. J., 1979. Information Retrieval,
Buttersworth. London, 2
th
edition.
Vapnik, V., 1995. The nature of statistical learning theory,
Springer. New York.
Wiener, E., Pedersen, J. O., Weigend, A. S., 1995. A
Neural Network Approach to Topic Spotting. SDAIR,
pp.317-332.
70%
75%
80%
85%
90%
95%
7 8 9 10 11 13 14 15 16 17
Accuracy
Number of topics
ICORES 2012 - 1st International Conference on Operations Research and Enterprise Systems
214