“uncategorized” documents (about 139) were clustered among other categories, and
then considered incorrect according to our rules.
5 Conclusions
In this paper, we presented and compared some clustering algorithms (stars, cliques,
best-star and full-stars) that can be applied over text collections to organize texts of
the same subject. We have found that the Best-Stars is the most interesting among
them to perform this task without needing much information and adjustments from
the user. Our experiments showed that clustering achieves similar or better micro-
average recover than the CONSTRUE system, but lacks precision. However, it can be
used to suggest appropriated categories for documents that the category is unknown.
References
1. Brake, D.: Lost in cyberspace. New Scientist, 154(2088):12-13, (1997)
2. Cutting, D., Karger, D.R., Pedersen, J.O. and Tukey, J.W.: Scatter/Gather: a cluster-based
approach to browsing large document collections. In Proc. of the ACM-SIGIR Conference
pp. 318-329. ACM Press, New York (1992)
3. Everitt, B.S., Landau, S. and Leese, M.: Cluster Analysis. Oxford University Press Inc,
New York (2001)
4. Farhoomand, A. F. and Drury, D. H.: Managerial information overload. Communications of
the ACM, 45(10):127-131, (2002)
5. Halkidi, M., Batistakis, Y. and Varzigiannis, M.: Cluster Validity Checking Methods: Part
II. ACM SIGMOD Record, 31(3):19-27, (2002)
6. Jain, A.K. and Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Inc., Upper
Saddle River, NJ (1988)
7. Jain, A.K., Murty, M.N. and Flynn, P.J.: Data clustering: a review. ACM Computing
Surveys, 31(3):264-323, (1999)
8. Kowalski, G.: Information Retrieval Systems: Theory and Implementation. Kluwer
Academic Publishers, Boston (1997)
9. LEWIS, D.D.: Representation and Learning in Information Retrieval. Department of
Computer and Information Science. University of Massachusetts, Amherst (1991)
10. Pedrycz, W.: Fuzzy neural networks and neurocomputations. Fuzzy Sets and Systems,
56(1):1-28, (1993)
11. Prado, H.A.d., de Oliveira, J.P.M., Ferneda, E., Wives, L.K., Silva, E.M. and Loh, S.: Text
Mining in the context of Business Intelligence. In: Khosrow-Pour, M. (ed.): Encyclopedia
of Information Science and Technology. Idea Group Reference, Hershey, PA, USA (2005)
2793-2798
12. Steinbach, M., Karypis, G. and Kumar, V.: A comparison of document clustering
techniques. In Proc. of the Workshop on Textmining. pp. 2, Boston, USA (2000)
13. Clustan, http://www.clustan.com.
14. Wives, L.K., de Oliveira, J.P.M. and Loh, S.: Conceptual Clustering of Textual Documents
and Some Insights for Knowledge Discovery. In: Prado, H.A. do and Ferneda, E. (eds.):
Text Mining: Techniques and Applications. Information Science Reference Hershey, PA,
USA (2008) 223-243
15. YANG, Y. and LIU, X.: An evaluation of statistical approaches to text categorization.
Journal of Information Retrieval, 1(1/2):67-88, (1999)
236