Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods
Elisabete Cunha, Álvaro Figueira, Óscar Mealha
2013
Abstract
In this paper we analyze and discuss two methods that are based on the traditional k-means for document clustering and that feature integration of social tags in the process. The first one allows the integration of tags directly into a Vector Space Model, and the second one proposes the integration of tags in order to select the initial seeds. We created a predictive model for the impact of the tags’ integration in both models, and compared the two methods using the traditional k-means++ and the novel k-C algorithm. To compare the results, we propose a new internal measure, allowing the computation of the cluster compactness. The experimental results indicate that the careful selection of seeds on the k-C algorithm present better results to those obtained with the k-means++, with and without integration of tags.
References
- Arthur, D. & Vassilvitskii, S. 2007. k-means++: the advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, Louisiana: Society for Industrial and Applied Mathematics.
- Cunha, E. & Figueira, Á. 2012. Automatic Clustering Assessment through a Social Tagging System. In: The 15th IEEE International Conference on Computational Science and Engineering, 5-7 Dec. 2012 Paphos, Cyprus. 74-81.
- Cunha, E., Figueira, Á. & MEALHA, O. 2013. Clustering Documents Using Tagging Communities and Semantic Proximity In: 8th Iberian Conference on Information Systems and Technologies (CISTI), in press.
- Davies, D. L. & Bouldin, D. W. 1979. A Cluster Separation Measure. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-1, 224-227.
- Dunn, J. C. 1974. Well separated clusters and optimal fuzzy-partitions Journal of Cybernetics, Vol. 4 pp. 95- 104.
- Dye, J. 2006. Folksonomy: A game of high-tech (and high-stakes) tag, Wilton, CT, ETATS-UNIS, Online.
- Feldman, R. & Sanger, J. 2007. The Text Mining Handbook Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press.
- Lee, C. S., Goh, D. H.-L., Razikin, K. & Chua, A. Y. K. 2009. Tagging, Sharing and the Influence of Personal Experience.
- Macqueen, J. 1967. Some Methods for Classification and Analysis of MultiVariate. Proceedings of the fifth Berkeley Symposium on Mathematical Statistics and Probability. University of California Press.
- Manning, C., Raghavan, P. & Schütze, H. 2009. An Introduction to Information Retrieval, Cambridge University Press. Cambridge, England.
- SnuderL, K. 2008. Tagging: Can user - generated content improve our service? Statiscal Jounal of the IAOS 25, 125-132.
- Springer, M., Dulabahn, B., Michel, P., Natanson, B., Reser, D., Woodward, D. & Zinkham, H. 2008. For The Common Good: The Library of Congreess. Flichr Pilot Project - Report Summary.
- Theodoridis, S. & Koutroumbas, K. 2009. Pattern Recognition, Fourth Edition, Academic Press.
- Trant, J. 2008. Tagging, Folksonomy and Art Museums: Results of Steve museum's research [Online]. Available:http://verne.steve.museum/SteveResearchRe port2008.pdf [Accessed 2011].
- Trant, J. 2009. Studying Social Tagging and Folksonomy: A Review and Framework.
- Wal, V. 2007. Folksonomy Coinage and Definition [Online].Available:http://vanderwal.net/folksonomy.ht ml [Accessed 2011].
- Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., 2007. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14, 1-37.
Paper Citation
in Harvard Style
Cunha E., Figueira Á. and Mealha Ó. (2013). Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013) ISBN 978-989-8565-75-4, pages 160-168. DOI: 10.5220/0004545201600168
in Bibtex Style
@conference{kdir13,
author={Elisabete Cunha and Álvaro Figueira and Óscar Mealha},
title={Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)},
year={2013},
pages={160-168},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004545201600168},
isbn={978-989-8565-75-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge Management and Information Sharing - Volume 1: KDIR, (IC3K 2013)
TI - Clustering and Classifying Text Documents - A Revisit to Tagging Integration Methods
SN - 978-989-8565-75-4
AU - Cunha E.
AU - Figueira Á.
AU - Mealha Ó.
PY - 2013
SP - 160
EP - 168
DO - 10.5220/0004545201600168