Authors:
Elisabete Cunha
1
;
Álvaro Figueira
2
and
Óscar Mealha
3
Affiliations:
1
Universidade do Porto, Universidade de Aveiro, ESE and IPVC, Portugal
;
2
Universidade do Porto, Portugal
;
3
CETAC.MEDIA, Portugal
Keyword(s):
Semantic Document Classification, Clustering, Tagging, Seed Selection, k-means, k-C, Cosine Similarity.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Clustering and Classification Methods
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Symbolic Systems
Abstract:
In this paper we analyze and discuss two methods that are based on the traditional k-means for document clustering and that feature integration of social tags in the process. The first one allows the integration of tags directly into a Vector Space Model, and the second one proposes the integration of tags in order to select the initial seeds. We created a predictive model for the impact of the tags’ integration in both models, and compared the two methods using the traditional k-means++ and the novel k-C algorithm. To compare the results, we propose a new internal measure, allowing the computation of the cluster compactness. The experimental results indicate that the careful selection of seeds on the k-C algorithm present better results to those obtained with the k-means++, with and without integration of tags.