Authors:
André Lourenço
1
;
Liliana Medina
2
;
Ana Fred
3
and
Joaquim Filipe
4
Affiliations:
1
Instituto Superior de Engenharia de Lisboa and a, Portugal
;
2
Institute for Systems and Technologies of Information and Control and Communication, Portugal
;
3
Instituto Superior Técnico, Portugal
;
4
Institute for Systems and Technologies of Information, Control and Communication and Polytechnic Institute of Setúbal, Portugal
Keyword(s):
Unsupervised learning, Clustering, Clustering combination, Clustering ensembles, Text mining, Feature selection, Concept induction, Metaterm.
Abstract:
Unsupervised organisation of documents, and in particular research papers, into meaningful groups is a difficult problem. Using the typical vector-space-model representation (Bag-of-words paradigm), difficulties arise due to its intrinsic high dimensionality, high redundancy of features, and the lack of semantic information. In this work we propose a document representation relying on a statistical feature reduction step, and an enrichment phase based on the introduction of higher abstraction terms, designated as metaterms, derived from text, using as prior knowledge papers topics and keywords. The proposed representation, combined with a clustering ensemble approach, leads to a novel document organization strategy. We evaluate the proposed approach taking as application domain conference papers, topic information being extracted from conference topics or areas. Performance evaluation on data sets from NIPS and INSTICC conferences show that the proposed approach leads to interesting
and encouraging results.
(More)