Figure 9: Textual Coherence over different experiments.
from the text. To evaluate the proposed methodol-
ogy we used two real-word data sets from conferences
NIPS and ICEIS. We also evaluated this methodology
against results obtained by applying LSI to the origi-
nal feature space.
To evaluate the results, we followed an unsuper-
vised approach, based on the observation of the ob-
tained co-association matrices, and on the within clus-
ter textual coherence. Based on both, we conclude
that feature reduction by employing feature aggrega-
tion into metaterms produces better results than both
the original TF-IDF feature spaces and the one using
the feature space reduction obtained by LSI.
As future work we want to improve the criteria
for feature aggregation, including a supervised step
of user annotation, and combining different criteria
(statistical and string comparison). Additionally, we
will use the EAC clustering combination algorithm
to combine the information already in use (titles and
abstracts) with citation information. Another of the
possible approaches is the usage of other ontologies
(besides WordNet) for the discovery of semantic rela-
tionships between features and documents, enabling
better aggregation of features.
ACKNOWLEDGEMENTS
This work was partially developed under the grants
SFRH/PROTEC/49512/2009 and PTDC/EIACCO/
103230/2008 (project EvaClue) from Fundac¸
˜
ao para
a Ci
ˆ
encia e Tecnologia(FCT), and project RETE, ref-
erence 3-CP-IPS-3-2009, from IPS/INSTICC, whose
support the authors gratefully acknowledge.
REFERENCES
(1998). Acm computing classification system. http://
www.acm.org/about/class/1998.
Ahlgren, P. and Jarneving, B. (2008). Bibliographic cou-
pling, common abstract stems and clustering: A
comparison of two document-document similarity ap-
proaches in the context of science mapping. Sciento-
metrics, 76:273–290. 10.1007/s11192-007-1935-1.
Aljaber, B., Stokes, N., Bailey, J., and Pei, J. (2010). Doc-
ument clustering of scientific texts using citation con-
texts. Inf. Retr., 13:101–131.
Banerjee, S. and Pedersen, T. (2003). Extended gloss over-
laps as a measure of semantic relatedness. In In Pro-
ceedings of the Eighteenth International Joint Confer-
ence on Artificial Intelligence, pages 805–810.
Boyack, K. W. and Klavans, R. (2010). Co-citation analy-
sis, bibliographic coupling, and direct citation: Which
citation approach represents the research front most
accurately? Journal of the American Society for Infor-
mation Science and Technology, 61(12):2389–2404.
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R.,
Patek, M., Biberstine, J. R., Schijvenaars, B., Skupin,
A., Ma, N., and Brner, K. (2011). Clustering more
than two million biomedical publications: Compar-
ing the accuracies of nine text-based similarity ap-
proaches. PLoS ONE, 6(3):e18029.
Dao, T. N. and Simpson, T. (2005). Measuring similarity
between sentences. http://opensvn.csie.org/WordNet
DotNet/trunk/Projects/Thanh/Paper/WordNetDotNet
Semantic Similarity.pdf.
Fellbaum, C. (1998). WordNet: An Electronical Lexical
Database. The MIT Press, Cambridge, MA.
Fred, A. (2001). Finding consistent clusters in data parti-
tions. In Kittler, J. and Roli, F., editors, Multiple Clas-
sifier Systems, volume 2096, pages 309–318. Springer.
Fred, A. and Jain, A. K. (2005). Combining multiple clus-
tering using evidence accumulation. IEEE Trans Pat-
tern Analysis and Machine Intelligence, 27(6):835–
850.
Globerson, A., Chechik, G., Pereira, F., and Tishby, N.
(2007). Euclidean Embedding of Co-occurrence Data.
The Journal of Machine Learning Research, 8:2265–
2295.
Hanan, G. A. and Mohamed, S. K. (2008). Cumulative
voting consensus method for partitions with variable
number of clusters. IEEE Trans. Pattern Anal. Mach.
Intell., 30(1):160–173.
Hotho, A., Staab, S., and Stumme, G. (2003). Wordnet im-
proves text document clustering. In In Proc. of the
SIGIR 2003 Semantic Web Workshop, pages 541–544.
Janssens, F., Leta, J., Glanzel, W., and De Moor, B. (2006).
Towards mapping library and information science. Inf.
Process. Manage., 42:1614–1642.
Karypis, G., Kumar, V., and Kumar, V. (1998). Multilevel k-
way partitioning scheme for irregular graphs. Journal
of Parallel and Distributed Computing, 48:96–129.
Lawrence, S., Giles, C. L., and Bollacker, K. (1999). Digi-
tal libraries and autonomous citation indexing. Com-
puter, 32:67–71.
Lesk, M. (1986). Automatic sense disambiguation using
machine readable dictionaries: how to tell a pine cone
from an ice cream cone. In Proceedings of the 5th
annual international conference on Systems documen-
UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS
567