We analyse different algorithms with different
number of clusters, terms and internal properties and
the results shows that the most appropriate number
of clusters is between two and four. However this
number of clusters (areas) is very small compared
with ACM or Microsoft. Analyzing the titles in these
clusters, the papers of the collection can be grouped
in Artificial Intelligence and Modeling Systems, but
this conclusion requires some human analysis, so we
need apply some contextual information to produce
computer science meaningful clusters.
It is needed to do more experiments in order to
try to select one or a few clustering algorithms that
automatically classify the titles of scientific papers.
As a future work we are going to apply these
algorithms to the keywords and abstracts of the
papers in order to compare the results obtained.
Also we are planning consider the membership
of several areas, two or three. This is natural because
computer science is an interdisciplinary science and
there is a lot trying to resolve problem of other areas.
We also think that would be interesting to work in a
window time of two or five year in order to analyse
the evolution of the research in one country as with
several ones.
Our approach is also applicable for any scientific
field.
REFERENCES
ACM (2012). Retrieved January 8, 2013, from
dl.acm.org/ccs.cfm.
Vaidas Balys, Rimantas Rudzkis (2010) Statistical
Classification of Scientific Publications, Informatica,
Volume 21, Issue 4, pp 471 – 486.
Guy Brock, Vasyl Pihur, Susmita Datta and Somnath
Datta (2008). clValid: An R Package for Cluster
Validation, Journal of Statistical Software. March
2008, Volume 25, Issue 4. http://www.jstatsoft.org.
Guy Brock, Vasyl Pihur, Susmita Datta and Somnath
Datta (2011). clValid: Validation of Clustering
Results. R package version 0.6-4. http://CRAN.R-
project.org/package=clValid.
G. Csardi, T Nepusz (2006). The igraph software package
for complex network research, InterJournal, Complex
Systems 1695. 2006. http://igraph.sf.net.
Ingo Feinerer, Kurt Hornik, and David Meyer (2008). Text
Mining Infrastructure in R. Journal of Statistical
Software 25(5): 1-54. URL: http://www.jstatsoft.org/
v25/i05/.
Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining
Package. R package version 0.5-8.3. http://CRAN.R-
project.org/package=tm.
Ian Fellows (2012). wordcloud: Word Clouds. R package
ver2.2 http://CRAN.R-project.org/package=wordcloud
C. Fraley, A. E. Raftery (2007). Bayesian regularization
for normal mixture estimation and model-based
clustering. Journal of Classification, Vol. 24, Issue2,
pp. 155-181.
C. Fraley, A. E. Raftery (2002). Model-based clustering,
discriminant analysis and density estimation. Journal
of the American Statistical Association, Vol. 97, pages
611-631.
Cristal-Karina Galindo Duran, Mihaela Juganaru-Mathieu,
Carlos Aviles Cruz, Héctor Javier Vazquez (2010).
Desarrollo de una aplicación destinada a la
clasificación de información textual y su evaluación
por simulación, Administración y Organizaciones
25:13, pages 119-131.
Tarek Gharib, Mohammed Fouad, Mostafa Aref (2010)
Fuzzy Document Clustering Approach using WordNet
Lexical Categories. In Advanced Techniques in
Computing Sciences and Software Engineering,
Khaled Elleithy (editor), Springer, pp 181-186.
Michiel Hazewinkel (2005) Dynamic Stochastic Models
for Indexes and Thesauri, Identification Clouds, and
Information Retrieval and Storage, In Recent
Advances in Applied Probability, Ricardo Baeza-Yates
et al (editors), Springer US, 2005, pp 181-204.
Mikael Laakso, Bo-Christer Björk (2012). Anatomy of
open access publishing: a study of longitudinal
development and internal structure, BMC Medicine,
10:124, pp 1-9.
Jian Ma; Wei Xu; Yong-hong Sun; Turban, E.; Shouyang
Wang; Ou Liu (2012) "An Ontology-Based Text-
Mining Method to Cluster Proposals for Research
Project Selection," Systems, Man and Cybernetics,
Part A: Systems and Humans, IEEE Transactions on ,
vol.42, no.3, pp 784-790.
M. Maechler, P Rousseeuw, A. Struyf, M. Hubert, K.
Hornik, (2012). cluster: Cluster Analysis Basics and
Extensions. R package version 1.14.3.
Microsoft Academic Search (2012). Retrieved January 22,
2013, from academic.research.microsoft.com.
R. Core Team (2013). R: A language and environment for
statistical computing.
R Foundation for Statistical
Computing, Vienna, Austria. ISBN 3-900051-07-0,
URL htttp://www.R-project.org/.
G. Schwarz (1978). Estimating the dimension of a model.
The Annals of Statistics, 6:461-464, 1978.
Fabrizio Sebastiani (2002) Machine learning in automated
text categorization, Journal ACM Computing Surveys,
Volume 34 Issue 1, pages 1 – 47.
Mohsen Taheriyan (2011) Subject classification of
research papers based on interrelationships analysis. In
Proceedings of the 2011 Workshop on Knowledge
Discovery, Modeling and Simulation (San Diego,
California, USA, August 2011). KDMS '11. ACM,
New York, NY, pages 39-44.
H. Wickham (2009) ggplot2: elegant graphics for data
analysis. Springer New York.
Ying Zhao, George Karypis, Usama Flayyad (2005),
Hierarchical Clustering Algorithms for Document
Datasets, Data Mining and Knowledge Discovery,
Volume 10, Issue 2, March 2005, pages 141-168.
KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
182