UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS

André Lourenço; Liliana Medina; Ana Fred; Joaquim Filipe

doi:10.5220/0003722905570568

UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS

André Lourenço, Liliana Medina, Ana Fred, Joaquim Filipe

2011

Abstract

Unsupervised organisation of documents, and in particular research papers, into meaningful groups is a difficult problem. Using the typical vector-space-model representation (Bag-of-words paradigm), difficulties arise due to its intrinsic high dimensionality, high redundancy of features, and the lack of semantic information. In this work we propose a document representation relying on a statistical feature reduction step, and an enrichment phase based on the introduction of higher abstraction terms, designated as metaterms, derived from text, using as prior knowledge papers topics and keywords. The proposed representation, combined with a clustering ensemble approach, leads to a novel document organization strategy. We evaluate the proposed approach taking as application domain conference papers, topic information being extracted from conference topics or areas. Performance evaluation on data sets from NIPS and INSTICC conferences show that the proposed approach leads to interesting and encouraging results.

References

(1998). Acm computing classification system. www.acm.org/about/class/1998.
Ahlgren, P. and Jarneving, B. (2008). Bibliographic coupling, common abstract stems and clustering: A comparison of two document-document similarity approaches in the context of science mapping. Scientometrics, 76:273-290. 10.1007/s11192-007-1935-1.
Aljaber, B., Stokes, N., Bailey, J., and Pei, J. (2010). Document clustering of scientific texts using citation contexts. Inf. Retr., 13:101-131.
Banerjee, S. and Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805-810.
Boyack, K. W. and Klavans, R. (2010). Co-citation analysis, bibliographic coupling, and direct citation: Which citation approach represents the research front most accurately? Journal of the American Society for Information Science and Technology, 61(12):2389-2404.
Boyack, K. W., Newman, D., Duhon, R. J., Klavans, R., Patek, M., Biberstine, J. R., Schijvenaars, B., Skupin, A., Ma, N., and Brner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLoS ONE, 6(3):e18029.
Dao, T. N. and Simpson, T. (2005). Measuring similarity between sentences. http://opensvn.csie.org/WordNet DotNet/trunk/Projects/Thanh/Paper/WordNetDotNet Semantic Similarity.pdf.
Fellbaum, C. (1998). WordNet: An Electronical Lexical Database. The MIT Press, Cambridge, MA.
Fred, A. (2001). Finding consistent clusters in data partitions. In Kittler, J. and Roli, F., editors, Multiple Classifier Systems, volume 2096, pages 309-318. Springer.
Fred, A. and Jain, A. K. (2005). Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Analysis and Machine Intelligence, 27(6):835- 850.
Globerson, A., Chechik, G., Pereira, F., and Tishby, N. (2007). Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning Research, 8:2265- 2295.
Hanan, G. A. and Mohamed, S. K. (2008). Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans. Pattern Anal. Mach. Intell., 30(1):160-173.
Hotho, A., Staab, S., and Stumme, G. (2003). Wordnet improves text document clustering. In In Proc. of the SIGIR 2003 Semantic Web Workshop, pages 541-544.
Janssens, F., Leta, J., Glanzel, W., and De Moor, B. (2006). Towards mapping library and information science. Inf. Process. Manage., 42:1614-1642.
Karypis, G., Kumar, V., and Kumar, V. (1998). Multilevel kway partitioning scheme for irregular graphs. Journal of Parallel and Distributed Computing, 48:96-129.
Lawrence, S., Giles, C. L., and Bollacker, K. (1999). Digital libraries and autonomous citation indexing. Computer, 32:67-71.
Lourenc¸o, A., Fred, A., and Jain, A. K. (2010). On the scalability of evidence accumulation clustering. In ICPR, Istanbul Turkey.
Manning, C. D., Raghavan, P., and Schtze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
Reforgiato Recupero, D. (2007). A new unsupervised method for document clustering by using wordnet lexical and conceptual relations. Information Retrieval, 10:563-579. 10.1007/s10791-007-9035-7.
Sebastiani, F. (2005). Text categorization. In Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pages 109-129. WIT Press.
Sedding, J. and Kazakov, D. (2004). Wordnet-based text document clustering. In Proceedings of the 3rd Workshop on RObust Methods in Analysis of Natural Language Data, ROMAND 7804, pages 104-113, Stroudsburg, PA, USA. Association for Computational Linguistics.
Sevillano, X., Cobo, G., Al?as, F., Socor?, J. C., Arquitectura, E., and Salle, L. (2009). Robust document clustering by exploiting feature diversity in cluster ensembles.
Strehl, A. and Ghosh, J. (2002). Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J. of Machine Learning Research 3.
Van Rijsbergen, C. J. (1979). Information Retrieval. Butterworth, London.
Zheng, H.-T., Borchert, C., and Kim, H.-G. (2009a). Exploiting corpus-related ontologies for conceptualizing document corpora. J. Am. Soc. Inf. Sci. Technol., 60:2287-2299.
Zheng, H.-T., Kang, B.-Y., and Kim, H.-G. (2009b). Exploiting noun phrases and semantic relationships for text document clustering. Information Sciences, 179(13):2249 - 2262. Special Section on High Order Fuzzy Sets.

Download

Paper Citation

in Harvard Style

Lourenço A., Medina L., Fred A. and Filipe J. (2011). UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011) ISBN 978-989-8425-79-9, pages 549-560. DOI: 10.5220/0003722905570568

in Bibtex Style

@conference{sstm11,
author={André Lourenço and Liliana Medina and Ana Fred and Joaquim Filipe},
title={UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)},
year={2011},
pages={549-560},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003722905570568},
isbn={978-989-8425-79-9},
}

in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: SSTM, (IC3K 2011)
TI - UNSUPERVISED ORGANISATION OF SCIENTIFIC DOCUMENTS
SN - 978-989-8425-79-9
AU - Lourenço A.
AU - Medina L.
AU - Fred A.
AU - Filipe J.
PY - 2011
SP - 549
EP - 560
DO - 10.5220/0003722905570568