LINGUISTICALLY ENHANCED CLUSTERING OF TECHNICAL PUBLICATIONS

Mahmoud Gindiyeh, Gintare Grigonyte, Johann Haller, Algirdas Avižienis

Abstract

Organizing documents and performing search is a common but not a trivial task in information systems. With the increasing number of documents, it is becoming crucial to automate these processes. Clustering is a solution for organizing large amount of documents. In this article we propose a method of improving document retrieval that was implemented in RKB Knowledge Base. Our method heavily relies on linguistic analysis, which aims to identify document specific noun phrases. We apply an adjusted hierarchical clustering algorithm for learning clusters of documents.

References

  1. Gelbukh, A.F., Sidorov, G., Guzmán-Arenas, A. 1999. Use of a Weighted Topic Hierarchy for Document Classification. In Proceedings of the 2nd international Workshop on Text, Speech and Dialogue V. Matousek, P. Mautner, J. Ocelíková, and P. Sojka, Eds. Lecture Notes In Computer Science, vol. 1692. SpringerVerlag, London, 133-138.
  2. Glaser, H., Millard, I., Jaffri, A. 2008. RKBExplorer.com: A Knowledge Driven Infrastructure for Linked Data Providers. The Semantic Web: Research and Applications, Springer, 797-801.
  3. Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J., 1998. Indexing with WordNet synsets can improve Text Retrieval. In proceedings of the COLING/ACL'98 Workshop on Usage of WordNet for NLP.
  4. Haller, J., Schmidt, P. 2006. AUTINDEX - Automatische Indexierung. Zeitschrift für Bibliothekswesen und Bibliographie: Sonderheft 89, Klostermann, Frankfurt am Main, 104-114.
  5. Hatzivassiloglou, V. , Gravano, L., Maganti, A. 2000. An investigation of linguistic features and clustering algorithms for topical document clustering. In proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, 224-231.
  6. Huang S., Xue G., Zhang B., Chen Z., Yu Y., Wei-Ying Ma 2004. TSSP: A Reinforcement Algorithm to Find Related Papers, Proceedings of the Web Intelligence, IEEE/WIC/ACM, p.117-123.
  7. Johnson S. C. 1967. Hierarchical Clustering Schemes. In Psychometrika, 2:241-254.
  8. Joerg B. 2008. Towards the Nature of Citations, In poster proceedinds of FOIS 2008, 31-36.
  9. Kouomou, A., Berti-Équille, L., Morin, A. 2005. Optimizing progressive query-by-example over preclustered large image databases, In proceedings of the 2nd international workshop on Computer vision meets databases, Baltimore, USA.
  10. Mass, H.D., Rösener, C., Theofilidis, A. 2009. Morphosyntactical and semantic analysis of text: The MPRO tagging procedure. Forthcoming: SFCM 2009 workshop on Systems and Frameworks for Computational Morphology, Zürich, Switzerland.
  11. Manning, C., Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press Cambridge, MA.
  12. FIZ Thesaurus Technik und Management. Hierarchisch strukturiertes Fachwortverzeichnis. 2000. FIZ-Technik Presse-Information. Frankfurt.
  13. Tikk, D., Biro, G., Szidarovszky, F., Kardkovacs, Z., Lemak, G., 2007. Topic and language specific internet search engine. In journal Acta Cybernetica, vol. 18.2, 279-291.
  14. Zheng, H., Kang, B., Kim, H., 2009. Exploiting noun phrases and semantic relationships for text document clustering, In Information Sciences, vol. 179.13, 2249-2262.
Download


Paper Citation


in Harvard Style

Gindiyeh M., Grigonyte G., Haller J. and Avižienis A. (2009). LINGUISTICALLY ENHANCED CLUSTERING OF TECHNICAL PUBLICATIONS . In Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009) ISBN 978-989-674-013-9, pages 324-327. DOI: 10.5220/0002308703240327


in Bibtex Style

@conference{kmis09,
author={Mahmoud Gindiyeh and Gintare Grigonyte and Johann Haller and Algirdas Avižienis},
title={LINGUISTICALLY ENHANCED CLUSTERING OF TECHNICAL PUBLICATIONS},
booktitle={Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009)},
year={2009},
pages={324-327},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002308703240327},
isbn={978-989-674-013-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Management and Information Sharing - Volume 1: KMIS, (IC3K 2009)
TI - LINGUISTICALLY ENHANCED CLUSTERING OF TECHNICAL PUBLICATIONS
SN - 978-989-674-013-9
AU - Gindiyeh M.
AU - Grigonyte G.
AU - Haller J.
AU - Avižienis A.
PY - 2009
SP - 324
EP - 327
DO - 10.5220/0002308703240327