A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS

Daniel Osuna-Ontiveros, Ivan Lopez-Arevalo, Victor Sosa-Sosa

Abstract

Information retrieval (IR) models process documents for preparing them for search by humans or computers. In the early models, the general idea was making a lexico-syntactic processing of documents, where the importance of the documents retrieved by a query is based on the frequency of its terms in the document. Another approach is return predefined documents based on the type of query the user make. Recently, some researchers have combined text mining techniques to enhance the document retrieval. This paper proposes a semantic clustering approach to improve traditional information retrieval models by representing topics associated to documents. This proposal combines text mining algorithms and natural language processing. The approach does not use a priori queries, instead clusters terms, where each cluster is a set of related words according to the content of documents. As result, a document-topic matrix representation is obtained denoting the importance of topics inside documents. For query processing, each query is represented as a set of clusters considering its terms. Thus, a similarity measure (e.g. cosine similarity) can be applied over this array and the matrix of documents to retrieve the most relevant documents.

References

  1. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022.
  2. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41:391-407.
  3. Fischer, H. (2011). Conclusion: The central limit theorem as a link between classical and modern probability theory. In A History of the Central Limit Theorem, Sources and Studies in the History of Mathematics and Physical Sciences, pages 353-362. Springer New York.
  4. Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Science, 101:5228-5235.
  5. Klein, D. and Manning, C. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Meeting of the Association for Computational Linguistics.
  6. Konietzny, S. G. A., Dietz, L., and McHardy, A. C. (2011). Inferring functional modules of protein families with probabilistic topic models. BMC Bioinformatics, 12:141.
  7. Lafferty, J. D. and Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, SIGIR, pages 111- 119. ACM.
  8. Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on Computational linguistics, pages 768-774, Morristown, NJ, USA. Association for Computational Linguistics.
  9. Manning, C. D., Raghavan, P., and Schtze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  10. Pantel, P. A. (2003). Clustering by committee. PhD thesis, University of Alberta Edmonton. Adviser-Dekang Lin.
  11. Ponte, J. and Croft, B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st International Conference on Research and Development in Information Retrieval.
  12. Robertson, S. E. and Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129-146.
  13. Salton, G., Wong, A., and Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620. The paper where vector space model for IR was introduced.
  14. Sánchez, D. (2009). Domain ontology learning from the web. The Knowledge Engineering Review, 24(04):413-413.
Download


Paper Citation


in Harvard Style

Osuna-Ontiveros D., Lopez-Arevalo I. and Sosa-Sosa V. (2011). A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011) ISBN 978-989-8425-79-9, pages 280-285. DOI: 10.5220/0003663802880293


in Bibtex Style

@conference{kdir11,
author={Daniel Osuna-Ontiveros and Ivan Lopez-Arevalo and Victor Sosa-Sosa},
title={A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)},
year={2011},
pages={280-285},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003663802880293},
isbn={978-989-8425-79-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2011)
TI - A SEMANTIC CLUSTERING APPROACH FOR INDEXING DOCUMENTS
SN - 978-989-8425-79-9
AU - Osuna-Ontiveros D.
AU - Lopez-Arevalo I.
AU - Sosa-Sosa V.
PY - 2011
SP - 280
EP - 285
DO - 10.5220/0003663802880293