GOTA - Using the Google Similarity Distance for OLAP Textual Aggregation

Mustapha Bouakkaz, Sabine Loudcher, Youcef Ouinten

Abstract

With the tremendous growth of unstructured data in the Business Intelligence, there is a need for incorporating textual data into data warehouses, to provide an appropriate multidimensional analysis (OLAP) and develop new approaches that take into account the textual content of data. This will provide textual measures to users who wish to analyse documents online. In this paper, we propose a new aggregation function for textual data in an OLAP context. For aggregating keywords, our contribution is to use a data mining technique, such as kmeans, but with a distance based on the Google similarity distance. Thus our approach considers the semantic similarity of keywords for their aggregation. The performance of our approach is analyzed and compared to another method using the k-bisecting clustering algorithm and based on the Jensen-Shannon divergence for the probability distributions. The experimental study shows that our approach achieves better performances in terms of recall, precision,F-measure complexity and runtime.

References

  1. Bringay, S., Laurent, A., Poncelet, P., Roche, M., and Teisseir, M. (2010). Bien cube: les donnees textuelles peuvent s'agreger. confrence internationale sur l'extraction et la gestion des connaissances, pages 585-596.
  2. Cilibrasi, R. and Vitanyi, P. (2007). The google similarity distance. IEEE Transactions on Knowledge and Data Engineering, pages 370-383.
  3. Elghannam and Elshishtawy (2013). Multi-topic multidocument summarizer. International Journal of Computer Science and Information Technology, pages 117-132.
  4. Frantziy, K., Ananiadou, S., and Mimaz, H. (2000). Automatic recognition of multi-word terms: the cvalue/nc-value method. International Journal on Digital Libraries, pages 117-132.
  5. Fuglede, B. and Topsoe, F. (2004). Jensen-shannon divergence and hilbert space embedding. International Symposium on Information Theory, pages 31-37.
  6. Hady, W., EcPeng, L., and HweeHua, P. (2007). Tube (textcube) for discovering documentary evidence of associations among entities. Symposium on Applied Computing, pages 824-828.
  7. Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Empirical Methods in Natural Language Processing, pages 216-223.
  8. Jones, K. and Willett, P. (1997). Readings in Information Retrieval. Morgan Kaufmann Publishing.
  9. Kimball, R. (2003). The data warehouse toolkit. John Wiley and Sons.
  10. Kohomban, U. and Lee, W. S. (2007). Optimizing classifier performance in word sense disambiguation by redefining sense classes. International Joint Conference on Artificial Intelligence, pages 1635-1640.
  11. Krapivin, M. and Marchese, M. (2009). Large Dataset for Keyphrases Extraction. Technical report, University of Trento.
  12. Medelyan, Frank, and Witten (2009). Human-competitive tagging using automatic keyphrase extraction. Empirical Methods in Natural Language Processing, pages 1318-1327.
  13. Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into texts. Empirical Methods in Natural Language Processing, pages 26-31.
  14. Nguyen, T. and Kan, M. (2007). Key phrase extraction in scientific publications. International Conference on Asian Digital Libraries, pages 317-326.
  15. Oukid, L., Asfari, O., Bentayeb, F., Benblidia, N., and Boussaid, O. (2013). Cxt-cube: Contextual text cube model and aggregation operator for text olap. International Workshop On Data Warehousing and OLAP, pages 56-61.
  16. Poudat, C., Cleuziou, G., and Clavier, V. (2006). Cleuziou g., and clavier v., categorisation de textes en domaines et genres. complementarite des indexations lexicale et morpho syntaxique. Lexique et morphosyntaxe en RI, 9:61-76.
  17. Ravat, F., Teste, O., and Tournier, R. (2007). Olap aggregation function for textual data warehouse. In International Conference on Enterprise Information Systems, pages 151-156.
  18. Ravat, F., Teste, O., Tournier, R., and Zurfluh, G. (2008). Top keyword extraction method for olap document. In International Conference on Data Warehousing and Knowledge Discovery, pages 257-269.
  19. Schutz, A. (2013). Keyphrase Extraction from Single Documents in the Open Domain Exploiting Linguistic and Statistical Methods. Master thesis, National University of Ireland.
  20. Subhabrata, M. and Sachindra, J. (2014). Cxt-cube: Contextual text cube model and aggregation operator for text olap. The Language Resources and Evaluation Conference, pages 26-31.
  21. Sullivan, D. (2001). Document Warehousing and Text Mining. John Wiley and Sons.
  22. SuNam, K., Medelyan, O., and Kan, M.-Y. (2013). Automatic keyphrase extraction from scientific articles. In Language Resources and Evaluation, pages 723-742.
  23. Sutcliffe, T. (1992). Measuring the informativeness of a retrieval process. Proc. of SIGIR, pages 23-36.
  24. Trec (2013). Common evaluation measures. The Twenty-Second Text REtrieval Conference, page (http://trec.nist.gov/pubs/trec22/trec2013.html).
  25. Tseng, F. and Chou, A. (2006). The concept of document warehousing for multi-dimensional modeling of textual-based business intelligence. journal of Decision Support Systems, 42:727-744.
  26. Wan, X. and Xiao, J. (2008). Collabrank: Towards a collaborative approach to single document keyphrase extraction. International Conference on Computational Linguistics, pages 317-326.
  27. Wartena, C. and Brussee, R. (2008). Topic detection by clustering keywords. International Conference on Database and Expert Systems Applications, pages 54- 58.
Download


Paper Citation


in Harvard Style

Bouakkaz M., Loudcher S. and Ouinten Y. (2015). GOTA - Using the Google Similarity Distance for OLAP Textual Aggregation . In Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-758-096-3, pages 121-127. DOI: 10.5220/0005357201210127


in Bibtex Style

@conference{iceis15,
author={Mustapha Bouakkaz and Sabine Loudcher and Youcef Ouinten},
title={GOTA - Using the Google Similarity Distance for OLAP Textual Aggregation},
booktitle={Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2015},
pages={121-127},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005357201210127},
isbn={978-989-758-096-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 17th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - GOTA - Using the Google Similarity Distance for OLAP Textual Aggregation
SN - 978-989-758-096-3
AU - Bouakkaz M.
AU - Loudcher S.
AU - Ouinten Y.
PY - 2015
SP - 121
EP - 127
DO - 10.5220/0005357201210127