A Study on the Role of Similarity Measures in Visual Text Analytics

F. San Roman S., R. D. de Pinho, R. Minghim, M. C. F. de Oliveira

2013

Abstract

Text Analytics is essential for a large number of applications and good approaches to obtain visual mappings of text are paramount. Many visualization techniques, such as similarity based point placement layouts, have proved useful to support visual analysis of documents. However, they are sensitive to data quality, which, in turn, relies on a critical preprocessing step that involves text cleaning and in some cases term detecting and weighting, as well as the definition of a similarity function. Not much has been discussed on the effect of these important similarity calculations in the quality of visual representations. This paper presents a study on the role of different text similarity measurements on the generation of visual text mappings. We focus mainly on two types of distance functions, those based on the well-known text vector representation and on direct string comparison measurements, comparing their effect on visual mappings obtained with point placement techniques. We find that both have their value but, in many circumstances, the vector space model (VSM) is the best solution when discrimination is important. However, the VSM is not incremental, that is, new additions to a collection force a recalculation of the whole feature space and similarities. In this work we also propose a new incremental model based on the VSM, which is shown to present the best visualization results in many configurations tested. We show the evaluation results and offer recommendations on the application of different text similarity measurements for Visual Text Analytics tasks.

References

  1. Alsakran, J., Chen, Y., Luo, D., Zhao, Y., Yang, J., Dou, W., and Liu, S. (2012). Real-Time Visualization of Streaming Text with a Force-Based Dynamic System. IEEE Comp. Graph. and Applic., 32(1):34-45.
  2. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. J. of Mach. Learn. Res., 3:993- 1022.
  3. Cohen, W., Ravikumar, P., and Fienberg, S. (2003). A Comparison of String Distance Metrics for NameMatching Tasks. In Proc. IJCAI-2003 Workshop on Information Integration on the Web, pages 73-78.
  4. Cuadros, A. M., Paulovich, F. V., Minghim, R., and Telles, G. P. (2007). Point Placement by Phylogenetic Trees and its Application to Visual Analysis of Document Collections. In Proc. 2007 IEEE Symp. Vis. Analytics Sci. and Techn., pages 99-106.
  5. Huang, S., Ward, M., and Rundensteiner, E. (2005). Exploration of Dimensionality Reduction for Text Visualization. In Proc. Coord. and Mult. Views in Exploratory Vis., pages 63-74.
  6. Kempken, S., Luther, W., and Pilz, T. (2006). Comparison of distance measures for historical spelling variants. In Artif. Intel. Theory and Prac., pages 295-304.
  7. Landauer, T. K., McNamara, D. S., Dennis, S., and Kintsch, W. (2007). Handbook of Latent Semantic Analysis. Lawrence Erlbaum Assoc.
  8. Lopes, A. A., Pinho, R., Paulovich, F. V., and Minghim, R. (2007). Visual text mining using association rules. Comp & Graph., 31(3):316-326.
  9. Paiva, J. G. S., Florian, L., Pedrini, H., Telles, G. P., and Minghim, R. (2011). Improved Similarity Trees and their Application to Visual Data Classification. IEEE Trans. on Vis. and Comp. Graph., 17(12):2459-2468.
  10. Paulovich, F. V. and Minghim, R. (2008). HiPP: A Novel Hierarchical Point Placement Strategy and its Application to the Exploration of Document Collections. IEEE Tran. Vis. and Comp. Graph., 14(6):1229-1236.
  11. Paulovich, F. V., Nonato, L. G., Minghim, R., and Levkowitz, H. (2008). Least Square Projection: A Fast High-Precision Multidimensional Projection Technique and its Application to Document Mapping. IEEE Trans. Vis. and Comp. Graph., 14(3):564-575.
  12. Pinho, R., de Oliveira, M. C. F., and Lopes, A. A. (2009). Incremental board: a grid-based space for visualizing dynamic data sets. In Proc. .2009 ACM Symp. Appl. Comp., pages 1757-1764.
  13. Pinho, R., de Oliveira, M. C. F., and Lopes, A. A. (2010). An incremental space to visualize dynamic data sets. Multimedia Tools and Appl., 50(3):533-562.
  14. Salton, G., Wong, A., and Yang, C. S. (1975). A vector space model for automatic indexing. Commun. ACM, 18(11):613-620.
  15. Tan, P. N., Steinbach, M., and Kumar, V. (2005). Introduction to Data Mining. Addison-Wesley.
  16. Telles, G. P., Minghim, R., and Paulovich, F. V. (2007). Normalized compression distance for visual analysis of document collections. Comp. & Graph., 31(3):327- 337.
  17. Wei, F., Liu, S., Song, Y., Pan, S., Zhou, M. X., Qian, W., Shi, L., Tan, L., and Zhang, Q. (2010). TIARA: A Visual Exploratory Text Analytic System. In Proc. . 16th ACM SIGKDD Int. Conf. on Knowl. Discovery and Data Min., pages 153-162.
  18. Wise, J. A., Thomas, J. J., Pennock, K., Lantrip, D., Pottier, M., Schur, A., and Crow, V. (1995). Visualizing the non-visual: spatial analysis and interaction with information from text documents. In Proc. .1995 IEEE Symp. Inf. Vis., pages 51-58.
Download


Paper Citation


in Harvard Style

San Roman S. F., D. de Pinho R., Minghim R. and C. F. de Oliveira M. (2013). A Study on the Role of Similarity Measures in Visual Text Analytics . In Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2013) ISBN 978-989-8565-46-4, pages 429-438. DOI: 10.5220/0004214004290438


in Bibtex Style

@conference{ivapp13,
author={F. San Roman S. and R. D. de Pinho and R. Minghim and M. C. F. de Oliveira},
title={A Study on the Role of Similarity Measures in Visual Text Analytics},
booktitle={Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2013)},
year={2013},
pages={429-438},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004214004290438},
isbn={978-989-8565-46-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Computer Graphics Theory and Applications and International Conference on Information Visualization Theory and Applications - Volume 1: IVAPP, (VISIGRAPP 2013)
TI - A Study on the Role of Similarity Measures in Visual Text Analytics
SN - 978-989-8565-46-4
AU - San Roman S. F.
AU - D. de Pinho R.
AU - Minghim R.
AU - C. F. de Oliveira M.
PY - 2013
SP - 429
EP - 438
DO - 10.5220/0004214004290438