Dismantling Composite Visualizations in the Scientific Literature

Po-Shen Lee, Bill Howe


We are analyzing the visualizations in the scientific literature to enhance search services, detect plagiarism, and study bibliometrics. An immediate problem is the ubiquitous use of multi-part figures: single images with multiple embedded sub-visualizations. Such figures account for approximately 35% of the figures in the scientific literature. Conventional image segmentation techniques and other existing approaches have been shown to be ineffective for parsing visualizations. We propose an algorithm to automatically segment multi-chart visualizations into a set of single-chart visualizations, thereby enabling downstream analysis. Our approach first splits an image into fragments based on background color and layout patterns. An SVM-based binary classifier then distinguishes complete charts from auxiliary fragments such as labels, ticks, and legends, achieving an average 98.1% accuracy. Next, we recursively merge fragments to reconstruct complete visualizations, choosing between alternative merge trees using a novel scoring function. To evaluate our approach, we used 261 scientific multi-chart figures randomly selected from the Pubmed database. Our algorithm achieves 80% recall and 85% precision of perfect extractions for the common case of eight or fewer sub-figures per figure. Further, even imperfect extractions are shown to be sufficient for most chart classification and reasoning tasks associated with bibliometrics and academic search applications.


  1. Bergstrom, C. T., West, J. D., and Wiseman, M. A. (2008). The Eigenfactor metrics. The Journal of neuroscience : the official journal of the Society for Neuroscience, 28:11433-11434.
  2. Datta, R., Joshi, D., Li, J., and Wang, J. Z. (2006). Image Retrieval : Ideas , Influences, and Trends of the New Age. ACM Computing Surveys, pages 1-35.
  3. Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113.
  4. Futrelle, R., Kakadiaris, I., Alexander, J., Carriero, C., Nikolakis, N., and Futrelle, J. (1992). Understanding diagrams in technical documents. Computer, 25.
  5. Futrelle, R., Shao, M., Cieslik, C., and Grimes, A. (2003). Extraction,layout analysis and classification of diagrams in pdf documents. In Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on, pages 1007-1013.
  6. Hsu, C.-w., Chang, C.-c., and Lin, C.-j. (2010). A Practical Guide to Support Vector Classification. Bioinformatics, 1:1-16.
  7. Huang, W., Tan, C., and Leow, W. (2004). Model-based chart image recognition. In Llads, J. and Kwon, Y.-B., editors, Graphics Recognition. Recent Advances and Perspectives, volume 3088 of Lecture Notes in Computer Science, pages 87-99. Springer Berlin Heidelberg.
  8. Huang, W. and Tan, C. L. (2007). A System for Understanding Imaged Infographics and Its Applications. In DOCENG'07: Proceedings of the 2007 ACM Symposium on Document Engineering, pages 9-18.
  9. Lew, M. S. (2006). Content-Based Multimedia Information Retrieval : State of the Art and Challenges. ACM Transactions on Multimedia Computing, Communications and Applications, 2:1-19.
  10. Prasad, V., Siddiquie, B., Golbeck, J., and Davis, L. (2007). Classifying Computer Generated Charts. 2007 International Workshop on Content-Based Multimedia Indexing.
  11. Savva, M., Kong, N., Chhajta, A., Fei-Fei, L., Agrawala, M., and Heer, J. (2011). ReVision: Automated Classification, Analysis and Redesign of Chart Images. In UIST 7811, pages 393-402.
  12. Shao, M. and Futrelle, R. (2006). Recognition and classification of figures in pdf documents. In Liu, W. and Llads, J., editors, Graphics Recognition. Ten Years Review and Future Perspectives, volume 3926 of Lecture Notes in Computer Science, pages 231-242. Springer Berlin Heidelberg.
  13. Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., and Jain, R. (2000). Content-based image retrieval at the end of the early years. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22:1349- 1380.
  14. Tufle, E. (1983). The visual display of quantitative information. CT Graphics, Cheshire.
  15. West, J. D., Bergstrom, T. C., and Bergstrom, C. T. (2006). The Eigenfactor Metrics: A Network Approach to Assessing Scholarly Journals. College & Research Libraries, 71:236-244.
  16. White, T. (2009). Hadoop: The Definitive Guide: The Definitive Guide. O'Reilly Media.
  17. Yokokura, N. and Watanabe, T. (1998). Layout-based approach for extracting constructive elements of barcharts. In Tombre, K. and Chhabra, A., editors, Graphics Recognition Algorithms and Systems, volume 1389 of Lecture Notes in Computer Science, pages 163-174. Springer Berlin Heidelberg.
  18. Zhou, Y. and Tan, C. L. (2001). Learning-based scientific chart recognition. In 4th IAPR International Workshop on Graphics Recognition, GREC2001, pages 482-492.
  19. Zhou, Y. P. Z. Y. P. and Tan, C. L. T. C. L. (2000). Hough technique for bar charts detection and recognition in document images. Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101), 2.

Paper Citation

in Harvard Style

Lee P. and Howe B. (2015). Dismantling Composite Visualizations in the Scientific Literature . In Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM, ISBN 978-989-758-077-2, pages 79-91. DOI: 10.5220/0005213100790091

in Bibtex Style

author={Po-Shen Lee and Bill Howe},
title={Dismantling Composite Visualizations in the Scientific Literature},
booktitle={Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,},

in EndNote Style

JO - Proceedings of the International Conference on Pattern Recognition Applications and Methods - Volume 2: ICPRAM,
TI - Dismantling Composite Visualizations in the Scientific Literature
SN - 978-989-758-077-2
AU - Lee P.
AU - Howe B.
PY - 2015
SP - 79
EP - 91
DO - 10.5220/0005213100790091