
task-specific training for CNNs, while also highlight-
ing the potential of VLMs in handling diverse and un-
seen chart types.
Our future work will focus on developing a more
comprehensive dataset that better aligns with profes-
sional data visualization software standards, which
typically support around 50 different chart types.
While VLMs demonstrate promising zero-shot capa-
bilities, their context length limitations when deal-
ing with numerous chart classes make fine-tuning a
more practical approach for real-world applications.
Therefore, we plan to fine-tune VLMs on this future
dataset to bridge the current gap between academic
research and industry requirements in chart classifi-
cation tasks. Additionally, we aim to explore chart
description generation, leveraging the multimodal ca-
pabilities of VLMs.
ACKNOWLEDGEMENTS
We would like to especially thank the companies Dat-
analysis and Duke, which made this research possible
through their financial support and provided access to
their computing resources on Azure.
REFERENCES
Akhtar, M., Cocarascu, O., and Simperl, E. P. B.
(2023). Reading and reasoning over chart images
for evidence-based automated fact-checking. ArXiv,
abs/2301.11843.
Amara, J., Kaur, P., Owonibi, M., and Bouaziz, B. (2017).
Convolutional neural network based chart image clas-
sification.
Anil, R., Borgeaud, S., Wu, Y., Alayrac, J., Yu, J., Soricut,
R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican,
K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I.,
Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., Lilli-
crap, T. P., Lazaridou, A., Firat, O., Molloy, J., Isard,
M., Barham, P. R., Hennigan, T., Lee, B., Viola, F.,
Reynolds, M., Xu, Y., Doherty, R., Collins, E., Meyer,
C., Rutherford, E., Moreira, E., Ayoub, K., Goel, M.,
Tucker, G., Piqueras, E., Krikun, M., Barr, I., Savinov,
N., Danihelka, I., Roelofs, B., White, A., Andreassen,
A., von Glehn, T., Yagati, L., Kazemi, M., Gonza-
lez, L., Khalman, M., Sygnowski, J., and et al. (2023).
Gemini: A family of highly capable multimodal mod-
els. CoRR, abs/2312.11805.
Ara
´
ujo, T., Chagas, P., Alves, J. B., Santos, C. G. R., Santos,
B. S., and Meiguins, B. S. (2020). A real-world ap-
proach on the problem of chart recognition using clas-
sification, detection and perspective correction. Sen-
sors (Basel, Switzerland), 20.
Baji
´
c, F., Habijan, M., and Nenadi
´
c, K. (2024). Eval-
uation of shallow convolutional neural network in
open-world chart image classification. Informatica,
48(6):185–198.
Balaji, A., Ramanathan, T., and Sonathi, V. (2018). Chart-
text: A fully automated chart image descriptor. CoRR,
abs/1812.10636.
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A.,
Wang, X., Salz, D., Neumann, M., Alabdulmohsin,
I., Tschannen, M., Bugliarello, E., Unterthiner, T.,
Keysers, D., Koppula, S., Liu, F., Grycner, A., Grit-
senko, A., Houlsby, N., Kumar, M., Rong, K., Eisen-
schlos, J., Kabra, R., Bauer, M., Bo
ˇ
snjak, M., Chen,
X., Minderer, M., Voigtlaender, P., Bica, I., Balazevic,
I., Puigcerver, J., Papalampidi, P., Henaff, O., Xiong,
X., Soricut, R., Harmsen, J., and Zhai, X. (2024).
Paligemma: A versatile 3b vlm for transfer. arXiv
preprint arXiv:2407.07726.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. CoRR, abs/2005.14165.
Cai, Z. and Vasconcelos, N. (2017). Cascade R-CNN:
delving into high quality object detection. CoRR,
abs/1712.00726.
Chen, J., Kong, L., Wei, H., Liu, C., Ge, Z., Zhao, L., Sun,
J., Han, C., and Zhang, X. (2024). Onechart: Purify
the chart structural extraction via one auxiliary token.
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Chang-
pinyo, S., Wu, J., Ruiz, C. R., Goodman, S., Wang,
X., Tay, Y., Shakeri, S., Dehghani, M., Salz, D. M.,
Lucic, M., Tschannen, M., Nagrani, A., Hu, H., Joshi,
M., Pang, B., Montgomery, C., Pietrzyk, P., Ritter, M.,
Piergiovanni, A. J., Minderer, M., Pavetic, F., Waters,
A., Li, G., Alabdulmohsin, I. M., Beyer, L., Amelot,
J., Lee, K., Steiner, A., Li, Y., Keysers, D., Arnab, A.,
Xu, Y., Rong, K., Kolesnikov, A., Seyedhosseini, M.,
Angelova, A., Zhai, X., Houlsby, N., and Soricut, R.
(2023a). On scaling up a multilingual vision and lan-
guage model. 2024 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
14432–14444.
Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J.,
Voigtlaender, P., Mustafa, B., Goodman, S., Alabdul-
mohsin, I. M., Padlewski, P., Salz, D. M., Xiong, X.,
Vlasic, D., Pavetic, F., Rong, K., Yu, T., Keysers, D.,
Zhai, X.-Q., and Soricut, R. (2023b). Pali-3 vision
language models: Smaller, faster, stronger. ArXiv,
abs/2310.09199.
Cheng, Z., Dai, Q., Li, S., Sun, J., Mitamura, T., and Haupt-
mann, A. G. (2023). Chartreader: A unified frame-
work for chart derendering and comprehension with-
out heuristic rules. CoRR, abs/2304.02173.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H.,
Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E.,
Stoica, I., and Xing, E. P. (2023). Vicuna: An open-
source chatbot impressing gpt-4 with 90%* chatgpt
quality.
A Comparative Study of CNNs and Vision-Language Models for Chart Image Classification
825