Furthermore, the dataset also includes exercises
with figures and tables, which can pose an additional
challenge but have not been evaluated in this paper.
When splicing the exams, it stood out that some
exercises spanned across multi-columns, leading to
wrong outputs when directly exploring OCR
capabilities. Exploring possibilities through varying
layout options or other supplemental information
could thus also prove impactful.
In this regard, the paper also only peripherally
addressed the added challenge of having multiple
tasks on one page.
Most math problems also named variables
directly. To make inference harder for LVLMs, using
other terms or synonyms that are obvious to humans
could be explored.
Another aspect is the contamination of the
training dataset with questions that have been used for
benchmarking. This can only be directly evaluated
when having access to the training dataset; however,
it could be explored whether, e.g., just rephrasing the
tasks has an impact, as is known for evaluating
decontamination efforts.
While the prevalent goal of exploring these
concepts is to increase the robustness of exams, their
evaluation could also give further insight into the
limits of current models and goals for future models.
REFERENCES
Allam, A., & Haggag, M. (2012). The Question Answering
Systems: A Survey. International Journal of Research
and Reviews in Information Sciences, 2, 211–221.
Baird, H. S. (2007). The State of the Art of Document
Image Degradation Modelling. In S. Singh & B. B.
Chaudhuri (Eds.), Advances in Pattern Recognition.
Digital document processing: major directions and
recent advances (pp. 261–279). London: Scholars
Portal.
Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang,
X. [Xiao], Salz, D., . . . Zhai, X. (2024). PaliGemma: A
versatile 3B VLM for transfer.
C. Zauner (2010). Implementation and Benchmarking of
Perceptual Image Hash Functions.
Chang, Y. [Yupeng], Wang, X. [Xu], Wang, J., Wu, Y.
[Yuan], Yang, L. [Linyi], Zhu, K., . . . Xie, X. (2023).
A Survey on Evaluation of Large Language Models.
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M. [Mark],
Jun, H., Kaiser, L., . . . Schulman, J. (2021). Training
Verifiers to Solve Math Word Problems.
Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A.,
Tanzer, G., . . . Vinyals, O. (2024). Gemini 1.5:
Unlocking multimodal understanding across millions of
tokens of context.
Goyal, Y., Khot, T., Agrawal, A., Summers-Stay, D., Batra,
D., & Parikh, D. (2019). Making the V in VQA Matter:
Elevating the Role of Image Understanding in Visual
Question Answering. International Journal of
Computer Vision, 127(4), 398–414.
Guo, Z., Jin, R., Liu, C. [Chuang], Huang, Y., Shi, D.,
Supryadi, . . . Xiong, D. (2023). Evaluating Large
Language Models: A Comprehensive Survey.
Guyon, I., Haralick, R. M., Hull, J. J., & Phillips, I. T.
(2000). Data sets for OCR and Document Image
Understanding Research. In H. Bunke (Ed.), Handbook
of character recognition and document image analysis
(1st ed., pp. 779–799). Singapore: World Scientific.
Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart,
S., Tang, E., . . . Steinhardt, J. (2021). Measuring
Mathematical Problem Solving With the MATH
Dataset.
Heya, T. A., Serwadda, A., Griswold-Steiner, I., & Matovu,
R. (2021). A Wearables-Driven Attack on Examination
Proctoring. In 2021 18th International Conference on
Privacy, Security and Trust (PST): 13-15 Dec. 2021 (pp.
1–7). Piscataway, New Jersey: IEEE.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y.
(2022). Large Language Models are Zero-Shot
Reasoners.
Li, B., Zhang, Y. [Yuanhan], Guo, D. [Dong], Zhang, R.,
Li, F., Zhang, H. [Hao], . . . Li, C. [Chunyuan] (2024).
LLaVA-OneVision: Easy Visual Task Transfer.
Liu, Y. [Yuliang], Li, Z., Huang, M., Yang, B., Yu, W., Li,
C. [Chunyuan], . . . Bai, X. (2023). OCRBench: On the
Hidden Mystery of OCR in Large Multimodal Models.
Lu, P., Bansal, H., Xia, T., Liu, J. [Jiacheng], Li, C.
[Chunyuan], Hajishirzi, H., . . . Gao, J. (2023).
MathVista: Evaluating Mathematical Reasoning of
Foundation Models in Visual Contexts.
Masry, A., Long, D., Tan, J. Q., Joty, S., & Hoque, E.
(2022). ChartQA: A Benchmark for Question
Answering about Charts with Visual and Logical
Reasoning. Findings of the Association for
Computational Linguistics: ACL 2022, 2263–2279.
Mathew, M., Karatzas, D., & Jawahar, C. V. (2020).
DocVQA: A Dataset for VQA on Document Images.
Meta (2024). Llama 3.2.
Nyamawe, A. S., & Mtonyole, N. (2014). The Use of
Mobile Phones in University Exams Cheating:
Proposed Solution. International Journal of
Engineering Trends and Technology, 17(1), 14–17.
OpenAI (2023). GPT-4 Technical Report.
Ray-Ban (2024). Ray-Ban | Meta smart glasses 2024 | Ray-
Ban®.
Roy, S., & Roth, D. (2016). Solving General Arithmetic
Word Problems.
Rupert Urbanski and Ralf Peters (2024). Examining the
threat of Smart Glasses to Exam Integrity utilizing
LLMs and CV. AMCIS 2024 TREOs. (40).
Shen, J., Yin, Y., Li, L. [Lin], Shang, L., Jiang, X., Zhang,
M., & Liu, Q. (2021). Generate & Rank: A Multi-task
Framework for Math Word Problems. In M.-F. Moens,
X. Huang, L. Specia, & S. W. Yih (Eds.), Findings of
the Association for Computational Linguistics: EMNLP