
the Chrome driver, resulting in more efficient learn-
ing and higher accuracy. Next, we set up an environ-
ment to infer clickability from HTML and added the
inference results to the state, which led to equally ef-
ficient learning and improved accuracy compared to
not using the LLM-inferred values. This shows that
in DRL, incorporating LLM inference results as part
of the state is effective. As future work, it will be
necessary to validate on more complex web applica-
tions and to verify other types of information beyond
clickability.
ACKNOWLEDGEMENTS
This paper uses ChatGPT’s o1-preview and 4o for
translations from Japanese to English. This work
was supported by JSPS KAKENHI Grant Numbers
JP22K12157, JP23K28377, JP24H00714.
REFERENCES
Adamo, D., Khan, M. K., Koppula, S., and Bryce, R.
(2018). Reinforcement learning for android gui test-
ing. In Proceedings of the 9th ACM SIGSOFT Inter-
national Workshop on Automating TEST Case Design,
Selection, and Evaluation, A-TEST 2018, page 2–8,
New York, NY, USA. Association for Computing Ma-
chinery.
Apple (2024). App review. https://developer.apple.com/jp/
distribute/app-review. Access Date: 2024-05-23.
Bertolino, A. (2007). Software testing research: Achieve-
ments, challenges, dreams. In Future of Software En-
gineering (FOSE ’07), pages 85–103.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners.
Cai, L., Wang, J., Cheng, M., and Wang, J. (2021). Au-
tomated testing of android applications integrating
residual network and deep reinforcement learning.
Carlini, N. and Wagner, D. (2017). Towards evaluating the
robustness of neural networks. In 2017 IEEE Sympo-
sium on Security and Privacy (SP), pages 39–57, Los
Alamitos, CA, USA. IEEE Computer Society.
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu,
K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W.,
Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., and Xie,
X. (2024). A survey on evaluation of large language
models. ACM Trans. Intell. Syst. Technol., 15(3).
Eskonen, J., Kahles, J., and Reijonen, J. (2020). Automat-
ing gui testing with image-based deep reinforcement
learning. In 2020 IEEE International Conference on
Autonomic Computing and Self-Organizing Systems
(ACSOS), pages 160–167.
Google (2024). Play console help. https://support.google.
com/googleplay/android-developer/answer/9859751?
hl=en. Access Date: 2024-05-23.
Itkonen, J. and Rautiainen, K. (2005). Exploratory testing: a
multiple case study. In 2005 International Symposium
on Empirical Software Engineering, 2005., pages 10
pp.–.
Kwon, M., Xie, S. M., Bullard, K., and Sadigh, D. (2023).
Reward design with language models. In The Eleventh
International Conference on Learning Representa-
tions.
OpenAI (2024a). Openai gym. https://github.com/openai/
gym.
OpenAI (2024b). Proximal policy optimization. https://
openai.com/index/openai-baselines-ppo/.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus,
M., and Dormann, N. (2021). Stable-baselines3: Reli-
able reinforcement learning implementations. Journal
of Machine Learning Research, 22(268):1–8.
Romdhana, A., Merlo, A., Ceccato, M., and Tonella, P.
(2022). Deep reinforcement learning for black-box
testing of android apps. ACM Trans. Softw. Eng.
Methodol., 31(4).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms.
Sneha, K. and Malle, G. M. (2017). Research on soft-
ware testing techniques and software automation test-
ing tools. In 2017 International Conference on En-
ergy, Communication, Data Analytics and Soft Com-
puting (ICECDS), pages 77–81.
Tao, C., Wang, F., Gao, Y., Guo, H., and Gao, J. (2024). A
reinforcement learning-based approach to testing gui
of mobile applications. World Wide Web, 27(2).
Wetzlmaier, T., Ramler, R., and Putsch
¨
ogl, W. (2016). A
framework for monkey gui testing. In 2016 IEEE In-
ternational Conference on Software Testing, Verifica-
tion and Validation (ICST), pages 416–423.
Xia, T., Yu, B., Wu, Y., Chang, Y., and Zhou, C. (2024).
Language models can evaluate themselves via proba-
bility discrepancy.
Yoon, J., Feldt, R., and Yoo, S. (2023). Autonomous large
language model agents enabling intent-driven mobile
gui testing.
You, E. (2024). Vue.js. https://ja.vuejs.org/.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
1008