
Hosseini, M., Snijders, R., Dalpiaz, F., Brinkkemper, S.,
Ali, R., and Ozum, A. (2015). Refine: A gamified plat-
form for participatory requirements engineering.
Hou, X., Zhao, Y., Liu, Y., Yang, Z., Wang, K., Li, L., Luo,
X., Lo, D., Grundy, J., and Wang, H. (2023). Large lan-
guage models for software engineering: A systematic lit-
erature review. arXiv preprint arXiv:2308.10620.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G.,
Lample, G., Saulnier, L., et al. (2023). Mistral 7b. arXiv
preprint arXiv:2310.06825.
Krejcie, R. V. and Morgan, D. W. (1970). Determining sam-
ple size for research activities. Educational and psycho-
logical measurement, 30(3):607–610.
Landis, J. R. and Koch, G. G. (1977). The measurement
of observer agreement for categorical data. biometrics,
pages 159–174.
Li, H., Zhang, L., Zhang, L., and Shen, J. (2010). A user sat-
isfaction analysis approach for software evolution. 2010
IEEE International Conference on Progress in Informat-
ics and Computing, 2:1093–1097.
Maalej, W., Kurtanovi
´
c, Z., Nabil, H., and Stanik, C.
(2016). On the automatic classification of app reviews.
Requirements Engineering, 21.
Maalej, W. and Nabil, H. (2015). Bug report, feature re-
quest, or simply praise? on automatically classifying app
reviews. In 2015 IEEE 23rd international requirements
engineering conference (RE), pages 116–125. IEEE.
Maalej, W., Nayebi, M., Johann, T., and Ruhe, G. (2015).
Toward data-driven requirements engineering. IEEE
Software, 33:48–56.
Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y., Paul,
S., and Bossan, B. (2022). Peft: State-of-the-art
parameter-efficient fine-tuning methods. https://github.
com/huggingface/peft.
Mekala, R. R., Razeghi, Y., and Singh, S. (2024).
Echoprompt: Instructing the model to rephrase queries
for improved in-context learning.
Møller, A. G., Dalsgaard, J. A., Pera, A., and Aiello, L. M.
(2023). The parrot dilemma: Human-labeled vs. llm-
augmented data in classification tasks. arXiv preprint
arXiv:2304.13861.
Pagano, D. and Maalej, W. (2013). User feedback in the
appstore: An empirical study.
P
´
erez, A., Fern
´
andez-Pichel, M., Parapar, J., and Losada,
D. E. (2023). Depresym: A depression symptom anno-
tated corpus and the role of llms as assessors of psycho-
logical markers. arXiv preprint arXiv:2308.10758.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.,
et al. (2018). Improving language understanding by gen-
erative pre-training.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,
Sutskever, I., et al. (2019). Language models are un-
supervised multitask learners. OpenAI blog, 1(8):9.
Reiter, L. (2013). Zephyr. Journal of Business & Finance
Librarianship, 18(3):259–263.
Restrepo, P., Fischbach, J., Spies, D., Frattini, J., and Vo-
gelsang, A. (2021). Transfer learning for mining feature
requests and bug reports from tweets and app store re-
views.
Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A.,
Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet,
P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C.,
Kroiz, G., Li, F., Tao, H., Srivastava, A., Costa, H. D.,
Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G.,
Galynker, I., Peskoff, D., Carpuat, M., White, J., Anad-
kat, S., Hoyle, A., and Resnik, P. (2024). The prompt
report: A systematic survey of prompting techniques.
Stanik, C., Haering, M., and Maalej, W. (2019). Classify-
ing multilingual user feedback using traditional machine
learning and deep learning. pages 220–226.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
M.-A., Lacroix, T., Rozi
`
ere, B., Goyal, N., Hambro,
E., Azhar, F., et al. (2023a). Llama: Open and ef-
ficient foundation language models. arXiv preprint
arXiv:2302.13971.
Touvron, H., Martin, L., Stone, K., Albert, P., Alma-
hairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhar-
gava, P., Bhosale, S., et al. (2023b). Llama 2: Open
foundation and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Vu, P. M., Nguyen, T. T., Pham, H. V., and Nguyen,
T. T. (2015). Mining user opinions in mobile app
reviews: A keyword-based approach. arXiv preprint
arXiv:1505.04657.
Wang, S., Liu, Y., Xu, Y., Zhu, C., and Zeng, M. (2021).
Want to reduce labeling cost? gpt-3 can help. arXiv
preprint arXiv:2108.13487.
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Khashabi, D., and Hajishirzi, H. (2022). Self-instruct:
Aligning language models with self-generated instruc-
tions. arXiv preprint arXiv:2212.10560.
Yu, D., Li, L., Su, H., and Fuoli, M. (2024). Assessing
the potential of llm-assisted annotation for corpus-based
pragmatics and discourse analysis: The case of apology.
International Journal of Corpus Linguistics.
Zhang, R., Li, Y., Ma, Y., Zhou, M., and Zou, L. (2023a).
Llmaaa: Making large language models as active anno-
tators. arXiv preprint arXiv:2310.19596.
Zhang, T., Irsan, I. C., Thung, F., and Lo, D. (2023b).
Revisiting sentiment analysis for software engineering
in the era of large language models. arXiv preprint
arXiv:2310.11113.
Automatic Analysis of App Reviews Using LLMs
839