
gests promising indicators of controllability for nar-
rative elements and explicitness. However, our er-
ror analysis revealed instances where control did not
occur, underscoring the need for further investigation
when employing few-shot prompting with a state-of-
the-art model like GPT-3.5. Additionally, our findings
demonstrate that the few-shot strategy can outperform
the reference model in certain scenarios, however,
these improvements are not consistently statistically
significant.
Considering our results, which align with those
of a smaller yet well-established reference model for
QG and CQG tasks, we find it worthwhile to em-
ploy the few-shot strategy for CQG, especially when
(1) data availability is limited or (2) one favors for
a “plug-and-play” AI-assisted approach. For future
work, we consider it important the application of a
post-model solution to ensure that the QA pairs align
with the attributes to be controlled, thereby excluding
misaligned QA pairs.
LIMITATIONS
The effectiveness of controlling narrative elements
and explicitness may vary across different datasets
and tasks due to the unique characteristics of each
context. While we have established that our study is
focused on a specific domain and data, we recognize
this limitation. Also, the lack of human evaluation
is a limitation of this work. Although we believe the
current evaluation process is solid for assessing the
method’s performance in CQG, an assessment with
domain experts may help better understand the poten-
tial of CQG for educational purposes.
ACKNOWLEDGEMENTS
This work was financially supported by Base Fund-
ing - UIDB/00027/2020 of the Artificial Intelligence
and Computer Science Laboratory (LIACC) funded
by national funds through FCT/MCTES (PIDDAC).
Bernardo Leite is supported by a PhD studentship
(reference 2021.05432.BD), funded by Fundac¸
˜
ao
para a Ci
ˆ
encia e a Tecnologia (FCT).
REFERENCES
Alonzo, J., Basaraba, D., Tindal, G., and Carriveau, R. S.
(2009). They read, but how well do they understand?
an empirical look at the nuances of measuring read-
ing comprehension. Assessment for Effective Inter-
vention, 35(1):34–44.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. In Larochelle, H., Ranzato, M., Hadsell, R.,
Balcan, M., and Lin, H., editors, Advances in Neu-
ral Information Processing Systems, volume 33, pages
1877–1901. Curran Associates, Inc.
Elkins, S., Kochmar, E., Serban, I., and Cheung, J. C. K.
(2023). How useful are educational questions gen-
erated by large language models? In Wang, N.,
Rebolledo-Mendez, G., Dimitrova, V., Matsuda, N.,
and Santos, O. C., editors, Artificial Intelligence in
Education. Posters and Late Breaking Results, Work-
shops and Tutorials, Industry and Innovation Tracks,
Practitioners, Doctoral Consortium and Blue Sky,
pages 536–542, Cham. Springer Nature Switzerland.
Ghanem, B., Lutz Coleman, L., Rivard Dexter, J., von der
Ohe, S., and Fyshe, A. (2022). Question generation
for reading comprehension assessment by modeling
how and what to ask. In Findings of the Associa-
tion for Computational Linguistics: ACL 2022, pages
2131–2146, Dublin, Ireland. Association for Compu-
tational Linguistics.
Krathwohl, D. R. (2002). A revision of bloom’s taxonomy:
An overview. Theory Into Practice, 41(4):212–218.
Kurdi, G., Leo, J., Parsia, B., Sattler, U., and Al-Emari,
S. (2020). A systematic review of automatic ques-
tion generation for educational purposes. Interna-
tional Journal of Artificial Intelligence in Education,
30(1):121–204.
Leite, B. and Lopes Cardoso, H. (2023). Towards enriched
controllability for educational question generation. In
Wang, N., Rebolledo-Mendez, G., Matsuda, N., San-
tos, O. C., and Dimitrova, V., editors, Artificial Intel-
ligence in Education, pages 786–791, Cham. Springer
Nature Switzerland.
Li, J., Galley, M., Brockett, C., Gao, J., and Dolan, B.
(2016). A diversity-promoting objective function
for neural conversation models. In Proceedings of
the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguis-
tics: Human Language Technologies, pages 110–119,
San Diego, California. Association for Computational
Linguistics.
Lin, C.-Y. (2004). ROUGE: A Package for Automatic
Evaluation of Summaries. In Text Summarization
Branches Out, pages 74–81, Barcelona, Spain. ACL.
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig,
G. (2023). Pre-train, prompt, and predict: A system-
atic survey of prompting methods in natural language
processing. ACM Comput. Surv., 55(9).
Lynch, J. S., Van Den Broek, P., Kremer, K. E., Kendeou,
P., White, M. J., and Lorch, E. P. (2008). The de-
velopment of narrative comprehension and its relation
CSEDU 2024 - 16th International Conference on Computer Supported Education
70