to explain grading results is a must and further work
should be directed towards the combination of mod-
ern language models with explainable capabilities.
The current work has a number of limitations. One
of the most important is that our experiment was car-
ried out on only one dataset. Other datasets like BEE-
TLE or DT-Grade (Banjade et al., 2016) could also
be used to confirm the promising characteristics of
BERT and XLNET for ASAG. Another limitation is
that that we did not use the largest BERT model due
to limited computing power. We also note that we
could not go beyond 10 epochs, and as a result ad-
justed our early stopping based on the observation on
this 10 epoch experiment.
We plan to address the above-mentioned limita-
tions in future work. We also intend to explore ensem-
bling BERT with other classifiers to boost the grading
performance, especially by considering features that
were successful in the SOTA.
ACKNOWLEDGMENTS
The current research is supported by an INSIGHT
grant from the Social Sciences and Humanities Re-
search Council of Canada (SSHRC).
REFERENCES
Badger, E. and Thomas, B. (1992). Open-ended questions
in reading. Practical assessment, research & evalua-
tion, 3(4):03.
Banjade, R., Maharjan, N., Niraula, N. B., Gautam, D.,
Samei, B., and Rus, V. (2016). Evaluation dataset
(dt-grade) and word weighting approach towards con-
structed short answers assessment in tutorial dialogue
context. In Proceedings of the 11th Workshop on In-
novative Use of NLP for Building Educational Appli-
cations, pages 182–187.
Burrows, S., Gurevych, I., and Stein, B. (2015). The eras
and trends of automatic short answer grading. Interna-
tional Journal of Artificial Intelligence in Education,
25(1):60–117.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bor-
des, A. (2017). Supervised learning of universal sen-
tence representations from natural language inference
data. arXiv preprint arXiv:1705.02364.
Dagan, I., Glickman, O., and Magnini, B. (2005). The
pascal recognising textual entailment challenge. In
Machine Learning Challenges Workshop, pages 177–
190. Springer.
Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J.,
Le, Q. V., and Salakhutdinov, R. (2019). Transformer-
xl: Attentive language models beyond a fixed-length
context. arXiv preprint arXiv:1901.02860.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer,
T. K., and Harshman, R. (1990). Indexing by latent
semantic analysis. Journal of the American society
for information science, 41(6):391–407.
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. arXiv preprint
arXiv:1810.04805.
Dzikovska, M. O., Moore, J. D., Steinhauser, N., Campbell,
G., Farrow, E., and Callaway, C. B. (2010). Beetle
ii: a system for tutoring and computational linguis-
tics experimentation. In Proceedings of the ACL 2010
System Demonstrations, pages 13–18. Association for
Computational Linguistics.
Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C.,
Giampiccolo, D., Bentivogli, L., Clark, P., Dagan,
I., and Dang, H. T. (2013). Semeval-2013 task 7:
The joint student response analysis and 8th recogniz-
ing textual entailment challenge. Technical report,
NORTH TEXAS STATE UNIV DENTON.
Goldberg, Y. (2017). Neural network methods for natural
language processing. Synthesis Lectures on Human
Language Technologies, 10(1):1–309.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep
learning. MIT press.
Heilman, M. and Madnani, N. (2013). Ets: Domain adapta-
tion and stacking for short answer scoring. In Second
Joint Conference on Lexical and Computational Se-
mantics (* SEM), Volume 2: Proceedings of the Sev-
enth International Workshop on Semantic Evaluation
(SemEval 2013), volume 2, pages 275–279.
III, H. D. (2009). Frustratingly easy domain adaptation.
arXiv preprint arXiv:0907.1815.
Jimenez, S., Becerra, C., and Gelbukh, A. (2013). Soft-
cardinality: Hierarchical text overlap for student re-
sponse analysis. In Second Joint Conference on Lex-
ical and Computational Semantics (* SEM), Volume
2: Proceedings of the Seventh International Workshop
on Semantic Evaluation (SemEval 2013), volume 2,
pages 280–284.
Jimenez, S., Gonzalez, F., and Gelbukh, A. (2010). Text
comparison using soft cardinality. In International
symposium on string processing and information re-
trieval, pages 297–302. Springer.
Mantecon, J. G. A., Ghavidel, H. A., Zouaq, A., Jovanovic,
J., and McDonald, J. (2018). A comparison of features
for the automatic labeling of student answers to open-
ended questions. In TEMPLATE’06, 1st International
Conference on Template Production. International Ed-
ucational Data Mining Conference.
Marvaniya, S., Saha, S., Dhamecha, T. I., Foltz, P., Sind-
hgatta, R., and Sengupta, B. (2018). Creating scor-
ing rubric from representative student answers for im-
proved short answer grading. In Proceedings of the
27th ACM International Conference on Information
and Knowledge Management, pages 993–1002. ACM.
Nielsen, R. D., Ward, W. H., Martin, J. H., and Palmer, M.
(2008). Annotating students’ understanding of science
concepts. In LREC.
CSEDU 2020 - 12th International Conference on Computer Supported Education
66