automation reliance. International Journal of Human-
Computer Studies, 58(6):697–718.
Frison, A.-K., Wintersberger, P., Riener, A., Schartm
¨
uller,
C., Boyle, L. N., Miller, E., and Weigl, K. (2019).
In UX We Trust: Investigation of Aesthetics and Us-
ability of Driver-Vehicle Interfaces and Their Impact
on the Perception of Automated Driving. In Proceed-
ings of the 2019 CHI Conference on Human Factors
in Computing Systems, pages 1–13. Association for
Computing Machinery, New York, NY, USA.
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Gian-
notti, F., and Pedreschi, D. (2018). A Survey of Meth-
ods for Explaining Black Box Models. ACM Comput.
Surv., 51(5):93:1–93:42.
Hochreiter, S. and Schmidhuber, J. (1997). Long Short-
Term Memory. Neural Computation, 9(8):1735–1780.
Hoff, K. A. and Bashir, M. (2015). Trust in automation: In-
tegrating empirical evidence on factors that influence
trust. Human factors, 57(3):407–434.
Jesus, S., Bel
´
em, C., Balayan, V., Bento, J., Saleiro, P.,
Bizarro, P., and Gama, J. (2021). How can I choose
an explainer? an Application-grounded Evaluation of
Post-hoc Explanations. In Proceedings of the 2021
ACM Conference on Fairness, Accountability, and
Transparency, FAccT ’21, pages 805–815, New York,
NY, USA. Association for Computing Machinery.
Kaur, H., Nori, H., Jenkins, S., Caruana, R., Wallach, H.,
and Wortman Vaughan, J. (2020). Interpreting In-
terpretability: Understanding Data Scientists’ Use of
Interpretability Tools for Machine Learning. In Pro-
ceedings of the 2020 CHI Conference on Human Fac-
tors in Computing Systems, CHI ’20, pages 1–14,
Honolulu, HI, USA. Association for Computing Ma-
chinery.
Keras (2021). Keras documentation: About Keras.
Kontogiannis, T. (1999). User strategies in recovering
from errors in man–machine systems. Safety Science,
32(1):49–68.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu,
S., Barnes, L., and Brown, D. (2019). Text Classifica-
tion Algorithms: A Survey. Information, 10(4):150.
Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent con-
volutional neural networks for text classification. In
Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence, AAAI’15, pages 2267–2273,
Austin, Texas. AAAI Press.
Landis, J. R. and Koch, G. G. (1977). The measurement of
observer agreement for categorical data. biometrics,
pages 159–174.
Lee, J. D. and See, K. A. (2004). Trust in Automation:
Designing for Appropriate Reliance. Human Factors,
46(1):50–80.
Lundberg, S. M. and Lee, S.-I. (2017). A Unified Ap-
proach to Interpreting Model Predictions. In Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R.,
Vishwanathan, S., and Garnett, R., editors, Advances
in Neural Information Processing Systems 30, pages
4765–4774. Curran Associates, Inc.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y.,
and Potts, C. (2011). Learning Word Vectors for Sen-
timent Analysis. In Proceedings of the 49th Annual
Meeting of the Association for Computational Lin-
guistics: Human Language Technologies, pages 142–
150. Association for Computational Linguistics.
Mittelstadt, B., Russell, C., and Wachter, S. (2019). Ex-
plaining Explanations in AI. In Proceedings of the
Conference on Fairness, Accountability, and Trans-
parency, FAT* ’19, pages 279–288, New York, NY,
USA. Association for Computing Machinery.
Nickerson, R. S. (1998). Confirmation Bias: A Ubiquitous
Phenomenon in Many Guises. Review of General Psy-
chology, 2(2):175–220.
Nourani, M., King, J., and Ragan, E. (2020). The Role
of Domain Expertise in User Trust and the Impact of
First Impressions with Intelligent Systems. Proceed-
ings of the AAAI Conference on Human Computation
and Crowdsourcing, 8:112–121.
Raybaud, S., Langlois, D., and Sma
¨
ıli, K. (2011). “This
sentence is wrong.” Detecting errors in machine-
translated sentences. Machine Translation, 25(1):1.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ”Why
Should I Trust You?”: Explaining the Predictions
of Any Classifier. In Proceedings of the 22nd
ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, KDD ’16, pages
1135–1144, New York, NY, USA. ACM.
Sanchez, J., Rogers, W. A., Fisk, A. D., and Rovira, E.
(2014). Understanding reliance on automation: Ef-
fects of error type, error distribution, age and ex-
perience. Theoretical issues in ergonomics science,
15(2):134–160.
Sanderson, P. M. and Murtagh, J. M. (1990). Predicting
fault diagnosis performance: Why are some bugs hard
to find? IEEE Transactions on Systems, Man, and
Cybernetics, 20(1):274–283.
Sauer, J., Chavaillaz, A., and Wastell, D. (2016). Expe-
rience of automation failures in training: Effects on
trust, automation bias, complacency and performance.
Ergonomics, 59(6):767–780.
Tenney, I., Wexler, J., Bastings, J., Bolukbasi, T., Co-
enen, A., Gehrmann, S., Jiang, E., Pushkarna, M.,
Radebaugh, C., Reif, E., and Yuan, A. (2020). The
Language Interpretability Tool: Extensible, Interac-
tive Visualizations and Analysis for NLP Models.
arXiv:2008.05122 [cs].
Wobbrock, J. O., Findlater, L., Gergle, D., and Higgins,
J. J. (2011). The aligned rank transform for nonpara-
metric factorial analyses using only anova procedures.
In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, CHI ’11, pages 143–
146, New York, NY, USA. Association for Computing
Machinery.
Xiong, D., Zhang, M., and Li, H. (2010). Error detection
for statistical machine translation using linguistic fea-
tures. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, ACL
’10, pages 604–611, USA. Association for Computa-
tional Linguistics.
Zhang, P., Wang, J., Farhadi, A., Hebert, M., and Parikh, D.
(2014). Predicting Failures of Vision Systems. In Pro-
ceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 3566–3573.
Effect of Errors on the Evaluation of Machine Learning Systems
57