
cally LIME and SHAP. Our experiments reveal that
models with similar and high accuracy can rely on
different features — potentially spurious and irrel-
evant ones — in the decision-making process, em-
phasizing that high accuracy alone does not guaran-
tee model reliability. Among the components tested,
varying the feature extractor introduced the highest
variability in feature reliance, identifying it as the pri-
mary factor in underspecification. However, optimiz-
ers and initial weights can also contribute to under-
specification. While this study primarily focuses on
these three components, our proposed framework can
be extended to investigate additional elements within
ML pipelines.
While this study effectively highlights the preva-
lence of underspecification in various ML pipeline
components and identifies where it potentially occurs,
it is important to note that it does not directly address
how to reduce it. Additionally, we observe slight in-
consistencies in LIME explanations due to its inher-
ent randomness, which suggests that relying solely
on LIME may limit the robustness of underspecifica-
tion analysis. Additionally, our cosine distance-based
ClassLevelScore metric, despite its effectiveness, dis-
plays sensitivity on simpler datasets like MNIST, po-
tentially amplifying variability and underspecification
in these cases. Furthermore, while dataset quality
and representation are known to significantly impact
underspecification, these aspects are not directly ex-
plored in this work, as they are extensively studied
in the literature. Lastly, although post-hoc explana-
tion tools such as LIME and SHAP provide valuable
insights, their computational intensiveness may limit
their applicability to datasets with large numbers of
classes or instances, posing a challenge for scalability
in more complex settings.
Future work will focus on addressing these limi-
tations by exploring strategies to mitigate underspec-
ification. This may involve identifying which feature
extractors or optimizers contribute most consistently
to stable feature reliance, testing alternative initializa-
tion methods, and developing a framework to guide
pipeline configurations toward reduced variability.
REFERENCES
Alahmari, S. S., Goldgof, D. B., Mouton, P. R., and Hall,
L. O. (2020). Challenges for the repeatability of deep
learning models. IEEE Access, 8:211860–211868.
Arnold, C., Biedebach, L., K
¨
upfer, A., and Neunhoeffer,
M. (2024). The role of hyperparameters in machine
learning models and how to tune them. Political Sci-
ence Research and Methods, 12(4):841–848.
Bischl, B., Binder, M., Lang, M., Pielok, T., Richter,
J., Coors, S., Thomas, J., Ullmann, T., Becker, M.,
Boulesteix, A.-L., et al. (2023). Hyperparameter op-
timization: Foundations, algorithms, best practices,
and open challenges. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 13(2):e1484.
Bouthillier, X., Delaunay, P., Bronzi, M., Trofimov, A.,
Nichyporuk, B., Szeto, J., Sepah, N., Raff, E., Madan,
K., Voleti, V., et al. (2021). Accounting for vari-
ance in machine learning benchmarks. arXiv preprint
arXiv:2103.03098.
Breiman, L. (2001). Statistical modeling: The two cultures
(with comments and a rejoinder by the author). Sta-
tistical science, 16(3):199–231.
Chollet, F. (2017). Xception: Deep learning with depthwise
separable convolutions. In Proceedings of the IEEE
conference on computer vision and pattern recogni-
tion, pages 1251–1258.
D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Ali-
panahi, B., Beutel, A., Chen, C., Deaton, J., Eisen-
stein, J., Hoffman, M. D., et al. (2022). Underspeci-
fication presents challenges for credibility in modern
machine learning. Journal of Machine Learning Re-
search, 23(226):1–61.
Dozat, T. (2016). Incorporating nesterov momentum into
adam. Technical report, Stanford University.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity
mappings in deep residual networks. In Computer
Vision–ECCV 2016: 14th European Conference, Am-
sterdam, The Netherlands, October 11–14, 2016, Pro-
ceedings, Part IV 14, pages 630–645. Springer.
Hinns, J., Fan, X., Liu, S., Raghava Reddy Kovvuri, V., Yal-
cin, M. O., and Roggenbach, M. (2021). An initial
study of machine learning underspecification using
feature attribution explainable ai algorithms: A covid-
19 virus transmission case study. In PRICAI 2021:
Trends in Artificial Intelligence: 18th Pacific Rim In-
ternational Conference on Artificial Intelligence, PRI-
CAI 2021, Hanoi, Vietnam, November 8–12, 2021,
Proceedings, Part I 18, pages 323–335. Springer.
Hinton, G. (2012). Lecture 6e rmsprop: Divide the gradient
by a running average of its recent magnitude. Cours-
era Lecture: Neural Networks for Machine Learning.
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D.,
Wang, W., Weyand, T., Andreetto, M., and Adam,
H. (2017). Mobilenets: Efficient convolutional neu-
ral networks for mobile vision applications. arXiv
preprint arXiv:1704.04861.
Howard, J. (2019). Imagenette (version 1.0). https://github
.com/fastai/imagenette. GitHub repository.
Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
K. Q. (2017). Densely connected convolutional net-
works. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 4700–
4708.
Kaggle (2013). Dogs vs. Cats Dataset. https://www.kaggle
.com/c/dogs-vs-cats. Accessed: 2024-11-09.
Kamath, U. and Liu, J. (2021). Explainable artificial in-
telligence: an introduction to interpretable machine
learning, volume 2. Springer.
A Framework for Identifying Underspecification in Image Classification Pipeline Using Post-Hoc Analyzer
435