Authors:
Artur Ferreira
1
;
2
;
3
and
Mário Figueiredo
3
;
2
;
1
Affiliations:
1
ISEL, Instituto Superior de Engenharia de Lisboa, Instituto Politécnico de Lisboa, Portugal
;
2
Instituto de Telecomunicações, Lisboa, Portugal
;
3
IST, Instituto Superior Técnico, Universidade de Lisboa, Portugal
Keyword(s):
Cancer Detection, Classification, k-Fold Data Split, Explainability, Feature Selection, Leave-One-Out.
Abstract:
Learning with high-dimensional (HD) data poses many challenges, since the large number of features often yields redundancy and irrelevance issues, which may decrease the performance of machine learning (ML) methods. Often, when learning with HD data, one resorts to feature selection (FS) approaches to avoid the curse of dimensionality. The use of FS may improve the results, but its use by itself does not lead to explainability, in the sense of identifying the small subset of core features that most influence the prediction of the ML model, which can still be seen as a black-box. In this paper, we propose k-fold feature selection (KFFS), which is a FS approach to shed some light into that black-box, by resorting to the k-fold data partition procedure and one generic unsupervised or supervised FS filter. KFFS finds small and decisive subsets of features for a classification task, at the expense of increased computation time. On HD data, KFFS finds small subsets of features, with dimens
ionality small enough to be analyzed by human experts (e.g, a medical doctor in a cancer detection problem). It also provides classification models with lower error rate and fewer features than those provided by the use of the individual supervised FS filter.
(More)