categorized into two groups. The first group is
univariate techniques, also known as rankers, which
rank the features by selecting a particular number of
attributes to retain (threshold). The second group is
multivariate techniques, which utilize a particular
search strategy and a variety of performance metrics
to identify the best subset of features.
In the study (Effrosynidis & Arampatzis, 2021),
authors assessed the efficacy of feature selection
methods for classification tasks on eight
environmental datasets, using RF and LGBM. The
study employed six filter methods, four wrapper
methods, two embedded methods, and six ensemble
methods. Twelve individuals and six ensembles were
used to evaluate the performance of the feature
selection methods. The findings revealed that the
most effective individual methods were Shapley
Additive Explanations and Permutation Importance
across the eight datasets with Reciprocal Ranking
performing the best among the six ensemble methods.
LGBM was found to outperform Random Forest. In
the paper (Nemani et al., 2022), authors evaluated the
effectiveness of filter and wrapper feature reduction
techniques on a high-dimensional dataset covering
multiple scales, aiming to predict the distribution of
species assemblages. The study used underwater
video sampling as ground truth to identify five
species assemblages. The features that predicted the
presence of these assemblages were evaluated using
both filter and wrapper methods, and the selected
features were modeled using SVM, RF, and extreme
gradient boosting (XGB). The highest accuracy
(61.67%) and a kappa value of 0.49 was achieved by
the XGB model that employed features selected by
the scale-factor from the Boruta wrapper algorithm.
In the study (Wieland, Kerkow, Früh, Kampen, &
Walther, 2017), authors used a data science technique
to choose a set of features using SVM, which is
employed to establish a relationship between the
distribution of a specific invasive mosquito species
and climate data. For the feature selection they used
genetic algorithm. The simulation’s outcome based
on data science was contrasted with the results of two
biologists based on their domain expertise. The paper
then considers how data science might be used to
produce new knowledge and identifies its
shortcomings. Results show that the distribution
model with the features selected using the proposed
approach gives better performance than the
distribution model with the features selected by the
two biologists. To the best of our knowledge, the
present study represents the first attempt to focus on
filter methods with the best threshold choice
regardless of the univariate filters and classification
techniques used. Moreover, this paper uses the Scott
Knott (SK) statistical test since it shows high
performance compared to other statistical tests
Calinski and Corsten (Calinski & Corsten, 1985), and
Cox and Spjotvoll . Besides, we used the Borda Count
voting method to rank the classifiers that belong to
the best SK clusters. Within this context, this paper
conducts several experiments to evaluate and
compare the impact of different thresholds on the
performance of different classifiers. For that, five
feature ranking techniques are used: ReliefF, Linear
Correlation, Mutual Information, Fisher Score and
Anova F-value. Furthermore, RF, LGBM, DT and
SVM classification techniques are used to assess the
performance of the selected subsets provided by five
thresholds (5%, 10%, 20%, 40% and 50%). The
reasoning for selecting these four classifiers is their
wide usage in several studies related to environmental
datasets. The classifiers are evaluated using the k-fold
cross validation method and the accuracy, kappa and
F1-score. In total, this study evaluates 312 variants of
classifiers: 4 classifiers *26 feature selection methods
(5 univariate-filters *5 selection-thresholds + the
entire feature set) *3 datasets and aims at addressing
the following research questions:
• (RQ1): What is the best threshold choice
regardless of the feature ranking and
classification techniques used?
• (RQ2): Is there any classifier which distinctly
outperformed the others?
• (RQ3): Are there any combinations of feature
selection and classifiers that outperform the
others?
The main significant contributions of this paper
can be condensed into:
1. Assessing the impact of the five thresholds (5%,
10%, 20%, 40% and 50%) on the four classifiers
(RF, LGBM, DT and SVM) using the five
univariate-filters (ReliefF, Linear Correlation,
Mutual Information, Fisher Score and Anova F-
value).
2. Comparing the performances of the different
classifiers using the best-selected thresholds.
3. Evaluating the best combination (Classifier +
feature ranking method + threshold value) for
each classifier over the three species datasets
(P.Moussieri, P.Ochruros and P.Phoenicurus)
using SK test and Borda Count.
The remaining sections of this paper are organized
as follows: Section 2 provides details regarding the
study area, including the species occurrence datasets,
environmental data, andthe practical steps taken to