Impact of Thresholds of Univariate Filters for Predicting Species

Distribution

Yousra Cherif

1

, Ali Idri

1,2

and Omar El Alaoui

1

Software Project Management Research Team, ENSIAS, Mohammed V University in Rabat, Morocco

2

Mohammed VI Polytechnic University Benguerir, Morocco

Keywords: Species Distribution Models, Redstart Bird, Feature Selection, Univariate Filters, Environmental Data,

Machine Learning, Classification.

Abstract: Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence

records and environmental data. These models offer insights into the ecological and evolutionary aspects of

the subject. Feature selection (FS) aims to choose useful interlinked features or remove those that are

unnecessary and redundant, reduce model costs, storage needs, and make the induced model easier to

understand. Therefore, to predict the distribution of three bird species, this study compares five filter-based

univariate feature selection methods to select relevant features for classification tasks using five thresholds,

as well as four classifiers; Support Vector Machine (SVM), Light gradient-boosting machine (LGBM),

Decision Tree (DT), and Random Forest (RF). The empirical evaluations involve several techniques, such as

the 5-fold cross-validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three

performance criteria (accuracy, kappa and F1-score). Experiments showed that 40% and 50% thresholds were

the best choice for classifiers, with RF outperforming LGBM, DT and SVM. Finally, the best combination

for each classifier is as follows: RF and LGBM classifiers using Mutual information with 40% threshold, DT

using ReliefF with 50% thresholds, and SVM using Anova F-value with 40% thresholds.

1 INTRODUCTION

Maintaining biodiversity relies heavily on

safeguarding and conserving diverse species, their

habitats, and ecosystems (IUCN, 2002). The

preservation of species’ habitats can contribute to

mitigating carbon dioxide emissions and promoting a

healthy, unpolluted environment (IUCN, 2002;

Mawdsley, O’Malley, & Ojima, 2009). Furthermore,

as all species are interdependent in some manner, the

disappearance of one species can threaten the survival

of others. To achieve ecological stability and ensure

the protection of habitats, ecologists have developed

different approaches, and one of these is species

distribution modeling. This method is designed to

preserve species and their habitats. Species

distribution models (SDMs) are known by,

bioclimatic envelope models, habitat suitability

models, and ecological niche models. They

investigate how species occurrences are related to

environmental variables by analyzing the geographic

distribution of species (Guisan & Zimmermann,

2000). There are several SDM methods, and they vary

in how well they can condense the connections

between response and predictor variables. The

objective of FS is to generate a set of features that

effectively characterizes a specific problem. This is

accomplished by recognizing significant features

while excluding redundant or irrelevant ones (Guyon,

2006). FS has additional benefits beyond its capacity

to enhance data mining (DM) model performance

(Bolón-Canedo, Sánchez-Maroño, & Alonso-

Betanzos, 2015), including the ability to reduce the

number of measurements, decrease execution time

(Jaganathan & Kuppuchamy, 2013; Liu & Yu, 2005).

Filter, wrapper, and embedded models are the three

main classifications of FS algorithms. Filter methods

evaluate the relevance of a feature set by how well it

correlates with the dependent variable, whereas

wrapper methods assess the value of a feature set by

actually using it to train a model. On the other hand,

Embedded methods incorporate the FS process into

the machine learning algorithm’s training phase.

Researchers added new hybrid approaches to these

three FS technique types to combine their benefits

and discard their shortcomings. FS methods can be

86

Cherif, Y., Idri, A. and El Alaoui, O.

Impact of Thresholds of Univariate Filters for Predicting Species Distribution.

DOI: 10.5220/0012203000003598

In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2023) - Volume 1: KDIR, pages 86-97

ISBN: 978-989-758-671-2; ISSN: 2184-3228

categorized into two groups. The first group is

univariate techniques, also known as rankers, which

rank the features by selecting a particular number of

attributes to retain (threshold). The second group is

multivariate techniques, which utilize a particular

search strategy and a variety of performance metrics

to identify the best subset of features.

In the study (Effrosynidis & Arampatzis, 2021),

authors assessed the efficacy of feature selection

methods for classification tasks on eight

environmental datasets, using RF and LGBM. The

study employed six filter methods, four wrapper

methods, two embedded methods, and six ensemble

methods. Twelve individuals and six ensembles were

used to evaluate the performance of the feature

selection methods. The findings revealed that the

most effective individual methods were Shapley

Additive Explanations and Permutation Importance

across the eight datasets with Reciprocal Ranking

performing the best among the six ensemble methods.

LGBM was found to outperform Random Forest. In

the paper (Nemani et al., 2022), authors evaluated the

effectiveness of filter and wrapper feature reduction

techniques on a high-dimensional dataset covering

multiple scales, aiming to predict the distribution of

species assemblages. The study used underwater

video sampling as ground truth to identify five

species assemblages. The features that predicted the

presence of these assemblages were evaluated using

both filter and wrapper methods, and the selected

features were modeled using SVM, RF, and extreme

gradient boosting (XGB). The highest accuracy

(61.67%) and a kappa value of 0.49 was achieved by

the XGB model that employed features selected by

the scale-factor from the Boruta wrapper algorithm.

In the study (Wieland, Kerkow, Früh, Kampen, &

Walther, 2017), authors used a data science technique

to choose a set of features using SVM, which is

employed to establish a relationship between the

distribution of a specific invasive mosquito species

and climate data. For the feature selection they used

genetic algorithm. The simulation’s outcome based

on data science was contrasted with the results of two

biologists based on their domain expertise. The paper

then considers how data science might be used to

produce new knowledge and identifies its

shortcomings. Results show that the distribution

model with the features selected using the proposed

approach gives better performance than the

distribution model with the features selected by the

two biologists. To the best of our knowledge, the

present study represents the first attempt to focus on

filter methods with the best threshold choice

regardless of the univariate filters and classification

techniques used. Moreover, this paper uses the Scott

Knott (SK) statistical test since it shows high

performance compared to other statistical tests

Calinski and Corsten (Calinski & Corsten, 1985), and

Cox and Spjotvoll . Besides, we used the Borda Count

voting method to rank the classifiers that belong to

the best SK clusters. Within this context, this paper

conducts several experiments to evaluate and

compare the impact of different thresholds on the

performance of different classifiers. For that, five

feature ranking techniques are used: ReliefF, Linear

Correlation, Mutual Information, Fisher Score and

Anova F-value. Furthermore, RF, LGBM, DT and

SVM classification techniques are used to assess the

performance of the selected subsets provided by five

thresholds (5%, 10%, 20%, 40% and 50%). The

reasoning for selecting these four classifiers is their

wide usage in several studies related to environmental

datasets. The classifiers are evaluated using the k-fold

cross validation method and the accuracy, kappa and

F1-score. In total, this study evaluates 312 variants of

classifiers: 4 classifiers *26 feature selection methods

(5 univariate-filters *5 selection-thresholds + the

entire feature set) *3 datasets and aims at addressing

the following research questions:

• (RQ1): What is the best threshold choice

regardless of the feature ranking and

classification techniques used?

• (RQ2): Is there any classifier which distinctly

outperformed the others?

• (RQ3): Are there any combinations of feature

selection and classifiers that outperform the

others?

The main significant contributions of this paper

can be condensed into:

1. Assessing the impact of the five thresholds (5%,

10%, 20%, 40% and 50%) on the four classifiers

(RF, LGBM, DT and SVM) using the five

univariate-filters (ReliefF, Linear Correlation,

Mutual Information, Fisher Score and Anova F-

value).

2. Comparing the performances of the different

classifiers using the best-selected thresholds.

3. Evaluating the best combination (Classifier +

feature ranking method + threshold value) for

each classifier over the three species datasets

(P.Moussieri, P.Ochruros and P.Phoenicurus)

using SK test and Borda Count.

The remaining sections of this paper are organized

as follows: Section 2 provides details regarding the

study area, including the species occurrence datasets,

environmental data, andthe practical steps taken to

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

87

perform all empirical evaluations. Section 3

summarizes and analyzes the obtained results.

Threats to the validity of the study are discussed in

Section 4. Finally, Section 5 presents a summary of

the conclusions drawn from the study and suggests

avenues for future research.

2 MATERIAL AND METHODS

2.1 Study Area

The study was conducted in Morocco, which is

situated in the northwest of Africa and has a surface

area of 710,850 km². Morocco is situated adjacent to

Algeria, Mauritania, and Spain, and is bounded by the

Mediterranean Sea in the north and the Atlantic

Ocean in the west. The geography of Morocco includes

the Atlantic Ocean, mountains, and the Sahara Desert.

The latitude and longitude of Morocco fall between

21° and 36°N and 1° and 17°W, respectively. Most of

the land is occupied by mountains, including the Rif,

the Middle Atlas, the High Atlas, and the Anti-Atlas,

with Toubkal and Ayachi being the highest peaks. The

climate of Morocco is highly influenced by its

surrounding bodies of water and the Sahara Desert,

resulting in varying temperatures and precipitation

levels throughout the year. There is an average annual

rainfall of 318.8 mm, and the temperature ranges from

9.4°C to 26°C. The precipitation is heavy from October

to April and at its lowest from June to August.

2.2 Species Occurrence Datasets

The dataset utilized in this study is composed of

occurrences from three bird species that fall under the

Phoenicurus genus group in accordance with

taxonomical classification. The genus Phoenicurus

belongs to the Muscicapidae family and is commonly

referred to as the Redstart. Among the eleven species

of passerine birds that belong to this genus, the

species occurrence data utilized in this study

specifically pertains to three: Phoenicurus Moussieri,

Phoenicurus Ochruros, and Phoenicurus Phoenicurus.

The presence-only dataset used in this study

comprises 10,993 observation records, as sourced

from the GBIF global database. For data balancing,

we generated pseudo-absence data by randomly

selecting points within a circle with a 50km radius

centered on each presence location. Table 1 displays

the occurrence data for the bird species. Figure 1

displays the geographic locations of the three species

of Redstart found in Morocco. The Moussier’s

Redstart is predominantly present in the mountainous

regions and can often be spotted on rocky hills

covered with shrubs, and arid slopes that feature open

forests and sparsely planted trees. The Black Redstart

species is strongly associated with rocky

environments, both natural (such as cliffs, rocky

scree, rocky slopes, and ravines) and man-made (such

as various human constructions), as they use rocks for

nesting. Conversely, the Common Redstart species is

typically found in forest areas, favoring deciduous

forests but also inhabiting mixed forests with

dominant conifers in the northern and eastern parts of

its range. This species avoids excessively dense facies

and prefers to occupy old open woodlands, edges and

clearings, riparian forests, as well as secondary

human-made wooded environments like parks and

gardens.

2.3 Environmental Data

Many believe that a species’ distribution is closely

tied to geographical and climatic changes. Since the

survival of the three redstart species studied here is

highly dependent on their environment, we chose

climate conditions as a predictor variable for our

distribution models. These models provide a de-

tailed representation of the state of the land, allowing

for a more accurate prediction of where these birds

are likely to be found. To construct these models, we

used 19 bioclimatic predictors obtained from the

global database called Worldclim (Fick & Hijmans,

2017). These indicators signify the present climatic

conditions and were estimated from information

gathered between 1970 and 2000. The models were

trained using a 2.5 arc-minute grid with

approximately 5km resolution across Morocco and

incorporated elevation data obtained from the SRTM

Digital Elevation Database. A set of 20

environmental variables were employed as predictors

for developing models that forecast the distribution of

the three redstart species, as outlined in Figure 2.

Figure 1: Location of the three redstart birds in the study

area.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

88

Figure 2: Environmental predictors used to model the

distribution of the three redstarts species in Morocco.

Table 1: Description of the three redstart birds in Morocco.

Scientific Name Common Name Total

observations

Phoenicurus

Moussieri

Moussier’s

Redstart

5223

Phoenicurus

Ochruros

Black Redstart 3364

Phoenicurus

Common Redstart 2406

2.4 Experimental Design

This section outlines the methodology used for ex-

perimentation, which includes: (1) the statistical tests

Scott Knott and Borda Count used to group the

classifiers based on their accuracy values and to rank

the best-performing classifiers within the SK cluster

according to three performance metrics, and (2) the

experimental procedure used to conduct all empirical

evaluations.

2.4.1 Statistical Test and Borda Count

Scott Knott is a clustering algorithm commonly used

in the analysis of variance (ANOVA). The method

was introduced by Scott and Knott in 1974 and

involves the use of multiple comparisons of treatment

means to identify overlapping groups.

Borda Count is a voting procedure for elections with

two winners. According to this method, candidates

are awarded points based on where they are ranked:

last choice receives one point, second-to-last choice

receives two points, and so on up to the top. The

candidate with the highest score after adding the point

values for each rank is declared the winner (García-

Lapresta & Martínez-Panero, 2002).

2.4.2 Experimental Process

In this study, we used ReliefF, Linear Correlation,

Mutual Information, Fisher Score and Anova F-value

to rank features with five thresholds: 5%, 10%, 20%,

40%, and 50%. Moreover, we evaluate and compare

the four well-known classifiers (RF, LGBM, DT and

SVM) over the datasets obtained using the five filters

and the dataset with the entire feature set. The

classifiers’ performance was evaluated using the 5-

fold cross-validation technique, the three criteria

accuracy, kappa and F1-score, the Scott Knott

statistical test, and Borda Count voting technique. It

consists of five steps:

Step 1: Return a feature ranking list for each feature

ranking method (ReliefF, Linear

Correlation, Mutual Information, Fisher

Score and Anova F-value).

Step 2: For each feature ranking list, select the top

ranked features according to the 5

thresholds(5%, 10%, 20%, 40% and 50%).

Step 3: Construct four classifiers (DT, SVM, LGBM

and RF) for each feature subset and the entire

feature set using 5-fold cross validation to

obtain accuracy, kappa and F1-score.

Step 4: Evaluate and compare the performances of

the classifiers using the SK test based on

accuracy.

Step 5: Rank the classifiers belonging to the best SK

cluster using Borda Count voting system

based on accuracy, kappa and F1-score.

To enhance clarity, the following abbreviations

were employed: ReliefF was referred to as R, Mutual

Information as M, Linear Correlation as C, Anova F-

value as A, and Fisher score as F. Furthermore, to

describe a feature subset that was chosen using a

ranker and threshold, we included the ranker

abbreviation and the threshold number together. The

entire feature set was referred to as ORG. Therefore,

M5 denotes the feature subset obtained using Mutual

Information and 5% threshold, while RFF50 denotes

the RF classifier that was trained using Fisher score

and 50% threshold.

3 RESULTS AND DISCUSSION

This section is devoted to the results and discussions

of the empirical evaluations of the filter-based

univariate feature selection over the three species

datasets (P. Moussieri, P. Ochruros, and P.

Phoenicurus). First, we assess the impact of the five

selected thresholds on the models’ performance using

four classifiers and the three metrics (accuracy, kappa

and F1-score). In addition, we clustered the best-

performing models using the SK test based on

accuracy and we ranked those belonging to the best

SK cluster using Borda Count voting system based on

the three metrics.

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

89

Furthermore, we used the Borda Count voting ap-

proach, considering the three metrics and the SK test,

to cluster the best-performing models as part of our

first research question (RQ1). Then, we compared the

performances of the four classifiers using the best-

selected thresholds to each other (RQ2). Finally, we

identified the best combination for each classifier

based on the best feature ranking method and the best

threshold choice (RQ3).

3.1 (RQ1): What Is the Best Threshold

Choice Regardless of the Feature

Ranking and Classification

Techniques Used?

In this subsection, we selected 5%, 10%, 20%, 40%

and 50% thresholds using five feature ranking

methods. Then, we trained every subset in addition to

the entire feature set using four classifiers (RF,

LGBM, SVM and DT). Moreover, we employed the

SK test to compare the performance of the four

classifiers across various threshold values based on

accuracy. Lastly, we used Borda Count based on

accuracy, kappa and F1-score to rank the thresholds

that belong to the best SK cluster. This research

question explores if there is an optimal threshold

value for univariate filter FS techniques. Tables 2 - 4

summarize the mean accuracy values of the different

classifiers over the three datasets. We observe that:

• For the P.Moussieri dataset, the best accuracy

values reached 92.11% with the entire feature

set and 92% using Anova F-Value with 40%

threshold for RF, 91.31% with the entire feature

set and 91.15% using Mutual Information with

40% threshold for LGBM, 89.94% using Linear

corre- lation with 50% threshold for DT,

78.68% with the entire feature set and 78.16%

using ReliefF with 50% threshold for the SVM

classifier. The worst accuracy values reached

60.18%, 60.14%, 60.18%, 59.7% using ReliefF

with 5% threshold for RF, LGBM, DT and SVM

respectively.

• For the P.Ochruros dataset, the best accuracy

values reached 91.52% using Fisher score with

50% threshold for RF, 90.6% with the entire

feature set and 90.48% using Fisher score with

50% threshold for LGBM, 88.76% using

ReliefF with 50% threshold for DT and 69.98%

using Fisher score with 10% threshold for SVM.

The worst accuracy values reached 63.85%,

63.8% and 63.85% using ReliefF with 5%

threshold for RF, LGBM and DT respectively,

and 55.25% using Mutual Informa- tion with

10% threshold for SVM.

• For the P.Phoenicurus dataset, the best accuracy

values reached 90.54% with the entire feature

set and 90.43% using Mutual Information with

40% threshold for RF, 89.82% with the entire

Table 2: Accuracy values of the different classifiers over the Phoenicurus Moussieri dataset.

Classifier Method 5 10 20 40 50 ORG

RF

R

C

M

F

A

60.18%

85.41%

86.60%

85.61%

85.17%

77.89%

90.08%

89.97%

90.24%

90.07%

89.26%

91.53%

91.83%

91.57%

91.53%

91.46%

91.94%

91.96%

91.82%

92%

91.55%

91.7%

91.94%

91.85%

91.91%

92.11%

LGBM

R

C

M

F

A

60.14%

79.90%

77.62%

79.97%

79.90%

77.58%

86.68%

86.55%

86.31%

86.68%

87.26%

89.09%

90.04%

89.30%

89.09%

90.33%

90.57%

91.15%

90.31%

90.57%

90.35%

90.67%

91.14%

90.23%

90.67%

91.31%

DT

R

C

M

F

A

60.18%

85.5%

86.47%

85.61%

85.5%

77.77%

87.42%

87.69%

87.44%

87.5%

88.17%

88.05%

88.71%

88.85%

88.06%

89.38%

88.88%

89.31%

88.94%

88.85%

89.15%

89.94%

88.98%

88.81%

89.07%

89.54%

SVM

R

C

M

F

A

59.7%

76.4%

69.81%

71.96%

76.4%

67.81%

77.04%

77.02%

75.39%

77.04%

75.68%

76.11%

76.46%

77.51%

76.11%

78.12%

70.4%

70.43%

72.15%

70.4%

78.16%

70.96%

71.76%

76.16%

70.96%

78.68%

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

90

Table 3: Accuracy values of the different classifiers over the Phoenicurus Ochruros dataset.

Classifier Method 5 10 20 40 50 ORG

RF

R

C

M

F

A

63.85%

86.34%

86.24%

85.96%

86.37%

75.39%

88.43%

89.98%

89.47%

88.41%

88.49%

90.47%

91.3%

91.04%

90.44%

90.69%

91.07%

91.25%

91.31%

91.19%

91.24%

91.01%

91.28%

91.52%

90.98%

91.45%

LGBM

R

C

M

F

A

63.8%

79.6%

81.57%

78.55%

79.6%

75.15%

83.14%

86.69%

87.43%

83.14%

87.01%

89.14%

89.24%

89.85%

89.14%

89.5%

89.68%

90%

90.13%

89.68%

89.89%

89.76%

90.38%

90.48%

89.76%

90.6%

DT

R

C

M

F

A

63.85 %

86.3 %

86.25 %

86.02 %

86.3 %

75.43 %

87.26 %

87.66 %

87.58 %

87.38 %

86.75 %

87.67 %

88.06 %

88.23 %

88.75 %

88.35 %

88.65 %

87.88 %

88.73 %

88.52 %

88.76 %

87.78 %

87.99 %

88.64 %

87.64 %

88.73 %

SVM

R

C

M

F

A

63.6%

62.56%

59.54%

61.35%

62.56%

63.8%

64.01%

55.25%

69.98%

64.01%

69.33%

69.6%

57.23%

66.2%

69.6%

64.48%

68.95%

57.46%

68.11%

68.95%

64.57%

69.01%

59.86%

68.06%

69.01%

68.09%

Table 4: Accuracy values of the different classifiers over the Phoenicurus Phoenicurus dataset.

Classifier Method 5 10 20 40 50 ORG

RF

R

C

M

F

A

67.87%

84.77%

85.01%

83.25%

84.62%

77.05%

86.66%

88.8%

88.43%

86.9%

87.64%

89.49%

90.1%

89.01%

89.38%

90.34%

89.73%

90.43%

90.32%

89.75%

90.38%

90.1%

90.04%

90.25%

90.1%

90.54%

LGBM

R

C

M

F

A

67.87%

79.83%

81.14%

79.27%

79.83%

76.72%

83.84%

86.43%

85.99%

83.84%

85.99%

87.25%

88.54%

87.58%

87.25%

87.97%

88.75%

89.36%

89.17%

88.75%

89.6%

88.97%

89.08%

89.21%

88.97%

89.82%

DT

R

C

M

F

A

67.87%

84.55%

85.01%

82.97%

84.55%

77.01%

84.86%

86.01%

85.32%

85.12%

86.21%

86.03%

87.49%

86.08%

86.14%

86.47%

86.58%

86.82%

87.12%

86.51%

87.36%

87.47%

87.1%

87.43%

87.17%

87.34%

SVM

R

C

M

F

A

67.87%

71.44%

74.07%

73.05%

71.44%

68.61%

71.74%

74.27%

74.98%

71.44%

68.7%

74.22%

70.46%

74.9%

74.22%

71.68%

74.77%

70.18%

72.85%

74.77%

71.5%

68%

73.98%

72.79%

68%

74.77%

feature set and 89.6% using ReliefF with 50%

threshold for LGBM, 87.49% using Mutual

Information with 20% threshold for DT and

74.98% using Fisher score with 10% threshold for

SVM. The worst accuracy values reached 67.87%

using ReliefF with 5% threshold for RF, LGBM

and DT and SVM respectively.

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

91

Afterwards, we used the SK test to select the best

thresholds in terms of accuracy for all the classifiers

and feature ranking methods over each dataset. We

observe that:

• Using the P.Moussieri dataset, we found fifteen

clusters, where the best SK cluster contains 40%

threshold, 50% threshold with all the five

methods and (20% threshold only four times

using RF) and the entire feature set using both

RF and LGBM classifiers. Therefore, we can

conclude that it is preferable to use for this

dataset, 40% and 50% thresholds.

• Using the P.Ochruros dataset, we found

nineteen clusters, where the best SK cluster

contains 40% threshold, 50% threshold and

(20% threshold only four times) and the entire

feature set using both RF and LGBM classifiers.

Therefore, we can con- clude that it preferable

to use for this dataset, 40% and 50% as

thresholds.

• Using the P.Phoenicurus dataset, we found

thirteen clusters, where the best SK cluster

contains 40% threshold, 50% threshold and

(20% threshold only one time) and the entire

feature set using both RF and LGBM classifiers.

Therefore, we can conclude that it preferable to

use for this dataset, 40% and 50% as thresholds.

• Furthermore, we count the number of occur-

rences of the five thresholds (5%, 10%, 20%,

40% and 50%) in the first three SK clusters (C1,

C2, C3) over each dataset (P.Moussieri,

P.Ochruros, and P.Phoenicurus). Table 5

presents the number of oc- currences of each

threshold in the first three SK clusters regardless

of the classifiers-filters over the three datasets.

We can observe that:

• For P.Moussieri, the 5% threshold has never

appeared in the three clusters. In addition, the

10% threshold appeared 4 times in the third

cluster only. As for the 20% threshold, it

appeared 4 times, and one time in the best and

the third clusters respectively. Moreover, the

40% and 50% thresholds appeared 5 times, 3

times and 2 times in the best, second and third

clusters respectively.

• For P.Ochruros, the 5% threshold has never ap-

peared in the three clusters. In addition, the 10%

threshold appeared two times in the second and

third clusters. As for the 20% threshold, it

appeared 4 times in the best and second clusters,

and two times in the third cluster. Moreover, the

40% threshold appeared 5 times in the best and

second clusters, and 4 times in the third cluster.

The 50% threshold appeared 7 times, 3 times,

and 2 times in the best, second and third clusters

respectively.

• For P.Phoenicurus, the 5% threshold has never

appeared in the three clusters. In addition, the

10% threshold appeared two times and one time

in the second and third clusters respectively. As

for the 20% threshold, it appeared one time, 4

times, and 5 times in the best, second, and third

clusters re- spectively. Moreover, the 40%

threshold appeared 5 times in the best and

second clusters, and 2 times in the third cluster.

The 50% threshold appeared 6 times, 4 times,

and 5 times in the best, second and third clusters

respectively.

Thus, we can conclude that the 50% and 40%

thresholds are the best ones since they appeared 37

times and 36 times over the first three clusters

respectively, whereas the 20% and 10% thresholds

appeared 25 times and 11 times respectively. Besides,

the 50% and 40% appeared 18 times and 15 times in

the best SK cluster respectively where the 20%

appeared only 9 times. As for the 10%, it has never

appeared in the best SK cluster. The worst threshold

was 5%, since it never appeared in the first three

clusters.

Finally, to determine the rankings of the

thresholds that belong to the best SK cluster, we used

Borda Count method based on accuracy, kappa and

F1-score measures. Table 6 presents the top ten ranks.

Table 5: Number of occurrences of each threshold in the first three SK clusters regardless of the classifiers-filters over the

three datasets.

Threshold 5% 10% 20% 40% 50%

Dataset C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3

Phoenicurus Moussieri 0 0 0 0 0 4 4 0 1 5 3 2 5 3 2

Phoenicurus Ochruros 0 0 0 0 2 2 4 4 2 5 5 4 7 3 2

Phoenicurus Phoenicurus 0 0 0 0 2 1 1 4 5 5 5 2 6 4 5

Total 0 0 0 0 4 7 9 8 8 15 13 8 18 10 9

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

92

Table 6: The top ten ranks of the classifiers using Borda

Count over each dataset.

Rank Phoenicurus

Moussieri

Phoenicurus

Ochruros

Phoenicurus

1 RFORG RFF50 RFORG

2 RFA40 RFORG RFM40

3 RFM40 RFF40 RFR50

4 RFC40 RFM20 RFR40

5 RFM50 RFM50 RFF40

6 RFA50 RFM40 RFF50

7 RFF50 RFR50 RFA50

8 RFM20 RFA40 RFC50

9 RFF40 RFC40 RFM20

10 RFC50 RFF20 RFM50

of the best classifiers for each of the three datasets. As

it can be seen:

• For P.Moussieri dataset, the top ten techniques

contain four classifiers using 40% and 50%

thresholds and one trained on subsets using 20%

threshold. These classifiers cover RFA40,

RFM40, RFC40 and RFM50 of which the

accuracies achieved 92%, 91.96%, 91.94% and

91.94% respectively

• For P.Ochruros dataset, the top ten techniques

contain four classifiers based on 40% threshold,

three classifiers based on 50% threshold and two

classifiers trained on subsets using 20%

threshold. These classifiers contain RFF50,

RFF40, RFM20 and RFM50 of which the

accuracies achieved 91.52%, 91.31%, 91.3%

and 91.28% respectively.

• For P.Phoenicurus dataset, the top ten

techniques contain five classifiers using 50%

thresholds, three classifiers based on 40%, and

one classifier trained on subsets using 20%

threshold. These classifiers cover RFM40,

RFR50, RFR40 and RFF40 of which the

accuracies achieved 90.43%, 90.38%, 90.34%

and 90.32% respectively.

We found that 40% and 50% thresholds are the

best thresholds since they appeared in the top ten

ranks twenty-three times, whereas the 20% threshold

appeared four times: one time fourth, one time eighth,

one time ninth and one time tenth. Furthermore, these

thresholds provide a classifier with the same

performance as with the entire feature set, especially

for RF, since RFORG belongs to the same cluster.

3.2 (RQ2): Is There any Classifier

Which Distinctly Outperformed the

Others?

This subsection aims to compare the four classifiers

in predicting the distribution of the three bird species.

For this intent, we used the SK test to cluster the

classifiers using 40% and 50% thresholds, as it can

been seen in Figure 3, Figure 4 and Figure 5, and the

Borda Count to rank the classifiers of the best SK

cluster based on the three-performance metrics. We

observed that:

• Using the P.Moussieri and P.Ochruros datasets,

we obtained seven clusters where the best one

contains RF classifier, the second one contains

LGBM, the third one contains DT, and the

remaining clusters contain SVM. Note that we

did not use the Borda Count method since the

best clus- ters across the two datasets contain

only RF.

• Using the P.Phoenicuros dataset, we obtained

eight clusters where the best one contains RF

and LGBM classifiers with ReliefF using 50%

thresh- old. The second one contains LGBM

with the other remaining filters except for

ReliefF using 40% threshold which belongs to

the third one with the DT classifier. The

remaining clusters contain SVM classifier.

When using Borda Count, we found that RF

appears in all the ten first ranks, whereas the

LGBM appears in the last rank.

Figure 3: SK results of the four classifiers with the thresholds 40% and 50% over Phoenicurus Moussieri.

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

93

Figure 4: SK results of the four classifiers with the thresholds 40% and 50% over Phoenicurus Ochruros.

Figure 5: SK results of the four classifiers with the thresholds 40% and 50% over Phoenicurus Phoenicurus.

3.3 (RQ3): Are There any

Combinations of Feature Selection

and Classifiers that Outperform the

Others?

This subsection aims to find the best combination for

each classifier regardless of the feature ranking

method with 40% and 50% thresholds over the three

datasets: P.Moussieri, P.Ochruros and P.Phoenicurus.

To this end, we used SK test to cluster the different

combinations to identify those with the same

predictive capabilities in terms of accuracy. Finally,

we used Borda Count to rank the combinations that

belong to the best SK cluster. Figure 6, Figure 7 and

Figure 8 present the results of the SK test based on

accuracy was used to compare the performances of

the different combinations of each classifier over the

three datasets. Our findings indicate that:

• Over the P.Moussieri, P.Ochruros and

P.Phoenicurus datasets, we obtained one cluster

using RF, LGBM and DT with all the five

methods, which implies that all the possible

combinations for each classifier have equivalent

predictive accuracy capabilities.

As for SVM:

• We obtained four clusters using P.Moussieri

where the best cluster contains ReliefF with

40% and 50% thresholds. Which implies that

these two combinations have equivalent

predictive accuracy capabilities.

• We obtained four clusters using P.Ochruros

where the best cluster contains Anova F-value,

Linear Correlation and Fisher score using 40%

and 50% thresholds.

• We obtained five clusters using P.Phoenicurus

where the best SK cluster contains Anova F-

value and Linear Correlation with 40%

threshold and Mutual Information with 50%

threshold.

Lastly, in order to rank the combinations of each

classifier we used Borda Count. We found that:

• For the RF classier, RFM40 was the best since

it was ranked first with P.Phoenicurus, second

with P.Moussieri and fourth with P.Ochruros.

Moreover, RFF50 was ranked first with

P.Ochruros, fifth with P.Phoenicurus and sixth

with P.Moussieri. RFA40 was ranked first with

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

94

P.Moussieri, sixth with P.Ochruros and ninth

with P.Phoenicurus.

• For the LGBM classier, LGBMM40 was the

best since it was ranked first with P.Moussieri,

second with P.Phoenicurus and fourth with

P.Ochruros. Besides, LGBMF50 was ranked

first with P.Ochruros, third with P.Phoenicurus

and tenth with P.Moussieri. For LGBMR50, it

was ranked first with P.Phoenicurus, fifth with

P.Ochruros and seventh with P.Moussieri.

• For the DT classier, DTR50 was the best since

it was ranked first with P.Ochruros, third with

P.Phoenicurus and P.Moussieri. As for DTC50,

it was ranked first with P.Phoenicurus, sixth

with P.Moussieri and ninth with P.Ochruros.

DTR40, was ranked first with P.Moussieri, sixth

with P.Ochruros and tenth with P.Phoenicurus.

• For the SVM classier, SVMA40 was the

best since it was ranked first with P.Phoenicurus

and third with P.Ochruros. As for SVMA50 and

SVMR50 they were ranked first with

P.Ochruros and P.Moussieri respectively.

Figure 6: SK results to identify the best combination classifier-filter over Phoenicurus Moussieri dataset.

Figure 7: SK results to identify the best combination classifier-filter over Phoenicurus Ochruros dataset.

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

95

Figure 8: SK results to identify the best combination classifier-filter over Phoenicurus Phoenicurus dataset.

4 THREATS OF VALIDITY

In this section we will introduce and describe the

study’s three primary threats to validity: internal

validity, external validity, and construct validity.

Internal Validity: We used in the study 5-fold

cross validation to improve the reliability of the

average accuracy of various classifiers and to prevent

overfitting. Furthermore, despite thorough double-

checking of the implementation, errors may occur

during the execution of the planned experiment.

External Validity: The study focused on a single

dataset that included three bird species from the same

taxonomic group: P.Moussieri, P.Ochruros, and

P.Phoenicurus. Moreover, we generated pseudo-

absence data to balance our data, since we have only

presence data. As a result, the study’s findings cannot

be applied to all species in that taxonomic class. To

overcome this limitation, additional research should

be carried out using different datasets or alternative

classifiers and feature ranking techniques to validate

or contradict the findings of this study.

Construct Validity: This study focused on three

commonly used evaluation metrics to ensure the

reliability of the classifier results: accuracy, kappa

and F1-score. Furthermore, the researchers used the

SK test and Borda Count to draw better conclusions,

giving equal weight to the three-performance metrics:

accuracy, kappa and F1-score. This approach was

employed to prevent any bias towards a specific

performance criterion, ensuring that equal

consideration was given to all criteria.

5 CONCLUSION AND FUTURE

WORKS

This study is an empirical evaluation of 312 variants

of classifiers; 5 thresholds (5%, 10%, 20%, 40% and

50%) and the original feature set as well as five

feature ranking methods (ReliefF, Linear Correlation,

Mutual Information, Fisher Score and Anova F-

value) when training four classifiers (RF, LGBM, DT

and SVM) over the three datasets. The empirical

evaluations were conducted across all the three

datasets using three performance criteria, including

accuracy, kappa and F1-score. In addition to this, the

evaluations made use of the SK test and the Borda

Count to evaluate and rank the performance of the

models.

• (RQ1): What is the best threshold choice

regardless of the feature ranking and

classification techniques used?

40% and 50% thresholds were conducting to

better results as they were always appearing in

the best SK clusters and the best ranks.

Moreover, these thresholds provide a classifier

with the same performance as with the entire

feature set, especially for RF, since RFORG

belongs to the same cluster.

• (RQ2): Is there any classifier which

distinctly outperformed the others?

The Random Forest classifier outperformed

all the three classifiers over the three datasets.

Followed by: LGBM, DT and SVM.

KDIR 2023 - 15th International Conference on Knowledge Discovery and Information Retrieval

96

• (RQ3): Are there any combinations of

feature selection and classifiers that

outperform the others?

For RF and LGBM classifiers the best

combination was using Mutual information

with 40% threshold. As for DT, the best

combination was using ReliefF with 50%

thresholds compared to the others. Finally, for

SVM, the best combination goes for Anova F-

value with a 40% threshold.

Our ongoing work intends to build ensemble

feature ranking methods by using additional feature

ranking techniques for promising results.

REFERENCES

Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-

Betanzos, A. (2015). Recent advances and emerging

challenges of feature selection in the context of big data.

Knowledge-Based Systems, 86, 33–45. https://doi.org/

10.1016/J.KNOSYS.2015.05.014

Calinski, T., & Corsten, L. C. A. (1985). Clustering Means

in ANOVA by Simultaneous Testing. Biometrics,

41(1), 39. https://doi.org/10.2307/2530641

Effrosynidis, D., & Arampatzis, A. (2021). An evaluation

of feature selection methods for environmental data.

Ecological Informatics, 61, 101224. https://doi.org/

10.1016/J.ECOINF.2021.101224

Fick, S. E., & Hijmans, R. J. (2017). WorldClim 2: new 1-

km spatial resolution climate surfaces for global land

areas. International Journal of Climatology, 37(12),

4302–4315. https://doi.org/10.1002/JOC.5086

García-Lapresta, J. L., & Martínez-Panero, M. (2002).

Borda count versus approval voting: A fuzzy approach.

Public Choice, 112(1), 167–184. https://doi.org/

10.1023/A:1015609200117/METRICS

Guisan, A., & Zimmermann, N. E. (2000). Predictive

habitat distribution models in ecology. Ecological

Modelling, 135(2–3), 147–186. https://doi.org/10.1016/

S0304-3800(00)00354-9

Guyon, I. (2006). Feature Extraction Foundations and

Applications. October, 207(10), 740. Retrieved from

http://www.springerlink.com/content/j847w74269401

u31/

IUCN. (2002). Strategic planning for species conservation:

a handbook. In International Union for Conservation of

Nature and Natural Resources Council, Gland,

Switzerland.

Jaganathan, P., & Kuppuchamy, R. (2013). A threshold

fuzzy entropy based feature selection for medical

database classification. Computers in Biology and

Medicine, 43(12), 2222–2229. https://doi.org/10.1016/

J.COMPBIOMED.2013.10.016

Liu, H., & Yu, L. (2005). Toward integrating feature

selection algorithms for classification and clustering.

IEEE Transactions on Knowledge and Data

Engineering, 17(4), 491–502. https://doi.org/10.1109/

TKDE.2005.66

Mawdsley, J. R., O’Malley, R., & Ojima, D. S. (2009). A

review of climate-change adaptation strategies for

wildlife management and biodiversity conservation.

Conservation Biology : The Journal of the Society for

Conservation Biology, 23(5), 1080–1089.

https://doi.org/10.1111/J.1523-1739.2009.01264.X

Nemani, S., Cote, D., Misiuk, B., Edinger, E., Mackin-

McLaughlin, J., Templeton, A., … Robert, K. (2022).

A multi-scale feature selection approach for predicting

benthic assemblages.

Estuarine, Coastal and Shelf

Science, 277, 108053. https://doi.org/10.1016/

J.ECSS.2022.108053

Wieland, R., Kerkow, A., Früh, L., Kampen, H., & Walther,

D. (2017). Automated feature selection for a machine

learning approach toward modeling a mosquito

distribution. Ecological Modelling, 352, 108–112.

https://doi.org/10.1016/J.ECOLMODEL.2017.02.029

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

97