Improved Effort Estimation of Heterogeneous Ensembles

using Filter Feature Selection

Mohamed Hosni

, Ali Idri

and Alain Abran

Software Project Management Research Team, ENSIAS, Mohammed V University Rabat, Morocco

Department of Software Engineering,

Ecole de Technologie Sup

erieure Montr

eal, Canada

Keywords:

Software Development Effort Estimation, Machine Learning, Ensemble, Feature Selection, Filter.

Abstract:

Estimating the amount of effort required to develop a new software system remains the main activity in soft-

ware project management. Thus, providing an accurate estimate is essential to adequately manage the software

lifecycle. For that purpose, many paradigms have been proposed in the literature, among them Ensemble Ef-

fort Estimation (EEE). EEE consists of predicting the effort of the new project using more than one single

predictor. This paper aims at improving the prediction accuracy of heterogeneous ensembles whose members

use ﬁlter feature selection. Three groups of ensembles were constructed and evaluated: ensembles without

feature selection, ensembles with one ﬁlter, and ensembles with different ﬁlters. The overall results suggest

that the use of different ﬁlters lead to generate more accurate heterogeneous ensembles, and that the ensembles

whose members use one ﬁlter were the worst ones.

1 INTRODUCTION

Software Development Effort Estimation (SDEE)

aims at providing the amount of effort needed to de-

velop a software system. The estimates provided play

a decisive role on the success of a project manage-

ment, since it allows software managers to allocate

adequately the resources needed to build the software

system. Providing an accurate effort estimate has

been the subject of many studies for more than four

decades and a large number of techniques have been

proposed (Azzeh et al., 2015; Hosni et al., 2017c;

Hosni et al., 2017a; Zhu, 2010; Hosni and Idri, 2017).

This paper deals with ensemble techniques. Ensem-

ble techniques have been successfully applied to solve

many classiﬁcation and regression tasks (Zhu, 2010;

Zhou, 2012). They consist of aggregating the outputs

of a set of single techniques by means of a combina-

tion rule. Ensembles techniques have been also ap-

plied in SDEE and will be referred to here as EEE

(Ensemble Effort Estimation). The literature distin-

guishes two types of ensembles (Idri et al., 2016c):

(1) Homogeneous EEE is divided into two subtypes:

ensembles that combine at least two conﬁgurations of

the same single SDEE technique, and ensembles that

combine one meta model such as Bagging, Boosting,

Negative Correlation, or Random Subspace and one

single SDEE technique; (2) Heterogeneous (HT) EEE

which combines at least two different SDEE single

techniques.

The systematic review of Idri et al. (Idri et al., 2016a)

has documented that in general ensembles outper-

formed their members. However, some studies of

EEE have shown the opposite (Hosni et al., 2017a;

Kocaguneli et al., 2009). It has been observed that

the accuracy of an ensemble mainly depends on two

main criteria: accuracy and diversity of its members

(Idri et al., 2016a; Idri et al., 2016b). In other words,

the estimates of an ensemble are inﬂuenced by the es-

timates of its members, thus, they should be as ac-

curate as possible. Also, they should be diverse (e.g.

make different errors in the same data point). Con-

sequently, each ensemble member can cancel the es-

timation errors done by other members. Otherwise,

an ensemble that integrates non-diverse members may

produce a lower estimation accuracy than its mem-

bers. Although, there is no generally accepted for-

mal deﬁnition of ensemble diversity, there are some

mechanisms used to generate diversity among them

selecting input features (Zhou, 2012), which was in-

vestigated in this paper.

Within this context, we carried out an empirical eval-

uation of heterogeneous ensembles whose members

were K-nearest neighbor (KNN), Multilayer Percep-

tron (MLP), Support Vector Regression (SVR), and

Decision Trees (DTs) (Hosni et al., 2017a). This

Hosni, M., Idri, A. and Abran, A.

Improved Effort Estimation of Heterogeneous Ensembles using Filter Feature Selection.

DOI: 10.5220/0006929104050412

In Proceedings of the 13th International Conference on Software Technologies (ICSOFT 2018), pages 405-412

ISBN: 978-989-758-320-9

405

study has dealt with: (1) accuracy of the ensemble

members by tuning their parameters using the grid

search optimization technique, and (2) diversity by

using two ﬁlters: Correlation based feature selection

(CFS) and RReliefF. The results obtained showed that

the ensembles whose members used ﬁlter selection

were less accurate than their constituents and ensem-

bles without feature selection as well. This paper

presents an improvement to the selection process of

the heterogeneous ensemble members of (Hosni et al.,

2017a); in particular, we deal with the diversity cri-

terion by selecting members using different ﬁlters,

while in (Hosni et al., 2017a), the ensemble mem-

bers used the same ﬁlter. Moreover, three instead of

two ﬁlters were used in this study: CFS, RReliefF,

and Linear Correlation (LC). The ensembles of this

study were compared to ensembles of (Hosni et al.,

2017a) and ensembles without feature selection in or-

der to evaluate the impact of using different ﬁlters for

ensemble members on the accuracy of the ensembles.

The ensemble members of this study were the same as

in (Hosni et al., 2017a): KNN, MLP, SVR and DTs.

As for combiners, three linear rules were used: av-

erage, median, and Inverse Ranked Weighted Mean

(IRWM).

The main contributions of this paper are: dealing with

the diversity criterion of ensemble members by means

of three ﬁlters; and evaluating the impact of diversity

of ensemble members on the accuracy of EEE.

The rest of this paper is structured as follows: Sec-

tion 2 presents an overview of the three ﬁlters as well

as the four ML techniques used in this paper. Sec-

tion 3 presents the ﬁndings of some related work in

SDEE dealing with feature selection techniques for

heterogeneous ensembles. Section 4 presents the em-

pirical methodology pursued throughout this study.

Section 5 presents and discusses the empirical results.

Section 6 presents the conclusions of this empirical

study.

2 BACKGROUND

2.1 Feature Selection Techniques

Feature selection aims at eliminating redundant and

irrelevant features in order to reduce the complexity

and to improve the performance of any learner. Sev-

eral feature selection techniques are proposed in the

literature and can be grouped into three categories

(Jovic et al., 2015): Filter techniques, Wrapper tech-

niques and Embedded techniques.

This paper used ﬁlter techniques since they are

less costly and are performed independently of the

learner. Three ﬁlters were used: Correlation based

feature selection (CFS), RReliefF technique, and Lin-

ear Correlation. CFS belongs to the multivariate fea-

ture selection family: it assesses the full feature space,

and select a subset of features. The other two tech-

niques belong to the univariate feature selection fam-

ily: they separately assess each attribute and provide

a ranking of the features which presents an issue in

selecting the number of features. Selecting log

(N)

attributes where N is the number of features available

in the dataset, was the solution proposed in the lit-

erature and was adopted in this study (Hosni et al.,

2017a; Khoshgoftaar et al., 2007).

2.2 Four ML Techniques and Their

Parameters Settings

This study uses the same four ML techniques investi-

gated in (Hosni et al., 2017a): Knn, SVR, MLP, and

DTs.

It is well-known that the performance accuracy

of an estimation technique depends on its parameters

settings (Song et al., 2013; Hosni et al., 2017b; Hosni

et al., 2017c). In (Hosni et al., 2017b), we conducted

an empirical evaluation in which two optimization

techniques, Grid Search and Particle Swarm Opti-

mization, were used to set the parameters of four tech-

niques KNN, MLP, SVR, and DTs. The results ob-

tained showed that tuning the parameters by means of

optimization techniques have a positive impact on the

accuracy of these estimation techniques. Therefore,

this paper uses a grid search optimization technique to

set the parameters values of the four ML techniques.

It consists of performing a preliminary round of ex-

ecutions in a predeﬁned range of values and there-

after selecting the optimal conﬁguration that allows

the technique to generate the best accuracy with re-

spect to a speciﬁc measure (Hosni et al., 2017b). This

paper uses Mean Absolute Errors (MAE) as the per-

formance measure and the conﬁguration that leads the

technique to generate less MAE was selected. Table 1

lists our predeﬁned range of parameters values of the

four ML techniques.

3 LITERATURE REVIEW

The aim of feature selection techniques is to select

a set of features providing consistent information on

the instances of the data. Thus, the selected features

are used as inputs of a technique performing a knowl-

edge data discovery task such as classiﬁcation, pre-

diction or clustering. Within this context, many stud-

ies in SDEE have investigated the use of feature se-

ICSOFT 2018 - 13th International Conference on Software Technologies

406

Table 1: Parameter Values For Grid Search.

Techniques Parameters with their search spaces

Knn

K= {1, 2, 4, 8, 12}

Similarity measure = {Euclidean Distance}

SVR

Complexity= {5, 10, 50, 100, 150}

Extent to which deviation are tolerated = {0.0001, 0.001, 0.01, 0.1}

Kernel=RBF

Kernel Parameter= {0.0001, 0.001, 0.01, 0.1}

MLP

Learning rate= {0.01, 0.02, 0.03, 0.04, 0.05}

Momentum= {0.1, 0.2, 0.3, 0.4, 0.5}

Kernel=RBF

Hidden layers= {3, 5, 9, 16}

DTs

Minimum instance per leaf = {1, 2, 3, 4, 5}

Minimum proportion of the data variance at a node = {0,0001, 0,001, 0.01,0.1}

Max depth= {1, 2, 4, 6, 8}

lection techniques, and the overall results have sug-

gested that the use of feature selection improved the

estimation accuracy of predictors (Hira and Gillies,

2015; Jovic et al., 2015). For instance, Idri et al. (Idri

and Cherradi, 2016) studied the impact of two wrap-

pers: feature forward selection and backward feature

selection on the accuracy of the Fuzzy Analogy esti-

mation technique: their results suggested that the two

wrappers improved the accuracy of the Fuzzy Anal-

ogy technique.

As for the EEE, there are few papers that investigate

the use of feature selection for ensembles. For in-

stance, Minku et al. (Minku and Yao, 2013) showed

that the use of CFS feature selection fails to improve

the accuracy of MLP homogeneous ensembles (Bag-

ging + MLP) but it improves the Radial Basis Func-

tion ensembles (Bagging and Negative Correlation

Learning). Hosni et al. (Hosni et al., 2017a) car-

ried out an empirical evaluation of heterogeneous en-

sembles whose members were KNN, SVR, MLP, and

DTs. The members used two ﬁlters: CFS and RRe-

liefF. Each ensemble contains four members with the

same ﬁlter and uses one of three linear rules (aver-

age, median and IRWM). Thus, 9 heterogeneous en-

sembles were developed. These ensembles were as-

sessed using Standardized Accuracy and Effect Size

to check their reasonability; thereafter the Scott-Knott

statistical test was performed to check the signiﬁcant

difference between the ensembles. The best ensem-

bles that share the same predictive capability were

ranked based on 8 performance measures through

Borda Count. These experiments were performed

over six datasets. The results obtained showed that

the ﬁlter ensembles underperformed ensembles with-

out ﬁlters (Hosni et al., 2017a).

4 EMPIRICAL DESIGN

4.1 Performance Measure and

Statistical Test

The ﬁrst question raised when evaluating an SDEE

technique is whether the technique is actually pre-

dicting or only guessing (Idri et al., 2017; Shep-

perd and MacDonell, 2012). Thus, the Standardized

Accuracy measure (SA, Eq.(8)) was used to check

the reasonability of any technique with respect to a

baseline method, and the Effect Size test (∆, Eq.(9))

was adopted to assess if there is an effect improve-

ment over the baseline method. The absolute val-

ues of ∆ can be interpreted in terms of the categories

proposed by Cohen (Cohen, 1992): small (≈ 0.2),

medium (≈ 0.5) and large (≈ 0.8). Thereafter, a set

of accuracy measures were used to assess the pre-

dictive capability of a given technique: Pred(0.25)

(Eq.(3)), MAE (Eq.(4)), Mean Balanced Relative Er-

ror (MBRE, Eq.(5)), Mean Inverted Balanced Rel-

ative Error (MIBRE, Eq.(6)) and Logarithmic Stan-

dard Deviation (LSD, Eq.(7)). However, given that

the mean is very sensitive to outliers, the median of

these measures was also used: median of absolute

errors (MdAE), median of Balanced Relative Error

(MdBRE), and median of Inverted Balanced Relative

Error (MdIBRE). Note that the Pred(0.25) measure

was used in this paper even if it is an MRE-based cri-

terion: it was empirically proven in (Idri et al., 2017)

that the possibility to generate biased results is very

low in SDEE datasets compared to the other MRE-

based criteria such as Mean Magnitude Relative Error

(MMRE).

Improved Effort Estimation of Heterogeneous Ensembles using Filter Feature Selection

407

− ˆe

(1)

MRE

(2)

Pred(0.25) =

100

∑

i=1



1 i f MRE

≤ 0.25

0 Otherwise

(3)

MAE =

∑

i=1

(4)

MBRE =

∑

i=1

min(e

, ˆe

)

(5)

MIBRE =

∑

i=1

max(e

, ˆe

)

(6)

LSD =

∑

i=1

(λ

)

N − 1

(7)

SA = 1 −

MAE

(8)

∆ =

MAE

− MAE

(9)

Where:

• e

and ˆe

are the actual and predicted effort for the

ith project.

• MAE

is the mean value of a large number runs

of random guessing. This is deﬁned as, predict

a e

for the target project i by randomly sampling

(with equal probability) over all the remaining n-1

cases and take e

where r is drawn randomly

from 1... n∧r 6= i. This randomization proce-

dure is robust since it makes no assumption and

requires no knowledge concerning a population.

• MAE

mean of absolute errors for a prediction

technique i.

• S

is the sample standard deviation of the random

guessing strategy.

• λ

= ln(e

) - ln(ˆe

)

• s

is an estimator of the variance of the residual

To check the signiﬁcance of the difference between

techniques, the Scott-Knott (SK) test was used (Scott

and Knott, 1974): the SK test performs multiple com-

parisons and take into account the error type I correc-

tion.

Concerning the evaluation method, the Jackknife

method was adopted in this paper.

4.2 Ensembles Construction

This paper evaluates three groups of heterogeneous

ensembles whose members are KNN, MLP, SVR and

DTs. These ensembles differ on the way on which

they were constructed. The members were selected

according to their accuracy values. The three groups

are deﬁned as follows:

• Ensembles whose members use different ﬁlters:

ENF.

• Ensembles whose members do not use ﬁlter fea-

ture selection: E0F.

• Ensembles whose members use one ﬁlter (i.e. en-

sembles of (Hosni et al., 2017a)): E1F.

Methodology to construct the ensembles: the

steps followed to select the ensemble members of

ENF are described next. Note that before conduct-

ing the experiments, the three ﬁlters were applied over

the six datasets to select the relevant features that will

feed the four ML techniques.

The steps followed to construct the ENF ensembles

are as follows:

Step a.1. Build the ML techniques using the three

ﬁlters over the six datasets. The parameters set-

tings of each technique were determined using the

grid search technique with the range values of Ta-

ble 1. The best three variants of each technique

over each dataset were selected.

Step a.2. Evaluate the reasonability of the best three

variants of each technique over each dataset in

terms of SA and ∆, and select the ones achiev-

ing a reasonability higher than the 5% quantile of

random guessing (high SA) and showing a large

improvement in terms of Effect Size (∆> 0.8).

Step a.3. Perform the SK test based on AE of the

three variants of each technique of Step a.2 over

each dataset. The aim of performing the SK test

is to cluster the selected techniques and to identify

the best ones (i.e. techniques that share the same

predictive capability). Note that before conduct-

ing the SK test, the distribution of AEs of all se-

lected techniques was checked to verify whether

or not it follows a normal distribution using the

Kolmogorov-Smirnov statistical test; this pre-step

is necessary since the SK test required that its in-

puts should be normally distributed. The box-cox

transformation was performed in order to make

the AEs follow a normal distribution.

Step a.4. Rank the members of the best cluster of

each technique over each dataset by means of

Borda count using 8 accuracy measures: MAE,

ICSOFT 2018 - 13th International Conference on Software Technologies

408

MdAE, MIBRE, MdIBRE, MBRE, MdBRE,

Pred, and LSD. The Borda voting system takes

into consideration the rank provided by each ac-

curacy criterion. The rationale behind using many

accuracy measures is that prior studies in SDEE

has showed that the selection of the best estima-

tion technique depends on which indicator of ac-

curacy used (Azzeh et al., 2015).

Step a.5. Select the best variant of each technique to

be used as the base technique of the ENF ensem-

bles. Therefore, for each dataset, three hetero-

geneous ensembles were deﬁned whose members

are the four best variants of KNN, MLP, SVR and

DTs with the associated ﬁlter each. The three en-

sembles used the three linear combiners each: av-

erage, median and IRWM.

The steps followed to build the E1F and E0F ensem-

bles are as follows:

Step b.1. Return the best variants of the four ML

techniques (i.e. having the lowest MAE) using a

grid search with the ranges of Table 1. This step is

performed for each couple (technique, ﬁlter) over

each dataset (E1F). It is also performed for each

technique without feature selection (E0F).

Step b.2. Construct the E1F ensembles whose mem-

bers were the best variants of the four techniques

using the same ﬁlter. The three E1F (e.g. LC, CFS

and R) ensembles used the three linear combin-

ers each: average, median and IRWM. Construct

the E0F ensembles whose members were the best

variants of the four techniques without feature se-

lection. E0F ensemble used the three linear com-

biners: average, medium and IRWM.

Comparison Methodology: to compare the ensem-

bles (i.e. ENF, E0F and E1F) we used the same

methodology as in (Hosni et al., 2017a) which con-

sists of three steps:

Step c.2. Assess the accuracy of the ensembles with

regards to SA and Effect Size, and select the ones

that achieve SA values higher than the 5% quan-

tile of random guessing and show a large improve-

ment over random guessing in terms of Effect Size

(∆> 0.8).

Step c.2. Cluster the selected ensembles through the

SK test in order to select the ones that have similar

predictive capability.

Step c.2. Rank the ensembles of the best cluster in

each dataset according to 8 accuracy measures.

4.3 Abbreviation Adopted

We abbreviate the name of single and ensembles tech-

niques as follows:

• Single techniques:

{Feature Technique}{Single Technique}

• ENF: HT{Rule}.

• E0F: OD{Rule}.

• E1F: {Feature Technique}{Rule}.

where:

• Feature Technique: CFS, R, LC denote the Cor-

relation based feature selection, RReliefF, and

Linear Correlation respectively.

• Single Technique: Knn, MLP, SVR, and DTs.

• Rules: AV, ME, IR denote the average, median,

and inverse ranked weighted mean respectively.

Examples:

HTAV denotes the heterogeneous ensemble whose

members are the four ML techniques using different

feature selection techniques and the average as a com-

biner.

CFSIR denotes the ensemble whose members are the

four ML techniques using the CFS feature selection

technique and IR as combiner.

4.4 Datasets Description

Six well-known datasets were selected to assess the

accuracy of the single and ensembles techniques.

These datasets are diverse in terms of size, num-

ber of features, and they were collected from dif-

ferent organizations around the world and from dif-

ferent software application domains. Five datasets

namely: Albrecht, China, COCOMO81, Desharnais,

and Miyazaki were selected from the PROMISE

repository while the other dataset came from the IS-

BSG repository (Release 8).

Note that, a cleaning and instance selecting steps were

performed on the ISBSG dataset in order to select

projects with high quality. This preprocessing stage

results on a dataset that contains 148 projects de-

scribed by means of nine attributes.

5 EMPIRICAL RESULTS

5.1 Feature Selection Step

Table 2 lists the features selected for each dataset.

While none of the feature selection method chose the

same subset of features, there is at least one common

Improved Effort Estimation of Heterogeneous Ensembles using Filter Feature Selection

409

attribute selected by the three ﬁlters, to the exception

of China dataset in which the R and LC ﬁlters chose

different features. Recall that the number of features

is the same for LC and R (i.e. log

(N) where N is

the number of features) since they are both univari-

ate ﬁlter techniques. We can conclude that the use of

different ﬁlters results in different subset of features

which, therefore, impact the accuracy of the single

and ensemble techniques. The common features be-

tween the three ﬁlters techniques are indicated in bold

in Table 2.

5.2 Selection of Best Single Techniques

This subsection presents the evaluations results of the

four single techniques using the three ﬁlters over the

six datasets. The best variant of each single tech-

nique with each ﬁlter was selected as a base technique

for the heterogeneous ensembles ENF. Therefore, for

each dataset, 12 best variants were selected (12 = 4

single techniques * 3 ﬁlters).

Step a.1 aims at building the single four ML tech-

niques using the three ﬁlters. Thus, several experi-

ments were performed by varying the parameter set-

tings of each technique over each dataset according to

Table 1. For each dataset and each single technique

using a ﬁlter, we retain the variant having the lowest

MAE value (i.e. best variant). Next, step a.2 con-

sists of evaluating the reasonability of the best single

techniques of step a.1 over each dataset by means of

SA and Effect Size in order to select the ones that

will participate in the further experiments. We select

the best single techniques having an SA value higher

than the 5% quantile of random guessing and a large

impact over random guessing (∆>0.8). The overall

results suggest that the SA values of all single tech-

niques are greater than the 5% quantile of random

guessing; thus, all techniques provide reasonable pre-

dictions and are selected for the further experiments.

The evaluation results are not presented due to the

limit number of pages and could be obtained upon re-

quest by email to the corresponding authors of this

paper.

Thereafter, step a.3 clusters the three variants of

each single technique over each dataset through the

SK test using the AE criterion. The purpose of this

step is to select the variants that have the same predic-

tive capability and do not have a signiﬁcant difference

between them. Afterward, the variants of techniques

that belong to the best cluster were selected to par-

ticipate in further experiments. In fact, the SK test

identiﬁed one cluster 14 times (i.e. the three variants

of the technique have the same capability prediction),

two clusters 9 times, and 3 clusters once (i.e. Knn

technique in COCOMO81 dataset).

The selected techniques of step a.3 were, there-

after, ranked using Borda with 8 performance mea-

sures based on the ranking obtained through Borda

counting voting system no ﬁlter outperformed the

other in all situations. For example, the LC ﬁlter was

the best for DTs since LC was ranked ﬁrst in ﬁve

datasets; however, the R ﬁlter was the best for SVR

(ranked 3 times in the ﬁrst position).

The best variant of each technique over each

dataset was therefore selected as a member of the pro-

posed heterogeneous ensembles ENF. Table 3 lists the

ENF ensemble members for each dataset. We observe

that each dataset has an ensemble with different ﬁlters

(e.g. ensemble of Albrecht dataset uses LC and R),

This means that the performance of a ﬁlter depends

on the characteristics of each dataset (size, number of

features, etc.). Hence, members of ENF ensembles

use different feature subsets, contrary to E1F ensem-

bles, which can lead to satisfy the diversity criterion.

5.3 Ensembles Evaluation

This subsection presents the evaluation of the hetero-

geneous ensembles ENF, E1F and E0F according to

steps c.1-c.3. We have in total for each dataset 15

ensembles (15 = 3 ENF ensembles (1 ensemble * 3

combiners) + 9 E1F ensembles (3 ﬁlters * 3 combin-

ers) + 3 E0F ensembles (1 ensemble * 3 combiners).

Step c.1 evaluates the SA and ∆ values of the 15 het-

erogeneous ensembles over the six datasets in order

to retain the ensembles that achieve SA values higher

than the 5% quantile of random guessing and show

a large improvement over random guessing (∆>0.8).

The results obtained show that all the ensembles gen-

erate better SA value than the 5% quantile of random

guessing and show large ∆ values; therefore, all the

ensembles were selected as participants in the next

experiments. The main ﬁndings are:

• There is no best ensemble that achieved the high-

est reasonability across all datasets.

• The E0F ensembles achieved the highest SA val-

ues in four datasets: Albrecht, China, Desharnais,

and Miyazaki.

• The ENF ensembles generate the highest SA

value in two datasets: COCOMO81 and ISBSG.

• None of the E1F ensembles was ranked at the ﬁrst

position in all datasets.

• The less reasonable ensembles were the ones of

the E1F ensembles.

• IR and ME rules lead ensembles to generate better

SA values.

ICSOFT 2018 - 13th International Conference on Software Technologies

410

Table 2: Feature Selection Results: Common Selected Features are in Bold for Each Dataset.

Datasets CFS LC RReliefF

Albrecht Output, Inquiry, RawFP-

counts

Output, ﬁle, RawFPcounts,

AdjFP

Output, Inquiry, RawFP-

counts, AdjFP

China Output, Enquiry, Interface,

Added, Resource, Duration

AFP, Input, File, Added Output, Enquiry, File, Duration

COCOMO81 DATA, TIME, STOR, TURN,

VEXP, KDSI

DATA, TIME, STOR, TURN,

KDSI

TIME, VIRTmajeur, PCAP,

VEXP, KDSI

Desharnais TeamExp, ManagerExp,

YearEnd, Length, Adjustment,

PointsAjust

Length, Transactions,

PointsNonAdjust, PointsAjust

ManagerExp, Transactions,

PointsNonAdjust, PointsAjust

ISBSG VAF, MTS, UBCU, FC VAF, MTS, IC, FC IC, OC, EC, FC

Miyazaki KLOC, SCRN, FORM, FILE,

EFORM, EFILE

KLOC, SCRN, FILE, EFILE KLOC, FORM, ESCRN,

EFORM

Table 3: Members of the ENF Heterogeneous Ensembles.

Albrecht China COCOMO81 Desharnais ISBSG Miyazaki

LCDT LCDT RDT LCDT LCDT LCDT

LCKnn RKnn CFSKnn LCKnn CFSKnn LCKnn

RMLP RMLP LCMLP LCMLP CFSMLP CFSMLP

RSVR RSVR RSVR CFSSVR LCSVR LCSVR

Step c.2 clusters the 15 heterogeneous ensem-

bles through the SK test with the purpose of select-

ing the ones that are the best and share similar pre-

dictive capability. The SK test identiﬁed 4 clusters

in COCOMO81 dataset, 2 clusters in ISBSG and

Miyazaki datasets, and one cluster in the three re-

maining datasets. We notice that the RReliefF ensem-

bles were not selected by the SK test in two datasets:

ISBSG and Miyazaki. Similarly, the CFS ensembles

were not selected in the best cluster in COCOMO81

dataset. The results of SK test are not presented due to

the limit number of pages and could be obtained upon

request by email to the corresponding authors of this

paper.

The results obtained from the ranking provided by

Borda count are:

• The ENF ensembles, regardless of the combina-

tion rules, outperformed the E1F and E0F ensem-

ble in 5 out of 6 datasets.

• None of the ENF ensembles was ranked in the last

positions in all datasets.

• The three E0F ensembles were ranked in the three

ﬁrst positions in Desharnais dataset.

• Most of the E1F ensembles were in general ranked

in the last positions in all datasets.

• The IR combiner provides more accurate ensem-

bles in 4 out of 6 datasets, followed by ME in 2

out of 6 datasets. Note that the AV combiner did

not occur in the ﬁrst position in any dataset.

Note that, the Table presenting the ﬁnal ranking

is not presented due to the limit number of pages and

could be obtained upon request by email to the corre-

sponding authors of this paper.

6 CONCLUSION AND FUTURE

WORK

The objective of this study was to evaluate the impact

of the diversity criterion on the accuracy of heteroge-

neous ensembles. The ensemble members were the

four ML techniques: KNN, MLP, SVR and DTs and

the combination rules were the three linear combin-

ers (average, median, and IRWM). In general, there

are three sources of diversity: sampling data, training

the same technique with different conﬁguration in the

same sample, and using different features as input of

a technique.

This study investigated ﬁlter feature selection as a

source of diversity of ensemble members. To do that,

we improved the selection process of ensemble mem-

bers used in (Hosni et al., 2017a) by allowing the use

of different ﬁlters in the same ensemble. This led sin-

gle technique of an ensemble to use different subsets

of features. To assess the impact of this strategy, we

evaluated and compared the accuracy of three groups

of ensembles: ENF (members of an ensemble use dif-

ferent ﬁlters), E1F (members of an ensemble use one

ﬁlter) and E0F (members of an ensemble do not use

feature selection).

The results in terms of SA suggest that all ensem-

bles ENF, E1F and E0F were reasonable and generate

Improved Effort Estimation of Heterogeneous Ensembles using Filter Feature Selection

411

more reasonable results with respect to random guess-

ing. Moreover, none of the 15 ensembles was ranked

in the ﬁrst position across different datasets. The E0F

ensembles were more reasonable than the other en-

sembles in four datasets; the ENF ensembles were

the best in two datasets (COCOMO81 and Miyazaki).

However, the E1F ensembles were the less reasonable

in all datasets.

However, the accuracy results in terms of 8 perfor-

mance measures suggest that the ENF, in particular

with the combiners IR or ME, outperformed the E1F

and E0F ensembles in 5 out of 6 datasets. This im-

plies that using different feature subsets by ensemble

members can lead to more accurate estimations than

when members use the same feature subset or all the

available features. In fact, the success of the ENF en-

sembles is mainly due to the fact that their members

were diverse and generate different estimations at the

same point (i.e. diversity) than the members of E1F or

E0F. Moreover, E0F ensembles generate slightly bet-

ter estimates than E1F. Therefore, we conclude that

ensembles without feature selection were better and

easier to construct than ensembles with one ﬁlter.

Ongoing work will focus on investigating the im-

pact of other feature selection techniques, including

ﬁlters or wrappers, on the accuracy of homogenous

and heterogeneous ensembles.

REFERENCES

Azzeh, M., Nassif, A. B., and Minku, L. L. (2015). An

empirical evaluation of ensemble adjustment methods

for analogy-based effort estimation. The Journal of

Systems and Software, 103:36–52.

Cohen, J. (1992). A power primer. Psychological Bulletin,

112(1):155–159.

Hira, Z. M. and Gillies, D. F. (2015). A review of feature se-

lection and feature extraction methods applied on mi-

croarray data. Advances in Bioinformatics, 2015(1).

Hosni, M. and Idri, A. (2017). Software effort estimation

using classical analogy ensembles based on random

subspace. In Proceedings of the ACM Symposium on

Applied Computing, volume Part F1280.

Hosni, M., Idri, A., and Abran, A. (2017a). Investigating

heterogeneous ensembles with ﬁlter feature selection

for software effort estimation. In ACM International

Conference Proceeding Series, volume Part F1319.

Hosni, M., Idri, A., Abran, A., and Nassif, A. B. (2017b).

On the value of parameter tuning in heterogeneous en-

sembles effort estimation.

Hosni, M., Idri, A., Nassif, A., and Abran, A. (2017c). Het-

erogeneous Ensembles for Software Development Ef-

fort Estimation. In Proceedings - 2016 3rd Interna-

tional Conference on Soft Computing and Machine In-

telligence, ISCMI 2016.

Idri, A., Abnane, I., and Abran, A. (2017). Evaluating

Pred( p ) and standardized accuracy criteria in soft-

ware development effort estimation. Journal of Soft-

ware: Evolution and Process, (September):e1925.

Idri, A. and Cherradi, S. (2016). Improving Effort Esti-

mation of Fuzzy Analogy using Feature Subset Se-

lection. In Computational Intelligence (SSCI), 2016

IEEE Symposium Series on.

Idri, A., Hosni, M., and Abran, A. (2016a). Improved Esti-

mation of Software Development Effort Using Classi-

cal and Fuzzy Analogy Ensembles. Applied Soft Com-

puting.

Idri, A., Hosni, M., and Abran, A. (2016b). Systematic

Mapping Study of Ensemble Effort Estimation. In

Proceedings of the 11th International Conference on

Evaluation of Novel Software Approaches to Software

Engineering, number Enase, pages 132–139.

Idri, A., Hosni, M., and Alain, A. (2016c). Systematic Liter-

ature Review of Ensemble Effort Estimation. Journal

of Systems and Software, 118:151–175.

Jovic, A., Brkic, K., and Bogunovic, N. (2015). A re-

view of feature selection methods with applications.

In 2015 38th International Convention on Information

and Communication Technology, Electronics and Mi-

croelectronics (MIPRO), number May, pages 1200–

1205.

Khoshgoftaar, T., Golawala, M., and Hulse, J. V. (2007). An

Empirical Study of Learning from Imbalanced Data

Using Random Forest. In 19th IEEE International

Conference on Tools with Artiﬁcial Intelligence(ICTAI

2007), volume 2, pages 310–317.

Kocaguneli, E., Kultur, Y., and Bener, A. (2009). Combin-

ing Multiple Learners Induced on Multiple Datasets

for Software Effort Prediction. In Proceedings of In-

ternational Symposium on Software Reliability Engi-

neering.

Minku, L. L. and Yao, X. (2013). Ensembles and locality:

Insight on improving software effort estimation. Infor-

mation and Software Technology, 55(8):1512–1528.

Scott, A. J. and Knott, M. (1974). A Cluster Analysis

Method for Grouping Means in the Analysis of Vari-

ance. Biometrics, 30(3):507–512.

Shepperd, M. and MacDonell, S. (2012). Evaluating pre-

diction systems in software project estimation. Infor-

mation and Software Technology, 54(8):820–827.

Song, L., Minku, L. L., and Yao, X. (2013). The impact of

parameter tuning on software effort estimation using

learning machines. In Proceedings of the 9th Interna-

tional Conference on Predictive Models in Software

Engineering.

Zhou, Z.-H. (2012). Ensemble Methods. CRC Press.

Zhu, D. (2010). A hybrid approach for efﬁcient ensembles.

Decision Support Systems, 48(3):480–487.

ICSOFT 2018 - 13th International Conference on Software Technologies

412