Understanding the Interplay of Simultaneous Model Selection and

Representation Optimization for Classiﬁcation Tasks

Fabian B¨urger and Josef Pauli

Intelligent Systems Group, University of Duisburg-Essen, Bismarckstraße 90, 47057 Duisburg, Germany

Keywords:

Model Selection, Representation Learning, Classiﬁcation, Evolutionary Optimization.

Abstract:

The development of classiﬁcation systems that meet the desired accuracy levels for real world-tasks appli-

cations requires a lot of expertise. Numerous challenges, like noisy feature data, suboptimal algorithms and

hyperparameters, degrade the generalization performance. On the other hand, almost countless solutions have

been developed, e.g. feature selection, feature preprocessing, automatic algorithm and hyperparameter selec-

tion. Furthermore, representation learning is emerging to automatically learn better features. The challenge

of ﬁnding a suitable and tuned algorithm combination for each learning task can be solved by automatic opti-

mization frameworks. However, the more components are optimized simultaneously, the more complex their

interplay becomes with respect to the generalization performance and optimization run time. This paper an-

alyzes the interplay of the components in a holistic framework which optimizes the feature subset, feature

preprocessing, representation learning, classiﬁers and all hyperparameters. The evaluation on a real-world

dataset that suffers from the curse of dimensionality shows the potential beneﬁts and risks of such holistic

optimization frameworks.

1 INTRODUCTION

Classiﬁer systems learn the connection between fea-

ture data and class labels and are potentially useful in

many applications such as image-based object recog-

nition, automated quality inspection and medical di-

agnosis systems. The development of classiﬁer sys-

tems for real-world tasks is still challenging, even

though many sophisticated classiﬁers, such as Sup-

port Vector Machines (SVM) or random forests, ex-

ist. The reasons for this are manifold: At ﬁrst, the

input feature data can be noisy, especially when raw

sensor data is used, like pixel data from camera sen-

sors. The development of task-speciﬁc features which

are invariant to certain variations is time-consuming

and requires a lot of expertise. Secondly, there is no

best performing general-purpose machine learning al-

gorithm and so a suitable one has to be chosen. Fur-

thermore, most algorithms have hyperparameters that

need to be adapted to each learning task.

Many well known, established solutions to almost

all of these challenges exist, like feature selection al-

gorithms, feature preprocessing methods or automatic

algorithm and hyperparameter selection methods.

The ﬁeld of representation learning aims at im-

proving the feature data and it has shown a great per-

formance boost in e.g. deep learning (Bengio et al.,

2013). The goal is to automatically generate better

suitable features out of low-level data. One way of

learning a better representation is manifold learning

that results in easier and lower dimensional features

(Ma and Fu, 2011).

The large number of potentially useful approaches

makes it difﬁcult to select the optimal algorithm com-

bination for each learning task. This challenge has

motivated the development of automatic optimization

frameworks that can handle a great amount of solu-

tions. Currently, there are two promising optimiza-

tion approaches to tackle the highly combinatorial

so-called algorithm conﬁguration problem. The ﬁrst

approach is Evolutionary Optimization (B¨ack, 1996)

which simulates the natural evolution process. It has

been successfully used to select features and classiﬁer

hyperparameters(Huang and Chang, 2007; Ans´otegui

et al., 2009), also in combination with manifold learn-

ing (B¨urger and Pauli, 2015). The second approach

is Bayesian optimization (Snoek et al., 2012; Hutter

et al., 2011) that collects all information about the

optimization trajectory in a probabilistic model and

evaluates the next most promising areas of the search

space. The Auto-WEKA framework (Thornton et al.,

2013) uses this approach to optimize features, algo-

Bürger, F. and Pauli, J.

Understanding the Interplay of Simultaneous Model Selection and Representation Optimization for Classiﬁcation Tasks.

DOI: 10.5220/0005705302830290

In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2016), pages 283-290

ISBN: 978-989-758-173-1

283

rithms and hyperparameters.

These holistic frameworks are successful, but the

large degree of adaptability of the optimized classi-

ﬁer systems can cause overﬁtting that degrades the

generalization performance. Therefore, a deeper un-

derstanding of the interplay of the involved optimiza-

tion components is necessary. This paper quantiﬁes

the impact of feature selection, feature preprocessing,

manifold and representation learning, classiﬁer selec-

tion and hyperparameter tuning on generalization per-

formance and optimization time using an Evolution-

ary Algorithm. Furthermore, three aspects regarding

the optimization algorithm itself are analyzed as well.

This paper is organized as follows. Section 2

lists a selection of approaches to improve machine

learning. Section 3 presents a holistic classiﬁcation

framework that incorporates these aspects. Section

4 discusses the Evolutionary Optimization strategy

to adapt the algorithm conﬁguration to each learning

task. Section 5 presents the results of the evaluation

of the framework. Finally, section 6 contains the con-

clusions.

2 MACHINE LEARNING

SOLUTIONS

This section discusses approaches to improve ma-

chine learning subdivided into established methods

and representation learning.

2.1 Established Approaches

There are some very popular “standard” approaches

to improve the performance of machine learning sys-

tems, which are described in e.g. (Jain et al., 2000) or

(Bishop and Nasrabadi, 2006).

Feature selection algorithms have the goal to

choose a useful subset of features to remove irrelevant

and noisy dimensions. Feature selection is a remedy

to the curse of dimensionality. Feature preprocess-

ing methods are relatively simple algorithms that e.g.

normalize the value ranges of the features to a deﬁned

range or remove correlations between features which

is known as pre-whitening (Juszczak et al., 2002).

Model selection algorithms have the goal to se-

lect an optimal classiﬁcation model for a speciﬁc task.

This can comprise the selection of a speciﬁc algo-

rithm out of an portfolio of alternatives and the tun-

ing of hyperparameters. The best performing model

is usually determined by generalization estimation

based on model validation techniques such as cross-

validation.

2.2 Representation Learning

The development of task-speciﬁc features can be one

of the most time-consuming parts of the development

process. The ﬁeld of automatic feature construction

methods is one part of representation learning and

aims at automatically learn better feature representa-

tions out of the provided data. The family of man-

ifold learning methods provides such functionality –

examples are Principal Component Analysis (PCA),

Isomap, Local Linear Embedding (LLE) or Autoen-

coders. References to these methods can be found

e.g. in (Van der Maaten et al., 2009) or (Ma and Fu,

2011). Most of these methods are unsupervised and

thus do not require expensive, labeled ground truth

data. However, the success of these methods depends

on the datasets. Furthermore, not all methods provide

a so-called direct out-of-sample extension that em-

beds unseen instances into the learned feature space,

butonly an approximationwhich can degrade the gen-

eralization performance.

3 HOLISTIC CLASSIFICATION

FRAMEWORK

The classiﬁcation pipeline is similar to our previ-

ous work (B¨urger and Pauli, 2015) and incorporates

all approaches discussed in section 2 into a holistic

framework. The pipeline consists of four pipeline el-

ements and is depicted in ﬁgure 1. The originally

proposed “feature scaling” element is replaced by a

generalized preprocessing element. The pipeline has

two modes, namely the training and classiﬁcation

mode. The training mode is used to adapt and train

the pipeline conﬁguration θ using the training dataset

T. This training dataset T = {(x

, y

), . . . , (x

, y

)} is

the input of the frameworkand contains m labeled fea-

ture vectors x

∈ R

and corresponding class labels

∈ {ω

, ω

, . . . , ω

}. The classiﬁcation mode uses a

speciﬁc conﬁguration to set up a pipeline which pro-

cesses the incoming feature vectors and returns class

labels. The functionality of each pipeline element is

described in the following.

3.1 Feature Selection Element

The ﬁrst element contains the functionality of feature

selection to remove irrelevant features as soon as pos-

sible. A feature subset S

FeatSet

∈ P ({1, 2, ..., D

}) \

is selected during the training mode. In the classiﬁ-

cation mode, the determined subset of dimensions is

selected from the incoming features vectors and the

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

284

Feature

Preprocessing

Element

Feature

Transform

Element

Input

vector

Class

label

Classifier

Element

Feature

Selection

Element

Training

dataset

Classification Mode

Training Mode

Figure 1: Classiﬁcation pipeline structure and data processing in the training and classiﬁcation mode.

others are removed. The feature dimensionality is de-

creased according to the number of selected features

FeatSet

3.2 Feature Preprocessing Element

The second pipeline element contains the function-

ality of feature preprocessing to provide better us-

able features for the learning algorithms in the next

two pipeline elements. This pipeline element has a

portfolio of preprocessing algorithms S

PreProc

which

contains four methods, namely feature rescaling of

each dimension to a range of [0, 1], statistical stan-

dardization, L

normalization, pre-whitening and –

additionally – the identity. In the training mode one

method f

PreProc

∈ S

PreProc

is chosen. In the classiﬁ-

cation mode, the selected preprocessing is applied to

the incoming feature vectors.

3.3 Feature Transform Element

The third element introduces the aspect of represen-

tation learning into the classiﬁcation pipeline. This is

realized in form of feature transforms that are learned

using manifold learning techniques. The pipeline el-

ement contains a portfolio S

FeatTrans

with 32 mani-

fold learning techniques provided by the Matlab Tool-

box for Dimensionality Reduction (Van der Maaten,

2014). Additionally, similar to the feature prepro-

cessing method, the identity is included, too. In

the training mode a method f

FeatTrans

∈ S

FeatTrans

well as values for the speciﬁc set of hyperparameters

Hyp

( f

FeatTrans

) are chosen. Additionally, the target

dimensionality D

Target

of the resulting feature space

needs to be chosen which is a common hyperparame-

ter of all manifold learners. Then the manifold learner

is trained with the feature vectors arriving from the

feature preprocessing element. Most manifold learn-

ing algorithms are unsupervised, but the supervised

ones additionally use the corresponding labels from

the training dataset T. In the classiﬁcation mode, the

out-of-sample extension or approximation of the se-

lected and trained manifold learner is applied to pro-

cess the feature vectors.

3.4 Classiﬁer Element

The fourth pipeline element is ﬁnally responsible to

perform the actual classiﬁcation. The pipeline ele-

ment contains a portfolio S

Classifiers

of eight popular

classiﬁer concepts and variants, e.g. kernel Support

Vector Machines (SVM), random forests and the mul-

tilayer perceptron. In the training mode, a classiﬁer

concept f

Classifier

∈ S

Classifiers

is selected as well as its

speciﬁc set of hyperparameters S

Hyp

( f

Classifier

). The

classiﬁer is trained using the processed feature vectors

and the corresponding labels in the training dataset T.

In the classiﬁcation mode, the trained classiﬁer is used

to classify instances.

3.5 Pipeline Conﬁguration

The pipeline conﬁguration contains all necessary in-

formation about the selected features, algorithms and

hyperparameters and is deﬁned as

θ = (S

FeatSet

, f

PreProc

, f

FeatTrans

, S

Hyp

( f

FeatTrans

Target

, f

Classifier

, S

Hyp

( f

Classifier

)). (1)

The number of possible conﬁgurations is huge and

also depends on the training dataset, because there are

− 1 possible feature subsets to choose from.

4 OPTIMIZATION FRAMEWORK

The pipeline conﬁguration θ needs to be adapted to

every learning task deﬁned by the training dataset T.

An optimization algorithm is required that can han-

dle the large search space which will likely also con-

tain a large number of local optima. Evolutionary Al-

gorithms have the potential to cope with these chal-

lenges and it is known that they can be easily run in

parallel – which is not the case for a Bayesian opti-

mization approach.

4.1 Extended Evolution Strategies

Evolution Strategies (ES) (Beyer and Schwefel, 2002)

are one variant of Evolutionary Algorithms which are

Understanding the Interplay of Simultaneous Model Selection and Representation Optimization for Classiﬁcation Tasks

285

suitable to solve high-dimensional optimization prob-

lems. A solution is coded into the genetic information

of an individual and a population contains several in-

dividuals. Starting from an initial population which

is usually generated randomly, the best individuals

are selected according to their ﬁtness. The ﬁtness is

usually determined using the objective function of the

optimization problem (see section 4.3). The selected

individuals are recombined and randomly mutated to

generate a new generation of hopefully improved in-

dividuals. This procedure is repeated until the termi-

nation criteria are fulﬁlled, e.g. too little ﬁtness im-

provements or a time limit.

The standard ES is only deﬁned for numeric vari-

ables, but all necessary evolutionary operators can be

extended to more data types which is described by

the authors of (B¨urger and Pauli, 2015). The conﬁg-

uration adaption problem requires a larger variety of

variable types, because of the feature selection and

hyperparameter tuning problem. Therefore, the fol-

lowing variable types are used:

• The real-valued variable type V

• the integer variable type V

• the categorical variable type V

and

• the Boolean variable type V

The numeric variables also contain information

about minimum and maximum values and the cat-

egorical variables the base set of all possible cate-

gories.

The parametrization of an ES-based optimization

algorithm is summarized in the (µ, κ, λ, ρ) notation. In

each generation λ = 200 individuals from ρ = 3 par-

ents are generated while the best µ = 20 individuals

survive. The maximum lifespan of individuals is lim-

ited to κ = 4 generations. The initial population con-

tains µ

init

= 400 random individuals. The number of

individuals should be as high as possible to increase

the chances to ﬁnd better solutions. However, there

must be a trade-off between computational complex-

ity and the risk of getting stuck in local optima.

4.2 Evolutionary Conﬁguration

Adaptation

The pipeline conﬁgurations have to be coded into the

genotype of the ES which is a sequence of N variables

G = [V

∗,1

∗,2

, . . . ,V

∗,N

], V

∗

∈ {V

}. (2)

The Evolutionary Conﬁguration Adaption (ECA)

schema is used to code a conﬁguration into the geno-

type as depicted in ﬁgure 2. First, the feature selec-

tion problem is coded as D

binary variables that in-

dicate whether a feature is selected or not. The feature

Feature

preprocessing

Feature

transform

Feature transform

hyperparameters

Classifier

hyperparameters

Feature

subset

Figure 2: ES genotype coding schema of a pipeline conﬁg-

uration.

preprocessing method is coded as a single categorical

variable. The feature transform is also selected with

a categorical variable. All hyperparameters of all fea-

ture transforms are also appended to the genome with

their corresponding variable type. When the individ-

uals are transformed to a conﬁguration, only those

hyperparameters are used that belong to the selected

feature transform. The target dimensionality depends

on the number of selected features and therefore it is

coded as a real-valued ratio d

Ratio

∈ [0, 1]. The actual

value for the target dimensionality is then calculated

as D

Target

= ⌊d

Ratio

·|S

FeatSet

|⌋. The classiﬁer is coded

similarly to the feature transform as categorical vari-

able and the corresponding hyperparameters are also

handled in the same way. Each individual I can be

mapped to a valid pipeline conﬁguration θ.

4.3 Fitness Function

The key role of ES is the selection of individuals

I by their ﬁtness fit(I). In case of a classiﬁcation

pipeline, the ﬁtness should be directly connected to

the expected generalization performance of the whole

pipeline. Therefore, a holistic k-fold cross-validation

(HCV) is used that considers the generalization of all

components. This is necessary, because especially the

feature transform is expected to contribute to the gen-

eralization.

The HCV method uses the training dataset T

to generate k = 5 disjoint training and validation

tuples {(T

train,l

, T

valid,l

)} with 1 ≤ l ≤ k. In each

cross-validation round, the classiﬁcation pipeline is

ﬁrst trained using T

train,l

. Subsequently, the instances

of the validation dataset T

valid,l

are classiﬁed with

this pipeline and the overall accuracy value q

acc,l

calculated. This makes sure that the validation data is

never used for training. During the cross-validation

process, the average overall accuracy so far

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

286

acc,l

∑

j=1

acc, j

(3)

is calculated. This value is used for an early dis-

carding system to save computation time. The cross-

validation process is stopped prematurely for bad per-

forming conﬁgurations that do not improve the pre-

viously observed best accuracy value q

∗

acc

. The early

discarding system is relatively aggressive and the pro-

cess is stopped if

acc,l

< q

∗

acc

. The last average is

used as ﬁtness value for the individuals fit(I) = q

acc,l

This ﬁtness metric is also used as termination cri-

terion. The optimization is stopped when the best

ﬁtness of the current generation increases less than

ε = 10

−3

(equal to 0.1 percent of accuracy) over the

last three generations.

4.4 Initial Population Improvement

The feature selection problem is expected to have

the largest impact on the problem complexity and

therefore, it is promising to speedup and improve

the ES optimization with an improved initial pop-

ulation. Random forests (Breiman, 2001) provide

an integrated feature importance metric proposed by

(Genuer et al., 2010). Before the actual ES optimiza-

tion begins, this metric β

is calculated for each fea-

ture with 1 ≤ i ≤ D

. The initial probability for each

feature is determined by the importance metric using

= p

min

+ (1− p

min

)

− β

min

max

− β

min

∈ [p

min

, 1] (4)

in which β

min

and β

max

are the minimum and maxi-

mum values of β

across all features. The p

min

= 0.25

value deﬁnes a minimum probability value for the

least important variable. This makes sure that the se-

lection of this variable does not become impossible.

4.5 Optimization Algorithm Variants

One of the main goals of this paper is to quantify and

understand the impact of the components of the clas-

siﬁcation pipeline. This is achieved by the introduc-

tion of restricted variants of the ECA optimization al-

gorithm that work with the leave-one-out principle for

each component. The following variants are consid-

ered:

• The ECA-full variant uses all components,

• the ECA-noFeatSel variant does not consider fea-

ture selection and always uses the full feature sub-

set,

• the ECA-noPreProc variant does not use any fea-

ture preprocessing method,

Figure 3: Three example images from each class of the

coins dataset.

• the ECA-noTrans variant does not use manifold

learning or feature transform methods,

• the ECA-simpleClassiﬁer variant only contains a

simple classiﬁer, namely the naive Bayes classi-

ﬁer and

• the ECA-defaultHyper variant does not consider

hyperparameter tuning and leaves all hyperparam-

eters at their standard values.

5 EVALUATION

An image-based object recognition task is used to

evaluate the components of the framework. The task

is to classify color images of ﬁve classes of Euro coins

which are depicted in ﬁgure 3. There are 64 color im-

ages from 1-, 2-, 5-, 10- and 20-cent coins leading to

320 images in total. The following feature set is ex-

tracted from each image:

• The object area in pixels,

• statistical features of the gray value and color

channel histogram,

• Hu moments of the gray value texture (Hu, 1962),

• Local Binary Patterns (Ojala et al., 2002) of the

gray value texture,

• low-level pixel features in form of down-scaled

gray value images: 5 × 5, 10 × 10, and 20 × 20

pixels.

The total dimensionality of the feature space is D

642. In combination with the small number of sam-

ples negative effects due to the curse of dimensional-

ity can be expected. However, this makes this dataset

particularly interesting to analyze. The dataset is sep-

arated randomly into 50 % training and 50 % test

dataset.

The reported cross-validation accuracy is ob-

tained on the training dataset and is equal to the ﬁt-

ness of the best individual. The generalization per-

formance is measured by using the conﬁguration with

Understanding the Interplay of Simultaneous Model Selection and Representation Optimization for Classiﬁcation Tasks

287

the highest ﬁtness value to set up a classiﬁcation

pipeline and process and evaluate the test dataset with

it. The optimization time is measured as well.

The performance of the ECA strategy is com-

pared to two baseline methods, namely an SVM clas-

siﬁer with a Gaussian kernel (denoted as Baseline-

SVM) and a random forest (denoted as Baseline-RF).

The baselines also use a grid-based hyperparameter

tuning with cross-validation. Furthermore, the per-

formance of the Auto-WEKA framework (Thornton

et al., 2013) with a time budget of 24 hours is com-

pared as well. Each experiment with the framework

is repeated ten times to obtain statistically relevant

results. All performance metrics are compared rel-

atively to the ECA-full strategy using a Welch test

(Howell, 2006) with a level of signiﬁcance of α =

0.05 to show if the deviations are statistically signiﬁ-

cant. If this is the case, the corresponding results are

marked with an exclamation mark.

The framework is implemented in Matlab 2014b

using the parallel computing toolbox and is running

on a workstation computer with 6 × 2.5Ghz and

32GB of RAM.

5.1 Results using All Components

The ECA-full variant of the framework can use the

full potential of all pipeline elements and is there-

fore the most promising and interesting approach.

Table 1 shows the results regarding the accuracy

values and optimization times. The two baseline

methods, especially the SVM, perform badly on this

dataset. An explanation is that this dataset suffers

from the curse of dimensionality. The ECA-full vari-

ant shows much higher cross-validation and gener-

alization accuracy values which also outperform the

Auto-WEKA framework signiﬁcantly.

The typical processing time of the ECA-full strat-

egy stays under two hours. The two baseline methods

are much faster, because these are simple classiﬁers

which just undergo a grid search for the best hyperpa-

rameters. The Auto-WEKA framework uses its time

budget of 24 hours to optimize and does not prema-

turely stop.

5.2 Impact of the Pipeline Components

The ECA-full variant provides the greatest amount

of adaptability and performs well, but it is impor-

tant to understand the contribution of the classiﬁca-

tion pipeline components. Figure 4 shows the impact

The Welch test is a statistical method to show if the

means of two samples are signiﬁcantly different while no

assumptions about the sample variances are made.

of the pipeline components on the accuracy and the

optimization times by comparing the restricted ECA

variants (see section 4.5) to the ECA-full variant.

All restricted variants of the ECA strategy show

worse average accuracy values than the ECA-full vari-

ant (see the left side of ﬁgure 4). This observation

affects the cross-validation and the generalization ac-

curacy values. All differences, except of the cross-

validation accuracy of the ECA-defaultHyper variant,

are also statistically signiﬁcant. The classiﬁer port-

folio has the largest impact here, because the ECA-

simpleClassiﬁer variant which uses only the Bayes

classiﬁer achieves around 15 % less generalization ac-

curacy on average. The other components are obvi-

ously not able to compensate a too simple classiﬁer.

The impact of the other components is smaller, but

still signiﬁcant. It is interesting that the feature pre-

processing has a slightly larger impact than the feature

transforms, because the ECA-noPreProc variant per-

forms worse than the ECA-noTrans variant. This in-

dicates that relatively simple preprocessing methods

can lead to more improvement than 32 highly nonlin-

ear manifold learning methods.

The hyperparameter tuning component has a rela-

tively small impact on the accuracy values. It is likely

the case that – by chance – the standard hyperparam-

eters already work reasonably well for the dataset.

Also the aspect of feature selection is less important

here, because the accuracy values drop only about 1 -

2 % if feature selection is switched off.

The different pipeline components also affect the

average processing times (see the right side of ﬁg-

ure 4) and most of the differences are statistically

signiﬁcant. The use of feature selection speeds up

the optimization tremendously by more then 50 min-

utes on average, because a lower dimensional fea-

ture space requires less computation time for most

of the involved algorithms. The use of feature trans-

forms slows down the optimization process by almost

50 minutes on average. This can be explained with

the computationally complex models of the manifold

learning algorithms, but also their potentially experi-

mental implementation within Matlab. The use of the

feature preprocessing methods also slows down the

optimization process by more than 20 minutes on av-

erage. But the reason is not the computational com-

plexity of the preprocessing methods – which are sim-

ple actually –, but that more complex feature trans-

forms and classiﬁers perform better on preprocessed

features and therefore are chosen more often in the

ES optimization. Another reason is the larger search

space. A similar explanation can be found to justify

the same effect when hyperparameter tuning is active.

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

288

Table 1: Comparison of cross-validation and generalization accuracy results as well as optimization times (mean ±1σ).

Cross-validation acc. Generalization acc. Optimization time [min]

ECA-full 0.935 ± 0.015 0.951 ± 0.013 73.74 ± 20.28

Baseline-SVM 0.119 ± 0.013 (!) 0.281 ± 0.020 (!) 0.02 ± 0.01 (!)

Baseline-RF 0.623 ± 0.028 (!) 0.708 ± 0.045 (!) 0.48 ± 0.01 (!)

Auto-WEKA (not comparable) 0.930 ± 0.014 (!) 1440 = 24 hours (!)

Accuracy change

-0.2 -0.15 -0.1 -0.05 0

Generalization accuracy

Cross-validation accuracy

Optimization time change [min]

-60 -40 -20 0 20 40 60 80 100

ECA-noFeatSel

ECA-noPreProc

ECA-noTrans

ECA-simpleClassifier

ECA-defaultHyper

(!)

Figure 4: Impact of the pipeline components on accuracy values and optimization times compared to the ECA-full variant.

The standard deviations are denoted as lines around the end of the bars.

5.3 Impact of the ECA Algorithm

In the following, the impact of three central aspects

of the ECA optimization algorithm are investigated

by switching off the corresponding component of the

ECA-full algorithm. Table 2 lists the results regarding

the accuracy values and the optimization times.

5.3.1 Impact of Holistic Cross-validation

Instead of HCV (see section 4.3), a classiﬁer-only

cross-validation approach is used for this experi-

ment. The pipeline is trained using the full train-

ing dataset and only the classiﬁer performs cross-

validation. With this procedure, the generalization

of the feature transform is never considered. This

leads to a cross-validation accuracy value of 100 %,

but the resulting generalization accuracy is unusably

bad. However, the average optimization time drops

slightly due to the reduced computational complexity

of the classiﬁer-only cross-validation.

5.3.2 Impact of Early Discarding

The early discarding system is an aggressive system

to save computation time (see section 4.3). It can be

seen that the average computation time rises tremen-

dously by a factor of around 3.8 when the early dis-

carding system is not used. The system does not have

any signiﬁcant positive or negative effect on the accu-

racy values.

5.3.3 Impact of Initial Population Improvement

The initial population improvement affects the feature

selection (see section 4.4). If there is no initial im-

provement and a completely random initialization of

the features is used, the cross-validation and general-

ization accuracy values drop signiﬁcantly. The com-

putation time actually drops slightly, but not signiﬁ-

cantly, if the initial population is not improved. It is

likely the case that the termination criterion of the ES

algorithm was fulﬁlled too early.

6 CONCLUSIONS

This paper analyzed the impact of multiple aspects

of a holistic model selection and representation opti-

mization framework on accuracy values and optimiza-

tion times. The ﬁrst experiment analyzed the impact

of the components of the classiﬁcation pipeline. It

was shown that all involved components, namely fea-

ture selection, multiple preprocessing methods, multi-

ple feature transforms, multiple classiﬁers and hyper-

parameter tuning are potentially contributing to the

generalization performance. However, multiple ob-

servations have been made that indicate that the inter-

play between the involved components is complex –

especially regarding the optimization run times. It can

be expected that some effects are highly dependent on

the dataset.

The second experiment analyzed the impact of

Understanding the Interplay of Simultaneous Model Selection and Representation Optimization for Classiﬁcation Tasks

289

Table 2: Impact of the central components of the ECA optimization algorithm on the accuracy values and optimization times

(mean ±1σ).

Cross-validation acc. Generalization acc. Optimization time [min]

ECA-full 0.935 ± 0.015 0.951 ± 0.013 73.74 ± 20.28

No holistic cross-validation 1.000 ± 0.0 (!) 0.287 ± 0.022 (!) 59.95 ± 9.16

No early discarding system 0.938 ± 0.024 0.933 ± 0.027 276.15 ± 59.64 (!)

No initial population improvement 0.912 ± 0.022 (!) 0.925 ± 0.025 (!) 59.30 ± 14.95

three aspects of the optimization algorithm itself. At

ﬁrst, it was shown that the incorporation of the feature

transform into the cross-validation process is abso-

lutely necessary. Secondly, the early discarding sys-

tem greatly improvesthe optimization speed while the

resulting accuracyvalues are not affected in a negative

way. And lastly, the incorporation of prior knowledge

about the importance of features into the optimization

algorithm improved the accuracy.

Ultimately, it can be concluded that not only the

amount of optimized components is important, but

also the suitability of the optimization algorithm it-

self. However, further experiments on other datasets

need to be conducted to explore the full variety of ef-

fects regarding the complex interplay of the machine

learning components.

REFERENCES

Ans´otegui, C., Sellmann, M., and Tierney, K. (2009). A

gender-based genetic algorithm for the automatic con-

ﬁguration of algorithms. In Gent, I., editor, Principles

and Practice of Constraint Programming - CP 2009,

volume 5732 of Lecture Notes in Computer Science,

pages 142–157. Springer Berlin Heidelberg.

B¨ack, T. (1996). Evolutionary Algorithms in Theory and

Practice: Evolution Strategies, Evolutionary Pro-

gramming, Genetic Algorithms. Oxford University

Press, Oxford, UK.

Bengio, Y., Courville, A., and Vincent, P. (2013). Represen-

tation learning: A review and new perspectives. Pat-

tern Analysis and Machine Intelligence, IEEE Trans-

actions on, 35(8):1798–1828.

Beyer, H.-G. and Schwefel, H.-P. (2002). Evolution strate-

gies - a comprehensive introduction. Natural Comput-

ing, 1(1):3–52.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recog-

nition and machine learning, volume 1. Springer New

York.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

B¨urger, F. and Pauli, J. (2015). Representation optimiza-

tion with feature selection and manifold learning in a

holistic classiﬁcation framework. In De Marsico, M.

and Fred, A., editors, ICPRAM 2015 - Proceedings of

the International Conference on Pattern Recognition

Applications and Methods, volume 1, pages 35–44,

Lisbon, Portugal. INSTICC, SCITEPRESS.

Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010).

Variable selection using random forests. Pattern

Recognition Letters, 31(14):2225 – 2236.

Howell, D. C. (2006). Statistical Methods for Psychology.

Wadsworth Publishing.

Hu, M.-K. (1962). Visual pattern recognition by moment

invariants. Information Theory, IRE Transactions on,

8(2):179–187.

Huang, H.-L. and Chang, F.-L. (2007). ESVM: Evolution-

ary support vector machine for automatic feature se-

lection and classiﬁcation of microarray data. Biosys-

tems, 90(2):516 – 528.

Hutter, F., Hoos, H., and Leyton-Brown, K. (2011). Se-

quential model-based optimization for general algo-

rithm conﬁguration. In Coello, C., editor, Learning

and Intelligent Optimization, volume 6683 of Lecture

Notes in Computer Science, pages 507–523. Springer

Berlin Heidelberg.

Jain, A. K., Duin, R. P. W., and Mao, J. (2000). Statistical

pattern recognition: a review. Pattern Analysis and

Machine Intelligence, IEEE Transactions on, 22(1):4–

37.

Juszczak, P., Tax, D., and Duin, R. (2002). Feature scal-

ing in support vector data description. In Proc. ASCI,

pages 95–102. Citeseer.

Ma, Y. and Fu, Y. (2011). Manifold Learning Theory and

Applications. CRC Press.

Ojala, T., Pietikainen, M., and Maenpaa, T. (2002). Mul-

tiresolution gray-scale and rotation invariant texture

classiﬁcation with local binary patterns. Pattern Anal-

ysis and Machine Intelligence, IEEE Transactions on,

24(7):971–987.

Snoek, J., Larochelle, H., and Adams, R. P. (2012). Prac-

tical bayesian optimization of machine learning algo-

rithms. In Advances in neural information processing

systems, pages 2951–2959.

Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown,

K. (2013). Auto-WEKA: Combined selection and

hyperparameter optimization of classiﬁcation algo-

rithms. In Proc. of KDD-2013, pages 847–855.

Van der Maaten, L. (2014). Matlab Tool-

box for Dimensionality Reduction.

http://lvdmaaten.github.io/drtoolbox/.

Van der Maaten, L., Postma, E., and Van Den Herik, H.

(2009). Dimensionality reduction: A comparative re-

view. Journal of Machine Learning Research, 10:1–

41.

ICPRAM 2016 - International Conference on Pattern Recognition Applications and Methods

290