Automated Feature Engineering for AutoML Using Genetic Algorithms

Kevin Shi and Sherif Saad

School of Computer Science, University of Windsor, Sunset Ave, Windsor, Canada

Keywords:

Automated Machine Learning, Optimization, Genetic Algorithm, Feature Engineering.

Abstract:

Automated machine learning (AutoML) is an approach to automate the creation of machine learning pipelines

and models. The ability to automatically create a machine learning pipeline would allow users without ma-

chine learning knowledge to create and use machine learning systems. However, many AutoML tools have

no or limited automated feature engineering support. We develop an approach that is able to augment existing

AutoMl tools with automated feature generation and selection. This generation method uses feature generators

guided by and genetic algorithm to generate and select features as part of the AutoMl model selection process.

We show that this approach is able to improve the AutoML model performance in 77% of all tested cases with

up to 78% error reduction. Our approach explores how existing AutoML tools can be augmented with more

automated steps to improve the generated machine learning pipeline’s performance.

1 INTRODUCTION

The use of machine learning plays a key role in to-

day’s society, with many models providing data to

change both the physical and digital world. As with

many technologies, automation is a logical next step,

but the complex and customizable nature of the ma-

chine learning pipeline makes this difﬁcult. In recent

years tools such as automated machine learning have

automated the process of selecting machine learning

models and tuning them. During research on auto-

mated machine learning tools, four main weaknesses

were identiﬁed: feature engineering, model drift,

interpretability and explainability, and data quality.

This research will study one of the four weaknesses:

feature engineering by creating a method of automat-

ing feature engineering using genetic algorithms to

optimize the transformation of features from their raw

state to transformed state for the use of machine learn-

ing models. We hope by using a genetic algorithm

for automating feature engineering, we can avoid the

black-box nature of many current automated machine

learning tools as the process of feature engineering

will be visible to the user and the generations explain-

able.

This study investigates the use of genetic algo-

rithms for automated feature engineering to augment

AutoML tools. In particular, we are interested in un-

derstanding the effects of extending existing AutoML

and the impacts on overall tool performance and time

requirements. The contribution of this paper is three-

fold.

1. First, it implements a genetic automated feature

engineering tool for AutoML.

2. Second, it compares the performance of different

generation approaches for automated feature en-

gineering.

3. Third, it assesses the impact of automated feature

engineering for AutoML and traditional machine

learning models.

The rest of the paper is structured as follows.

Section 2 is a literature review covering existing au-

tomated feature engineering methods and the cur-

rent support for feature engineering in AutoML tools.

Section 3 describes the methodology we follow and

provides a breakdown of the automated feature engi-

neering method. Section 4 presents our experimen-

tal work and discusses our ﬁndings. Finally, section

5 concludes our paper and presents potential future

work.

2 LITERATURE REVIEW

The existing research in regard to developing auto-

mated feature engineering for AutoML can be best di-

vided into two categories, research that focuses purely

on automated feature engineering and AutoML tools

450

Shi, K. and Saad, S.

Automated Feature Engineering for AutoML Using Genetic Algorithms.

DOI: 10.5220/0012090400003555

In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 450-459

ISBN: 978-989-758-666-8; ISSN: 2184-7711

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

with automated feature engineering. The ﬁrst cat-

egory of research generally focuses on stand-alone

tools that can automate an aspect of feature engineer-

ing, which will then be used to augment a machine

learning classiﬁer. The second category focuses on

AutoML tools with built-in automated feature engi-

neering. These tools have different methods which are

used as part of the AutoML optimization and model

search. In this section, we aim to give an overview of

the existing research in this area.

2.1 Automated Feature Engineering

Feature engineering is a process of selecting and

transforming raw features to use in a machine learn-

ing model. The automation of feature engineering can

be divided based on if the tool is for generalized fea-

ture engineering, a tool that is able to work with any

input data or specialized for certain problems or data

types.

Automated Feature engineering tools that are tar-

geted or specialized range in many forms and can

range different types of problems and data types.

Some examples of this type of tool include TimeAu-

toML(Jiao et al., 2020) tailored for time series data

with specialized feature generators to tools that target

problems such as bus passenger ﬂow(Liu et al., 2021)

and customer segmentation(Lee et al., 2021). These

tools have a very diverse methodology, from deﬁned

feature generators and neural networks to clustering-

based approaches. Generally, these types of auto-

mated Feature engineering tools are able to generate

features that better leverage the data as they are tools

tailored to the problem.

The generation-selection approach has been used

as part of many automated feature engineering mod-

els, these methods differ in the methods of genera-

tion and selection, but all contain this common aspect.

These models generate transformed features from the

data and then apply a method of selection for the gen-

eration of the ﬁnal dataset features.

A tool that follows the generation-selection ap-

proach is ExploreKit(Katz et al., 2016) a technique

in which features are generated from input data based

on deﬁned transformation functions, and after all the

features are generated, all features are ranked and then

the highest-ranked features are tested. Features that

pass the predeﬁned threshold are selected and added

to the data to be used to generate new features.

A second tool that follows the generation-

selection approach is AutoLearn(Kaul et al., 2017).

In this approach, each feature is used to predict the

value of other features by applying regression. This

process is performed in four steps. Firstly, prepos-

sessing is done by ranking candidate features using

information gain, then mining correlated features in

which the correlation between features is determined

using distance correlation. The next step of the pro-

cess is feature generation, in which new features are

created by regressing every correlated feature pair and

by taking the difference between one of the correlated

features and the generated feature. Lately, feature se-

lection is applied to all generated features based on

feature stability and information gain.

In the research, (Tran et al., 2016) genetic pro-

gramming is used to create a transformation tree that

is able to create deﬁned feature sets. With the trans-

formed features, the dimensional of the data is able to

be reduced without reducing classiﬁer performance.

Genetic algorithms have also been used to perform

automated feature selection as part of a feature en-

gineering model with manually created features se-

lected and optimized with a genetic algorithm(Khan

and Byun, 2020).

Evolutionary Algorithms have been used to aug-

ment other automated feature engineering approaches

such as in (Parfenov et al., 2020). In this approach, the

feature generation is performed with the ﬁrst stage of

ExploreKit with genetic optimization being used in

place of the original ranking classiﬁer for the selec-

tion of original and created features to be added to the

output data set.

2.2 Automated Machine Learning

(AutoML)

Automated machine learning aims to automate the

process of creating machine learning pipelines while

requiring minimal human intervention. Current Au-

toML tools are able to automate some of all of the

following steps of the machine learning pipeline: data

pre-processing, feature engineering, model selection

and hyperparameter optimization. Many AutoML

tools focus on model selection and hyperparameter

optimization in which a machine learning classiﬁer

is picked and optimized for the dataset. Some exam-

ples of common AutoML tools include TPOT, auto-

sklearn, FLAML, mljar, AutoKeras, and Autogulon.

Out of the six tools, TPOT, mljar, and Autogulon

have built-in support for automated feature engineer-

ing, while the three other tools lack support and focus

on model optimization. Each of the tools that sup-

port automated feature engineering has different be-

haviours.

TPOT works to automate feature engineering as

part of the automated pipeline. The genetic program-

ming approach can choose to include feature engi-

neering tools from sklearn, such as Polynomial Fea-

Automated Feature Engineering for AutoML Using Genetic Algorithms

451

tures. These feature engineering steps become part of

the total machine learning pipeline that is optimized

by TPOT.

MLJAR performs feature engineering as part of its

AutoML optimization process with a module for fea-

ture engineering called golden features. With golden

features, the tool aims to create new features from the

data that have greater predictive power. These fea-

tures are created by generating all possible feature

pairs based on division or subtraction operations and

then testing these features by training them on parts of

the dataset. MLJAR is able to generate new features

as part of its AutoML optimization process.

Autogulon automates feature engineering using

deﬁned feature generators that are automatically ap-

plied based on data type with an included genera-

tor for numeric, categorical, data and text features.

These feature generators are applied to features within

the dataset before the rest of the machine learning

pipeline is optimized.

These three AutoML approaches all support au-

tomated feature engineering but very different im-

plementations with approaches applying feature engi-

neering before, during and as part of the optimization

process.

Other AutoML approaches, such as Pycaret, have

support for featuring engineering steps built into the

tool but do not automatically apply these feature engi-

neering steps. For these tools, the user must manually

enable the built-in functions.

Overall support for feature engineering in Au-

toML can be best broken down into three categories

automated, manual and without. For automated tools,

the scope and impact of feature engineering can be ex-

panded, while for AutoML tools with manual support

or without support, an approach that can automate au-

tomatic feature engineering such as this research can

improve the automation of the creation of the whole

machine learning pipeline. External tools that incor-

porate different types of AutoML tools allow for au-

tomated feature engineering to support more AutoML

tools versus an add-on developed for a singular tool.

This approach has also been seen in literature, such

as running an automated feature extraction tool be-

fore the AutoML system for a limited degree of suc-

cess(Eldeeb et al., 2021).

2.3 Data Augmentation

Data augmentation is any technique that changes the

input data such that there is an increase in data size

or the quality of the data. These approaches are

commonly used for image processing but data aug-

mentation can also be applied to any type of data.

DeltaPy(Snow, 2020) is a framework for tabular data

augmentation, which is a process that allows for mod-

ular feature engineering. DeltaPy provides 50 pre-

built augmentation functions from the following cat-

egories: transforming, interacting, mapping, extract-

ing, and synthesising. Transforming is the process in

which a single input feature produces new features.

Interacting is the process in which more than one

feature is required to create new features. Mapping

functions aim to remove noise from the data or high-

light signals improving the overall quality of the data.

Overall, DeltaPy provides many functions that can as-

sist with the process of feature engineering for tabular

data and allows for a modular approach to feature en-

gineering.

3 METHODOLOGY

Figure 1: Automated Feature Engineering System Design.

The automated feature engineering system is

shown in ﬁgure 1. The system follows an iterative

approach where each iteration is composed of three

main phases: genetic algorithm, feature generators,

and AutoML. The genetic algorithm’s role is to pro-

vide the settings for the feature generators; this in-

cludes the type of feature generator and the target

features and modiﬁers to be used. With each itera-

tion, the genetic algorithm works to optimize these

settings for the best performance. The feature genera-

tors using the setting from the genetic algorithm apply

the chosen transformation to the dataset and creates

the new candidate feature. The AutoML step takes

the candidate feature and the base data and trains an

AutoML that will be evaluated based on its accuracy

score. The score is used for ranking the generated

feature within the genetic algorithm. A portion of

the data is held out at the start of the process, which

will be used to evaluate overall system performance.

Next, we describe each of the main phases in detail

SECRYPT 2023 - 20th International Conference on Security and Cryptography

452

below. This approach, as presented, differs from the

currently available AutoML feature engineering ap-

proaches. Firstly this approach can augment any Au-

toML tool as it is not limited to a single tool. Sec-

ondly, unlike the approaches presented by AutoGluon

and mljar, this approach is iterative and works to op-

timize for the best possible generated features, ver-

sus only running feature generation at the start of the

process and a single time as part of the AutoML pro-

cess, respectively. This approach is similar to TPOT

as both are evolutionary algorithm-based approaches,

but more ﬂexibility in feature generators is possible in

the presented approach.

3.1 Feature Generation

Normal feature generation is performed using feature

generation functions. These generation functions can

be best split into binary and unary operators. The bi-

nary operators are either arithmetic (addition, subtrac-

tion, multiplication and division) or boolean operators

(less than, is equal and greater than). The unary oper-

ators have three categories, scaler, clustering and bin-

ning. All binary operators are targeted to two features

from the genetic algorithm, while unary operators are

targeted at one feature. The modiﬁer value is used to

inﬂuence the generation function, such as for binning

and clustering, as well as picking the arithmetic op-

erator. Each generation function uses the features as

input and outputs a new feature to be appended to the

input data.

3.1.1 DeltaPy

For feature generation using DeltaPy all transform-

ing, interaction and mapping functions were utilized

except the following functions due to incompatibility

and issues in the functions: transform triple exponen-

tial smoothing, transform naive dec, interact autore-

gression, interact decision tree disc, interact tech, in-

teract genetic feat, mapper cross lag, mapper a chi,

mapper encoder dataset. In total 19 functions were

utilized with 13 transforming, 2 interaction and 4

mapping. The genetic algorithm was used to select

which functions were chosen as well as the target fea-

ture or features and modiﬁer if needed.

3.2 Genetic Algorithm

A genetic algorithm is utilized to search the possible

features as there can be multiple best features based

on the models used. We also aim to avoid using a

brute force approach due to the total possible num-

ber of generated features. In the sections below, the

conﬁguration of the individual sections of the genetic

algorithm is described.

3.2.1 Population

Each chromosome in the genetic algorithm represents

a generated feature in the normal case or the selected

features for the genetic feature selection case. For the

normal case the chromosome (see table 1) is four in

length with the ﬁrst two values representing the cho-

sen features of the raw data, the third value represent-

ing the selected transformation function and the last

value representing the modiﬁer for the transformation

function. For the genetic feature selection case, the

chromosome is a binary array with a size equal to the

total number of raw and transformed features, with a

”1” representing a chosen feature and a ”0” an unse-

lected feature.

3.2.2 Evaluation and Selection

To evaluate the performance of members of the popu-

lation, each individual member is chosen and an Au-

toML model is ﬁtted to that member, this resulting ac-

curacy when optimizing for the accuracy or the AUC

when optimizing for error reduction is assigned as the

score of the chosen member. If any member of the

population is unable to be successfully trained and

tested due to an invalid chromosome that member is

assigned a score of 0. After all members of the pop-

ulation and been trained and evaluated, all members

of the population are sorted based on their scores.

The top 50% of the population is selected to create

the population for the next generation and the bottom

50% is discarded.

3.2.3 Crossover and Mutation

Crossover and mutation are applied to selected mem-

bers of the population in order to generate the pop-

ulation for the next generation. Crossover requires a

pair of chromosomes called parents, which are used

to generate two new chromosomes called children;

this is done by using data from parents and switch-

ing which parent’s data is used after each crossover

point. This process generates two new chromosomes

that are mirrors of each other based on the data of

the parents. After the new members are generated by

the crossover operation, the mutation is then applied.

Mutation allows for changes in that data, not from the

parents, to maintain genetic diversity. For each value

in the chromosomes, if the mutation is triggered, that

value is replaced with a random value from the allow-

able range of values.

Automated Feature Engineering for AutoML Using Genetic Algorithms

453

3.2.4 Stopping

The genetic algorithm will stop when one of two con-

ditions is met: the set number of generations has

passed, or early stopping is triggered. On initiation of

the genetic algorithm, the maximum number of gen-

erations is set which is the maximum number of gen-

erations the genetic algorithm can search. Early stop-

ping this triggered if a set number of generations pass

without any improvement to the maximum score.

Table 1: Feature chromosome. N: number of raw features,

M: number of transformation functions, C: Max allowed

modiﬁer.

Feature A Feature B

Transformation

Function

Modiﬁer

0-N 0-N 0-M 0-C

3.3 Generation Approaches

We evaluate ﬁve approaches of the methodology that

differ in the number of generated features and the

method of their generation.

3.3.1 Singular Generation

Singular generation follows the approach presented

previously and generates a single feature optimized

by the genetic algorithm.

3.3.2 Group Generation

In group generation, the genetic algorithm’s chromo-

somes are changed to represent a group of features

instead of a single feature. This is done by extending

the length of the chromosome and having each fea-

ture sequentially represented by the chromosome. No

other changes are made to the overall methodology.

3.3.3 Pool Generation

In pool generation, before the evaluation stage, all

members of the population are split randomly into

groups that do not persist between generations. The

groups are evaluated and selected based on the over-

all group performance. The best performing group is

recorded and acts in place of the optimized feature.

No other changes are made to the overall methodol-

ogy.

3.3.4 Incremental Generation

In incremental generation, the approach from singular

generation is followed, but at the end of each genera-

tion, if a feature is found to increase the overall per-

formance of the dataset, that feature is appended to

the dataset, and this appended dataset is used for the

next generation. The process continues until there is

no improvement, then the population is randomized,

and the system continues until the set number of gen-

erations or early stopping is reached. This approach

slowly builds the ﬁnal feature set adding at most a

single feature per generation that improves the over-

all feature set.

3.3.5 Selection

In selection, an initial set of features of a predeﬁned

size is generated. The generated set of features, as

well as the original set of features, is selected by the

genetic algorithm. In this approach, the chromosome

is a bit string the size of combined generated and orig-

inal features. If the bit value corresponding to a fea-

ture is ”1” that feature is added to the evaluated fea-

ture set, while if the bit value is ”0” that feature is dis-

carded. The genetic algorithm in this approach aims

to optimize which features should be selected from

the combined set of features to create a ﬁnal feature

set. No other changes are made to the overall method-

ology.

3.4 AutoML

The AutoML tool that was chosen to be augmented

with automated feature engineering was Fast Library

for Automated Machine Learning (FLAML). This

AutoML tool focuses on fast automatic turning of ma-

chine learning models which is done using a cost ef-

fective hyperparameter optimization algorithm. The

algorithm used by FLAML is based on the random-

ized direct search method FLOW2 which starts from

a low cost region exploring the search space while

focusing on moving quickly toward the low-loss re-

gion while avoiding high cost regions unless neces-

sary. This approach allows FLAML to quickly ﬁnd

suitable conﬁgurations for machine learning models.

3.5 Datasets

For evaluations of the automatic feature engineering

tool, ﬁve datasets from the domain of cybersecurity

were selected. These datasets as seen in table 2 repre-

sent different domains of cybersecurity from malware

detection to spam.

3.5.1 CIC-AndMAL2017

This dataset contains malware samples from four cat-

egories: Adware, Ransomware, Scareware and SMS

Malware. Following the method described by Noor-

behbahani et al.(Noorbehbahani et al., 2019), only the

SECRYPT 2023 - 20th International Conference on Security and Cryptography

454

Table 2: Cybersecurity Datasets.

Reference Dataset Name Cybersecurity Problem Domain Number of Features Number of Samples

(Lashkari et al., 2018) CICAndMal2017 Malware Detection 84 10854

(Anderson and Roth, 2018) EMBER2018 Malware Detection 2381 50000

(Moustafa and Slay, 2015) UNSW-NB15 Intrusion/Anomaly Detection 47 2540044

(?) Phishing Websites Phishing Detection 30 11055

(?) Spambase Email Spam Detection 58 4601

ransomware datasets of 10 ransomware families and

benign samples were used from the CICAndMal2017

as an evaluation dataset.

3.5.2 Ember 2018

This dataset contains malware features from one mil-

lion portable execution ﬁles after performing static

analysis. For evaluation as described in (Galen and

Steele, 2020), 25000 benign and malicious samples

were selected from both the January and February pe-

riods for the training dataset. The testing dataset con-

tains 25000 begin and malicious samples from March

to December.

3.5.3 UNWS-NB15

This dataset consists of nine types of attacks: Fuzzers,

Analysis, Backdoors, DoS, Exploits, Generic, Recon-

naissance, Shellcode and Worms. For evaluation, all

four data ﬁles were combined to create one dataset no

other changes to the data were performed.

3.5.4 Spambase

This dataset consists of spam and non-spam emails.

No additional processing was required for this

dataset.

3.5.5 Phishing Websites

This dataset consists of legitimate and phishing web-

sites. No additional processing was required for this

dataset.

4 EXPERIMENT

4.1 Experiment Setup

In this section, we describe both the hardware

and software conﬁguration used to test all of the

automated feature engineering approaches with all

datasets, experiment data generation is shown in ﬁg-

ure 2. Experiments were run on Ubuntu 20.04 LTS

on a workstation with 12 cores of 3.7Ghz with 96GB

of ram. Python version 3.8 with FLAML version

1.0.7 was used. All data without predeﬁned training

and testing datasets were tested with a 75-25 training

Figure 2: Experiment Data Generation.

to testing split. The maximum training times were

set to 600 seconds for Spambase and Phishing Web-

sites, 1800 seconds for CIC-AndMAL2017 and 3600

seconds for Ember and UNWS-NB15. The generic

algorithm was set to a maximum of 20 generations

with early stopping set to 3, crossover and mutation

rates were set to 0.3. The population size for Singu-

lar, Group and Incremental was 32, and a group size

of 8 was used for Group and Pool generation. The

population size for Pool was 256, and the number of

generated features for selection was set to 100. For

AutoML, within the feature generation process, the

maximum time was set to 120 seconds and 10 max

iterations.

For comparison between the AutoML (FLAML)

and traditional machine learning models, Catboost

and three models from scikit-learn, Random Forest,

Naive Bayes and Decision Tree were used, with all

models set as default. Three evaluations are run for

each model, a baseline without any feature engineer-

ing, singular generation, and incremental generation.

Automated Feature Engineering for AutoML Using Genetic Algorithms

455

4.1.1 Metrics

For the evaluation, three metrics are used to measure

the performance of the automated feature engineering

approaches. The ﬁrst metric used is accuracy which

provides a summary of the given AutoML on the gen-

erated feature set by calculating the number of cor-

rect predictions divided by the total number of pre-

dictions as shown in equation 1. Accuracy is a metric

that can be used to show the performance of any given

model but has limitations when the data is highly un-

balanced.

accuracy =

(T P + T N)

T P + FN + FP + T N

(1)

The second metric used was balanced accuracy

as another metric to measure the performance of ML

models created by the models and tested approaches.

Balanced accuracy is a better metric to measure the

performance of ML models with imbalanced data.

The balanced accuracy is the mean of sensitivity and

speciﬁcity as shown in equation 2

balanced accuracy =

T P

2(T P + FN)

T N

2(T N + FP)

(2)

The third metric used in the experiment was error

reduction. This metric shows the relative reeducation

in error from each of the tested models, which is cal-

culated from the original error minus the new error

divided by the original error as shown in equation 3.

error reduction =

(1 − AU C

original

) − (1 − AUC

new

)

(1 − AU C

original

)

(3)

4.2 Experiment Results

4.2.1 Generation Approach Comparison

Overall from the experiment results, as seen in ta-

ble 3, the automated feature engineering tool was

able to improve the accuracy of the trained model

in the majority of cases, with 26 seeing improve-

ment and 9 with regression in performance. Both the

dataset tested and the generation approach of the au-

tomated feature engineering tool had major impacts

on its performance. Out of the ﬁve datasets, two (CI-

CMAL. UNSW) showed improvement with any gen-

eration approach, while Phishing Websites improved

will all generation approaches except selection. For

the Spambase and EMBER datasets, only three gen-

eration approaches were able to produce an improve-

ment, with Delta working for Spambase and pool for

EMBER. Two generation approaches, Singular and

Incremental, were able to improve the accuracy of the

data in all tested cases. The balanced accuracy re-

sults match the accuracy results except Incremental

generation was not able to improve in all cases while

Singular generation was able

Looking at the baseline AUC and error reduction

as seen in table 4, the automated feature engineering

was able to reduce the error of the trained model in

the majority of cases for four datasets and reduce the

error of the other dataset EMBER in three out of the

seven total cases. In total, the tool was able to re-

duce error in 26 cases and could not, in 9 cases, the

same number as the accuracy results. As with accu-

racy, performance on EMBER was weak, but in this

case, Spambase was able to improve in six of the total

seven approaches. Delta incremental had the greatest

average error reduction of 32% but was only able to

provide an error reduction in three of the ﬁve datasets.

Both singular and group generation were able to re-

duce error in all ﬁve tested datasets, with singular

generation having a higher average error reduction of

16% compared to 9%. The dataset that resulted in

the most error reduction was CICMAL, with an av-

erage of 31%. Unlike with the accuracy results, no

dataset resulted in error reduction with all approaches

tested, with the best being six out of the seven total

approaches. Overall applying the automated feature

engineering tool was able to reduce the error in the

large majority of tested cases.

The automated feature engineering times are in ta-

ble 5. The ﬁtting time of the feature engineering tool

was impacted by the dataset tested. The effects can

be best broken down into two parts feature generation

time and AutoML training time. There was no notable

impact on overall times for the basic feature genera-

tion, with only one case using basic feature generation

having a time of greater than one hour. Feature gen-

eration using DeltaPy did have an effect on training

times for the larger datasets, but this effect depended

on which feature generators were chosen, as seen in

the difference of time for Delta versus Delta incre-

mental generation for EMBER. As well, feature gen-

eration using DeltaPy resulted in increased training

time for all datasets except Spambase. Unlike feature

generation, the AutoML training time has a ﬁxed up-

per bound of 120 seconds, this limits the maximum

impact of the training time, but the training time still

has a major effect on the feature engineer time, es-

pecially for the smaller datasets. The generation ap-

proach with the lowest average training time was se-

lection which was the least consistent generation ap-

proach, only providing improvements in 2 out of 5

datasets. This was then followed by Singular and In-

SECRYPT 2023 - 20th International Conference on Security and Cryptography

456

cremental, the approach with the lowest average per-

formance improvements. In this research, the auto-

mated feature engineering process aimed to produce

new features quickly. It is possible to increase the

genetic algorithm’s population, the number of itera-

tions, and the maximum training time for the AutoML

tool, which would increase feature engineering times,

but it would be expected that higher-performing fea-

tures would be found. The settings from this study

should not be seen as optimized settings for this tool

but taken as a possible set of settings to show this

proof of concept. It is not the aim of this research

to optimize the approach demonstrated.

4.2.2 AutoML and Traditional Machine

Learning Comparison

The results can be found in table 6 for accuracy and

balanced accuracy. Starting with the accuracy metric,

before applying feature engineering, AutoML had the

highest accuracy in 2 datasets, UNSW and EMBER,

and tied Catboost in Spambase. Catboost leads Phish-

ing Websites and Random Forest had the highest ac-

curacy in CICMAL. While for balanced accuracy lead

in 3 datasets Spambase, UNSW and EMBER, while

Catboost lead Phishing Websites and random forest

lead CICMAL. After applying feature engineering,

the changes were that Catboost leads Phishing Web-

sites for accuracy and Random Forest leads for both

metrics in UNSW. The method with the highest av-

erage accuracy and balanced accuracy was Catboost

with singular feature engineering, followed by other

Catboost approaches and then the random forest ap-

proaches. The Naive Bayes approaches had the lowest

average accuracy and balanced accuracy, but feature

engineering resulted in the greatest average improve-

ment of 0.076 and 0.055 for balanced accuracy with

incremental generation. The approach with the sec-

ond highest average improvement was AutoML with

incremental generation. The AutoML approaches

were also the most consistent achieving an accuracy

improvement in all ten tested cases for accuracy and 9

out of the 10 cases for balanced accuracy. The random

forest and Naive Bayes models archived improvement

ins 7 out of the 10 cases for both metrics.

Exploring the error reduction results as seen in

table 7, the AutoML approach has the highest start-

ing AUC in 3 of the datasets, Spambase. UNSW

and Ember with the random forest model leading in

2, Phishing Websites and CICMAL. After applying

feature engineering, the AutoML approach with the

incremental generation now leads CICMAL. Look-

ing at the average error reduction, the average perfor-

mance with AutoML greatly exceeds any other tested

approach with almost 16% average error reduction for

both singular and incremental generation. The second

best approach, random forest singular generation, was

only able to achieve and 2.29% average error reduc-

tion. The AutoML approach also tied random forest

for the most consistent error reduction, reducing the

model error in 9 out of the 10 tested cases. The Naive

Bayes approaches had poor consistency in error re-

duction with singular generation resulting in 2 out of

5 and incremental resulting in 1 out of 5 models with

error reduction. As well, the Naive Bayes approaches

no longer had the greatest average improvement, only

achieving an error reduction of 1.24% for singular and

1.64% for incremental.

Overall, for all of the conﬁgurations tested and the

different metrics used using an AutoML tool resulted

in the most consistent improvement, bust as seen form

these results the feature engineering tool is able to be

applied to other machine learning model with smaller

and less consistent improvements possible.

The automated feature engineering generation

times are in table 8. Depending on the dataset used

and the machine learning model, the automated fea-

ture engineering process can be completed in a num-

ber of seconds to over an hour. The model used signif-

icantly affected the generation time, with approaches

such as Catboost requiring more time than any other

approach 4812 seconds for singular and 6820 seconds

for incremental generation on average. Naive Bayes

required the least amount of time on average only 201

seconds for singular and 608 seconds for incremen-

tal generation. The only models that required less

training time than the original AutoML approach are

Naive Bayes and random forest singular generation.

This means that the AutoML approach is able to train

ten models per candidate in less time than other ap-

proaches needed to train one. It would be possible

to optimize Catboost for better training time as done

with the AutoML approach, but this may impact per-

formance.

5 CONCLUSION AND FUTURE

WORK

In this research, we have shown that by utilizing ge-

netic algorithms and feature generators along with

AutoML tools, new features can be generated that can

boost the performance of classiﬁers in most tested

cases. We have explored different generation ap-

proaches as well as different feature generators. From

the experiment, we found it is possible to limit the

training time of the feature engineering process so

that it has an acceptable time requirement. While the

impact of certain feature generators can have a much

Automated Feature Engineering for AutoML Using Genetic Algorithms

457

Table 3: Generation Approach Accuracy and Balanced Accuracy Comparison.

Metric Accuracy Balanced Accuracy

Dataset Spambase PhishingWebsites CICMAL UNSW EMBER Average Spambase PhishingWebsites CICMAL UNSW EMBER Average

Baseline 96.09 96.82 49.92 99.6 93.39 87.164 95.8 96.62 50.4 99.05 93.61 87.096

Singular 96.35 96.89 55.57 99.69 93.5 88.4 95.95 96.71 55.99 99.26 93.7 88.322

Group 95.92 96.89 59.62 99.68 93.15 89.052 95.92 96.64 60.01 99.26 93.38 89.042

Pool 95.74 96.89 59.01 99.7 93.4 88.948 95.34 96.67 59.4 99.3 93.62 88.866

Incremental 96.52 96.85 57.85 99.69 93.4 88.862 96.13 96.62 57.77 99.26 93.62 88.68

Selection 95.57 96.09 59.46 99.65 92.96 88.746 95.19 95.85 59.85 99.19 93.19 88.654

Delta 97.91 97.07 59.01 99.7 93.33 89.404 97.69 96.83 59.4 99.29 93.55 89.352

Delta Incremental 96 98.84 59.73 99.7 93.32 89.518 95.72 98.72 60.11 99.29 93.53 89.474

Average 96.2625 97.0425 57.5213 99.6763 93.3063 95.991429 96.86285714 58.9329 99.2643 93.5129

Table 4: Generation Approach Comparison AUC Error Reduction.

Spambase PhishingWebsites CICMAL UNSW EMBER Average

Baseline 0.9909 0.9964 0.8475 0.9999 0.9867 0.9643

Singular 4.84% 15.60% 41.45% 17.73% 0.11% 15.95%

Group 5.97% 4.35% 25.90% 5.03% 1.62% 8.57%

Pool 2.93% 13.65% -4.70% 7.63% -0.27% 3.85%

Incremental 3.27% 14.17% 66.56% -4.69% 0.33% 15.93%

Selection -24.97% 5.86% 37.09% 16.91% -12.50% 4.48%

Delta 73.12% 1.12% 9.72% -6.04% -7.15% 14.15%

Delta Incremental 78.07% -3.27% 74.35% 8.49% 0.00% 31.53%

Average 17.91% 6.44% 31.30% 5.63% -2.23%

Table 5: Generation Approach Comparison Feature Engineering Times (seconds).

Spambase PhishingWebsites CICMAL UNSW EMBER Average

Singular 142.0343898 54.7885598 863.4260866 641.6705883 1970.88141 734.560207

Group 261.404541 114.6803781 1712.215061 3868.684788 2443.341906 1680.065335

Pool 204.7417284 49.2225681 2143.900624 2250.823053 1802.228511 1290.183297

Incremental 286.8734678 83.3469802 2009.662164 1347.980646 1841.885889 1113.949829

Selection 87.9470995 29.2978291 685.3757697 1479.616277 688.2061595 594.088627

Delta 106.6486528 108.726423 1734.906475 10244.63185 1869.085556 2812.799792

Delta Incremental 143.2637108 353.8791434 2853.830053 34302.44831 22799.88297 12090.66084

Average 176.1305129 113.4202688 1714.759462 7733.693645 4773.644629

Table 6: AutoML and Traditional Machine Learning Accuracy and Balanced Accuracy Comparison.

Metric Accuracy Balanced Accuracy

Model Method Spambase PhishingWebsites CICMAL UNSW EMBER Average Spambase PhishingWebsites CICMAL UNSW EMBER Average

FLAML Baseline 96.09 96.82 49.92 99.6 93.39 87.164 95.8 96.62 50.4 99.05 93.61 87.096

FLAML Singular 96.35 96.89 55.57 99.69 93.5 88.4 95.95 96.71 55.99 99.26 93.7 88.322

FLAML Incremental 96.52 96.85 57.85 99.69 93.4 88.862 96.13 96.62 57.77 99.26 93.62 88.68

Catboost Baseline 96.09 96.85 74.61 99.46 91.88 91.778 95.73 96.64 74.62 98.71 92.08 91.556

Catboost Singular 96.26 97.07 74.79 99.41 91.94 91.894 95.91 96.85 74.79 98.57 92.14 91.652

Catboost Incremental 96.18 96.78 74.56 99.47 91.82 91.762 95.78 96.59 74.57 98.72 92.03 91.538

Random Forest Baseline 95.57 96.67 76.65 99.59 88.1 91.316 95.07 96.42 76.65 99.04 88.58 91.152

Random Forest Singular 95.74 96.71 76.75 99.65 88.02 91.374 95.25 96.49 76.75 99.2 88.51 91.24

Random Forest Incremental 95.48 96.71 76.83 99.7 87.75 91.294 94.96 96.49 76.84 99.34 88.26 91.178

Naive Bayes Baseline 82.71 58.25 50.11 83.27 50.95 65.058 84.5 63.48 50.55 84.48 54.18 67.438

Naive Bayes Singular 83.32 66.32 50.17 82.84 50.95 66.72 85.08 70.54 50.61 84.55 54.18 68.992

Naive Bayes Incremental 82.97 90.74 50.16 88.49 50.95 72.662 84.75 90.86 50.6 84.38 54.18 72.954

Decision Tree Baseline 92.18 95.48 74.04 99.49 82.73 88.784 92 95.3 74.04 98.86 83.09 88.658

Decision Tree Singular 90.18 95.66 73.98 99.49 82.38 88.338 90.01 95.52 73.98 98.86 82.77 88.228

Decision Tree Incremental 91.31 95.48 74.07 99.49 82.48 88.566 91.01 95.36 74.07 98.86 82.87 88.434

Table 7: AutoML and Traditional Machine Learning Comparison AUC Error Reduction.

Model Method Spambase PhishingWebsites CICMAL UNSW EMBER Average

FLAML Baseline 0.9908689 0.996400354 0.847515322 0.999915559 0.9867202 0.964284058

FLAML Singular 4.84% 15.60% 41.45% 17.73% 0.11% 15.95%

FLAML Incremental 3.27% 14.17% 66.56% -4.69% 0.33% 15.93%

Catboost Baseline 0.9885191 0.997650741 0.846164499 0.999827772 0.9788337 0.962199151

Catboost Singular 2.61% -2.39% 0.06% -5.96% 2.73% -0.59%

Catboost Incremental 5.45% 0.09% -0.52% -1.17% 2.77% 1.32%

Random Forest Baseline 0.9859878 0.997714392 0.85519875 0.999854094 0.9654172 0.960834441

Random Forest Singular 1.91% 3.21% 3.07% 0.65% 2.62% 2.29%

Random Forest Incremental 7.22% 2.33% 0.74% -10.05% 0.20% 0.09%

Naive Bayes Baseline 0.9468869 0.970074923 0.52782945 0.887433898 0.6608854 0.798622115

Naive Bayes Singular 2.26% 3.94% 0.00% 0.00% 0.00% 1.24%

Naive Bayes Incremental -0.01% 8.14% 0.00% 0.00% 0.00% 1.62%

Decision Tree Baseline 0.915775 0.977057193 0.739819588 0.988409814 0.8296209 0.890136486

Decision Tree Singular 1.82% -3.00% 0.31% 0.55% 0.51% 0.04%

Decision Tree Incremental -2.44% -6.58% 0.43% 3.48% 0.65% -0.89%

SECRYPT 2023 - 20th International Conference on Security and Cryptography

458

Table 8: AutoML and Traditional Machine Learning Comparison Feature Engineering Times (seconds).

Type Approach Spambase PhishingWebsites CICMAL UNSW EMBER Average

FLAML Singular 142.0343898 54.7885598 863.4260866 641.6705883 1970.88141 734.560207

FLAML Incremental 286.8734678 83.3469802 2009.662164 1347.980646 1841.885889 1113.949829

Catboost Singular 286.7277985 1084.675602 4903.853139 9465.920835 8318.547868 4811.945049

Catboost Incremental 323.6878317 2189.710747 5479.15481 18088.45976 8020.990814 6820.400792

RF Singular 25.6818658 37.241105 514.3558151 2183.534547 278.2760238 607.8178714

RF Incremental 35.674479 39.5919785 771.8287661 9963.641372 513.4612354 2264.839566

NB Singular 1.1899117 4.7808542 81.4093617 754.2586317 160.8956941 200.5068907

NB Incremental 2.7123099 4.4873749 140.2050486 2982.596857 186.0150966 663.2033374

DT Singular 5.6072376 3.8393552 494.7855521 1823.740664 6042.433405 1674.081243

DT Incremental 6.7526271 3.9819257 1820.335759 1653.910428 5636.691553 1824.334459

greater impact on overall feature engineering time.

We show that this approach is also able to support

traditional machine learning models, but the AutoML

tool beneﬁts more from automated feature engineer-

ing. We also show that with the correct optimization,

the feature engineering time for AutoML can be less

on average than the traditional machine learning mod-

els.

The research, as presented, aims to present a proof

of concept of using genetic algorithms for feature en-

gineering with AutoML tools. Possible future work

areas include custom feature generators tailored to

each dataset or problem type. In addition, the in-

clusion of training time into the genetic search scope

would involve allowing the genetic algorithms to set

the training time of the AutoML tool while also in-

cluding the resting training as part of the ﬁtness func-

tion. Finally, it would be possible to expand this re-

search by utilizing multiple tools in the training pro-

cess.

REFERENCES

Anderson, H. S. and Roth, P. (2018). EMBER: An Open

Dataset for Training Static PE Malware Machine

Learning Models. ArXiv e-prints.

Eldeeb, H., Amashukeli, S., and El Shawi, R. (2021). An

Empirical Analysis of Integrating Feature Extraction

to Automated Machine Learning Pipeline, pages 336–

344.

Galen, C. and Steele, R. (2020). Evaluating performance

maintenance and deterioration over time of machine

learning-based malware detection models on the em-

ber pe dataset. In 2020 Seventh International Confer-

ence on Social Networks Analysis, Management and

Security (SNAMS), pages 1–7. IEEE.

Jiao, Y., Yang, K., Dou, S., Luo, P., Liu, S., and Song,

D. (2020). Timeautoml: Autonomous representation

learning for multivariate irregularly sampled time se-

ries.

Katz, G., Shin, E. C. R., and Song, D. (2016). Explorekit:

Automatic feature generation and selection. In 2016

IEEE 16th International Conference on Data Mining

(ICDM), pages 979–984.

Kaul, A., Maheshwary, S., and Pudi, V. (2017). Autolearn

— automated feature generation and selection. In

2017 IEEE International Conference on Data Mining

(ICDM), pages 217–226.

Khan, P. W. and Byun, Y.-C. (2020). Genetic algorithm

based optimized feature engineering and hybrid ma-

chine learning for effective energy consumption pre-

diction. IEEE Access, 8:196274–196286.

Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-

bani, A. A. (2018). Toward developing a system-

atic approach to generate benchmark android malware

datasets and classiﬁcation. In 2018 International Car-

nahan Conference on Security Technology (ICCST),

pages 1–7. IEEE.

Lee, Z.-J., Lee, C.-Y., Chang, L.-Y., and Sano, N. (2021).

Clustering and classiﬁcation based on distributed au-

tomatic feature engineering for customer segmenta-

tion. Symmetry, 13(9).

Liu, Y., Lyu, C., Liu, X., and Liu, Z. (2021). Automatic

feature engineering for bus passenger ﬂow prediction

based on modular convolutional neural network. IEEE

Transactions on Intelligent Transportation Systems,

22(4):2349–2358.

Moustafa, R. and Slay, J. (2015). A comprehensive data

set for network intrusion detection systems. School of

Engineering and Information Technology University

of New South Wales at the Australian Defense Force

Academy Canberra, Australia, UNSW-NB15.

Noorbehbahani, F., Rasouli, F., and Saberi, M. (2019).

Analysis of machine learning techniques for ran-

somware detection. In 2019 16th International ISC

(Iranian Society of Cryptology) Conference on Infor-

mation Security and Cryptology (ISCISC), pages 128–

133. IEEE.

Parfenov, D., Bolodurina, I., Shukhman, A., Zhigalov, A.,

and Zabrodina, L. (2020). Development and research

of an evolutionary algorithm for the formation of a

feature space based on automl for solving the prob-

lem of identifying cyber attacks. In 2020 Interna-

tional Conference Engineering and Telecommunica-

tion (En&T), pages 1–5.

Snow, D. (2020). Deltapy: A framework for tabular data

augmentation in python. Available at SSRN 3582219.

Tran, B., Xue, B., and Zhang, M. (2016). Genetic program-

ming for feature construction and selection in classiﬁ-

cation on high-dimensional data. Memetic Computing,

8(1):3–15.

Automated Feature Engineering for AutoML Using Genetic Algorithms

459