Table 8: AutoML and Traditional Machine Learning Comparison Feature Engineering Times (seconds).
Type Approach Spambase PhishingWebsites CICMAL UNSW EMBER Average
FLAML Singular 142.0343898 54.7885598 863.4260866 641.6705883 1970.88141 734.560207
FLAML Incremental 286.8734678 83.3469802 2009.662164 1347.980646 1841.885889 1113.949829
Catboost Singular 286.7277985 1084.675602 4903.853139 9465.920835 8318.547868 4811.945049
Catboost Incremental 323.6878317 2189.710747 5479.15481 18088.45976 8020.990814 6820.400792
RF Singular 25.6818658 37.241105 514.3558151 2183.534547 278.2760238 607.8178714
RF Incremental 35.674479 39.5919785 771.8287661 9963.641372 513.4612354 2264.839566
NB Singular 1.1899117 4.7808542 81.4093617 754.2586317 160.8956941 200.5068907
NB Incremental 2.7123099 4.4873749 140.2050486 2982.596857 186.0150966 663.2033374
DT Singular 5.6072376 3.8393552 494.7855521 1823.740664 6042.433405 1674.081243
DT Incremental 6.7526271 3.9819257 1820.335759 1653.910428 5636.691553 1824.334459
greater impact on overall feature engineering time.
We show that this approach is also able to support
traditional machine learning models, but the AutoML
tool benefits more from automated feature engineer-
ing. We also show that with the correct optimization,
the feature engineering time for AutoML can be less
on average than the traditional machine learning mod-
els.
The research, as presented, aims to present a proof
of concept of using genetic algorithms for feature en-
gineering with AutoML tools. Possible future work
areas include custom feature generators tailored to
each dataset or problem type. In addition, the in-
clusion of training time into the genetic search scope
would involve allowing the genetic algorithms to set
the training time of the AutoML tool while also in-
cluding the resting training as part of the fitness func-
tion. Finally, it would be possible to expand this re-
search by utilizing multiple tools in the training pro-
cess.
REFERENCES
Anderson, H. S. and Roth, P. (2018). EMBER: An Open
Dataset for Training Static PE Malware Machine
Learning Models. ArXiv e-prints.
Eldeeb, H., Amashukeli, S., and El Shawi, R. (2021). An
Empirical Analysis of Integrating Feature Extraction
to Automated Machine Learning Pipeline, pages 336–
344.
Galen, C. and Steele, R. (2020). Evaluating performance
maintenance and deterioration over time of machine
learning-based malware detection models on the em-
ber pe dataset. In 2020 Seventh International Confer-
ence on Social Networks Analysis, Management and
Security (SNAMS), pages 1–7. IEEE.
Jiao, Y., Yang, K., Dou, S., Luo, P., Liu, S., and Song,
D. (2020). Timeautoml: Autonomous representation
learning for multivariate irregularly sampled time se-
ries.
Katz, G., Shin, E. C. R., and Song, D. (2016). Explorekit:
Automatic feature generation and selection. In 2016
IEEE 16th International Conference on Data Mining
(ICDM), pages 979–984.
Kaul, A., Maheshwary, S., and Pudi, V. (2017). Autolearn
— automated feature generation and selection. In
2017 IEEE International Conference on Data Mining
(ICDM), pages 217–226.
Khan, P. W. and Byun, Y.-C. (2020). Genetic algorithm
based optimized feature engineering and hybrid ma-
chine learning for effective energy consumption pre-
diction. IEEE Access, 8:196274–196286.
Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-
bani, A. A. (2018). Toward developing a system-
atic approach to generate benchmark android malware
datasets and classification. In 2018 International Car-
nahan Conference on Security Technology (ICCST),
pages 1–7. IEEE.
Lee, Z.-J., Lee, C.-Y., Chang, L.-Y., and Sano, N. (2021).
Clustering and classification based on distributed au-
tomatic feature engineering for customer segmenta-
tion. Symmetry, 13(9).
Liu, Y., Lyu, C., Liu, X., and Liu, Z. (2021). Automatic
feature engineering for bus passenger flow prediction
based on modular convolutional neural network. IEEE
Transactions on Intelligent Transportation Systems,
22(4):2349–2358.
Moustafa, R. and Slay, J. (2015). A comprehensive data
set for network intrusion detection systems. School of
Engineering and Information Technology University
of New South Wales at the Australian Defense Force
Academy Canberra, Australia, UNSW-NB15.
Noorbehbahani, F., Rasouli, F., and Saberi, M. (2019).
Analysis of machine learning techniques for ran-
somware detection. In 2019 16th International ISC
(Iranian Society of Cryptology) Conference on Infor-
mation Security and Cryptology (ISCISC), pages 128–
133. IEEE.
Parfenov, D., Bolodurina, I., Shukhman, A., Zhigalov, A.,
and Zabrodina, L. (2020). Development and research
of an evolutionary algorithm for the formation of a
feature space based on automl for solving the prob-
lem of identifying cyber attacks. In 2020 Interna-
tional Conference Engineering and Telecommunica-
tion (En&T), pages 1–5.
Snow, D. (2020). Deltapy: A framework for tabular data
augmentation in python. Available at SSRN 3582219.
Tran, B., Xue, B., and Zhang, M. (2016). Genetic program-
ming for feature construction and selection in classifi-
cation on high-dimensional data. Memetic Computing,
8(1):3–15.
Automated Feature Engineering for AutoML Using Genetic Algorithms
459