Authors:
Luís Ferreira
1
;
2
;
André Pilastri
2
;
Carlos Martins
3
;
Pedro Santos
3
and
Paulo Cortez
1
Affiliations:
1
ALGORITMI Centre, Dep. Information Systems, University of Minho, Guimarães, Portugal
;
2
EPMQ - IT Engineering Maturity and Quality Lab, CCG ZGDV Institute, Guimarães, Portugal
;
3
WeDo Technologies, Braga, Portugal
Keyword(s):
Automated Machine Learning, Distributed Machine Learning, Supervised Learning, Risk Management.
Abstract:
Automation and scalability are currently two of the main challenges of Machine Learning. This paper proposes an automated and distributed ML framework that automatically trains a supervised learning model and produces predictions independently of the dataset and with minimum human input. The framework was designed for the domain of telecommunications risk management, which often requires supervised learning models that need to be quickly updated by non-ML-experts and trained on vast amounts of data. Thus, the architecture assumes a distributed environment, in order to deal with big data, and Automated Machine Learning (AutoML), to select and tune the ML models. The framework includes several modules: task detection (to detect if classification or regression), data preprocessing, feature selection, model training, and deployment. In this paper, we detail the model training module. In order to select the computational technologies to be used in this module, we first analyzed the capabi
lities of an initial set of five modern AutoML tools: Auto-Keras, Auto-Sklearn, Auto-Weka, H2O AutoML, and TransmogrifAI. Then, we performed a benchmarking of the only two tools that address distributed ML (H2O AutoML and TransmogrifAI). Several comparison experiments were held using three real-world datasets from the telecommunications domain (churn, event forecasting, and fraud detection), allowing us to measure the computational effort and predictive capability of the AutoML tools.
(More)