To guide the user in the data mining process, the
tasks are available for insertion according to a predefi-
ned grammar that is presented next in EBNF notation:
start = dataset_input val_procedure
val_procedure = ((assessment_method _1
{(preprocessin g_1 | feature_selecti on_1)}
create_model) | ( asse ssment_method_2
(preprocessing _2 | feature_selectio n_2 |
create_model)) )
preprocessing_ 1 = "preprocessing_method"
{val_procedure _1}
feature_select ion_1 = "feature_sele ction_algorithm"
{val_procedure _1}
preprocessing_ 2 = "preprocessing_method"
{val_procedure _2}
feature_select ion_2 = "feature_sele ction_algorithm"
{val_procedure _2}
create_model = "machine_learning_al gorithm"
"eval_metrics"
assessment_met hod_1 = "cross_valida tion" |
"hold_out" | "t_v_t"
assessment_met hod_2 = "use_entire_d ata"
dataset_input = "dataset_input"
val_procedure_ 1 = (preprocessing_1 |
feature_select ion_1)
val_procedure_ 2 = (preprocessing_2 |
feature_select ion_2 | create_model)
In this grammar, the terminals are between dou-
ble quotes. These are sp ecific tasks to be executed
and m ight have different representatio ns. For exam-
ple, “preprocessing
method” might be a z-score nor-
malisation or a min-max normalisation task.
In Figures 1 and 2 we show that when the user
clicks the plus button to add a new task, depending on
the current state of the workflow, s/he only sees the
tasks according to the previous grammar.
Figure 2: UI - Showing cross validation task (a validation
procedure task) and the tasks that can be used after.
In summary, the six types of task that can be used
in the workflow are the following:
• Dataset Input: a unique task where the user spe-
cifies the data set to use. S/he can also choose to
remove features during this step.
• Validation Procedure: contains tasks that spe-
cify a method to be used in the creation of the
data mining proce ss. The user can define if
the next tasks should be included in an a sses-
sment/selection process (e.g., cross validation), or
if the tasks should be created using all data.
• Preprocessing: contains tasks that apply transfor-
mations to attribute values (e.g., z-scor e normali-
zation).
• Feature Selection: contains tasks to assess the
relevance of featur es for selection (e.g., Relieff).
• Model Creation: contains tasks for the creation
of models using different algorithms (e.g., Sup-
port Vector Machine (SVM )).
• Model Evaluation: contains tasks that spec ify
the metrics to use for performance evaluation
(e.g., recall and precision).
3.2 Architecture
The previous UI is part of a microservices architecture
that we illustrate in Figure 3.
In this arch itec ture, a user can acce ss the UI
through the UI Service that provides a web applica-
tion written in ReactJS, fr om which further req uests
are done to our API Gateway that redirects the reque-
sts to different services accordingly.
The Tasks Serv ic e returns representations of d a ta
mining tasks that can be used to comp ose the seq uen-
tial data mining workflow.
The User Service enables users to login with a
username and a password and holds information re-
lated to users.
The Templates Service contains predefined tem-
plates of d ata mining workflows useful for certain
data and business dom ains.
The Datasets Service stores uploade d datasets in a
central file system (Network File System (NFS)) and
also returns data from the NFS accordin g to users’
requests. The MongoDB in D a ta sets Service is used
to store metadata related to uploaded datasets.
Then, we have the Workflows Service that trans-
lates sequential workflows sent by user s to a repre-
sentation that is understandable by Netflix Conduc-
tor
6
. The new representation is sent to the Conductor
Service that employs Netflix Conductor, an d becomes
available to be processed by different Data Science
services/workers. The Workflows Service is also con -
tacted to return the status of workflows sent by users.
By using the Netflix Conductor technology we
can organise the tasks in a certain sequence and the
Data Scien c e services can pull the scheduled tasks
and work on them in p arallel and inde pendently, fol-
lowing a com peting consumers pattern (Hohpe and
6
https://netflix.github.io/conductor/