A Data Mining Service for Non-Programmers

Artur Pedroso, Bruno Leonel Lopes, Jaime Correia, Filipe Araujo, Jorge Cardoso and R ui Ped ro Paiva

CISUC, Dept. of Informatics Engineering, University of Coimbra, Portugal

Keywords:

Data Science, Data Mining, Machine Learning, Microservices.

Abstract:

Wit h the emergence of Big Data, t he scarcity of data scientists to analyse all the data being produced in

different domains became evident. To train new data scientists faster, web applications providing data science

practices without requiring programming skills can be a great help. However, some available web applications

lack in providing good data mining practices, specially for assessment and selection of models. Thus, in this

paper we describe a system, currently under development, t hat will provide the construction of data mining

processes enforcing good data mining practices. The system will be available through a web UI and will f oll ow

a microservices architecture that i s still being designed and tested. Preliminary usability tests, were conducted

with two groups of users to evaluate the envisioned concept for the creation of data mining processes. In these

tests we observed a general high level of user satisfaction. To assess the performance of the current system

design, we have done tests in a public cloud w here we observed interesting results that will guide us in new

directions.

1 INTRODUCTION

In a broad view, data mining is th e pro cess of disco-

vering interesting patterns and knowledge from large

amounts of data (Han et al., 2011). However, for

the correct app lica tion of data m ining p rocesses and

also for the evolution of the ﬁeld, c ompetent data

scientists are required, a resource in high demand

these days (Henke et al., 2016; M iller and Hughes,

2017). To ﬁll such demand, more data scientists need

to be trained, which requires time due to the diver-

sity of discipline s to learn(Cao, 2017). Thus, by a b-

stracting somehow pro gramming languages from the

data scientist’s path, we might reduce the necessary

time to train them.

Having the data mining process in mind, we deci-

ded to create a system that allows users to build work-

ﬂows re presenting the data mining pro cess. It will b e

available through a web UI providing good usability

heuristics (Nielsen, 199 4), and gu iding the user in th e

creation of data mining processes without requir ing

programming skills.

The user will be able to create expe riments based

on workﬂows compo sed by sequential data mining

tasks. These tasks will allow data insertion , prepro-

cessing, feature selection, model c reation and model

evaluation. Some tasks will include parameters that

can be used in grid search along with nested c ross

validation enforcing good model a ssessment and se-

lection practices (Cawley and Talbot, 2010).

To evaluate th e envisioned system, we cre ated a

ﬁrst prototype and conducted usability tests using a

group of users familiar with data mining frameworks,

and another group of users without experience with

related tools, though having a background in statis-

tics, whom can also beneﬁt with our software. We ob-

served an overall positive user satisfaction with both

groups.

To evaluate the imp act of the current microservi-

ces architecture in the perfor mance of the system, we

deployed it in a public cloud and realised tests using

datasets with different sizes. The results are interes-

ting and an incentive to guide us in new directions.

The remaining document is organ ised as f ollows.

In Section 2, we analyse related re search and applica-

tions. In Section 3, we present an overview of the en-

visioned user interface and the system architecture. I n

Section 4, we present prelimin a ry experiments done

and the respective results. Finally, in section 5 we

draw the main conclusions of this work and po int o ut

future research directions.

2 RELATED WORK

Some applications in production already provide the

creation of data mining processes without requiring

340

Pedroso, A., Lopes, B., Correia, J., Araujo, F., Cardoso, J. and Paiva, R.

A Data Mining Service for Non-Programmers.

DOI: 10.5220/0007226003400346

In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR, pages 340-346

ISBN: 978-989-758-330-8

users to hold prog ramming skills.

Azure Machine Learning Studio

is a publicly

available software-as-a-service solution that allows its

users to cre a te data mining workﬂows by dragging

blocks that represent data mining tasks into a working

area.

RapidMiner Studio

and Orange

provide the

same concept as Azure Machine Learning Studio for

the creation of data mining processes. However, th e se

are local solutions.

The thre e previous tools require users to create

complex workﬂows to assess the performance of mo-

dels including different tasks and parameter s. Cross

validation in Az ure and Orange is just applied to the

model creation phase and do es not include prior o pe-

rations like feature selection which is a b a d practice

for estimating the model’s perfo rmance (Cawley and

Talbot, 2010).

H2O Flow

offers a fully distributed in-memory

ML open source platform that can be deployed in

clusters. The platform can be used from a web UI that

gives the possibility to apply machine learning (ML)

in a sequence of steps without requ iring users to have

programming skills. However the user is limited to

uploading datasets an d building models using the pro-

vided ML algorithms. Other data mining tasks (e.g.,

feature selec tion) are not available .

Weka

is a local solu tion that e nables the app lica -

tion of data mining task s to datasets. It can be c ome

complex to build da ta minin g processes composed of

multiple tasks and paramete rs.

(Kranjc et al., 2017) and (Medvedev et al., 2017)

are both researc h projects to provide cloud solutions

for the crea tion of data mining processes through a

web UI employin g similar concepts (drag-and-drop)

as Azure, RapidMiner and Orange. Both systems do

not solve the problems exposed by the p revious sys-

tems.

Besides RapidMiner, none of the above applicati-

ons provide the insertion of a data mining experiment

in a (ne sted) cross validation loop. It is also com -

mon to see in some of the previous systems that cross

validation is applied o nly to the ﬁnal mode l without

including prior tasks, such as feature selection in the

loop, which is a bad practice (Hastie et al., 2001; Ca-

wley and Talbot, 20 10).

Adding to the problems abovementioned, none of

these systems g uide the user in the d ata mining pro-

cess.

https://studio.azureml.net/

https://rapidminer.com/products/studio/

https://orange.biolab.si/

https://www.h2o.ai/

https://www.cs.waikato.ac.nz/ml/weka/

Having in mind these limitations, the following re-

quirements will be addressed in our system:

• Provide a web UI with good usability for non-

programmers to execute data mining tasks.

• Guide the user in the creation of a data mining

process.

• Provide different da ta preprocessing methods, fe-

ature selection and machine learning algorithms.

• Allow the crea tion of data min ing experiments in-

cluding different tasks, feature s and pa rameters

for evalu ation and selection of the best model (th e

one with “best” features and parameters). Here,

good data mining practices will be guaranteed,

e.g., nested cross validation.

• Provide an applicatio n accessible from the cloud

where data mining workﬂows can b e left running

and accessed later.

• Provide a scalable system to support a large num-

bers of simultaneous users.

3 DESIGN AND

IMPLEMENTATION

In this section we proc eed to present the user interface

that was used in the usability tests and the architecture

as it is at the mom e nt.

3.1 User Interface

The UI is divided in two key areas, as we can see in

Figure 1. The darker area on the left includes operati-

ons for creation and retr ieval of workﬂows and data-

sets. It also enables the execution and interruption of

workﬂows that are built on the right are a.

Figure 1: User interface - showing a dataset insertion task

and the option to insert a validation procedure after clicking

the plus button.

The area on the right is wher e the user builds the

workﬂow insertin g tasks that com pose a data mining

process.

A Data Mining Service for Non-Programmers

341

To guide the user in the data mining process, the

tasks are available for insertion according to a predeﬁ-

ned grammar that is presented next in EBNF notation:

start = dataset_input val_procedure

val_procedure = ((assessment_method _1

{(preprocessin g_1 | feature_selecti on_1)}

create_model) | ( asse ssment_method_2

(preprocessing _2 | feature_selectio n_2 |

create_model)) )

preprocessing_ 1 = "preprocessing_method"

{val_procedure _1}

feature_select ion_1 = "feature_sele ction_algorithm"

{val_procedure _1}

preprocessing_ 2 = "preprocessing_method"

{val_procedure _2}

feature_select ion_2 = "feature_sele ction_algorithm"

{val_procedure _2}

create_model = "machine_learning_al gorithm"

"eval_metrics"

assessment_met hod_1 = "cross_valida tion" |

"hold_out" | "t_v_t"

assessment_met hod_2 = "use_entire_d ata"

dataset_input = "dataset_input"

val_procedure_ 1 = (preprocessing_1 |

feature_select ion_1)

val_procedure_ 2 = (preprocessing_2 |

feature_select ion_2 | create_model)

In this grammar, the terminals are between dou-

ble quotes. These are sp eciﬁc tasks to be executed

and m ight have different representatio ns. For exam-

ple, “preprocessing

method” might be a z-score nor-

malisation or a min-max normalisation task.

In Figures 1 and 2 we show that when the user

clicks the plus button to add a new task, depending on

the current state of the workﬂow, s/he only sees the

tasks according to the previous grammar.

Figure 2: UI - Showing cross validation task (a validation

procedure task) and the tasks that can be used after.

In summary, the six types of task that can be used

in the workﬂow are the following:

• Dataset Input: a unique task where the user spe-

ciﬁes the data set to use. S/he can also choose to

remove features during this step.

• Validation Procedure: contains tasks that spe-

cify a method to be used in the creation of the

data mining proce ss. The user can deﬁne if

the next tasks should be included in an a sses-

sment/selection process (e.g., cross validation), or

if the tasks should be created using all data.

• Preprocessing: contains tasks that apply transfor-

mations to attribute values (e.g., z-scor e normali-

zation).

• Feature Selection: contains tasks to assess the

relevance of featur es for selection (e.g., Relieff).

• Model Creation: contains tasks for the creation

of models using different algorithms (e.g., Sup-

port Vector Machine (SVM )).

• Model Evaluation: contains tasks that spec ify

the metrics to use for performance evaluation

(e.g., recall and precision).

3.2 Architecture

The previous UI is part of a microservices architecture

that we illustrate in Figure 3.

In this arch itec ture, a user can acce ss the UI

through the UI Service that provides a web applica-

tion written in ReactJS, fr om which further req uests

are done to our API Gateway that redirects the reque-

sts to different services accordingly.

The Tasks Serv ic e returns representations of d a ta

mining tasks that can be used to comp ose the seq uen-

tial data mining workﬂow.

The User Service enables users to login with a

username and a password and holds information re-

lated to users.

The Templates Service contains predeﬁned tem-

plates of d ata mining workﬂows useful for certain

data and business dom ains.

The Datasets Service stores uploade d datasets in a

central ﬁle system (Network File System (NFS)) and

also returns data from the NFS accordin g to users’

requests. The MongoDB in D a ta sets Service is used

to store metadata related to uploaded datasets.

Then, we have the Workﬂows Service that trans-

lates sequential workﬂows sent by user s to a repre-

sentation that is understandable by Netﬂix Conduc-

tor

. The new representation is sent to the Conductor

Service that employs Netﬂix Conductor, an d becomes

available to be processed by different Data Science

services/workers. The Workﬂows Service is also con -

tacted to return the status of workﬂows sent by users.

By using the Netﬂix Conductor technology we

can organise the tasks in a certain sequence and the

Data Scien c e services can pull the scheduled tasks

and work on them in p arallel and inde pendently, fol-

lowing a com peting consumers pattern (Hohpe and

https://netﬂix.github.io/conductor/

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

342

Figure 3: Current system’s architecture.

Woolf, 2003). Netﬂix Conductor a llows that tasks ap-

pearing ahead in a workﬂow’s path are executed just

after the prior tasks have been executed.

The Data Science Services are multiple ﬁne

grained services/workers that work on speciﬁc data

science tasks pulled from the Conductor Service.

These D a ta Science Services share ﬁles (e. g., data-

sets, models) between them by writing and reading

to/from the NFS.

The communications between all the services pre -

sented in the architecture a re performed using th e

HTTP pr otocol, mainly through REST APIs. A ll the

services can be scaled out independently.

To better u nderstand how individual da ta science

tasks are processed in the system, in Figure 4 we pre-

sent an examp le of a translation from a sequential

workﬂow sent by the user (on the left), to its repre-

sentation in Netﬂix Conductor (on the right). This

translation abstracts users from the creation o f com-

plex workﬂows, which is an advantage over other sy-

stems such as Azure ML Studio, as abovementio ned.

Dataset input

Train-Test

validation

procedure

Feature Scaling

SVM

Classification

performance

Split Dataset

Feature scaling

Feature scalingSVM creation

SVM prediction

Calculate classification

performance

Figure 4: Example of a data mining workﬂow translation.

The sequential workﬂow sent by the u ser contains

the location of the dataset to use, the procedure to eva-

luate the process ( hold out / train-test method), a fea-

ture scaling task that is followed by a model creation

task using the SVM algorithm, an d ﬁnally there is a

task to show the classiﬁcation perform ance o f the pro-

duced model.

Upon receiving the workﬂow, the Workﬂows Ser-

vice translates it to the Netﬂix Conductor representa-

tion. In the new representation, the ﬂow starts with

a Split Data set task (split original data into training

and test sets), followed by a fea ture scaling task (ap-

plied to the training set). Then, an SVM creation task

(applied to the p rocessed training set) and a feature

scaling task (a pplied to the testing set and using info

from the p revious feature scaling task) can be handled

in parallel. The SVM pre diction task (applied to the

processed test set and using the model created before)

appears next, and ﬁnally, we have a task to compute

the classiﬁcation perf ormance of th e model. It is nor-

mal that tasks appearing ahead in the workﬂow use

data produced in preceding tasks.

4 EXPERIMENTS

In this section we present the tests done with a ﬁrst

prototy pe of the system dep loye d on a cluster in Goo-

gle Kubernetes Engine

. For that we used 4 instances

with 2 vCPUs and 7.5GB of RAM each.

4.1 Usability Tests

4.1.1 Setup

The usability tests provided a crucial role in evalu-

ating the prototype and validating the paradig m of

https://cloud.google.com/kubernetes-engine/

A Data Mining Service for Non-Programmers

343

1 1,5 2 2,5 3 3,5 4 4,5 5

You understood the exercises that were assigned

Doing the exercises was a pleasant experience

This application is relevant to solve problems like

the Iris one

The application is attractive

The design is easy to understand

It was easy to find the required functionalities

The application met my expectations

Learning how to use this application was easy

You would use this application again to solve

similar problems

Would recommend this application to a colleague

Questionnaire results

TYPE A TYPE B

Figure 5: Average and standard deviation of the users’ responses.

constructing data mining processes using sequential

tasks. The tests consisted in having the users exe-

cute a few exercises using th e interface and getting

their feedback. This fee dback was then used to evalu-

ate the users’ experience, the usability of the interface

and the value that was provided to them, hence vali-

dating the concept.

We divided the users in two types:

• Type A: Users with no experie nce with data mi-

ning sy stems and no knowledge in data mining or

programming languages (8 users).

• Type B: Users w ith experience in d ata mining sy-

stems (mainly Orange), with knowledge in data

mining but without prog ramming skills (11 u sers).

The usability tests started with a quick overview

of the platform and its functionalities, which took less

than 3 minutes. After this introduction and question

answering, we gave the users a script with a few exe-

rcises estimated to be solved in less than 20 minutes.

In the end we gave a questionnaire that the users had

to ﬁll abo ut their experience, and their thoughts on the

relevance of the system.

To keep the tests simple we dec ided to ask the

users to make six exercises using the iris ﬂower da-

taset (Anderson, 1936).

The exercises were simple and intertwined, ma-

king the user have a feeling of progress during their

execution.

Brieﬂy, the exercises that we asked them to per-

form were the following:

1. To scale the attributes of the dataset betwee n the

values 0 and 1.

2. To create an SVM model and to use the hold-out

proced ure to assess the m odel per formance. Also

verify the accuracy and f-measure of the produced

model.

3. Same exercise a s before, however including a fea-

ture scaling o peration before model creation. This

was conducted to verify whe ther the user was

aware that tasks could be crea te d and rem oved in

the middle of a workﬂow previously created.

4. To perform feature selection using the Relieff al-

gorithm and different numbers of fea tures to see

which attributes would have the most predictive

capabilities.

5. To build an SVM model prec e ded by feature sca-

ling using the teo best features discovered in the

previous exercise and to use cr oss validation to va-

lidate the model.

4.1.2 Results

After performing the tests we asked the users to ﬁll a

questionn aire, which allowed us to know how much

the users liked the interface, their experience using

the tool and if they found it useful. Each statement

could be answered as: totally disagree, disag ree, inde-

cisive, agree and totally agree . To analyse the average

response and the standard deviation we converted the

answers to num bers, where number 1 translates to “to-

tally disagre” and 5 to “totally agree ”.

As seen in Figure 5 the values are all above

average. The most satisfactory results were that users

found the interface easy to use, they would recom -

mend it to colleagues and that they would use it again

to solve related prob le ms. The attractiveness of th e in-

terface, even though it was very positive, scored lower

than the other metrics; this was expected since this is a

prototy pe and that part was not a priority. The results

acquired from typ e A users are lower than the ones

from type B. This showed that the users with no expe-

rience (type A) had more difﬁculty using the interface

which was expected, but surprisingly they found ea-

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

344

sier to ﬁnd the required functionalities and the design

simpler to understand.

Besides answering the questionnaire the users also

had a place to write sug gestions, critiques and what

they liked the most in the application. This feedback

reinfor c ed what was discovered during the question-

naire and it was very satisfactory. None o f the criti-

ques were about the conc ept we aim to prove and the

things they liked the most were inline with the objecti-

ves we tried to achieve when building the application.

4.2 Computational Performance Tests

Basic preliminary computational performance tests

were done to assess how the system will behave with

the current architecture. We executed tests using two

randomly generated numerical da ta sets with a binary

response class: Dataset 1 containing 10000 rows and

1001 columns (34.2 MB) a nd Dataset 2 with 20000

rows and 100 1 columns (68.4 MB).

Using each dataset we created 10 times a Na¨ıve

Bayes model and evaluated its classiﬁcation perfor-

mance using 10-fold cross validation.

As a baseline, we performed the same experiments

with H2O deployed in an equal cluster.

The results can be seen in Figure 6.

0,00

50,00

100,00

150,00

200,00

Dataset 1 Dataset 2

Time (sec.)

Time to execute a data mining job

Our system

H2O

Figure 6: Tests performed with our system and H2O.

It ca n be seen that o ur system is slower in the pre-

liminary tests, but this is nothing we were not ex-

pecting, as we are storing intermedia te results in a

centralised disk using NFS, while H2O stores them

in memory. We will address th is issue in the future.

5 CONCLUSION

We presented a service for non-programmers to per-

form data mining experiments employing good ma-

chine learning / data mining practices. We prototy-

ped a c loud application following a microservices ar-

chitecture with an interface that aim s to achieve high

usability metrics.

To evaluate a ﬁrst prototype and validate the

paradigm of visu a l programming using sequential

tasks we made experiments with experienced and

non-experienced users which provided us satisfactory

feedback.

Future works will include not only more usability

tests with experienc ed users to improve the user inte r-

face in aesthetics and fu nctionality te rms, but mainly

the investment in optimising the current architecture,

which might include exploring the storage of interme-

diate results in memory and other techniques that can

produce results faster.

ACKNOWLEDGEMENTS

This work was c a rried out under the project

PTDC/EEI-ESS/1189/2014 Data Science for Non-

Programmers, supporte d by COMPETE 2 020, Portu-

gal 2020-POCI, UE-FEDER a nd FCT.

REFERENCES

Anderson, E. (1936). The species problem in iris. Annals

of the Missouri Botanical Garden, 23:457–509.

Cao, L. (2017). Data science: A comprehensive overview.

ACM Comput. Surv., 50(3):43:1–43:42.

Cawley, G. C. and Talbot, N. L. (2010). On over-ﬁtting in

model selection and subsequent selection bias in per-

formance evaluation. Journal of Machine Learning

Research, 11(Jul):2079–2107.

Han, J., Pei, J., and Kamber, M. (2011). Data mining: con-

cepts and techniques. Elsevier.

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Ele-

ments of Statistical Learning. Springer Series in Sta-

tistics. Springer New York Inc., New York, NY, USA.

Henke, N., Bughin, J., Chui, M., Manyika, J., Saleh, T.,

Wiseman, B., and Sethupathy, G. (2016). The age of

analytics: Competing in a data-driven world. McKin-

sey Global Institute, 4.

Hohpe, G. and Woolf, B. (2003). Enterprise Integration

Patterns: Designing, Building, and Deploying Messa-

ging Solutions. Addison-Wesley Longman Publishing

Co., Inc., Boston, MA, USA.

Kranjc, J., Ora, R., Podpean, V., Lavra, N., and Robnik-

ikonja, M. (2017). Clowdﬂows: Online workﬂows for

distributed big data mining. Future Generation Com-

puter Systems, 68:38 – 58.

Medvedev, V., Kurasova, O., Bernataviien, J., Treigys, P.,

Marcinkeviius, V., and Dzemyda, G. (2017). A new

web-based solution for modelling data mining proces-

ses. Simulation Modelling Practice and Theory, 76:34

– 46. High-Performance Modelling and Simulation

for Big Data Applications.

A Data Mining Service for Non-Programmers

345

Miller, S. and Hughes, D. (2017). The quant crunch: How

the demand for data science skills is disrupting the job

market. Burning Glass Technologies.

Nielsen, J. (1994). Enhancing the explanatory power of usa-

bility heuristics. I n Proceedings of the SIGCHI Confe-

rence on Human Factors in Computing Systems, CHI

’94, pages 152–158, New York, NY, USA. ACM.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

346