any given data project management should be
somehow measured in order to let organizations know
how to choose the most adequate professionals for a
task.
The way in which we propose the measurement of
this ability covers two dimensions: on one hand, it
covers the data management expertise of the data
scientist; and on the other hand, his/her efficiency
when performing specific data science tasks.
To define the measure of the first dimension –
named as personal data management maturity - we
ground our proposal on an existing data management
maturity model: dgmr (Caballero et al., 2013) which
is further described in Section 2. Similarly, the
measure for the second dimension – named as
personal data science performance- is grounded on
the Personal Software Process (PSP) described in
(Humphrey, 2000) (1997) as “a set of methods, forms,
and scripts that show software engineers how to plan,
measure, and manage their work”
With these two dimensions, we propose the
Personal Data Scientist Process as a structured set of
process descriptions, measurements, and methods
that can help data scientists to improve their personal
performance and their ability to act and decide on the
various steps of the lifecycle of data used to conduct
the various analyses.
To the best of our knowledge, no one has ever
proposed any principled methodology for data
scientist (self-) appraisal. The main rationale for this
proposal is to develop a universal recognition of the
skills and capabilities of professional working on
Data Science. In this sense, organizations can select
the most valuable professionals for their projects, and
data scientists can self-assess themselves against a
common reference framework.
The remainder of the paper is structured as
follows: Section 2 introduces the most important
concepts underlying our position paper. Section 3
describes the PdsP. Section 4 introduces an
illustrative example to describe the framework.
Finally, Section 5 provides conclusions and future
work.
2 STATE OF THE ART
In this section, we introduce the most relevant
concepts to better ground the basis of our proposal.
2.1 Required Skills for Data Scientists
Data scientists usually have a strong educational
background in Mathematics, Statistics, Computer
Science or Engineering. They can acutely understand
the business problems and needs of the industry they
are working in and fluently translate their technical
findings to a non-technical team, such as the
Marketing or Sales departments. Along with strong
technical skills in Analytics (mastering R or SAS)
data scientists should have skills in Computer Science
for big data management and experience with Hadoop
platform, Pig or Hive and also be able to write
complex SQL queries. Their goal is to arm the
business and decision makers with quantified insights
for their decision-making process and technical skills
to tame, clean, and analyse the data appropriately.
2.2 Dgmr Framework
This section briefly introduces dgmr, which is a
framework containing three main elements:
A process reference model, describing the
processes related to data management (DM), data
quality management (DQM) and data governance
(DG). These processes are described as ISO
12207 does. See Table 1.
A maturity model, in which the processes
previously described, has been arranged in five
levels, according to what organizations should
perform in order to maintain the highest levels of
quality and availability for data. See Figure 1.
An assessment methodology, which enables the
assessment of the level of organizational data
management maturity.
3 PdsP
The PdsP describes the concepts and processes that
any data scientist should learn and follow to get a
better job when analysing data. In this context, “a
better job” means not only getting more reliable
results but also more repeatable results in a more
productive way.
The design of PdsP is based on analogous
principles as PSP (Humphrey, 2000):
1. Every data scientist is different; to be most
effective, data scientists should be able to plan
their work and they should base their plan on their
own personal data.
2. In order to improve their performance, data
scientists should follow well-defined and
measured processes.
3. High quality data analysis must be achieved by
highly motivated and responsible data scientist.
TowardsPrincipledDataScienceAssessment-ThePersonalDataScienceProcess(PdsP)
375