Realization of a Machine Learning Domain Speciﬁc Modeling Language:

A Baseball Analytics Case Study

Kaan Koseler, Kelsea McGraw and Matthew Stephan

Dept. of Computer Science and Software Engineering, Miami University, 510 East High Street, Oxford Ohio, U.S.A.

Keywords:

Machine Learning, Domain Speciﬁc Modeling Language, Baseball Analytics, Binary Classiﬁcation,

Supervised Learning, Model Driven Engineering.

Abstract:

Accompanying the Big Data (BD) paradigm is a resurgence in machine learning (ML). Using ML techniques

to work with BD is a complex task, requiring specialized knowledge of the problem space, domain speciﬁc

concepts, and appropriate ML approaches. However, specialists who possess that knowledge and program-

ming ability are difﬁcult to ﬁnd and expensive to train. Model-Driven Engineering (MDE) allows developers to

implement quality software through modeling using high-level domain speciﬁc concepts. In this research, we

attempt to ﬁll the gap between MDE and the industrial need for development of ML software by demonstrating

the plausibility of applying MDE to BD. Speciﬁcally, we apply MDE to the setting of the thriving industry of

professional baseball analytics. Our case study involves developing an MDE solution for the binary classiﬁca-

tion problem of predicting if a baseball pitch will be a fastball. We employ and reﬁne an existing, but untested,

ML Domain-Speciﬁc Modeling Language (DSML); devise model instances representing prediction features;

create a code generation scheme; and evaluate our solution. We show our MDE solution is comparable to the

one developed through traditional programming, distribute all our artifacts for public use and extension, and

discuss the impact of our work and lessons we learned.

1 INTRODUCTION

Using data to derive patterns is a critical driver of hu-

man knowledge and progress (Mattson, 2014). Due

to the global proliferation and ubiquity of informa-

tion processing devices, we have more data about

the world than ever before. “Big Data” is a popu-

lar term that captures the essence of this era, repre-

senting the idea of there being more data than human

beings can process without the assistance of com-

puters and algorithms. The individuals or organiza-

tions that can leverage this data to derive insights are

richly rewarded (Zimmerman, 2012). However, this

almost always requires the deployment of advanced

data processing and analysis techniques. One partic-

ularly popular technique is machine learning. A gen-

erally accepted deﬁnition of machine learning is ﬁnd-

ing patterns in data that “provide insight or enable fast

and accurate decision making” (Witten et al., 2016).

This usually takes the form of an output of predictions

on new examples.

The explosive growth of machine learning has in-

troduced several new problems and challenges for

industries. Due to the esoteric nature of machine

learning, it is difﬁcult to ﬁnd software engineers that

posses mastery of machine learning techniques and

domain knowledge, or data analysts with software

engineering abilities (DeLine, 2015). One possible

approach to address this industry problem is to uti-

lize Model-Driven Engineering (MDE), which is a

paradigm focused on formal abstractions to build a

“model” of a particular application or software ar-

tifact (Kent, 2002). These models abstract the un-

derlying source code to facilitate better design and

maintenance of software. Importantly, in the MDE

paradigm, this model is used throughout the software

engineering life cycle, from requirements to testing

and deployment, and can include automatic code gen-

eration through interpretation of that model. Ideally,

an engineer may progress through the software life

cycle without ever manipulating source code, which is

very low level and can be cumbersome to manage and

maintain. Additionally, with domain-speciﬁc model-

ing languages (DSML) (Kelly and Tolvanen, 2008),

engineers can create models using a syntax consistent

with terms and abstractions from their domain.

For this paper, we attempt to help address this

industrial need by applying MDE using a machine

Koseler, K., McGraw, K. and Stephan, M.

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study.

DOI: 10.5220/0007245800130024

In Proceedings of the 7th International Conference on Model-Driven Engineering and Software Development (MODELSWARD 2019), pages 13-24

ISBN: 978-989-758-358-2

learning DSML to an industrial setting and data.

Speciﬁcally, we conduct a case study involving the

industry of professional baseball analytics. Baseball

is particularly suited for this type of analysis due to its

rigidly discrete nature and wealth of statistical infor-

mation over its long history. Machine learning tech-

niques are increasingly being incorporated by organi-

zations into the baseball industry (Sawchik, 2015). It

is a multi-billion dollar, competitive industry that has

wholeheartedly embraced data-driven analytics, and

thus presents an excellent case study for us to assess

the feasibility of applying MDE in a popular and prac-

tical industry. Additionally, the scale of the data pro-

vided by professional baseball is considerable. At the

highest levels of North American competition, teams

play 162 games per year that each generate thousands

of data points.

Previous work by Breuker deﬁned a syntax

for a DSML that abstracts machine learning tech-

niques (Breuker, 2014). However, it was never ap-

plied, implemented, nor validated. Our project, which

is based on our thesis research (Koseler, 2018), serves

as the ﬁrst-time realization of their syntax and lan-

guage in a popular and appropriate domain (Koseler

and Stephan, 2017b), including a deﬁnition of a code

generation scheme for a baseball binary classiﬁca-

tion problem. While our model instances are spe-

ciﬁc to our baseball case study, the DSML itself is not

speciﬁc to baseball. We have published our reﬁned

DSML, model instances, code generation scheme,

data cleaning scripts, and more artifacts on our repos-

itory to help enable future use of modeling in machine

learning applications and reproduction of our study

We set out to answer the following questions in our

case study,

1. Is it possible to encapsulate machine learning bi-

nary classiﬁcation concepts in a DSML and ac-

companying code generation scheme?

(a) What lessons did we learn in doing so?

2. Can we realize a complete application of machine

learning MDE in the context of a Big Data base-

ball use case that allows for various model in-

stances and code generation?

We begin in Section 2 by giving a brief overview

of machine learning, baseball analytics, and Breuker’s

work (Breuker, 2014). In Section 3, we describe our

method towards demonstrating the feasibility of MDE

in the machine learning domain. In Section 4, we

present and discuss the results of our method, partic-

ularly our prediction accuracy. We additionally ad-

dress our research questions, explore potential threats

https://sc.lib.miamioh.edu/handle/2374.MIA/6234

to validity, explain the challenges we faced along with

the corresponding lessons we learned, and address the

practical impact of our work and how it may be ex-

panded upon in the future. In Section 5, we present

related work and in Section 6 we present our conclu-

sion.

2 BACKGROUND

In Section 2.1, we introduce general concepts of su-

pervised machine learning and the binary classiﬁca-

tion problem, which are the focus of our case study. In

Section 2.2, we overview the ﬁeld of baseball analyt-

ics and its intersection with machine learning. In Sec-

tion 2.3, we present Breuker’s preliminary work on

devising a machine-learning DSML. We omit back-

ground on MDE and DSMLs, assuming that our read-

ers posses sufﬁcient knowledge in these areas.

2.1 Machine Learning

There is broad agreement that machine learning in-

volves automated pattern extraction from data (Kelle-

her et al., 2015). The patterns extracted from machine

learning techniques are often used by analysts to make

predictions. Thus, the most common type of ma-

chine learning is supervised machine learning, which

is more dominant than unsupervised learning and re-

inforcement learning (Kelleher et al., 2015). Our re-

search focuses on this type of learning, which Bishop

deﬁnes as consisting of problems that take in training

data example input vectors, x

, and their correspond-

ing target vectors, y

(Bishop, 2006). For example,

consider the case of predicting whether a certain stu-

dent will gain admittance into university. A natural

place to begin is an examination of past admission

cycles. We might take in input vectors of student at-

tributes like GPA, test scores, and admission status

from the past year. The crucial marker of a supervised

learning problem is the inclusion of past observations

and their target vectors.

2.1.1 Binary Classiﬁcation Problems

The most commonly encountered problem classes

are binary classiﬁcation, multiclass classiﬁcation, re-

gression, and novelty detection (Smola and Vish-

wanathan, 2008). Our research addresses the binary

classiﬁcation problem, which we describe herein.

Binary classiﬁcation is perhaps the most under-

stood problem in machine learning (Smola and Vish-

wanathan, 2008). Given a set of observations in a

domain X and their target values Y as training data,

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development

determine the values Y on the test data, where Y is

a binary value that classiﬁes the observation. In gen-

eral, the values of Y are referred to as either positive

or negative. This can be modiﬁed to suit the needs

of the user. As an example, let us return to the prob-

lem of university admissions. A student who submits

an application to a university will either be admitted

or rejected. Although there may be other admission

categories, such as “waitlist”, for the purposes of this

example we assume that admission and rejection are

the only classes.

2.2 Baseball Analytics

Due to its wealth of data and discrete nature, baseball

readily lends itself to statistical analysis more than

any other sport. Many books have been written on

the subject, and in recent years baseball teams have

embraced data-driven and statistical analysis promi-

nently and ﬁnancially (Costa et al., 2012). Most of the

machine learning problem classes are applicable to

baseball, and there are a variety of examples of differ-

ent input variables and predicated variables (Koseler

and Stephan, 2018). Some of the predictions are made

in real time, for example, guessing the next pitch

or where to position the defensive ﬁelders (Baumer

and Zimbalist, 2013). Others are made away from

the game for example in personnel decisions (Lewis,

2004).

There are many forms of statistical analysis ap-

plied to baseball that do not relate to machine learn-

ing. Even simple statistics, such as batting aver-

age or a pitcher’s win-loss record, are often useful

in determining success of a player or team. Prior to

Bill James’s popularization of more complex analy-

sis in the 1980s, these simple metrics served as the

statistical foundation of baseball for decades (Lewis,

2004). One example of more complex analysis is Bill

James’s Pythagorean expectation (James, 1987). This

is still a relatively simple formula, but it goes beyond

the basic win-loss ratio to calculate the expected num-

ber of wins for a team given their runs scored and runs

allowed. The formula is as follows,

ExpectedWinRatio =

RunsScored

/(RunsScored

+ RunsAllowed

)

This Pythagorean expectation might be appropriate

for a sports website or for amateur fans and analysts,

as it is both simple to use and reasonably effective

in its predictive power. The machine learning anal-

yses such as the one we focus on in this paper and

the examples we present in the next section are more

appropriate for professional analysts and academics

interested in the ﬁeld, as they are more complex and

more powerful.

2.2.1 Machine Learning Applied to Baseball

Machine learning’s predictive power has led to its

use in baseball for both practical and research ap-

plications (Koseler and Stephan, 2017a). In gen-

eral, analyses that employ machine learning improve

with increasing numbers of observations (Russell and

Norvig, 2003). Due to baseball’s relatively large num-

ber of observations from 162 games per season with

30 teams, machine learning is a very viable candidate

for strong predictive power in baseball.

Our case study focuses on predicting whether a

pitcher’s next pitch will be a fastball or not. This

information is extremely valuable in the context of

baseball as it assists a hitter in devising an approach

and strategy at the plate. This is a binary classiﬁca-

tion problem in that there are two classes: fastball

and non-fastball. Previous researchers have demon-

strated excellent predictive improvements when using

machine learning for this exact problem. Ganeshapil-

lai and Guttag programmed a linear support vector

machine classiﬁer to classify pitches based on data

from the 2008 season and predict the pitches of the

2009 season (Ganeshapillai and Guttag, 2012). Sup-

port vector machines are a type of supervised learn-

ing that attempt to ﬁnd a “separating hyperplane that

leads to the maximum margin” (Kelleher et al., 2015),

which places classes on different sides of a hyper-

plane (line), with a large margin extending on either

side of the hyperplane. This is the type of learning

that has feedback associated with it, such as labeled

examples. In their experiments, the 2008 pitching

data was the feedback, which contained labeled ex-

amples of the type of pitch that was thrown by the

pitcher. The performance improvement witnessed by

their approach over a naive classiﬁer was approxi-

mately 18%. The naive classiﬁer can be thought of as

a simple Bayes classiﬁcation based on probability. In

other words, if a pitcher in 2008 used a fastball greater

than 50% of the time, the naive classiﬁer would pre-

dict that every pitch in 2009 would be a fastball. The

model the two researchers created was able to cor-

rectly predict the next pitch 70% of the time, whereas

the naive classiﬁer was able to correctly predict the

next pitch 52% of the time. This type of analysis us-

ing a support vector machine is one of the most widely

used methods for binary classiﬁcation.

2.3 Big Data DSML

Based on our research, the only DSML designed

to model machine learning was ﬁrst proposed by

Breuker in 2014 (Breuker, 2014). This DSML was

designed to bridge the gap between high demand in

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study

industry for Big Data analysts and their lack of soft-

ware skills. Although designing and implementing

algorithms for Big Data analytics is difﬁcult and in-

volved, there are several tools to facilitate the task.

Breuker’s DSML represents an abstraction of a prob-

abilistic graphical model (PGM). A PGM is a repre-

sentation of a probabilistic model and is comprised

primarily of variables and their relationships. Breuker

deﬁnes requirements for PGMs to be represented as a

DSML to include 1) a “modeling language to express

distributions as graphs” and 2) an inference algorithm

that processes the graphs to “answer questions re-

garding conditional marginal distributions” (Breuker,

2014). Breuker’s DSML was built explicitly to work

with the Infer.NET

C# library. This library builds

a PGM by having users deﬁne variables and con-

necting them with factors. The software compiles

the inference model deﬁned by the user’s code and

runs the given inference algorithm. Infer.NET devel-

opers build the inference model through code rather

than MDE-like, graphical, modeling. This is one sig-

niﬁcant shortcoming of the Infer.NET library and all

other similar libraries.

Breuker’s DSML allows a user to build their in-

ference models graphically rather than through code.

We omit Breuker’s DSML original syntax metamodel

from this paper for the sake of brevity and because

we provide a representation of it when describing our

approach. Breuker provides no accompanying de-

ﬁned code generation scheme, but they do outline a

simple skeleton consisting of three methods within

a single C# class: GenerateModel, In f erPosteriors,

and MakePredictions (Breuker, 2014). We thus adapt

their concepts as is because they are tailored to work

with Infer.NET. By employing their DSML to con-

duct our case study, we are also validating the DSML

by using it in a full end-to-end realization.

3 METHOD

We present an overview of our method in Figure 1.

We ﬁrst form and reﬁne our metamodel in Papyrus.

We then create multiple model instances conforming

to that metamodel to showcase an ability to update

and incrementally create model instances. We then

derive and test our code generation engine. Our last

step involves us performing code execution and val-

idation experiments. We elaborate on these steps in

this section.

https://www.microsoft.com/en-us/research/project/in

feret/

3.1 Metamodel Formation

For our ﬁrst task, we created a representation of the

machine learning DSML in Papyrus (G

erard et al.,

2010). Papyrus is an open-source software tool for

supporting MDE. We chose to use Papyrus because

it was free, open-source, and allows users to deﬁne

and use their own DSMLs. We aimed to have the

Papyrus metamodel match the metamodel speciﬁed

by Breuker as closely as possible. We present our

representation in Figure 2. As Papyrus uses UML

on the back-end, the basic elements of our DSML

are extensions of the UML element Class. This re-

lationship is demonstrated through the ﬁlled triangle

arrow. The Node, Gate, GateOption, and Plate ele-

ments are the only elements that extend from Class.

The Variable and Factor elements are both special-

izations of Node. Observed Variable and Random

Variable are specializations of Variable. This special-

ization/generalization is represented by the triangle-

tipped arrow. Finally, the arrows with an open tip

represent different relationships in the Breuker meta-

model. These relationships are made explicit through

a relationship name which is labeled accordingly. For

instance, the relationship between a Plate and a Factor

indicates that a Plate selects a factor, and we labeled

this “selectsFactor.” Barring the Papyrus-required

and initial extension from the UML element “Class,”

this is in accordance with the metamodel deﬁned by

Breuker.

Upon completion of this step of the MDE pro-

cess, we enable modelers to build their own model in-

stances that conform to, and can be validated against,

the metamodel. Although we eventually create a code

generation scheme that is targeted at a binary clas-

siﬁcation use case only, our Papyrus metamodel will

help future users target different use cases and ensure

that their model instance conforms to the metamodel.

It is available for download on our public university

repository. This is the ﬁrst realization of the Breuker

metamodel. The crucial elements of this DSML are

the Observed Variables, Random Variables, and Fac-

tors. To model a binary classiﬁcation problem and

take advantage of our code generation engine, a mod-

eler implementing an instance model ﬁrst designates

the training data features as Observed Variables. They

then choose a Variable that will be predicted with test

data. This “predict” Variable should be connected to

the features by a Factor, which contains a Random

Variable representing a weight matrix. Any number of

features corresponding to Observed Variables can be

modeled, but only one Factor/Random Variable pair

may be used. In addition, there should be only one

Observed Variable to be predicted. Of course, model-

ers will have to have knowledge of the machine learn-

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development

Figure 1: Overview of our Method.

Figure 2: Metamodel in Papyrus.

ing concepts abstracted in the meta model in order to

create instance models, however they will not need to

develop any code.

3.2 Creation of Different Model

Instances

To help validate our DSML metamodel and answer

our research question regarding various model in-

stances, we decided to create several iterations of

model instances conforming to the metamodel. Each

of these model instances must be capable of predict-

ing if the next pitch will be a fastball and will vary in

their different features for pitch classiﬁcation. That is,

the purpose of each instance model is to represent the

factors considered when deciding whether each ob-

servation (pitch) will be a fastball or not. For exam-

ple, Figure 3 presents a ﬁrst iteration model instance

that uses count (balls and strikes) as the sole predic-

tive factor for the next pitch. A reﬁned version of

this model instance we created, which we present in

Figure 4, uses the count as before but now includes

the pitcher’s Earned Run Average (ERA) as a factor

in predicting the pitch. For these two examples, we

use Papyrus’ built-in model validation tool to validate

these model instances, ﬁnding no errors or warnings,

indicating that they conform to our metamodel.

These basic model instances were designed to

represent a feasible real-world application of base-

ball analytics and showcase iterative model develop-

ment. The use of count and ERA as predictive fea-

tures is common when evaluating a pitcher’s antic-

ipated behavior. While these may be simple exam-

ples, they are appropriate for an amateur/novice an-

alyst. These examples also help us conﬁrm that we

have correctly implemented our metamodel through

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study

Figure 3: First Iteration Model Instance with Count as Sole

Factor (Strikes and Balls).

Figure 4: Second Iteration Model Instance with Count and

Pitchers ERA as Factors.

Papyrus’ built in model-validation functionality. We

were able to validate these models manually, and their

successful validation through Papyrus indicates meta-

model correctness. Through these simple examples,

we demonstrate incrementally developing software

through small updates to the model instance.

The ﬁnal iteration of our model instance that

we built consisted of 18 factors adapted from the

work of Ganeshapillai and Guttag (Ganeshapillai and

Guttag, 2012). Due to our inability to obtain the

original data source, as we will discuss later, we

were limited in our ability to model the original

36 features. The factors we include are Inning,

Handedness, NumberOfPitches, Strikes, Balls, Outs,

BasesLoaded, HomePrior, CountPrior, BattingTeam-

Prior, BatterPrior, HomePriorSupport, CountPrior-

Support, BatTeamPriorSupport, BatterPriorSupport,

PreviousPitchType, PreviousPitchResult, and Pre-

viousPitchVelocity. We further deﬁne these fac-

tors/terms in the glossary of our thesis (Koseler,

2018). We present our full model instance in Fig-

ure 5. In the interest of readability, we do not display

the names of the relationships between the constructs

in the model instance and we condense the model ele-

ments. The relatively large number of predictive fea-

tures makes it difﬁcult to create an aesthetically pleas-

ing and readable model for us to present in this paper.

However, the model navigator in Papyrus allows for

easier interpretation, and, in either case, models are

better suited for readability than source code. Users

can download these models on our public repository

to view in Papyrus. In the next section, we discuss

the code generation engine we designed that automat-

ically generates code from these model instances.

3.3 Devising the Code Generation

Engine

In this step, we devised a code generation engine

that 1) parses models conforming to our DSML and

2) generates a C# ﬁle that can be used out-of-the-

box to make predictions on test data. We completed

this step from scratch as Breuker provided no code

nor examples, instead merely proposing a code lay-

out. Our goal was to have the resulting C# code use

the Infer.NET library to create a coded abstraction of

the user-deﬁned model. We wrote our code template

using Epsilon Coordination Language (EGX), EGL

(Epsilon Generation Language), and the Epsilon Ob-

ject Language (EOL) (Kolovos et al., 2006). Our code

generation template is comprised of eighty four lines

of code. This will automatically generate source code

totaling roughly eighty four lines of code that makes

calls to the Infer.NET library.

Although we represented the entire DSML as a

metamodel in Papyrus that can be used for construct-

ing models, the Factors and Observed & Random

Variables are the only ones that we accounted for in

our code generation engine. We ignored the other

constructs, such as Gate, GateOption, and Plate as

they were unnecessary for our case study to demon-

strate the plausibility of using MDE to create machine

learning software without writing code manually.

We have uploaded our code generation template to

our public university repository along with the other

artifacts we generated for this project. In the Epsilon

family of languages, text in the code template that

is enclosed in “[%” brackets is dynamic and variable

based on the user-deﬁned model instances. All other

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development

Figure 5: Complete Model Instance with 18 Factors.

text in the template is static. Our ﬁrst step in our code

generator parsing is to have the engine collect the Ob-

served Variables and determine which of them is to

be predicted by the generated software. We then have

the engine create an array for each Observed Vari-

able containing its training values and insert it into

the code. This step also inserts the Boolean array of

training values for the Observed Variable that will be

predicted by the software. The generated code pro-

ceeds to initialize a weight matrix with random num-

bers from a Gaussian distribution, with the size of the

matrix corresponding to the number of features.

This training data is passed by the generated code

to a method that updates the weights based on the

training data using the Expectation Propagation algo-

rithm (Minka, 2001) to infer the posterior distribu-

tion of the weight matrix. In machine learning, this

is referred to as “training” the model. The gener-

ated code then creates arrays to hold the test data and

makes predictions on this data. The output consists

of a probability for each observation (pitch), indicat-

ing the likelihood that the given observation belongs

to the “true” class. Recall that this is due to the tar-

get training data consisting of a Boolean “true/false”

array. In our baseball analytics use case, the output

indicates the probability of the next pitch being a fast-

ball.

3.3.1 Quality of Generated Code

The quality of the generated code is important be-

cause the code generated in MDE techniques must be

correct and robust since users will interact with mod-

eling artifacts only, not code. To help assess the qual-

ity of the generated code, we used input space par-

titioning (Ammann and Offutt, 2016) to test the code

by devising several different inputs as training and test

data. For instance, the test case of using valid integers

for the “strikes” data that do not make sense in the

context of baseball, such as a value of 10 where there

can be only a value of 0,1, or 2 strikes. Overall, we

found the quality of the code to be fairly high, with

some important areas for improvement. In particu-

lar, a mismatched number of training observations for

each Observed Variable is allowed by the program.

If a user inputs 3 values for one Observed Variable

and 4 for another, the code will still execute when it

should throw an exception. The generated code can

still make predictions given nonsensical data, but the

value of these observations is limited. We plan on

addressing such concerns in future work by placing

limits on the possible values that can be entered in

Papyrus. Given that Papyrus allows for such limits on

entered values, our MDE method need not be limited

to the realm of baseball analytics. These limits can be

adjusted as appropriate depending upon the domain.

3.4 Code Execution and Validation

The actual data set we used consisted of the play-by-

play pitching statistics from the 2016 and 2017 MLB

seasons of the Cincinnati Reds, New York Yankees,

New York Mets, and Toronto Blue Jays. Speciﬁcally,

we considered all pitchers from these four teams in

2016 and used their 2017 pitch data to test our pre-

diction model. Even if a pitcher switched teams in

2017, we still considered their pitch data on that new

team. Similarly, we also removed pitchers from the

2016 data who retired or were free agents, and thus

did not pitch during the 2017 season. We retrieved

this data from Baseball Savant, which provides of-

ﬁcial Major League Baseball data available publicly

for free

. Through this interface, we were able to en-

ter manual queries that gave us play-by-play data for

these 4 teams/pitchers in 2016 and 2017. Each season

consisted of roughly 85,000 observations. We have

uploaded this training and test data to our repository

to allow for reproduction of our experiments. We used

the 2016 season data as training data and the 2017

season as the test data to evaluate our prediction ac-

curacy.

After building our ﬁnal model instance, we needed

a way to feed it our large data set of approxi-

mately 85,000 observations. Papyrus allows data en-

https://baseballsavant.mlb.com

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study

try through its interface, but no systematic method of

inputting large strings of information. To feed data to

the model, we had to enter the observations for each

Observed Variable as a string in the format {x

, x

... , x

}, where n = NumberO f Observations. Thus,

we used Python and the Pandas library

to read our

data, which is in the form of a comma separated val-

ues (CSV) ﬁle.

To help validate our model and the process as a

whole we fed this formatted data to our complete

model instance in Papyrus, subsequently generating

an executable C# ﬁle. After running this C# ﬁle, we

obtained a list of output probabilities for each test

observation. We then imported these probabilities

into Python as a Pandas series and classiﬁed them as

“true” if the probability was greater than 50%, oth-

erwise they were classiﬁed as false. We then com-

pared it against our actual test data from 2017. We

further compared our predictions against the predic-

tions that would be made by a naive classiﬁer. The

naive classiﬁer classiﬁes the 2017 observations based

on the pitcher’s overall prior probability from 2016.

In other words, if the pitcher in 2016 threw fastballs

more than 50% of the time, the naive classiﬁer would

classify all pitches in 2017 as fastballs.

4 RESULTS AND DISCUSSION

In this section, we present the quantitative results of

using our DSML full-factored model instance and au-

tomatic code generation MDE solution to this base-

ball analytics binary classiﬁcation problem. This

mimics the process that analysts can follow now that

we have created meta models and code generation en-

gine. We followed the steps that an analyst would

take, that is, formatting our data, building the instance

model conforming to the meta model, feeding the data

into the instance model, and running our automati-

cally generated software. In doing so, our software

exhibited a prediction accuracy of 71.36%. That is,

our prediction model was able to successfully deter-

mine if the next pitch was a fastball or not 71.36%

of the time in the 2017 season. In contrast to a tradi-

tional programming based solution, this DSML sys-

tem helped the modeler in that Infer.NET calls were

invoked through an easily generated and modiﬁable

instance model conforming to a meta model, not re-

quiring a user to deal with source code at all.

While our prediction accuracy is higher than

Ganeshapillai and Guttag, we must note that our

model exhibited a smaller increase over the naive

https://pandas.pydata.org/

classiﬁer than their work. When using a naive clas-

siﬁer for our data set, there was a prediction accuracy

of 61.72%. Thus, we achieved about 15.6% improve-

ment against the naive classiﬁer. This is certainly a

positive result as we have demonstrated a signiﬁcant

improvement in accuracy over the naive classiﬁer.

However, Ganeshapillai and Guttag achieved roughly

18% improvement in prediction accuracy, which is

somewhat higher than ours. We consider this further

in our discussion, however, our goal was to validate

the plausibility of the MDE approach to building ma-

chine learning software. The key quantitative take-

away is that our automatically generated C# code ex-

ecuted and gave us an improvement in prediction ac-

curacy over a naive classiﬁer. Figure 6 illustrates the

prediction accuracy of our work versus that of Gane-

shapillai and Guttag.

Figure 6: Prediction Accuracy for our Work and Gane-

shapillai and Guttag’s.

4.1 Research Questions

At the beginning of this project, we set out to an-

swer two main research questions. Regarding our

ﬁrst question, we were able to successfully encapsu-

late machine learning binary classiﬁcation concepts

in a DSML and devise an accompanying code gen-

eration scheme. Although we demonstrated this with

a baseball analytics use case, our DSML metamodel

is general in that it will allow a user to enter what-

ever data they wish, as long as there are training and

test observations. We successfully built a code gen-

eration engine that parses a model to build machine

learning software and can support a number of fea-

tures and observations. For the second part of our ﬁrst

research question, we encountered several challenges

which we document in Section 4.3.

For our second research question, we realized

a complete application of MDE for machine learn-

ing that allows for model updates/reﬁnements and

code generation through a baseball analytics use case.

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development

We created some iterations of model instances to

demonstrate model updates. We essentially built our

model instances in a step-by-step incremental fashion.

We facilitated generated code for a large and fully-

featured predictive model that can be used for mean-

ingful analysis. Through the use of MDE software

models, we automatically generated executable code,

which is the essence of MDE. In our solution, domain

experts need to have knowledge of machine learning

concepts only, rather than machine learning concepts

and source code development skills. We have essen-

tially raised the level of abstraction for analysts to de-

velop solutions.

4.2 Threats to Validity

Most prominently, our training and test data is differ-

ent than the set that Ganeshapillai and Guttag (Gane-

shapillai and Guttag, 2012) used for their research.

We used the 2016 MLB season as training data and

the 2017 season as test data, whereas they used the

2008 MLB season as training data and the 2009 sea-

son as test data. We reached out to both researchers

and the data owners, and they were unwilling to pro-

vide us access to that data. The data was avail-

able for purchase but at a prohibitively expensive

cost as the data owners had changed their business

model since Ganeshapillai and Guttag’s original ex-

periments. Since we were concerned mainly with the

plausibility of using MDE for this binary classiﬁca-

tion problem and exhibiting only comparable results

to the traditional coded approach, we decided hav-

ing their exact data was unnecessary and that real-life

baseball data, albeit from a different year, was accept-

able. However, it is still a threat to validity. It may

be the case that had we used their older data set, our

prediction accuracy increase would be quite different.

The game of baseball has changed signiﬁcantly since

the time of that data. Baseball teams have invested

heavily in statistical analysis departments and conse-

quently changed their approach to pitching and the

game in general.

We used a smaller set of factors than Ganeshapil-

lai and Guttag did in their work. As a result of be-

ing unable to obtain the original curated data set, we

had to resort to the limited data set provided by the

free Baseball Savant website. For example, we did

not have access to play-by-play statistics for each bat-

ter that was facing the pitcher. This inability to build

a deep batter proﬁle likely resulted in a loss in our

prediction accuracy. Although the batter was taken

into account, it was only through simple identiﬁcation

and pitcher-batter priors. Due to the necessary oner-

ous web querying required to gather data on the free

Baseball Savant website, we decided to use 4 MLB

teams as our baseline rather than the entire league,

resulting in 85,000 observations. We used the New

York Mets, New York Yankees, Toronto Blue Jays,

and Cincinnati Reds as these were teams of personal

interest to us. Undoubtedly, this is a limitation and

threat to validity to our prediction accuracy. Despite

this limitation, our work still shows the feasibility of

building machine learning software through the MDE

paradigm, and our prediction accuracy was very simi-

lar to that of Ganeshapillai and Guttag. Achieving this

without having to write any source code manually is

a signiﬁcant result in itself.

Another important threat to validity is our use of

a different algorithm for making predictions. Gane-

shapillai and Guttag used a Support Vector Machine

to classify their pitches. Because we were extend-

ing a DSML that is intended for use with the In-

fer.NET library, we were limited in the algorithms

that were available to us. In particular, Infer.NET

has no constructs that allow for a Support Vector Ma-

chine. Because the Infer.NET library is based on in-

ference learning for probabilistic graphical models,

the algorithms are likewise limited. We had to use a

Bayes Point Machine classiﬁer, which may have im-

pacted potential prediction accuracy. This represents

a limitation of our DSML.

Finally, our metamodel and code generation en-

gine were built to work with Papyrus. We chose Pa-

pyrus as it is free and open-source, can work on all

major operating systems (Linux, macOS, Windows),

and has growing support in both research and indus-

try. This is a threat to validity as we did not consider

other MDE DSML tools. To simplify future exper-

iments in Papyrus and other tools, we’ve included a

ready-to-import Papyrus project in our public reposi-

tory that can be used directly in Papyrus or converted

for other tools.

4.3 Lessons Learned and Challenges

Part of our goal in our industrial case study was to

help identify lessons that we learned and interesting

challenges.

Before beginning the research and modeling pro-

cess, we ﬁrst needed to gather real-world Major

League Baseball data, ideally the same data as that

used by Ganeshapillai and Guttag (Ganeshapillai and

Guttag, 2012). Upon contacting their data source,

STATS Inc., we were told that such data was no longer

publicly available and the quoted purchase price was

prohibitively expensive. As a result, we decided to

use Baseball Savant’s web interface. This website did

not have the 2008 nor 2009 data, thus we decided to

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study

use the more recent 2016 and 2017 data. While this

data was sufﬁcient for our goals and research ques-

tions, this experience calls attention to one of the ma-

jor problems in the machine learning ﬁeld of ﬁnding

relevant and clean data. Future researchers should

keep this in mind as a high priority for their projects.

Another challenge we faced was the learning

curve in using EGX and EGL to write the code gener-

ation engine. In the process of learning how to use

these components of the Epsilon Object Language,

we had to make multiple forum posts on the Epsilon

forum. This is fair and to be expected for an open-

source language. Like our previous lesson, future re-

searchers who wish to use EGX and EGL should be

aware of the time investment required to sufﬁciently

comprehend the API. Searching and contributing fo-

rum posts on the ofﬁcial Epsilon forum may prove

useful, as it did for us.

The open-source nature of both Papyrus and the

Epsilon Object Language was probably the most

signiﬁcant challenge we encountered, albeit a tool-

related one. In addition to having to rely on the

open-source community, we found ourselves wanting

greater customization options for the Papyrus dash-

board. In particular, when a user uses our metamodel

to instantiate their own model instance, they must ﬁrst

create it as a UML class. They must then click on this

class and successively follow the Properties -> Pro-

ﬁle -> Applied Stereotypes menu chain before select-

ing their desired stereotype. These stereotypes cor-

respond to our DSML constructs of Observed Vari-

able, Random Variable, and Factor. Another obsta-

cle we faced in using Papyrus was the method re-

quired to feed data to the model. As we discussed in

Section 3.4, there is no systematic way to feed large

strings of information to the model, so it must be done

manually.

4.4 Potential Impact and Future Work

The target audience for this work is domain experts

who want to build machine learning software without

manually writing source code. This can include orga-

nizations that possess the requisite machine learning

and computer science expertise, but cannot spare the

time nor resources to write code in a traditional man-

ner. Although our work’s scope is limited in that it

applies to binary classiﬁcation problems only, it was

able to support a fair amount of features and training

observations.

An important impact of this work is that we pub-

lished our code generation engine, models, and other

artifacts on our public repository, and they are open-

source and free to use. This makes it a cost-effective

solution, or starting point, for individuals or organiza-

tions that are building binary classiﬁcation systems.

Even if individuals are not interested in our code gen-

eration engine, we believe that our metamodel rep-

resentation in Papyrus can serve as a valuable tool

for future practitioners building probabilistic graph-

ical models.

We anticipate this work will help serve as a step-

ping stone to future research by moving towards prov-

ing the viability of MDE in the machine learning do-

main. The industrial use case of baseball analytics

is an entirely relevant and economically viable one

to demonstrate the potential practical impact of this

work. While one can have doubts about the represen-

tatives of baseball analytics for industrial practition-

ers, future practitioners or academics can build off of

this project to further ﬂesh out the code generation

engine and make it robust for other machine learn-

ing problem classes, and optionally address the Gate,

GateOption, and Plate constructs that were unneces-

sary for our case study. Although we chose to demon-

strate the feasibility of MDE in the machine learning

domain through a baseball analytics case study, future

researchers can apply these same techniques to differ-

ent problems and data sources. We plan on doing so

to measure the generality of this work to other appli-

cations and problems, and to facilitate further under-

standing and use. The code generation we developed

is problem-agnostic; as long as the data is formatted

properly for a binary classiﬁcation problem, the gen-

erated software can derive meaningful predictions.

An interesting area of future work would be to

assess the degree of difﬁculty involved in using our

approach. For example, we could recruit volunteers

to help further support our claim that MDE is vi-

able in the machine learning domain. Additionally,

we could quantify how much effort a domain expert

would require to address a relevant domain data prob-

lem, and how much they enjoyed this MDE-based

method. Lastly, we could also compare it to some of

the approaches we describe in our Related Work sec-

tion using the same dataset to better explicate its ad-

vantages and beneﬁts over other methods. We deemed

this beyond the scope of the project at this time.

5 RELATED WORK

To our knowledge, while there is some work on ma-

chine learning domain speciﬁc languages (DSL) (Por-

tugal et al., 2016), there is little other related work

in creating a machine learning DSML (Zafar et al.,

2017). The paper by Breuker (Breuker, 2014) was

only exploratory in nature, and has not been cited

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development

in anything other than survey papers until this time.

Ours appears to be the ﬁrst original work in realizing

this DSML and demonstrating its viability for build-

ing quality machine learning software.

There are several software packages that allow for

rapid application of machine learning algorithms on

sets of data. One example is the Orange package,

which is part of the Anaconda Python distribution.

Orange can be more thought of, and is advertised

as, a data mining suite (Dem

sar et al., 2013). The

crucial difference between our work and Orange is

that Orange does not adhere to nor support the MDE

approach to formal software engineering. Although

users interact with a visual interface and can connect

certain components like “Data” or “Analysis” to one

another, there is no model validation and no resultant

automatically generated software. Rather, the Orange

package allows one to apply a machine learning al-

gorithm to user data, assess prediction accuracy, and

build visualizations. There are other similar UI-based

editors for machine learning work ﬂows, but they do

not follow a strict model-driven approach.

WEKA is a similar package to Orange, which al-

lows users to perform data mining on their data sets

with a variety of different machine learning algo-

rithms (Hall et al., 2009). WEKA bears even less

of a resemblance to the MDE paradigm than Orange.

There is no visual connecting of components like

there is with Orange. Consequently, WEKA does not

allow a user to deﬁne a model for what their program

should look like.

TensorFlow is a machine learning framework de-

veloped by Google Brain that is used in training neu-

ral networks (Abadi et al., 2016). The idea behind

TensorFlow is very similar to that of Infer.NET, in that

the user deﬁnes models of behavior using code. While

Infer.NET is used for deﬁning probabilistic graphical

models, TensorFlow is used primarily to deﬁne neu-

ral network architectures. Although TensorFlow does

not deﬁne itself as an MDE language in any tradi-

tional sense, one can argue a neural network architec-

ture deﬁned in TensorFlow is analogous to a software

engineering model. A signiﬁcant difference in their

work and ours is TensorFlow does not allow the user

to deﬁne/create neural network architectures in a vi-

sual manner. The Tensorboard feature allows the user

to view existing models that have already been built

through user-deﬁned code. Some potential research

we may consider includes developing some type of

visual modeling layer on top of TensorFlow, very sim-

ilar to what we have accomplished with this work, al-

lowing the user to deﬁne TensorFlow models without

writing any code by hand.

6 CONCLUSION

In this paper, we considered the plausibility of using

Model-Driven Engineering to build machine learning

software. Much like Breuker (Breuker, 2014), we

were motivated by the industrial problem of having

a high demand for machine learning expertise and

a shortage of individuals with the technical knowl-

edge required to build such quality software systems.

In particular, there are many domain experts in vari-

ous ﬁelds who could beneﬁt from such software but

have neither the expertise nor resources required. We

hoped that MDE could help address this shortfall,

and thus looked for other research that had been per-

formed in this domain. Breuker’s incomplete pro-

posal for a machine learning DSML was the only one

we found. Although incomplete, the metamodel they

proposed was a direct mapping of the constructs in the

Infer.NET library, which are themselves constructs

taken from probabilistic graphical models.

We deﬁned and reﬁned Breuker’s metamodel in

Papyrus, an open-source package that allows users

to describe their own modeling languages and create

model instances conforming to that language. Thus,

our work allows a user to build model instances in

Papyrus and have the native model validation tool en-

sure that their model instances are in accordance with

the metamodel. To conﬁrm this, we built several it-

erations of model instances of the baseball analyt-

ics problem of classifying pitches as fastball or non-

fastball based on past data. Our early iteration model

instances allowed us to determine the ease of quickly

and incrementally building and deﬁning our software

through models. Further, we built a code genera-

tion engine using EGX, EGL, and the Epsilon Object

Language that allows users to generate code based

on their model instances. Our case study took 2016

MLB data from pitchers for four teams as training

data and 2017 data from those same pitchers as test

data. Building a model that emulated many of the fea-

tures from work by Ganeshapillai and Guttag (Gane-

shapillai and Guttag, 2012), we successfully achieved

a comparable increase in prediction accuracy over a

naive classiﬁer. Most importantly, we did this through

automatically generated code that we used to make

predictions. Our application of this MDE solution to

an industrial machine learning context provides evi-

dence of a complete application of machine learning

MDE, and helps prove the viability of MDE in the

machine learning domain. We believe the practical

impact of this work includes helping facilitate future

MDE machine learning practice and research through

extension of, and drawing inspiration from, our de-

scribed process and the artifacts we published on our

repository.

Realization of a Machine Learning Domain Speciﬁc Modeling Language: A Baseball Analytics Case Study

ACKNOWLEDGEMENTS

This work is supported by the Miami University Sen-

ate Committee on Faculty Research (CFR) Faculty

Research Grants Program.

REFERENCES

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,

J., Devin, M., Ghemawat, S., Irving, G., Isard, M.,

Kudlur, M., Levenberg, J., Monga, R., Moore, S.,

Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V.,

Warden, P., Wicke, M., Yu, Y., and Zheng, X. (2016).

Tensorﬂow: A system for large-scale machine learn-

ing. In Operating Systems Design and Implementa-

tion, OSDI’16, pages 265–283, Berkeley, CA, USA.

Ammann, P. and Offutt, J. (2016). Introduction to software

testing. Cambridge University Press.

Baumer, B. and Zimbalist, A. (2013). The sabermetric rev-

olution: Assessing the growth of analytics in baseball.

University of Pennsylvania Press.

Bishop, C. M. (2006). Pattern Recognition and Machine

Learning. Springer.

Breuker, D. (2014). Towards model-driven engineering for

big data analytics–an exploratory analysis of domain-

speciﬁc languages for machine learning. In Hawaii

International Conference on System Sciences, pages

758–767. IEEE.

Costa, G. B., Huber, M. R., and Saccoman, J. T. (2012).

Reasoning with Sabermetrics: Applying Statistical

Science to Baseball’s Tough Questions. McFarland.

DeLine, R. (2015). Research opportunities for the big data

era of software engineering. In International Work-

shop on Big Data Software Engineering, pages 26–29.

Dem

sar, J., Curk, T., Erjavec, A.,

Crt Gorup, Ho

cevar, T.,

Milutinovi

c, M., Mo

zina, M., Polajnar, M., Toplak,

M., Stari

c, A.,

Stajdohar, M., Umek, L.,

Zagar, L.,

Zbontar, J.,

Zitnik, M., and Zupan, B. (2013). Orange:

Data mining toolbox in python. Journal of Machine

Learning Research, 14:2349–2353.

Ganeshapillai, G. and Guttag, J. (2012). Predicting the next

pitch. In Sloan Sports Analytics Conference.

erard, S., Dumoulin, C., Tessier, P., and Selic, B. (2010).

19 Papyrus: A UML2 tool for domain-speciﬁc lan-

guage modeling. In Model-Based Engineering of Em-

bedded Real-Time Systems, pages 361–368. Springer.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reute-

mann, P., and Witten, I. H. (2009). The WEKA data

mining software: an update. SIGKDD Explorations,

11(1):10–18.

James, B. (1987). The Bill James Baseball Abstract 1987.

Ballantine Books.

Kelleher, J. D., Mac Namee, B., and D’Arcy, A. (2015).

Fundamentals of machine learning for predictive data

analytics: algorithms, worked examples, and case

studies. MIT Press.

Kelly, S. and Tolvanen, J.-P. (2008). Domain-speciﬁc mod-

eling: enabling full code generation. John Wiley &

Sons.

Kent, S. (2002). Model driven engineering. In Inter-

national Conference on Integrated Formal Methods,

pages 286–298. Springer.

Kolovos, D. S., Paige, R. F., and Polack, F. A. (2006). The

epsilon object language (eol). In European Confer-

ence on Model Driven Architecture-Foundations and

Applications, pages 128–142. Springer.

Koseler, K. (2018). Realization of Model-Driven En-

gineering for Big Data: A Baseball Analytics Use

Case. Master’s thesis, Miami University, Oxford,

Ohio, USA.

Koseler, K. and Stephan, M. (2017a). Machine learning ap-

plications in baseball: A systematic literature review.

Applied Artiﬁcial Intelligence, 31(9-10):745–763.

Koseler, K. and Stephan, M. (2017b). Towards the real-

ization of a DSML for machine learning: A baseball

analytics use case. In International Summer School

on Domain-Speciﬁc Modeling Theory and Practice.

https://sc.lib.miamioh.edu/handle/2374.MIA/6224.

Koseler, K. and Stephan, M. (2018). A survey of base-

ball machine learning: A technical report. Techni-

cal Report MU-CEC-CSE-2018-001, Department of

Computer Science and Software Engineering, Mi-

ami University. https://sc.lib.miamioh.edu/handle/

2374.MIA/6218.

Lewis, M. (2004). Moneyball: The art of winning an unfair

game. WW Norton & Company.

Mattson, M. P. (2014). Superior pattern processing is the

essence of the evolved human brain. Frontiers in neu-

roscience, 8.

Minka, T. P. (2001). Expectation propagation for approxi-

mate bayesian inference. In Uncertainty in Artiﬁcial

Intelligence, pages 362–369, San Francisco, USA.

Portugal, I., Alencar, P., and Cowan, D. (2016). A pre-

liminary survey on domain-speciﬁc languages for ma-

chine learning in big data. In SWSTE, pages 108–110.

Russell, S. J. and Norvig, P. (2003). Artiﬁcial Intelligence:

A Modern Approach. Pearson Education, 2 edition.

Sawchik, T. (2015). Big Data Baseball: Math, Miracles,

and the End of a 20-Year Losing Streak. Macmillan.

Smola, A. and Vishwanathan, S. (2008). Introduction to

Machine Learning. Cambridge University Press.

Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. (2016).

Data Mining: Practical machine learning tools and

techniques. Morgan Kaufmann.

Zafar, M. N., Azam, F., Rehman, S., and Anwar, M. W.

(2017). A systematic review of big data analytics us-

ing model driven engineering. In International Con-

ference on Cloud and Big Data Computing, pages 1–

Zimmerman, A. (2012). Can retailers halt ’showrooming’.

The Wall Street Journal, 259:B1–B8.

MODELSWARD 2019 - 7th International Conference on Model-Driven Engineering and Software Development