For example, Ganeshapillai and Guttag (Ganeshapil-
lai and Guttag, 2012) show that pitchers are much
more predictable in counts that favor the batter (usu-
ally more balls than strikes). Furthermore, Hopkins
and Magel (Hopkins and Magel, 2008) show a dis-
tinct effect of count on the slugging percentage of the
batter. More specifically, they show that average slug-
ging percentage is significantly lower in counts that
favor the pitcher; however, there is no significant dif-
ference in average slugging percentage (a weighted
measure of the on-base frequency of a batter) in neu-
tral counts or counts that favor the batter (Hopkins
and Magel, 2008). These results verify that count has
a significant effect on the pitcher-batter relationship,
and will thus be an important factor in pitch predic-
tion. Another interesting topic is pitch prediction,
which could have significant real-world applications
and potentially improve batter performance in base-
ball. One example of research on this topic is the work
by Ganeshapillai and Guttag (Ganeshapillai and Gut-
tag, 2012), who use a linear support vector machine
(SVM) to perform binary (fastball vs. nonfastball)
classification on pitches of unknown type. The SVM
is trained on PITCH f/x data from pitches in 2008 and
tested on data from 2009. Across all pitchers, an aver-
age prediction accuracy of roughly 70 percent is ob-
tained, though pitcher-specific accuracies vary.
In this paper we provide a machine learning ap-
proach to pitch prediction, using classification meth-
ods to predict pitch types. Our results build upon the
work in (Ganeshapillai and Guttag, 2012); however
we are able to improve performance by examining
different types of classification methods and by tak-
ing a pitcher adaptive approach to feature set selec-
tion. For more information about baseball itself, con-
sult the Appendix for a glossary of baseball terms.
2 METHODS
2.1 PITCH f/x Data
Our classifiers are trained and tested using PITCH f/x
data from all MLB games during the 2008 and 2009
seasons. Raw data is publicly available (Pitchf/x,
2013), though we use scraping methods to transform
the data into a suitable format. The data contains ap-
proximately 50 features (each represents some char-
acteristic of a pitch like speed or position); however,
we only use 18 features from the raw data and create
additional features that are relevant to prediction. For
example, some created features are: the percentage of
fastballs thrown in the previous inning, the velocity of
the previous pitch, strike result percentage of previous
pitch, and current game count (score). For a full list
of features used, see Appendix.
We apply classification methods to the data to pre-
dict pitches. On that note, it is important to clarify
a subtle distinction between pitch classification and
pitch prediction. The distinction is simply that clas-
sification uses post-pitch information about a pitch to
determine which type it is, whereas prediction uses
pre-pitch information to classify its type. For exam-
ple, we may use features like pitch speed and curve
angle to determine whether or not it was a fastball.
These features are not available pre-pitch; in that case
we use information about prior results from the same
scenario to judge which pitch can be expected.
The prediction process is performed as binary
classification (see section 2.2); all pitch types are
members of one of two classes (fastball and nonfast-
ball). We conduct prediction for all pitchers who had
at least 750 pitches in both 2008 and 2009. This spec-
ification results in a set of 236 pitchers. For each
pitcher, the data is further split by each count and,
with 12 count possibilities producing 2,832 smaller
subsets of data (one for each pitcher and count com-
bination). After performing feature selection (see sec-
tion 2.3) on each data subset, each classifier (see Ap-
pendix) is trained on each subset of data from 2008
and tested on each subset of data from 2009. The
average classification accuracy for each classifier is
computed for test points with a type confidence (one
feature in the PITCH f/x data that measures the con-
fidence level that the pitch type is correct) of at least
0.5.
2.2 Classification Methods
Classification is the process of taking an unlabeled
data observation and using some rule or decision-
making process to assign a label to it. Within the
scope of this research, classification represents deter-
mining the type of a pitch, i.e. given a pitch x with
characteristics x
i
, determine which pitch type (curve-
ball, fastball, slider, etc) x is. There are several clas-
sification methodologies one can use to accomplish
this task, here we used the methods of Support Vec-
tor Machine (SVM) and k−nearest neighbor (k-NN),
they are explained in full detail in (Theodoridis and
Koutroumbas, 2009).
2.3 Feature Selection Methods
The key differences between our approach and former
research (Ganeshapillai and Guttag, 2012) is the fea-
ture selection methodology. Rather than using a static
set of features, a different optimal set of features is
ApplyingMachineLearningTechniquestoBaseballPitchPrediction
521