using a surrogate-based model. Furthermore, an agent
with the possibility of predicting accurately the win-
ner could adapt its strategy to change the outcome of
the match.
This could not be achieved without a good set of
features. Training a classifier is easy, but it does not
help if data have no quality. This set of data and fea-
tures could be used in other works based on StarCraft
data to try to improve their results.
2 STATE OF THE ART
In the StarCraft research a lot of approaches have
been presented. The most used approach is develop-
ing probabilistic graphical models to predict the win-
ner of a match. Some examples are in (Synnaeve and
Bessi
`
ere, 2011) and (Stanescu et al., 2013), where im-
portant events in the match are used to predict the
outcome: when a very important building appears, an
important event succession for a race, the birth of the
best unit of a race, etc.
Another approach based on supervised learning is
presented in (S
´
anchez-Ruiz, 2015), but the environ-
ment is homogeneus and controlled. It is possible that
it doesn’t show the diversity in StarCraft matches.
A better dataset is presented in (Robertson and Wat-
son, 2014), which is very heterogeneus, complete and
granulated.
Further works look for plans and strategies based
on predictions of the outcome of matches, as we can
see in (Oen, 2012) and in (Alburg et al., 2014).
Another approach is developing strategies using
Genetic Programming, creating plans automatically
which can win. These kind of algorithms are very
time consuming, so whatever saved time would be ap-
preciated. This approach gives good results, as we
can see in (Fern
´
andez-Ares et al., 2016) and (Garc
´
ıa-
S
´
anchez et al., 2015).
3 METHODOLOGY
In this paper we do a complete KDD process using
SQL and some Apache tools: Spark with its Scala
interface and MLlib. We did this election because
Apache echosystem is suitable for dealing with very
large datasets, offering a framework which produces
similar projects in centralized and distributed environ-
ments.
Its Scala interface was chosen because it is the
most complete one for Spark and MLlib. The only
thing it is missed is an implementation of KNN,
so saurfang:spark-knn from spark-packages is
taken. Furthermore, Scala is a modern, functional
and object-oriented language which is used widely
in some companies as LinkedIn, Twitter or Siemens.
One of these advantages is that Scala compiles to the
Java Virtual Machine or JVM. As a consequence, mul-
tiplatform code is developed. This code is available at
GitHub, https://git.io/vdmyj.
3.1 Feature Selection
The data we use is taken from (Robertson and Wat-
son, 2014), who with their work offer six relational
databases of one versus one matches, with all the pos-
sible combinations of races that the game offers.
In Figure 1 we can see the entity-relationship dia-
gram of the databases that contain the matches. Un-
derstanding all features was easy because (Robertson
and Watson, 2014) work is totally open, so we could
explore the code associated. Furthermore, a lot of
features have the same name that attributes from the
BWAPI, although a set of features was calculated by
the researchers like the distance to the base in a mo-
ment of the match.
To get a rows and columns dataset, we propose
this structure. Each row of the dataset will be a precise
instant of the match, determined by a Frame. Each in-
stant has the information of resources of each player.
This approach is different to other ones presented in
Section 1. It seems easy but the organisation of the
data was not trivial.
We present here the list of selected features, also
exposed in Figure 1. Some of them are used only to
organise the data, the identifiers of replay, player and
region. Features which are used to model are bold.
• replay: Contains data about each match.
– ReplayID: Match identifier.
• playerreplay: Contains data about a player in a
match.
– PlayerReplayID: Player identifier.
– ReplayID: Match identifier.
– Winner: Winner of the match. This is the target
we want to predict.
• resourcechange: Contains data associated to
changes in player’s resources.
– PlayerReplayID: Player identifier.
– Frame: Frame when the resource changes.
– Minerals: Amount of minerals of a player.
– Gas: Amount of gas of a player.
– Supply: Carrying capacity of a player.
– TotalMinerals: Total amount of minerals of a
player, without costs.