
3 DATA
The data that has been used in this research spans
from the 2017/2018 to 2023/2024 Premier League
seasons. The dataset covers 2660 matches, of which
1192 ended in a home team victory, 602 in a draw, and
866 in an away team victory. The detailed statistics
of each match have been collected from two sources:
“https://fbref.com/en/” and “http://clubelo.com/”.
3.1 Features
For each match, the performance of home and away
teams in previous matches is being compared. Each
team is described by a set of 18 individual features
that have been meticulously chosen as the ones which
contribution to the match outcome may have the
strongest influence. Next, these teams’ information
are concatenated along with the result of their match
and the venue, yielding a total of 38 attributes describ-
ing their meeting.
These features have been categorized into three
different types. The first type is related to the overall
team form and includes the team’s results in previous
matches, specifically information about wins, losses,
or draws, as well as the location of the match: home
or away. The second type of features concerns de-
tailed information about the team’s performance dur-
ing those matches. This includes the following fea-
tures: the number of aerials won, clearances, corners,
crosses, fouls, goalkicks, interceptions, longballs, off-
sides, passes and passes accuracy, possessions, saves,
shooting accuracy, shots on-target, tackles, throw-ins.
The final set of features relates to the strength of the
team in a given match, for which the Elo rating was
used (Elo, 1961). The Elo rating is well known from
chess, it assess the relative strength of teams based
on their previous performance. The adjustments are
made after each match, depending on the match out-
come and the strength of the opponent.
3.2 Dataset
Two datasets were prepared to thoroughly investigate
the proposed approach. In the first dataset, the form
of both teams (home and away) in the five matches
preceding the considered match is taken into account.
The second dataset, however, focuses on the teams’
form at home and away specifically, considering their
last three respective matches.
The first dataset (MatchForm-5) takes into consid-
eration detailed statistics of both teams in the previous
five matches. For each of those matches, the statistics
of the considered team and its opponent are recorded,
including metrics such as goals scored, shots, passes,
possession, the location of the match, the match out-
come, and the strength of both teams at the time of
the game. This gives a comprehensive informations
about the team’s form leading up to the match in ques-
tion. In total the 38 features is available for each
game. To determine the form of a given team, only
matches played in the Premier League, were taken
into account. Therefore, matches from the Champi-
ons League, FA Cup, or other competitions played by
the teams during that time were ignored. When it was
not possible to generate the form of either team based
on the last five matches the match was ignored. Fi-
nally, the dataset consists of 2307 matches from the
seven seasons considered, with 1046 home team vic-
tory, 518 draws, and 743 away team victory.
The second dataset (HomeAwayForm-3) considers
only the team’s form based on the venue. It is very
common in football that a team’s playing style differs
between home and away matches. This is clearly vis-
ible in the points earned by teams in home matches
and those gathered in away games. Additionally, this
distinction is also evident in the detailed statistics of
the matches, including metrics such as the number
of passes completed, shots taken, and other relevant
performance indicators. In this dataset the form of
the host team in the last three home matches is taken
into account, reflecting potential advantages of play-
ing home. Similarly, for the guest team, their form in
the last three away matches is considered. In contrast
to the first dataset, this case does not include statistics
from opponents in historical matches. Matches for
which it was not possible to collect data on the last
three games for either team were ignored. Finally,
the dataset includes 2228 matches with 20 features,
where 1011 matches ended in a home team victory,
499 in a draw, and 718 in an away team victory.
The literature commonly presents two popular ap-
proaches for dividing data into train and test datasets.
In the first approach, there is a simple division into
two disjoint sets, with 80% of the matches allocated
to the train set and 20% to the test set (80 20, Fig. 1).
The second method of splitting the data takes into ac-
count the seasons in which the matches are played. In
this case, the most recent season is typically treated
as the test data, while the remaining seasons consti-
tute the train set (test is last). In our study, both ap-
proaches were applied.
4 PROPOSED APPROACH
First, the selected classifiers were trained on the pre-
viously prepared data to establish a baseline for the
Time Series and Deep Learning Approaches for Predicting English Premier League Match Outcomes
791