both these technologies together. Reinforcement
learning together with neural networks have
successfully been applied for different control
problems - control of gas turbine (Schafer, 2008)
and motor-control task (Coulom, 2002).
Given research describes poker agent
development using RL with ANN.
4 STATE OF THE ART
Texas Hold’em is one of the most popular forms of
poker. It is also a very complex game. There are
several factors that make poker game as uncertain
environment (concealed cards, bluffing). These
characteristics make poker partially observable
Markov Decision process that has no ready solution.
Reinforcement learning together with neural
network for value function approximation provides a
solution for uncertain environment agent. This paper
gives a brief description of information needed to
develop agent poker game:
Poker game rules;
Definition of the partially observable Markov
decision process;
Neural network theory;
Reinforcement learning theory.
4.1 Poker Game
Poker is a game of imperfect information in which
players have only partial knowledge about the
current state of the game (Johanson, 2007). Poker
involves betting and individual play, and the winner
is determined by the rank and combination of cards.
Poker has many variations - in experiments, and data
analyses author uses Texas hold’em poker game
version. Texas hold’em consists of two cards dealt to
player and five table cards. Texas hold’em is an
extremely complicated form of poker. This is
because the exact manner in which a hand should be
played is often debatable. It is not uncommon to
hear two expert players argue the pros and cons of a
certain strategy (Sklansky, Malmuth, 1999).
Poker game consists of 4 phases - pre-flop, flop,
turn, river. On the first phase (pre-flop) two cards
are dealt for every player. On the second phase
(flop) three table cards are shown. On next phase
(turn) fourth card is shown and finally on the last
phase (river) table fifth card is shown and winner is
determined. Game winner is a player with the
strongest five card combination. Possible card
combinations are (starting from the highest rank)
Straight flush, Royal flush, Four of a kind, Full
house, Flush, Straight, Three of a kind, Two pair,
One pair, High card.
4.2 Partially Observable Markov
Decision Process
Markov decision process can be described as a tuple
(S, A, P, R), where
S, a set of states of the world;
A, a set of actions;
P:S×S ×A →[0,1], which specifies the
dynamics. This is written P(s'|s,a), where
∀s ∈S ∀a ∈A ∑s'∈S P(s'|s,a) = 1.
In particular, P(s'|s,a) specifies the probability of
transitioning to state s' given that the agent is in a
state s and does action a.
R:S×A ×S →R, where R(s,a,s') gives the
expected immediate reward from doing action
a and transitioning to state s' from state s
(Poole and Mackworth, 2010).
Figure 1: Decision network representing a finite part of an
MDP (Poole and Mackworth, 2010).
Partially observable Markov decision process is a
formalism for representing decision problems for
agents that must act under uncertainty (Sandberg,
Lo, Fancourt, Principe, Katagiri, Haykin, 2001).
POMDP can be formally described as a tuple (S,
A, T, R, O, Ω), where
S - finite set of states of the environment;
A - finite set of actions;
T: S × A → ∆(S) - state-transition function,
giving a distribution over states of the
environment, given a starting state and an
action performed by the agent;
R: S × A → R - the reward function, giving a
real-values expected immediate reward, given
a starting state and an action performed by the
agent;
Ω - finite set of observations the agent can
experience;
O: S × A → ∆(Ω) - the observation function,
giving a distribution over possible
observations, given a starting state and an
action performed by the agent.
BuildingPokerAgentUsingReinforcementLearningwithNeuralNetworks
23