Monte Carlo Tree Search in The Octagon Theory

Hugo Fernandes, Pedro Nogueira and Eug

enio Oliveira

LIACC, Faculty of Engineering of the University of Porto, Porto, Portugal

Keywords:

Monte Carlo Tree Search (MCTS), Upper Conﬁdence Bounds for Trees (UCT), Monte Carlo Search, Artiﬁcial

Intelligence (AI), The Octagon Theory.

Abstract:

Monte Carlo Tree Search (MCTS) is a family of algorithms known by its performance in difﬁcult problems

that cannot be targeted with the current technology using classical AI approaches. This paper discusses the

application of MCTS techniques in the ﬁxed-length game The Octagon Theory, comparing various policies

and enhancements with the best known greedy approach and standard Monte Carlo Search. The experiments

reveal that the usage of Move Groups, Decisive Moves, Upper Conﬁdence Bounds for Trees (UCT) and Limited

Simulation Lengths turn a losing MCTS agent into the best performing one in a domain with estimated game-

tree complexity of 10

293

, even when the provided computational budget is kept low.

1 INTRODUCTION

Board games present a simple and entertaining mean

of competition between opponents, focused solely on

the intelligence and decision making capability of

their intervenients when confronted with players of

different characteristics and playing levels. In light

of this, most work done regarding game-theory was

accomplished on deterministic board games (Allis,

1994). The main advantage in using these games as

a test-bed for decision making research is that board

games provide a wide complexity range, facilitating

the comparison of approaches across different do-

mains of varying complexity classes (Papadimitriou,

1994). Although many games, such as Chess, can be

targeted with the current technology using classical

AI approaches (Allis, 1994), higher complexity ones,

such as Go (M

uller, 2002), cannot.

With the breakthrough of Monte Carlo Tree

Search (MCTS) (Cazenave and Helmstetter, 2005),

the main focus of research in the ﬁeld of game the-

ory moved from the already approachable games to

higher complexity ones, as new solution perspectives

emerged (Kroeker, 2011).

Over the last few years, research in this ﬁeld has

greatly expanded and MCTS has been successfully

applied to various non-traditional games featuring dif-

ferent characteristics, such as modern single player

games (Schadd, 2009), multi-player games (Nijssen

and Winands, 2011), real-time games (Den Teuling,

2011) and non-deterministic games (Ponsen et al.,

2010). Moreover, MCTS has also been applied in

various non-game related domains such as planning

(Silver and Veness, 2010), scheduling (Silver and Ve-

ness, 2010) and optimization (Rimmel et al., 2011)

problems. However, most of MCTS research is still

focused on deterministic board games with unknown

game lengths, such as Go. While improvements are

still being constantly proposed in this domain, the

fact that many of them are domain-dependent allied

to the large difference in length of the problems be-

ing solved has been inciting research in other domains

where speciﬁc domain knowledge is not so relevant

and the length of the problem is kept consistent, as the

effectiveness and comparison of different approaches

becomes simpler and clearer.

The Octagon Theory (TOT) is a deterministic two-

player board game that falls in the previously men-

tioned conditions. The consistent length of the prob-

lem, simple rules, and complex board conﬁguration,

turn TOT into an interesting challenge for AI and

a promising test-bed for comparison between algo-

rithms. This paper presents the results of a com-

parison of MCTS methods with the current approach

used by the best known solvers, using standard Monte

Carlo search as a baseline. For the MCTS solver, vari-

ous policies and enhancements found in the literature

were considered, and the effectiveness of such poli-

cies in each sub-step of the algorithm is discussed.

The remainder of this paper begins with an in-

troduction to the game of TOT and its complexity

in Section 2, followed by an overview of the MCTS

328

Fernandes H., Nogueira P. and Oliveira E..

Monte Carlo Tree Search in The Octagon Theory.

DOI: 10.5220/0004757603280335

In Proceedings of the 6th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2014), pages 328-335

ISBN: 978-989-758-015-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: Large version of the board in The Octagon The-

ory.

algorithm and its considered variations in Section 3.

Section 4 presents the obtained experimental results,

while Section 5 summarizes the conclusions of this

research.

2 THE OCTAGON THEORY

In this Section, the game The Octagon Theory is ﬁrst

presented, followed by an analysis and comparison of

its state-space and game-tree complexities. Finally,

previous work done in this game context is discussed

and domain-speciﬁc knowledge found in this research

is demonstrated.

2.1 Game Description

The Octagon Theory (TOT) is a deterministic two-

player board game available for iOS and Web. TOT is

a turn-based pushing game in which two players ﬁght

for the control of an octagonal board by pushing their

opponents’ pieces off the edges of the board, by us-

ing a limited selection of pieces over a predetermined

number of turns. The winner of the game is the player

who has more pieces on the board after all turns have

been played. If both players have the same amount of

pieces, the game ends in a tie.

Although TOT can be played in boards of three

different sizes, the only changes in the rules related to

the size of the board are the starting amount of pieces

per player and the number of turns to be disputed until

the game ends. The larger version of the board, used

as focus of this research, is presented in Fig. 1.

Each player has four different kinds of pieces with

increasing pushing capabilities. The four pieces and

their respective starting amounts for the large version

of the board are presented in Fig. 2. Each piece is

represented as an octagon and contains one or more

straight lines from the center to the edges, representa-

tive of the piece’s pushing capabilities. When a piece

is placed on the board in any of the eight possible ori-

entations (achieved by rotating the piece), any oppo-

nent pieces in neighbour positions pointed at by the

Figure 2: Starting amounts of pieces per player in The Oc-

tagon Theory.

Figure 3: Example of the pushing and cannoning processes

after placement of an 8-piece in The Octagon Theory.

placed piece are pushed back one position, provided

the positions behind them are free or off the board. If

a pushed piece does not have a free position behind it,

the push occurs on the last piece of the stack of pieces,

as long as all pieces belong to the player whose piece

is being pushed. This kind of push is referred to as

cannoning. An example of the pushing and cannon-

ing processes that occur after placement of an 8-piece

(i.e. a piece with 8 pushing directions) on a board is

shown in Fig. 3.

2.2 Complexity

In order to gain some insight on which approaches

should be explored in a TOT game-playing scenario, a

study on both the state-space and game-tree complex-

ity of the three versions of the board was conducted.

2.2.1 State-space Complexity

The state-space complexity of a game represents the

number of legal game states reachable from the initial

one. In TOT, the initial game state is an empty board.

As the game is played between two players P1 and

P2, each position of the board can be in one of three

different states: occupied by a P1 piece, occupied by

a P2 piece, or empty. Thus, for a given board with N

positions, the state-space (SS) is given by:

SS = 3

(1)

As shown in Table 1, the state-space complexity

of the large board version of TOT is similar to the

one of Chess. As such, ﬁnding equivalent board-states

throughout the game is very uncommon, turning state

detection and classiﬁcation into an impractical prob-

lem with the current technology (Allis, 1994).

MonteCarloTreeSearchinTheOctagonTheory

329

100

120

140

160

180

200

Unique moves

Turn number

Large

Medium

Small

Figure 4: Average number of unique moves per turn after

100,000 simulations in the various boards of The Octagon

Theory.

2.2.2 Game-tree Complexity

The game-tree complexity of a game represents the

number of nodes that need to be visited to determine

the value of the initial game position through a full-

width tree search (Allis, 1994). A full-width game

tree considers all the reachable nodes in any depth

level.

Since TOT is a ﬁxed-length game, the average

game length is easily deﬁned according to the used

version of the board. However, the rules of the game,

namely pushing, cause the branching factor to vary

throughout the game, as the strategy of both players

inﬂuences the number of possible moves in each turn.

With this in mind, the game-tree complexity of TOT

was estimated through a set of 100,000 simulated

games between pseudo-random players (i.e. players

that randomly bias their move selections) on each ver-

sion of the board. In each simulation, the number of

unique moves (i.e. moves that leave the board in a

unique state) per turn were recorded. The remaining

(non-unique) moves were not considered in order to

prevent an artiﬁcial increase of the problem complex-

ity. The results of these simulations are displayed in

Fig. 4.

As shown in Fig. 4, the branching factor initially

grows in every version of the board, as more pieces

emerge, leading to a higher amount of unique moves

(e.g. by pushing a single piece in eight possible di-

rections). However, while the branching factor stabi-

lizes on the smaller versions of the board, as the board

ﬁlls up and pieces start being pushed off more often,

the larger version of the board suffers a decrease in

unique moves as the game approaches the end. This

decrease derives from the fact that the number of turns

disputed on the large version of the board is much

higher than on the lower versions (70 versus 20 in the

smaller version and 30 on the medium one), causing

a large amount of pieces to be blocked from further

pushes as more pieces start being stacked together.

Table 1: State-space and game-tree complexity comparison

of The Octagon Theory with other popular games.

Game name Board size State-space Game-tree

Go (19x19) 361 10

171

360

TOT (large) 96 10

293

Chess 64 10

123

TOT (medium) 68 10

121

Hex (11x11) 121 10

TOT (small) 36 10

Checkers 32 10

2.2.3 Comparison with other Popular Games

With the state-space and game-tree complexity re-

sults, the positioning of TOT in relation to other well-

known games found in the literature can be estimated.

This comparison, ordered by game-tree complexity,

is presented in Table 1. As shown in this table, the

game-tree complexity of the large board version of

TOT is located in the upper range limit of complex-

ities found in the literature, suggesting it belongs in

the complexity class EXPTIME-Complete (Papadim-

itriou, 1994).

2.3 Previous Work

At this point in time, all known agents capable of

playing TOT follow the same greedy approach, based

on the ofﬁcial AI included in the TOT AI modders kit

described in (richardk, 2012). This approach essen-

tially consists in evaluating game states by crossing

a weight matrix with each players pieces, selecting

moves that lead to the maximum positional gain (or

minimum loss) over the opponent in each turn. As

such, for a given board-state (B), the chosen move is

the move that satisﬁes the following condition:

= argmax

x∈[1,n],y∈[1,n], p∈[1,4],o∈[1,8]

[ f

(B + a

x,y, p,o

) − f

(B)],

(2)

where n is the width/height of the board, a is the

resulting action of placing a piece (p) with orientation

(o) in a position of the board (x, y) and f is the evalu-

ation function (i.e. the function that crosses a weight

matrix (W) with a speciﬁc board-state). Despite its

simplicity, the highest level AI (i.e. the AI with the

best tuned W matrix found) is capable of defeating

most human players.

3 MONTE CARLO TREE

In this Section, the basic Monte Carlo Tree Search al-

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

330

gorithm is ﬁrst described and its main steps are intro-

duced. Afterward, the enhancements applied in each

policy are identiﬁed and their usage on the context of

TOT is discussed.

3.1 Overview

Monte Carlo Tree Search (MCTS) (Cazenave and

Helmstetter, 2005) is a product of the integration of

Monte Carlo search methods with tree search meth-

ods. Therefore, MCTS is a best-ﬁrst search algorithm

that creates and asymmetrically expands a game tree

based on the outcome of random simulations. Once

enough samples have been drawn, a decision is made

based on the estimated values of the tree nodes. As

such, the effectiveness of MCTS algorithms is greatly

dependent on the overall accuracy of these nodes, be-

ing higher as coverage of the tree increases (e.g. by

increasing the number of performed iterations until a

decision is made).

The basic MCTS algorithm is divided in four dif-

ferent sequential steps (Chaslot et al., 2008):

1. Selection: Select an expandable node of the tree

to explore.

2. Expansion: Expand the selected node by adding

one or more child nodes to it and select a child

node.

3. Simulation: Randomly simulate a game starting

at the selected node until an ending condition is

found.

4. Back-propagation: Back-propagate the result

across the path followed on the tree.

These steps run continuously, for as long as they

are allowed to, building and updating the nodes and

edges of the tree that is ultimately used for the ﬁnal

selection of the action to perform.

3.2 Selection

The child selection policy controls the balance be-

tween exploration and exploitation (Rejeb et al.,

2005). Exploration consists in selecting nodes with

previously found bad scores to explore, possibly turn-

ing the bad scores into more favorable ones. Exploita-

tion consists in selecting nodes with promising scores,

in order to approximate their score to the actual value

and increase the conﬁdence of performing such ac-

tions. Balancing exploration and exploitation is a cru-

cial task, as the score of a node is only as good as the

process that led to it.

In this study, the considered selection policy en-

hancements were move groups, decisive moves and

1 2 3 2 1

4 5 6 7 6 5 4

8 9 10 11

12 13 14

15 16

10 9 84 4

1 2 3 2 1

4 5 6 7 6 5 4

8 9 10 11 10 9 84

13 12

12 13 14 13 12

5 5

9 9

Figure 5: Symmetric positions on the large version of the

board of The Octagon Theory.

upper conﬁdence bounds for trees. As a baseline pol-

icy, a random selection policy (i.e. a policy that ran-

domly selects child nodes for exploration) was also

considered.

3.2.1 Move Groups

Childs et al. (Childs et al., 2008) proposed the deﬁ-

nition of move groups when selecting nodes in a tree.

The idea behind move groups is that if not all nodes

correspond to different moves (i.e. moves leading to

different outcomes), they should be grouped together.

In TOT, every version of the board resembles an

octagon (i.e. a regular polygon with eight sides).

While the small version of the board is a regular oc-

tagon (i.e. an octagon with eight lines of reﬂective

symmetry and rotational symmetry of eighth order),

the remaining board versions are irregular octagons of

equivalent opposite side lengths and rotational sym-

metry of fourth order. Thus, on the larger version of

the board, the 96 positions can be deﬁned from a pool

of only 16 different ones, as shown on Fig. 5.

3.2.2 Decisive Moves

As the selection policy is the process in charge of

selecting what nodes should be explored, it is also

the process in charge of ﬁnding decisive moves (i.e.

moves that end the game) and anti-decisive moves

(i.e. moves that prevent the opponent from ending the

game) (Teytaud and Teytaud, 2010). Although these

moves can be found using pure sampling techniques

(Browne et al., 2012), doing so might contradict the

default selection policy. As such, decisive move anal-

ysis can be performed before the default selection pol-

icy is applied, at the cost of computational time.

In the case of ﬁxed-length games, such as TOT,

nodes of the tree representing decisive moves are all

at the same level. As such, once the game reaches a

certain point in time, fully determining winning se-

quences (i.e. sequences of moves that lead to decisive

moves regardless of the opponent’s moves) and losing

MonteCarloTreeSearchinTheOctagonTheory

331

sequences might be possible within the given compu-

tational budget, effectively replacing the standard se-

lection policy until the end of the game.

3.2.3 Upper Conﬁdence Bounds for Trees

Upper Conﬁdence Bounds for Trees (UCT) is cur-

rently the most popular selection policy among

MCTS implementations found in the literature. UCT

was proposed by Kocsis et al. (Kocsis et al., 2006),

as a MCTS method that uses an Upper Conﬁdence

Bound function (UCB1) as its default selection pol-

icy. As such, UCT essentially turns the node selection

problem into a Multi-armed bandit problem, in which

each starting move is a slot machine with a limited

amount of money (i.e. wins) and an unknown distri-

bution for returning it (i.e. win ratio).

In a typical UCT strategy, the value of the nodes

begins by being calculated according to the default

selection policy (Coulom, 2007). Once the number

of visits of a node crosses a predeﬁned threshold, the

calculation of its value swaps to a UCB1 based func-

tion. For a given set of child nodes reachable from a

parent node j, the selected child node is the node i that

maximizes the following equation:

UCT = X

lnn

, (3)

where X

is the normalized average reward of node

i, n

is the visit count for node j, n

the visit count

for node i and C is a positive constant. If more than

one child node have the same UCT value, the win-

ner is usually selected randomly. The constant C is

the exploration constant, and can be adjusted to in-

crease - or decrease - the ratio between exploration

and exploitation. Although a starting value of C = 2

is suggested in the literature, this parameter is typi-

cally experimentally tuned (Browne et al., 2012), as

the optimal value is largely dependent on the domain,

computational budget and the MCTS implementation.

3.3 Expansion

The only varying factor between implementations

found in the literature regarding expansion policies is

the number of child nodes to be added per expansion.

However, this number typically varies according to

the domain and computational budget, ranging from

a single expansion (i.e. adding one single node) to a

full expansion (i.e. adding every possible node).

Out of all the games recorded between the best

performing agents, over 95% of the moves (exclud-

ing the initial one) were pushes and kills (i.e. pushes

that throw a piece off the board), while the amount

of pieces placed in positions with no neighbouring

pieces (here referred to as free moves) represent less

than 1% of the total recorded moves. In light of this,

an expansion policy here referred to as move abstrac-

tion was also experimented with. When using this

policy, only one free move per symmetric board po-

sition (Fig. 5) is considered during expansion.

3.4 Simulation

The default simulation policy in MCTS is the same

as in regular Monte Carlo approaches: random sam-

pling. The main advantage of this approach, besides

is that since it is simple and domain independent,

its time complexity is inherently lower than domain

knowledge based policies.

When dealing with domains with large branching

factors such as TOT, the length of the simulations (i.e.

the number of nodes traversed until an end-game con-

dition is met) may be too high to ensure sufﬁcient cov-

erage of the game tree in limited thinking times. Since

low game tree coverage leads to less-informed deci-

sions, the length of the simulations can be artiﬁcially

reduced to boost the number of performed iterations.

In TOT, for a given turn (n) of a game played on a

board size with a number of turns (T

) as its game-

ending condition, the artiﬁcial ending turn (T

) for a

limited simulation length (L) is given by:

= min(T

, T

+ L). (4)

3.5 Back-propagation

TOT is a zero-sum game with no added bonus on

heavier wins (i.e. wins by a larger margin). As such,

the chosen back-propagation policy was a simple dis-

crete function with V = {-1, 0, 1} for {losses, ties,

wins}, as suggested in the literature (Browne et al.,

2012).

3.6 Final Selection

Final selection is the process responsible for deter-

mining which move should be chosen once a decision

has to be made. Four different criteria were found for

this selection (Schadd, 2009):

• Max child: Select the root child node that pro-

duces the highest reward value.

• Robust child: Select the node with the highest

visit count.

• Max-Robust child: Select the node with both the

highest value and visit count. If no such node

exists, resume the search process until a suitable

node is found (Coulom, 2007).

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

332

• Secure child: Select the node representing the

lowest risk for the player.

As this research is focused on a domain with lim-

ited thinking time (as most games are), the Max-

Robust child criteria cannot be successfully applied

without budgeting additional thinking time for addi-

tional search rounds. In light of this, only the remain-

ing three criteria were considered.

4 EXPERIMENTS

This Section presents and discusses the experimental

results obtained during this study.

4.1 Experimental Setup

The following results were obtained on a single 3.4

GHz processor, with a limited maximum heap size of

8192 MB.

For every experiment, 100 test games were con-

ducted between each pair of agents under comparison.

Out of these games, each agent played 50 of them as

player 1 and 50 as player 2. In order to keep the re-

sults consistent with the literature, the computational

budget was deﬁned as the time required by a standard

MCTS agent to perform 20,000 iterations per move

on average throughout the course of a game. As such,

the standard thinking time was set to 3 seconds on the

machine used for the experiments. This thinking time

led to 7 minute games, ensuring enough experiments

could be performed in a realistic time-span.

4.2 Parameter Tuning and Policy

Selection

This Subsection presents and discusses the performed

experiments for electing and tuning the researched

policies.

4.2.1 UCT

As mentioned in Subsection 3.2, the exploration con-

stant C is typically experimentally tuned for the do-

main, and a starting value of C = 2 is suggested in

the literature. Furthermore, it has been observed that

decreasing this value (i.e. favoring exploitation) typ-

ically leads to better performance in domains with

larger branching factors, while increasing it (i.e. fa-

voring exploration) enhances performance in lower

complexity ones (Browne et al., 2012).

Table 2 shows the win, loss and tie rates for dif-

ferent values of C against a baseline UCT agent with

Table 2: UCT parameter tuning.

C 0.2 0.4 0.6 0.8 1.0 1.5 2.5 3.0 4.0

Win 25% 40% 40% 38% 46% 26% 35% 26% 22%

Loss 30% 30% 27% 29% 27% 44% 40% 41% 53%

Tie 45% 30% 33% 33% 27% 30% 25% 33% 25%

equal exploration-exploitation ratio (C = 2). As seen

from the results, the importance of exploitation was

found to be higher than that of exploration. The

best found value for C was 1.0, winning 46% of the

matches against the baseline agent, which was only

able to win 27% of the matches. However, it is worth

noting that values in the [0.4, 0.8] range also obtained

promising results and could prove to be more efﬁcient

when facing different opponents (versus the baseline

agent).

4.2.2 Limited Simulation Length

The simulation length parameter (L), described in

Subsection 3.4 was tuned by following a similar pro-

cess to the one used for tuning the UCT agent, albeit

with the regular thinking time of 3 seconds per move

(versus a number of ﬁxed iterations).

When using the larger version of the board, TOT

is played over 70 turns (i.e. 140 moves). As such, a

regular MCTS agent has to traverse a maximum num-

ber of 140 tree nodes to reach the end of the game.

However, this number linearly decreases as the game

progresses. In light of this, different agents for every

L∈[100, 10] in increments of 10 were paired against

their limit-free counterpart.

As shown in Fig. 6, the best found value for L

was 40, winning 63% of the matches against its un-

limited version. An interesting observation is the fact

that while the win rate of the limited length agent

rises steadily over the [100, 50] interval, it suffers a

large performance loss right after peaking, reaching

lower win rates and higher loss rates than the regu-

lar agent. This sudden decrease suggests that limiting

the length of the simulations too much leads to an es-

sentially greedy behavior, causing the limited agent to

lose considerably more often despite making more in-

formed decisions regarding what it considers the end

of the game.

4.2.3 Final Selection

The ﬁnal selection policy, described in Subsection

3.6, was tested by performing a round-robin between

the three mentioned criteria. As shown in Table 3,

Max Child was found to be the best of the three tested

criteria, winning 43% of the matches. Although Se-

cure Child presented the worst results, it was capable

of forcing over 50% of the matches to end in a tie.

MonteCarloTreeSearchinTheOctagonTheory

333

10%

20%

30%

40%

50%

60%

70%

100

Simulation Length Limit

Win Rate

Loss Rate

Baseline

Figure 6: Limited Simulation Length parameter tuning.

Table 3: Results of a round-robin of 100 rounds played be-

tween three differenet ﬁnal selection criteria.

Policy Max Child Robust Child Secure Child

Win 43.0% 26.0% 10.5%

Loss 16.5% 28.5% 34.5%

Tie 40.5% 45.5% 55.0%

This outcome suggests that using a hybrid ﬁnal selec-

tion policy could prove to be an interesting strategy

(e.g. by using one criteria to gain an advantage and a

different one to hold it).

4.2.4 Decisive Moves

For the established thinking time of 3 seconds per

move, the implemented MCTS agent is capable of

fully exploring the game tree once its length is equal

or less than 3, as long as the branching factor stays

below 88 on the last 3 moves (on the tested ma-

chine). With this in mind, a standard MCTS agent

was matched against an agent that uses the same se-

lection policy as its opponent until the last 3 moves

of the game, switching to a pure decisive move policy

afterward, as long as the number of unique moves at

that point is lower than 88.

Unlike every other experiment, the test games for

this experiment were started at turn 50 on board-states

obtained from standard MCTS mirror matches (i.e.

matches between the same agent) that led to a tie.

This process was chosen as restricting the number

of turns allows the results to be focused on the lat-

est portion of the game, removing unnecessary noise

that could otherwise exist (e.g. bad openings that con-

demned the entire match).

As shown in Table 4, the addition of decisive

moves was found to improve the performance of the

agent when playing as player 2. Although the sec-

ond player does have an advantage over the ﬁrst one

when playing a full game, as the ﬁrst move of the

game is the only move in the entire game that is not a

counter-move, this advantage is diminished by start-

ing the games at later stages when the board is no

longer empty. The fact that the player does not bene-

ﬁt from the addition of decisive moves when playing

Table 4: Results of 100 games of a hybrid MCTS agent with

decisive moves against a standard MCTS agent.

As Player 1 As Player 2 Total

Win 32% 78% 55%

Loss 38% 10% 24%

Tie 30% 12% 21%

as player 1 suggests that one single move (versus two

moves as player 2) is not enough to make a difference

at the end of the game. However, the positive results

as player 2 suggest that both players should beneﬁt

from the addition of decisive moves as the computa-

tional budget increases.

4.3 Results

Once the parameters were tuned and the policies cho-

sen, a round-robin with 100 rounds was conducted be-

tween the best found agents in each approach. For this

experiment, the best known Greedy approach, stan-

dard Monte Carlo search (MC) and standard MCTS

were included as baseline agents. For both Monte

Carlo versions, agents with an added L adhere to the

Limited Simulation Length policy, while agents with

added MG make use of the Move Groups algorithm in

both the selection and expansion phases, as described

in Section 3. Furthermore, every agent uses a Max

Child Policy and Decisive Moves. The agent UCT-

L-MG considers every enhancement discussed in this

paper. The results of the round-robin are presented in

Table 5.

As shown in this experiment, both UCT versions

were able to surpass the best known greedy approach,

even with the limited thinking time of 3 seconds. The

addition of move groups also proved to be worthy, as

it was able to improve every agent even further, espe-

cially in the case of UCT. From the results, it is also

clear that standard MCTS and Monte Carlo search

(i.e. with no limits on the length of the simulations)

perform poorly for the given thinking time. However,

the simple addition of an artiﬁcial limit and a UCB1

selection policy (i.e. UCT) greatly increase the per-

formance of MCTS.

5 CONCLUSIONS AND FUTURE

RESEARCH

This paper presented a research on various MCTS

policies and enhancements applied to the game

The Octagon Theory: a game of ﬁxed-length and

high game-tree complexity (10

293

), suitable for clear

comparisons between different problem solving ap-

proaches. For the MCTS solver, the considered poli-

ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence

334

Table 5: Win rate of the various agents after a round-robin of 100 rounds.

Agent UCT-L-MG UCT-L Greedy MCTS-L-MG MC-L-MG MC-L MCTS-L MCTS MC Average win rate

UCT-L-MG - 51% 75% 55% 55% 61% 57% 87% 92% 66.625%

UCT-L 26% - 69% 53% 58% 60% 52% 82% 85% 60.625%

Greedy 10% 14% - 55% 45% 61% 53% 78% 75% 48.875%

MCTS-L-MG 20% 32% 30% - 38% 41% 48% 55% 61% 40.625%

MC-L-MG 7% 29% 30% 36% - 42% 38% 50% 54% 35.750%

MC-L 11% 24% 22% 38% 38% - 36% 50% 57% 34.500%

MCTS-L 17% 26% 24% 40% 25% 25% - 48% 63% 33.500%

MCTS 0% 0% 0% 14% 12% 10% 18% - 42% 12.000%

MC 0% 0% 0% 16% 9% 12% 17% 38% - 11.500%

cies and enhancements included Move Groups, De-

cisive Moves, Upper Conﬁdence Bounds for Trees

(UCT), Limited Simulation Lengths, Max Child Se-

lection, Robust Child Selection and Secure Child Se-

lection.

Although the described approach was able to turn

a losing MCTS agent into the best performing one,

there is still a clear dependency on enhancements that

aid the agent in the starting moments of the game, as

the number of performed iterations per turn is lower

and the branching factor keeps increasing. This sug-

gests that the attribution of a game-based computa-

tional budget could lead to an interesting challenge.

Under these rules, the player would not only face a

game theory problem, but also a resource allocation

task when determining which moves should be priori-

tized (i.e. given more thinking time) during the course

of the game.

REFERENCES

Allis, V. L. (1994). Searching for Solutions in Games and

Artiﬁcial Intelligence. PhD thesis, University of Lim-

burg.

Browne, C., Powley, E., Whitehouse, D., Lucas, S., Cowl-

ing, P., Rohlfshagen, P., Tavener, S., Perez, D.,

Samothrakis, S., and Colton, S. (2012). A survey of

monte carlo tree search methods. Computational In-

telligence and AI in Games, IEEE Transactions on,

4(1):1–43.

Cazenave, T. and Helmstetter, B. (2005). Combining tacti-

cal search and monte-carlo in the game of go. In IN:

CIG05, pages 171–175.

Chaslot, G., Bakkes, S., Szita, I., and Spronck, P. (2008).

Monte-carlo tree search: A new framework for game

ai. In Darken, C. and Mateas, M., editors, AIIDE. The

AAAI Press.

Childs, B., Brodeur, J., and Kocsis, L. (2008). Transpo-

sitions and move groups in monte carlo tree search.

In Computational Intelligence and Games, 2008. CIG

’08. IEEE Symposium On, pages 389 –395.

Coulom, R. (2007). Efﬁcient selectivity and backup op-

erators in monte-carlo tree search. In Proceedings

of the 5th international conference on Computers

and games, CG’06, pages 72–83, Berlin, Heidelberg.

Springer-Verlag.

Den Teuling, N. (2011). Monte-carlo tree search for the

simultaneous move game tron. Univ. Maastricht,

Netherlands, Tech. Rep.

Kocsis, L., Szepesv

ari, C., and Willemson, J. (2006). Im-

proved monte-carlo search. Univ. Tartu, Estonia, Tech.

Rep, 1.

Kroeker, K. L. (2011). A new benchmark for artiﬁcial in-

telligence. Commun. ACM, 54(8):13–15.

uller, M. (2002). Computer go. Artiﬁcial Intelligence,

134(1):145–179.

Nijssen, J. and Winands, M. (2011). Enhancements for

multi-player monte-carlo tree search. Computers and

Games, pages 238–249.

Papadimitriou, C. M. (1994). Computational complexity.

Addison-Wesley, Reading, Massachusetts.

Ponsen, M., Gerritsen, G., and Chaslot, G. (2010). Inte-

grating opponent models with monte-carlo tree search

in poker. In Proc. Conf. Assoc. Adv. Artif. Intell.: In-

ter. Decis. Theory Game Theory Workshop, Atlanta,

Georgia, pages 37–42.

Rejeb, L., Guessoum, Z., and MHallah, R. (2005).

The exploration-exploitation dilemma for adaptive

agents. In Proceedings of the 5th European Work-

shop on Adaptive Agents and Multi-Agent Systems

(AAMAS05). Citeseer.

richardk (2012). The ai in the octagon theory. url:

http://aigamedev.com/open/interview/the-octagon-

theory/ (last visited on 2013-05-16).

Rimmel, A., Teytaud, F., and Cazenave, T. (2011). Opti-

mization of the nested monte-carlo algorithm on the

traveling salesman problem with time windows. Ap-

plications of Evolutionary Computation, pages 501–

510.

Schadd, F. C. (2009). Monte-carlo search techniques in the

modern board game thurn and taxis. Master’s thesis,

Department of Computer Science, Maastricht Univer-

sity.

Silver, D. and Veness, J. (2010). Monte-carlo planning in

large pomdps. Advances in Neural Information Pro-

cessing Systems (NIPS), 46.

Teytaud, F. and Teytaud, O. (2010). On the huge bene-

ﬁt of decisive moves in monte-carlo tree search al-

gorithms. In Computational Intelligence and Games

(CIG), 2010 IEEE Symposium on, pages 359 –364.

MonteCarloTreeSearchinTheOctagonTheory

335