success rate is far above simple voting schemas, sug-
gesting that communications between independent
randomized agents are important and that communi-
cating only at the very end is not enough.
- Our two parallelizations (multi-core and cluster) are
orthogonal, in the sense that: (i) the multi-core paral-
lelization is based on a faster breadth-first exploration
(the different cores are analyzing the same tree and go
through almost the same path in the tree; in spite of
many trials, we have no improvement by introducing
deterministic or random diversification in the differ-
ent threads. (ii) the cluster parallelization is based
on sharing statistics guiding the first levels only of
the tree, leading to a natural form of load balancing.
The deep exploration of nodes is completely orthog-
onal. Moreover, the results are cumulative; we see
the same speed-up for the cluster parallelization with
multi-threaded versions of the code or mono-thread
versions.
- In 9x9 Go, we have roughly linear speed-up until
4 cores or 9 nodes. The speed-up is not negligible
beyond this limit, but not linear. In 19x19 Go, the
speed-up remains linear until at least 4 cores and 9
machines.
Extending these results to higher numbers of ma-
chines is the natural further work. Increasing the
number of cores is difficult, as getting an access to a
16-cores machine is not easy. Monte-Carlo planning
is a strongly innovative tool with more and more ap-
plications, in particular in cases in which variants of
backwards dynamic programming do not work. Ex-
trapolating the results to the human scale of perfor-
mance is difficult. People usually consider that dou-
bling the computational power is roughly equivalent
to adding almost one stone to the level. This is con-
firmed by our experiments. Then, from the 2nd or 3rd
Kyu of the sequential MoGo in 19x19, we need 10 or
12 stones for the best human level. Then, we need
a speed-up of a few thousands. This is far from im-
possible, if the speed-up remains close to linear with
more nodes.
REFERENCES
Agrawal, R. (1995). The continuum-armed bandit problem.
SIAM J. Control Optim., 33(6):1926–1951.
Auer, P., Cesa-Bianchi, N., and Gentile, C. (2001). Adap-
tive and self-confident on-line learning algorithms.
Machine Learning Journal.
Banks, J. S. and Sundaram, R. K. (1992). Denumerable-
armed bandits. Econometrica, 60(5):1071–96. Avail-
able at http://ideas.repec.org/a/ecm/emetrp/
v60y1992i5p1071-96.html.
Barto, A., Bradtke, S., and Singh, S. (1993). Learning to
act using real-time dynamic programming. Technical
Report UM-CS-1993-002.
Bellman, R. (1957). Dynamic Programming. Princeton
Univ. Press.
Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and
Shepp, L. A. (1997). Bandit problems with infinitely
many arms. Ann. Statist., 25(5):2103–2116.
Bertsekas, D. (1995). Dynamic Programming and Optimal
Control, vols I and II. Athena Scientific.
Bruegmann, B. (1993). Monte carlo go. Unpublished.
Cazenave, T. and Helmstetter, B. (2005). Combining tacti-
cal search and monte-carlo in the game of go. IEEE
CIG 2005, pages 171–175.
Coquelin, P.-A. and Munos, R. (2007). Bandit algorithms
for tree search. In Proceedings of UAI’07.
Coulom, R. (2006). Efficient selectivity and backup opera-
tors in monte-carlo tree search. In P. Ciancarini and
H. J. van den Herik, editors, Proceedings of the 5th
International Conference on Computers and Games,
Turin, Italy.
Coulom, R. (2007). Computing elo ratings of move patterns
in the game of go. In van den Herik, H. J., Uiterwijk,
J. W. H. M., Winands, M., and Schadd, M., editors,
Computer Games Workshop, Amsterdam.
Dani, V. and Hayes, T. P. (2006). Robbing the bandit:
less regret in online geometric optimization against
an adaptive adversary. In SODA ’06: Proceedings
of the seventeenth annual ACM-SIAM symposium on
Discrete algorithm, pages 937–943, New York, NY,
USA. ACM Press.
Gelly, S. and Silver, D. (2007). Combining online and
offline knowledge in uct. In ICML ’07: Proceed-
ings of the 24th international conference on Machine
learning, pages 273–280, New York, NY, USA. ACM
Press.
Hussain, Z., Auer, P., Cesa-Bianchi, N., Newnham, L., and
Shawe-Taylor, J. (2006). Exploration vs. exploitation
challenge. Pascal Network of Excellence.
Kocsis, L. and Szepesvari, C. (2005). Reduced-variance
payoff estimation in adversarial bandit problems. In
Proceedings of the ECML-2005 Workshop on Rein-
forcement Learning in Non-Stationary Environments.
Kocsis, L. and Szepesvari, C. (2006a). Bandit-based monte-
carlo planning. ECML’06.
Kocsis, L. and Szepesvari, C. (2006b). Discounted-ucb. In
2nd Pascal-Challenge Workshop.
Lai, T. and Robbins, H. (1985). Asymptotically efficient
adaptive allocation rules. Advances in Applied Math-
ematics, 6:4–22.
Powell, W.-B. (2007). Approximate Dynamic Program-
ming. Wiley.
Wang, Y. and Gelly, S. (2007). Modifications of UCT and
sequence-like simulations for Monte-Carlo Go. In
IEEE Symposium on Computational Intelligence and
Games, Honolulu, Hawaii, pages 175–182.
THE PARALLELIZATION OF MONTE-CARLO PLANNING - Parallelization of MC-Planning
249