Monte Carlo Sampling for Regret Minimization in
Extensive Games
Marc Lanctot Kevin Waugh
Department of Computing Science School of Computer Science
University of Alberta Carnegie Mellon University
Edmonton, Alberta, Canada T6G 2E8 Pittsburgh PA 15213-3891
abpewr@r.postjobfree.com abpewr@r.postjobfree.com
Martin Zinkevich Michael Bowling
Yahoo! Research Department of Computing Science
Santa Clara, CA, USA 95054 University of Alberta
Edmonton, Alberta, Canada T6G 2E8
abpewr@r.postjobfree.com
abpewr@r.postjobfree.com
Abstract
Sequential decision-making with multiple agents and imperfect information is
commonly modeled as an extensive game. One ef cient method for computing
Nash equilibria in large, zero-sum, imperfect information games is counterfactual
regret minimization (CFR). In the domain of poker, CFR has proven effective, par-
ticularly when using a domain-speci c augmentation involving chance outcome
sampling. In this paper, we describe a general family of domain-independent CFR
sample-based algorithms called Monte Carlo counterfactual regret minimization
(MCCFR) of which the original and poker-speci c versions are special cases. We
start by showing that MCCFR performs the same regret updates as CFR on expec-
tation. Then, we introduce two sampling schemes: outcome sampling and external
sampling, showing that both have bounded overall regret with high probability.
Thus, they can compute an approximate equilibrium using self-play. Finally, we
prove a new tighter bound on the regret for the original CFR algorithm and re-
late this new bound to MCCFR s bounds. We show empirically that, although the
sample-based algorithms require more iterations, their lower cost per iteration can
lead to dramatically faster convergence in various games.
1 Introduction
Extensive games are a powerful model of sequential decision-making with imperfect information,
subsuming nite-horizon MDPs, nite-horizon POMDPs, and perfect information games. The past
few years have seen dramatic algorithmic improvements in solving, i.e., nding an approximate
Nash equilibrium, in two-player, zero-sum extensive games. Multiple techniques [1, 2] now exist
for solving games with up to 1012 game states, which is about four orders of magnitude larger than
the previous state-of-the-art of using sequence-form linear programs [3].
Counterfactual regret minimization (CFR) [1] is one such recent technique that exploits the fact that
the time-averaged strategy pro le of regret minimizing algorithms converges to a Nash equilibrium.
The key insight is the fact that minimizing per-information set counterfactual regret results in min-
imizing overall regret. However, the vanilla form presented by Zinkevich and colleagues requires
the entire game tree to be traversed on each iteration. It is possible to avoid a full game-tree traver-
sal. In their accompanying technical report, Zinkevich and colleagues discuss a poker-speci c CFR
1
variant that samples chance outcomes on each iteration [4]. They claim that the per-iteration cost
reduction far exceeds the additional number of iterations required, and all of their empirical studies
focus on this variant. The sampling variant and its derived bound are limited to poker-like games
where chance plays a prominent role in the size of the games. This limits the practicality of CFR
minimization outside of its initial application of poker or moderately sized games. An additional
disadvantage of CFR is that it requires the opponent s policy to be known, which makes it unsuit-
able for online regret minimization in an extensive game. Online regret minimization in extensive
games is possible using online convex programming techniques, such as Lagrangian Hedging [5],
but these techniques can require costly optimization routines at every time step.
In this paper, we present a general framework for sampling in counterfactual regret minimization.
We de ne a family of Monte Carlo CFR minimizing algorithms (MCCFR), that differ in how they
sample the game tree on each iteration. Zinkevich s vanilla CFR and a generalization of their chance-
sampled CFR are both members of this family. We then introduce two additional members of this
family: outcome-sampling, where only a single playing of the game is sampled on each iteration; and
external-sampling, which samples chance nodes and the opponent s actions. We show that under a
reasonable sampling strategy, any member of this family minimizes overall regret, and so can be used
for equilibrium computation. Additionally, external-sampling is proven to require only a constant-
factor increase in iterations yet achieves an order reduction in the cost per iteration, thus resulting an
asymptotic improvement in equilibrium computation time. Furthermore, since outcome-sampling
does not need knowledge of the opponent s strategy beyond samples of play from the strategy, we
describe how it can be used for online regret minimization. We then evaluate these algorithms
empirically by using them to compute approximate equilibria in a variety of games.
2 Background
An extensive game is a general model of sequential decision-making with imperfect information. As
with perfect information games (such as Chess or Checkers), extensive games consist primarily of a
game tree: each non-terminal node has an associated player (possibly chance) that makes the deci-
sion at that node, and each terminal node has associated utilities for the players. Additionally, game
states are partitioned into information sets where a player cannot distinguish between two states in
the same information set. The players, therefore, must choose actions with the same distribution at
each state in the same information set. We now de ne an extensive game formally, introducing the
notation we use throughout the paper.
De nition 1 [6, p. 200] a nite extensive game with imperfect information has the following com-
ponents:
A nite set N of players. A nite set H of sequences, the possible histories of actions, such
that the empty sequence is in H and every pre x of a sequence in H is also in H . De ne
h h to mean h is a pre x of h . Z H are the terminal histories (those which are not
a pre x of any other sequences). A(h) = {a : ha H } are the actions available after a
non-terminal history, h H \ Z .
A function P that assigns to each non-terminal history a member of N {c}. P is the
player function. P (h) is the player who takes an action after the history h. If P (h) = c
then chance determines the action taken after history h.
For each player i N {c} a partition Ii of {h H : P (h) = i} with the property that
A(h) = A(h ) whenever h and h are in the same member of the partition. For Ii Ii
we denote by A(Ii ) the set A(h) and by P (Ii ) the player P (h) for any h Ii . Ii is the
information partition of player i; a set Ii Ii is an information set of player i.
A function fc that associates with every information set I where P (I ) = c a probability
measure fc ( I ) on A(h) (fc (a I ) is the probability that a occurs given some h I ), where
each such probability measure is independent of every other such measure.1
1
Traditionally, an information partition is not speci ed for chance. In fact, as long as the same chance
information set cannot be revisited, it has no strategic effect on the game itself. However, this extension allows
us to consider using the same sampled chance outcome for an entire set of histories, which is an important part
of Zinkevich and colleagues chance-sampling CFR variant.
2
For each player i N a utility function ui from the terminal states Z to the reals R. If
N = {1, 2} and u1 = u2, it is a zero-sum extensive game. De ne u,i = maxz ui (z )
minz ui (z ) to be the range of utilities to player i.
In this paper, we will only concern ourselves with two-player, zero-sum extensive games. Further-
more, we will assume perfect recall, a restriction on the information partitions such that a player
can always distinguish between game states where they previously took a different action or were
previously in a different information set.
2.1 Strategies and Equilibria
A strategy of player i, i, in an extensive game is a function that assigns a distribution over A(Ii )
to each Ii Ii . We denote i as the set of all strategies for player i. A strategy pro le,, consists
of a strategy for each player, 1, . . ., n . We let i refer to the strategies in excluding i .
Let (h) be the probability of history h occurring if all players choose actions according to . We
can decompose (h) = i N {c} i (h) into each player s contribution to this probability. Here,
i (h) is the contribution to this probability from player i when playing according to . Let i (h)
be the product of all players contribution (including chance) except that of player i. For I H,
de ne (I ) = h I (h), as the probability of reaching a particular information set given all
players play according to, with i (I ) and i (I ) de ned similarly. Finally, let (h, z ) =
(z )/ (h) if h z, and zero otherwise. Let i (h, z ) and i (h, z ) be de ned similarly. Using
this notation, we can de ne the expected payoff for player i as ui = h Z ui (h) (h).
Given a strategy pro le,, we de ne a player s best response as a strategy that maximizes their
expected payoff assuming all other players play according to . The best-response value for player
i is the value of that strategy, bi ( i ) = max i i ui ( i, i ). An -Nash equilibrium is an
approximation of a Nash equilibrium; it is a strategy pro le that satis es
i N ui + max ui ( i, i ) (1)
i i
If = 0 then is a Nash Equilibrium: no player has any incentive to deviate as they are all playing
best responses. If a game is two-player and zero-sum, we can use exploitability as a metric for
determining how close is to an equilibrium, = b1 ( 2 ) + b2 ( 1 ).
2.2 Counterfactual Regret Minimization
Regret is an online learning concept that has triggered a family of powerful learning algorithms. To
t
de ne this concept, rst consider repeatedly playing an extensive game. Let i be the strategy used
by player i on round t. The average overall regret of player i at time T is:
T
1
T
ui ( i, i ) ui ( t )
t
Ri = max (2)
T i i t=1
t
Moreover, de ne i to be the average strategy for player i from time 1 to T . In particular, for each
information set I Ii, for each a A(I ), de ne:
t
T
i (I ) t (a I )
t t=1
i (a I ) = . (3)
T
t
i (I )t=1
There is a well-known connection between regret, average strategies, and Nash equilibria.
Theorem 1 In a zero-sum game, if Ri {1,2}, then T is a 2 equilibrium.
T
t
An algorithm for selecting i for player i is regret minimizing if player i s average overall regret
t
(regardless of the sequence i ) goes to zero as t goes to in nity. Regret minimizing algorithms in
self-play can be used as a technique for computing an approximate Nash equilibrium. Moreover, an
algorithm s bounds on the average overall regret bounds the convergence rate of the approximation.
Zinkevich and colleagues [1] used the above approach in their counterfactual regret algorithm (CFR).
The basic idea of CFR is that overall regret can be bounded by the sum of positive per-information-
set immediate counterfactual regret. Let I be an information set of player i. De ne (I a) to be
3
a strategy pro le identical to except that player i always chooses action a from information set
I . Let ZI be the subset of all terminal histories where a pre x of the history is in the set I ; for
z ZI let z [I ] be that pre x. Since we are restricting ourselves to perfect recall games z [I ] is
unique. De ne counterfactual value vi (, I ) as,
i (z [I ]) (z [I ], z )ui (z ).
vi (, I ) = (4)
z ZI
T T
The immediate counterfactual regret is then Ri,imm (I ) = maxa A(I ) Ri,imm (I, a), where
T
1
T
vi ( (I a), I ) vi ( t, I )
t
Ri,imm (I, a) = (5)
T t=1
Let x+ = max(x, 0). The key insight of CFR is the following result.
T,+
T
Ri Ri,imm (I )
Theorem 2 [1, Theorem 3] I Ii
Using regret-matching2 the positive per-information set immediate counterfactual regrets can be
driven to zero, thus driving average overall regret to zero. This results in an average overall regret
T
bound [1, Theorem 4]: Ri u,i Ii Ai / T, where Ai = maxh:P (h)=i A(h) . We return to
this bound, tightening it further, in Section 4.
This result suggests an algorithm for computing equilibria via self-play, which we will refer to as
vanilla CFR. The idea is to traverse the game tree computing counterfactual values using Equation 4.
Given a strategy, these values de ne regret terms for each player for each of their information sets
using Equation 5. These regret values accumulate and determine the strategies at the next iteration
using the regret-matching formula. Since both players are regret minimizing, Theorem 1 applies
and so computing the strategy pro le t gives us an approximate Nash Equilibrium. Since CFR
only needs to store values at each information set, its space requirement is O( I ). However, as
previously mentioned vanilla CFR requires a complete traversal of the game tree on each iteration,
which prohibits its use in many large games. Zinkevich and colleagues [4] made steps to alleviate
this concern with a chance-sampled variant of CFR for poker-like games.
3 Monte Carlo CFR
The key to our approach is to avoid traversing the entire game tree on each iteration while still having
the immediate counterfactual regrets be unchanged in expectation. In general, we want to restrict
the terminal histories we consider on each iteration. Let Q = {Q1, . . ., Qr } be a set of subsets of
Z, such that their union spans the set Z . We will call one of these subsets a block. On each iteration
we will sample one of these blocks and only consider the terminal histories in that block. Let qj > 0
r
be the probability of considering block Qj for the current iteration (where j =1 qj = 1).
Let q (z ) = j :z Qj qj, i.e., q (z ) is the probability of considering terminal history z on the current
iteration. The sampled counterfactual value when updating block j is:
1
ui (z ) i (z [I ]) (z [I ], z )
vi (, I j ) =
(6)
q (z )
z Qj ZI
Selecting a set Q along with the sampling probabilities de nes a complete sample-based CFR algo-
rithm. Rather than doing full game tree traversals the algorithm samples one of these blocks, and
then examines only the terminal histories in that block.
Suppose we choose Q = {Z }, i.e., one block containing all terminal histories and q1 = 1. In
this case, sampled counterfactual value is equal to counterfactual value, and we have vanilla CFR.
Suppose instead we choose each block to include all terminal histories with the same sequence of
chance outcomes (where the probability of a chance outcome is independent of players actions as
t
2
Regret-matching selects actions with probability proportional to their positive regret, i.e., i (a I ) =
T,+ T,+
P
Ri,imm (I, a)/ a A(I ) Ri,imm (I, a). Regret-matching satis es Blackwell s approachability criteria. [7, 8]
4
in poker-like games). Hence qj is the product of the probabilities in the sampled sequence of chance
outcomes (which cancels with these same probabilities in the de nition of counterfactual value) and
we have Zinkevich and colleagues chance-sampled CFR.
Sampled counterfactual value was designed to match counterfactual value on expectation. We show
this here, and then use this fact to prove a probabilistic bound on the algorithm s average overall
regret in the next section.
Lemma 1 Ej qj [ i (, I j )] = vi (, I )
v
Proof:
qj
(z [I ]) (z [I ], z )ui (z )
Ej qj [ i (, I j )] = qj vi (, I j ) =
v (7)
q ( z ) i
z Qj ZI
j j
qj
j :z Qj
i (z [I ]) (z [I ], z )ui (z )
= (8)
q (z )
z ZI
i (z [I ]) (z [I ], z )ui (z ) = vi (, I )
= (9)
z ZI
Equation 8 follows from the fact that Q spans Z . Equation 9 follows from the de nition of q (z ).
This results in the following MCCFR algorithm. We sample a block and for each information
set that contains a pre x of a terminal history in the block we compute the sampled immediate
counterfactual regrets of each action, r(I, a) = vi ( (I a), I ) vi ( t, I ). We accumulate these
t
regrets, and the player s strategy on the next iteration applies the regret-matching algorithm to the
accumulated regrets. We now present two speci c members of this family, giving details on how the
regrets can be updated ef ciently.
Outcome-Sampling MCCFR. In outcome-sampling MCCFR we choose Q so that each block
contains a single terminal history, i.e., Q Q, Q = 1. On each iteration we sample one terminal
history and only update each information set along that history. The sampling probabilities, qj must
specify a distribution over terminal histories. We will specify this distribution using a sampling
pro le,, so that q (z ) = (z ). Note that any choice of sampling policy will induce a particular
distribution over the block probabilities q (z ). As long as i (a I ) >, then there exists a > 0 such
that q (z ) >, thus ensuring Equation 6 is well-de ned.
The algorithm works by sampling z using policy, storing (z ). The single history is then
traversed forward (to compute each player s probability of playing to reach each pre x of the history,
i (h)) and backward (to compute each player s probability of playing the remaining actions of the
history, i (h, z )). During the backward traversal, the sampled counterfactual regrets at each visited
information set are computed (and added to the total regret).
ui (z ) i (z ) i (z [I ]a, z )
wI 1 (a z [I ]) if (z [I ]a) z
r(I, a) =, where wI =
wI (a z [I ]) otherwise (z )
(10)
One advantage of outcome-sampling MCCFR is that if our terminal history is sampled according to
the opponent s policy, so i = i, then the update no longer requires explicit knowledge of i as
it cancels with the i . So, wI becomes ui (z ) i (z [I ], z )/ i (z ). Therefore, we can use outcome-
sampling MCCFR for online regret minimization. We would have to choose our own actions so that
t
i i, but with some exploration to guarantee qj > 0. By balancing the regret caused by
exploration with the regret caused by a small (see Section 4 for how MCCFR s bound depends
upon ), we can bound the average overall regret as long as the number of playings T is known in
advance. This effectively mimics the approach taking by Exp3 for regret minimization in normal-
form games [9]. An alternative form for Equation 10 is recommended for implementation. This and
other implementation details can be found in the paper s supplemental material or the appendix of
the associated technical report [10].
5
External-Sampling MCCFR. In external-sampling MCCFR we sample only the actions of the
opponent and chance (those choices external to the player). We have a block Q Q for each
pure strategy of the opponent and chance, i.e.,, for each deterministic mapping from I Ic
IN \{i} to A(I ). The block probabilities are assigned based on the distributions fc and i, so
q = I Ic fc ( (I ) I ) I IN \{i} i ( (I ) I ). The block Q then contains all terminal histories
z consistent with, that is if ha is a pre x of z with h I for some I I i then (I ) = a. In
practice, we will not actually sample but rather sample the individual actions that make up only
as needed. The key insight is that these block probabilities result in q (z ) = i (z ). The algorithm
iterates over i N and for each doing a post-order depth- rst traversal of the game tree, sampling
actions at each history h where P (h) = i (storing these choices so the same actions are sampled at
all h in the same information set). Due to perfect recall it can never visit more than one history from
the same information set during this traversal. For each such visited information set the sampled
counterfactual regrets are computed (and added to the total regrets).
r(I, a) = (1 (a I ))
ui (z ) i (z [I ]a, z ) (11)
z Q ZI
Note that the summation can be easily computed during the traversal by always maintaining a
weighted sum of the utilities of all terminal histories rooted at the current history.
4 Theoretical Analysis
We now present regret bounds for members of the MCCFR family, starting with an improved bound
for vanilla CFR that depends more explicitly on the exact structure of the extensive game. Let ai be
a subsequence of a history such that it contains only player i s actions in that history, and let Ai be
the set of all such player i action subsequences. Let Ii (ai ) be the set of all information sets where
player i s action sequence up to that information set is ai . De ne the M -value for player i of the
game to be Mi = ai Ai Ii (a) . Note that Ii Mi Ii with both sides of this bound
being realized by some game. We can strengthen vanilla CFR s regret bound using this constant,
which also appears in the bounds for the MCCFR variants.
T
Theorem 3 When using vanilla CFR for player i, Ri u,i Mi Ai / T .
We now turn our attention to the MCCFR family of algorithms, for which we can provide probabilis-
tic regret bounds. We begin with the most exciting result: showing that external-sampling requires
only a constant factor more iterations than vanilla CFR (where the constant depends on the desired
con dence in the bound).
Theorem 4 For any p (0, 1], when using external-sampling MCCFR, with probability at least
2
T
1 p, average overall regret is bounded by, Ri 1 + p u,i Mi Ai / T .
Although requiring the same order of iterations, note that external-sampling need only traverse a
fraction of the tree on each iteration. For balanced games where players make roughly equal numbers
of decisions, the iteration cost of external-sampling is O( H ), while vanilla CFR is O( H ),
meaning external-sampling MCCFR requires asymptotically less time to compute an approximate
equilibrium than vanilla CFR (and consequently chance-sampling CFR, which is identical to vanilla
CFR in the absence of chance nodes).
Theorem 5 For any p (0, 1], when using outcome-sampling MCCFR where z Z either
i (z ) = 0 or q (z ) > 0 at every timestep, with probability 1 p, average overall regret
2
is bounded by Ri 1 + p 1 u,i Mi Ai / T
T
The proofs for the theorems in this section can be found in the paper s supplemental material and
as an appendix of the associated technical report [10]. The supplemental material also presents a
slightly complicated, but general result for any member of the MCCFR family, from which the two
theorems presented above are derived.
6
H (106 ) I (103 ) l M1 M2 tvc tos tes
Game
22.4 2 5 45 32 28s 46 s 99 s
OCP
98.3 329*-**-***** 89884 110s 150 s 150ms
Goof
70.4 160**-**-******* 1236660 38s 62 s 70ms
LTTT
91.8 20-13-954*-**** 120s 85 s 28ms
PAM
Table 1: Game properties. The value of H is in millions and I in thousands, and l = maxh H h .
tvc, tos, and tes are the average wall-clock time per iteration4 for vanilla CFR, outcome-sampling
MCCFR, and external-sampling MCCFR.
5 Experimental Results
We evaluate the performance of MCCFR compared to vanilla CFR on four different games. Goof-
spiel [11] is a bidding card game where players have a hand of cards numbered 1 to N, and take
turns secretly bidding on the top point-valued card in a point card stack using cards in their hands.
Our version is less informational: players only nd out the result of each bid and not which cards
were used to bid, and the player with the highest total points wins. We use N = 7 in our exper-
iments. One-Card Poker [12] is a generalization of Kuhn Poker [13], we use a deck of size 500.
Princess and Monster [14, Research Problem 12.4.1] is a pursuit-evasion game on a graph, neither
player ever knowing the location of the other. In our experiments we use random starting positions,
a 4-connected 3 by 3 grid graph, and a horizon of 13 steps. The payoff to the evader is the number of
steps uncaptured. Latent Tic-Tac-Toe is a twist on the classic game where moves are not disclosed
until after the opponent s next move, and lost if invalid at the time they are revealed. While all of
these games have imperfect information and roughly of similar size, they are a diverse set of games,
varying both in the degree (the ratio of the number of information sets to the number of histories)
and nature (whether due to chance or opponent actions) of imperfect information. The left columns
of Table 1 show various constants, including the number of histories, information sets, game length,
and M-values, for each of these domains.
We used outcome-sampling MCCFR, external-sampling MCCFR, and vanilla CFR to compute an
approximate equilibrium in each of the four games. For outcome-sampling MCCFR we used an
epsilon-greedy sampling pro le . At each information set, we sample an action uniformly ran-
domly with probability and according to the player s current strategy t . Through experimentation
we found that = 0.6 worked well across all games; this is interesting because the regret bound
suggests should be as large as possible. This implies that putting some bias on the most likely
outcome to occur is helpful. With vanilla CFR we used to an implementational trick called pruning
to dramatically reduce the work done per iteration. When updating one player s regrets, if the other
player has no probability of reaching the current history, the entire subtree at that history can be
pruned for the current iteration, with no effect on the resulting computation. We also used vanilla
CFR without pruning to see the effects of pruning in our domains.
Figure 1 shows the results of all four algorithms on all four domains, plotting approximation quality
as a function of the number of nodes of the game tree the algorithm touched while computing.
Nodes touched is an implementation-independent measure of computation; however, the results are
nearly identical if total wall-clock time is used instead. Since the algorithms take radically different
amounts of time per iteration, this comparison directly answers if the sampling variants lower cost
per iteration outweighs the required increase in the number of iterations. Furthermore, for any
xed game (and degree of con dence that the bound holds), the algorithms average overall regret
is falling at the same rate, O(1/ T ), meaning that only their short-term rather than asymptotic
performance will differ.
The graphs show that the MCCFR variants often dramatically outperform vanilla CFR. For example,
in Goofspiel, both MCCFR variants require only a few million nodes to reach
takes 2.5 billion nodes, three orders of magnitude more. In fact, external-sampling, which has
the tightest theoretical computation-time bound, outperformed CFR and by considerable margins
(excepting LTTT) in all of the games. Note that pruning is key to vanilla CFR being at all practical
in these games. For example, in Latent Tic-Tac-Toe the rst iteration of CFR touches 142 million
nodes, but later iterations touch as few as 5 million nodes. This is because pruning is not possible
4
As measured on an 8-core Intel Xeon 2.5 GHz machine running Linux x86 64 kernel 2.6.27.
7
Goofspiel Latent Tic-Tac-Toe
2 1.8
CFR CFR
1.8 1.6
CFR with pruning CFR with pruning
MCCFR-outcome MCCFR-outcome
1.6 1.4
MCCFR-external MCCFR-external
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0 1e+09 2e+09 3e+09 4e+09 5e+09 0 2e+08 4e+08 6e+08
Nodes Touched Nodes Touched
One-Card Poker Princess and Monster
20
0.25 CFR CFR
18
CFR with pruning CFR with pruning
MCCFR-outcome MCCFR-outcome
16
0.2
MCCFR-external MCCFR-external
0 2e+08 4e+08 6e+08 8e+08 1e+09 0 1e+08 2e+08 3e+08 4e+08 5e+08
Nodes Touched Nodes Touched
Figure 1: Convergence rates of Vanilla CFR, outcome-sampled MCCFR, and external-sampled MC-
CFR for various games. The y axis in each graph represents the exploitability of the strategies for
the two players (see Section 2.1).
in the rst iteration. We believe this is due to dominated actions in the game. After one or two
traversals, the players identify and eliminate dominated actions from their policies, allowing these
subtrees to pruned. Finally, it is interesting to note that external-sampling was not uniformly the best
choice, with outcome-sampling performing better in Goofspiel. With outcome-sampling performing
worse than vanilla CFR in LTTT, this raises the question of what speci c game properties might
favor one algorithm over another and whether it might be possible to incorporate additional game
speci c constants into the bounds.
6 Conclusion
In this paper we de ned a family of sample-based CFR algorithms for computing approximate equi-
libria in extensive games, subsuming all previous CFR variants. We also introduced two sampling
schemes: outcome-sampling, which samples only a single history for each iteration, and external-
sampling, which samples a deterministic strategy for the opponent and chance. In addition to pre-
senting a tighter bound for vanilla CFR, we presented regret bounds for both sampling variants,
which showed that external sampling with high probability gives an asymptotic computational time
improvement over vanilla CFR. We then showed empirically in very different domains that the re-
duction in iteration time outweighs the increase in required iterations leading to faster convergence.
There are three interesting directions for future work. First, we would like to examine how the
properties of the game effect the algorithms convergence. Such an analysis could offer further
algorithmic or theoretical improvements, as well as practical suggestions, such as how to choose
a sampling policy in outcome-sampled MCCFR. Second, using outcome-sampled MCCFR as a
general online regret minimizing technique in extensive games (when the opponents strategy is not
known or controlled) appears promising. It would be interesting to compare the approach, in terms
of bounds, computation, and practical convergence, to Gordon s Lagrangian hedging [5]. Lastly,
it seems like this work could be naturally extended to cases where we don t assume perfect recall.
Imperfect recall could be used as a mechanism for abstraction over actions, where information sets
are grouped by important partial sequences rather than their full sequences.
8
References
[1] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret mini-
mization in games with incomplete information. In Advances in Neural Information Processing
Systems 20 (NIPS), 2008.
[2] Andrew Gilpin, Samid Hoda, Javier Pe a, and Tuomas Sandholm. Gradient-based algorithms
n
for nding Nash equilibria in extensive form games. In 3rd International Workshop on Internet
and Network Economics (WINE 07), 2007.
[3] D. Koller, N. Megiddo, and B. von Stengel. Fast algorithms for nding randomized strategies
in game trees. In Proceedings of the 26th ACM Symposium on Theory of Computing (STOC
94), pages 750 759, 1994.
[4] Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret min-
imization in game with incomplete information. Technical Report TR07-14, University of
Alberta, 2007. http://www.cs.ualberta.ca/research/techreports/2007/
TR07-14.php.
[5] Geoffrey J. Gordon. No-regret algorithms for online convex programs. In In Neural Informa-
tion Processing Systems 19, 2007.
[6] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. MIT Press, 1994.
[7] Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equi-
librium. Econometrica, 68(5):1127 1150, September 2000.
[8] D. Blackwell. An analog of the minimax theorem for vector payoffs. Paci c Journal of Math-
ematics, 6:1 8, 1956.
[9] Peter Auer, Nicol` Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. Gambling in a rigged
o
casino: The adversarial multi-arm bandit problem. In 36th Annual Symposium on Foundations
of Computer Science, pages 322 331, 1995.
[10] Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sam-
pling for regret minimization in extensive games. Technical Report TR09-15, University of
Alberta, 2009. http://www.cs.ualberta.ca/research/techreports/2009/
TR09-15.php.
[11] S. M. Ross. Goofspiel the game of pure strategy. Journal of Applied Probability, 8(3):621
625, 1971.
[12] Geoffrey J. Gordon. No-regret algorithms for structured prediction problems. Technical Report
CMU-CALD-05-112, Carnegie Mellon University, 2005.
[13] H. W. Kuhn. Simpli ed two-person poker. Contributions to the Theory of Games, 1:97 103,
1950.
[14] Rufus Isaacs. Differential Games: A Mathematical Theory with Applications to Warfare and
Pursuit, Control and Optimization. John Wiley & Sons, 1965.
9