Post Job Free

Resume

Sign in

Information It

Location:
Edmonton, AB, Canada
Posted:
November 08, 2012

Contact this candidate

Resume:

Regret Minimization in Games with Incomplete

Information

Martin Zinkevich Michael Johanson

abpewo@r.postjobfree.com abpewo@r.postjobfree.com

Michael Bowling Carmelo Piccione

Computing Science Department Computing Science Department

University of Alberta University of Alberta

Edmonton, AB Canada T6G2E8 Edmonton, AB Canada T6G2E8

abpewo@r.postjobfree.com abpewo@r.postjobfree.com

Abstract

Extensive games are a powerful model of multiagent decision-making scenarios

with incomplete information. Finding a Nash equilibrium for very large instances

of these games has received a great deal of recent attention. In this paper, we

describe a new technique for solving large games based on regret minimization.

In particular, we introduce the notion of counterfactual regret, which exploits the

degree of incomplete information in an extensive game. We show how minimizing

counterfactual regret minimizes overall regret, and therefore in self-play can be

used to compute a Nash equilibrium. We demonstrate this technique in the domain

of poker, showing we can solve abstractions of limit Texas Hold em with as many

as 1012 states, two orders of magnitude larger than previous methods.

1 Introduction

Extensive games are a natural model for sequential decision-making in the presence of other

decision-makers, particularly in situations of imperfect information, where the decision-makers have

differing information about the state of the game. As with other models (e.g., MDPs and POMDPs),

its usefulness depends on the ability of solution techniques to scale well in the size of the model. So-

lution techniques for very large extensive games have received considerable attention recently, with

poker becoming a common measuring stick for performance. Poker games can be modeled very

naturally as an extensive game, with even small variants, such as two-player, limit Texas Hold em,

being impractically large with just under 1018 game states.

State of the art in solving extensive games has traditionally made use of linear programming using a

realization plan representation [1]. The representation is linear in the number of game states, rather

than exponential, but considerable additional technology is still needed to handle games the size of

poker. Abstraction, both hand-chosen [2] and automated [3], is commonly employed to reduce the

game from 1018 to a tractable number of game states (e.g., 107 ), while still producing strong poker

programs. In addition, dividing the game into multiple subgames each solved independently or in

real-time has also been explored [2, 4]. Solving larger abstractions yields better approximate Nash

equilibria in the original game, making techniques for solving larger games the focus of research

in this area. Recent iterative techniques have been proposed as an alternative to the traditional

linear programming methods. These techniques have been shown capable of nding approximate

solutions to abstractions with as many as 1010 game states [5, 6, 7], resulting in the rst signi cant

improvement in poker programs in the past four years.

1

In this paper we describe a new technique for nding approximate solutions to large extensive games.

The technique is based on regret minimization, using a new concept called counterfactual regret. We

show that minimizing counterfactual regret minimizes overall regret, and therefore can be used to

compute a Nash equilibrium. We then present an algorithm for minimizing counterfactual regret

in poker. We use the algorithm to solve poker abstractions with as many as 1012 game states, two

orders of magnitude larger than previous methods. We also show that this translates directly into

an improvement in the strength of the resulting poker playing programs. We begin with a formal

description of extensive games followed by an overview of regret minimization and its connections

to Nash equilibria.

2 Extensive Games, Nash Equilibria, and Regret

Extensive games provide a general yet compact model of multiagent interaction, which explicitly

represents the often sequential nature of these interactions. Before presenting the formal de nition,

we rst give some intuitions. The core of an extensive game is a game tree just as in perfect infor-

mation games (e.g., Chess or Go). Each non-terminal game state has an associated player choosing

actions and every terminal state has associated payoffs for each of the players. The key difference

is the additional constraint of information sets, which are sets of game states that the controlling

player cannot distinguish and so must choose actions for all such states with the same distribution.

In poker, for example, the rst player to act does not know which cards the other players were dealt,

and so all game states immediately following the deal where the rst player holds the same cards

would be in the same information set. We now describe the formal model as well as notation that

will be useful later.

De nition 1 [8, p. 200] a nite extensive game with imperfect information has the following com-

ponents:

A nite set N of players.

A nite set H of sequences, the possible histories of actions, such that the empty sequence

is in H and every pre x of a sequence in H is also in H . Z H are the terminal histories

(those which are not a pre x of any other sequences). A(h) = {a : (h, a) H } are the

actions available after a nonterminal history h H,

A function P that assigns to each nonterminal history (each member of H \Z ) a member of

N {c}. P is the player function. P (h) is the player who takes an action after the history

h. If P (h) = c then chance determines the action taken after history h.

A function fc that associates with every history h for which P (h) = c a probability mea-

sure fc ( h) on A(h) (fc (a h) is the probability that a occurs given h), where each such

probability measure is independent of every other such measure.

For each player i N a partition Ii of {h H : P (h) = i} with the property that

A(h) = A(h ) whenever h and h are in the same member of the partition. For Ii Ii

we denote by A(Ii ) the set A(h) and by P (Ii ) the player P (h) for any h Ii . Ii is the

information partition of player i; a set Ii Ii is an information set of player i.

For each player i N a utility function ui from the terminal states Z to the reals R. If

N = {1, 2} and u1 = u2, it is a zero-sum extensive game. De ne u,i = maxz ui (z )

minz ui (z ) to be the range of utilities to player i.

Note that the partitions of information as described can result in some odd and unrealistic situations

where a player is forced to forget her own past decisions. If all players can recall their previous

actions and the corresponding information sets, the game is said to be one of perfect recall. This

work will focus on nite, zero-sum extensive games with perfect recall.

2.1 Strategies

A strategy of player i i in an extensive game is a function that assigns a distribution over A(Ii ) to

each Ii Ii, and i is the set of strategies for player i. A strategy pro le consists of a strategy

for each player, 1, 2, . . ., with i referring to all the strategies in except i .

2

Let (h) be the probability of history h occurring if players choose actions according to . We can

decompose = i N {c} i (h) into each player s contribution to this probability. Hence, i (h)

is the probability that if player i plays according to then for all histories h that are a proper pre x

of h with P (h ) = i, player i takes the corresponding action in h. Let i (h) be the product of all

players contribution (including chance) except player i. For I H, de ne (I ) = h I (h),

as the probability of reaching a particular information set given, with i (I ) and i (I ) de ned

similarly.

The overall value to player i of a strategy pro le is then the expected payoff of the resulting terminal

node, ui = h Z ui (h) (h).

2.2 Nash Equilibrium

The traditional solution concept of a two-player extensive game is that of a Nash equilibrium. A

Nash equilibrium is a strategy pro le where

u1 max u1 ( 1, 2 ) u2 max u2 ( 1, 2 ). (1)

1 1 2 2

An approximation of a Nash equilibrium or -Nash equilibrium is a strategy pro le where

u1 + max u1 ( 1, 2 ) u2 + max u2 ( 1, 2 ). (2)

1 1 2 2

2.3 Regret Minimization

Regret is an online learning concept that has triggered a family of powerful learning algorithms. To

t

de ne this concept, rst consider repeatedly playing an extensive game. Let i be the strategy used

by player i on round t. The average overall regret of player i at time T is:

T

1

T

ui ( i, i ) ui ( t )

t

Ri = max (3)

T i i t=1

t

Moreover, de ne i to be the average strategy for player i from time 1 to T . In particular, for each

information set I Ii, for each a A(I ), de ne:

t

T

i (I ) t (I )(a)

t t=1

i (I )(a) = . (4)

T t

i (I )

t=1

There is a well-known connection between regret and the Nash equilibrium solution concept.

Theorem 2 In a zero-sum game at time T, if both player s average overall regret is less than, then

T is a 2 equilibrium.

t

An algorithm for selecting i for player i is regret minimizing if player i s average overall regret

t

(regardless of the sequence i ) goes to zero as t goes to in nity. As a result, regret minimizing

algorithms in self-play can be used as a technique for computing an approximate Nash equilibrium.

Moreover, an algorithm s bounds on the average overall regret bounds the rate of convergence of the

approximation.

Traditionally, regret minimization has focused on bandit problems more akin to normal-form games.

Although it is conceptually possible to convert any nite extensive game to an equivalent normal-

form game, the exponential increase in the size of the representation makes the use of regret algo-

rithms on the resulting game impractical. Recently, Gordon has introduced the Lagrangian Hedging

(LH) family of algorithms, which can be used to minimize regret in extensive games by working

with the realization plan representation [5]. We also propose a regret minimization procedure that

exploits the compactness of the extensive game. However, our technique doesn t require the costly

quadratic programming optimization needed with LH allowing it to scale more easily, while achiev-

ing even tighter regret bounds.

3

3 Counterfactual Regret

The fundamental idea of our approach is to decompose overall regret into a set of additive regret

terms, which can be minimized independently. In particular, we introduce a new regret concept

for extensive games called counterfactual regret, which is de ned on a individual information set.

We show that overall regret is bounded by the sum of counterfactual regret, and also show how

counterfactual regret can be minimized at each information set independently.

We begin by considering one particular information set I Ii and player i s choices made in that

information set. De ne ui (, h) to be the expected utility given that the history h is reached and

then all players play using strategy . De ne counterfactual utility u(, I ) to be the expected

utility given that information set I is reached and all players play using strategy except that player

i plays to reach I . This can be thought of as a counterfactual as it is the value to player i of

reaching information set I if the player had tried to do so. Finally, for all a A(I ), de ne I a to

be a strategy pro le identical to except that player i always chooses action a when in information

set I . The immediate counterfactual regret is:

T

1 t

T

(I ) ui ( t I a, I ) ui ( t, I )

Ri,imm (I ) = max (5)

T a A(I ) t=1 i

Intuitively, this is the player s regret in its decisions at information set I in terms of counterfactual

utility, with an additional weighting term for the counterfactual probability that I would be reached

on that round if the player had tried to do so. As we will often be most concerned about regret when it

T,+ T

is positive, let Ri,imm (I ) = max(Ri,imm (I ), 0) be the positive portion of immediate counterfactual

regret.

We can now state our rst key result.

T,+

T

Theorem 3 Ri Ri,imm (I )

I Ii

This proof is in the appendix1 . Since minimizing immediate counterfactual regret minimizes the

overall regret, it enables us to nd an approximate Nash equilibrium if we can only minimize the

immediate counterfactual regret.

The key feature of immediate counterfactual regret is that it can be minimized by controlling only

i (I ). To this end, we can use Blackwell s algorithm for approachability to minimize this regret

independently on each information set. In particular, we maintain for all I Ii, for all a A(I ):

T

1 t

T

i (I ) ui ( t I a, I ) ui ( t, I )

Ri (I, a) = (6)

T t=1

De ne Ri + (I, a) = max(Ri (I, a), 0), then the strategy for time T + 1 is:

T, T

Ri,+ (I,a)

T

if a A(I ) Ri + (I, a) > 0

T,

T,+

T +1

a A(I ) Ri (I,a)

i (I )(a) = (7)

1 otherwise.

A(I )

In other words, actions are selected in proportion to the amount of positive counterfactual regret

for not playing that action. If no actions have any positive counterfactual regret, then the action is

selected randomly. This leads us to our second key result.

T

Theorem 4 If player i selects actions according to Equation 7 then Ri,imm (I ) u,i Ai / T

T

and consequently Ri u,i Ii Ai / T where Ai = maxh:P (h)=i A(h) .

The proof is in the appendix. This result establishes that the strategy in Equation 7 can be used

in self-play to compute a Nash equilibrium. In addition, the bound on the average overall regret is

linear in the number of information sets. These are similar bounds to what s achievable by Gordon s

Lagrangian Hedging algorithms. Meanwhile, minimizing counterfactual regret does not require

a costly quadratic program projection on each iteration. In the next section we demonstrate our

technique in the domain of poker.

1

Material noted as being in the appendix is available in the supporting material accompanying this submis-

sion. It will be made available in a companion technical report if this paper is accepted for publication.

4

4 Application To Poker

We now describe how we use counterfactual regret minimization to compute a near equilibrium

solution in the domain of poker. The poker variant we focus on is heads-up limit Texas Hold em, as it

is used in the AAAI Computer Poker Competition[9]. The game consists of two players (zero-sum),

four rounds of cards being dealt, and four rounds of betting, and has just under 1018 game states [2].

As with all previous work on this domain, we will rst abstract the game and nd an equilibrium of

the abstracted game. In the terminology of extensive games, we will merge information sets; in the

terminology of poker, we will bucket card sequences. The quality of the resulting near equilibrium

solution depends on the coarseness of the abstraction. In general, the less abstraction used, the

higher the quality of the resulting strategy. Hence, the ability to solve a larger game means less

abstraction is required, translating into a stronger poker playing program.

4.1 Abstraction

The goal of abstraction is to reduce the number of information sets for each player to a tractable

size such that the abstract game can be solved. Early poker abstractions [2, 4] involved limiting

the possible sequences of bets, e.g., only allowing three bets per round, or replacing all rst-round

decisions with a xed policy. More recently, abstractions involving full four round games with the

full four bets per round have proven to be a signi cant improvement [7, 6]. We also will keep the

full game s betting structure and focus abstraction on the dealt cards.

Our abstraction groups together observed card sequences based on a metric called hand strength

squared. Hand strength is the expected probability of winning2 given only the cards a player has

seen. This was used a great deal in previous work on abstraction [2, 4]. Hand strength squared

is the expected square of the hand strength after the last card is revealed, given only the cards a

player has seen. Intuitively, hand strength squared is similar to hand strength but gives a bonus to

card sequences whose eventual hand strength has higher variance. Higher variance is preferred as it

means the player eventually will be more certain about their ultimate chances of winning prior to a

showdown. More importantly, we will show in Section 5 that this metric for abstraction results in

stronger poker strategies.

The nal abstraction is generated by partitioning card sequences based on the hand strength squared

metric. First, all round-one card sequences (i.e., all private card holdings) are partitioned into ten

equally sized buckets based upon the metric. Then, all round-two card sequences that shared a

round-one bucket are partitioned into ten equally sized buckets based on the metric now applied at

round two. Thus, a partition of card sequences in round two is a pair of numbers: its bucket in

the previous round and its bucket in the current round given its bucket in the previous round. This

is repeated after reach round, continuing to partition card sequences that agreed on the previous

rounds buckets into ten equally sized buckets based on the metric applied in that round. Thus, card

sequences are partitioned into bucket sequences: a bucket from {1, . . . 10} for each round. The

resulting abstract game has approximately 1.65 1012 game states, and 5.73 107 information

sets. In the full game of poker, there are approximately 9.17 1017 game states and 3.19 1014

information sets. So although this represents a signi cant abstraction on the original game it is two

orders of magnitude larger than previously solved abstractions.

4.2 Minimizing Counterfactual Regret

Now that we have speci ed an abstraction, we can use counterfactual regret minimization to com-

pute an approximate equilibrium for this game. The basic procedure involves having two players

repeatedly play the game using the counterfactual regret minimizing strategy from Equation 7. Af-

T T

ter T repetitions of the game, or simply iterations, we return ( 1, 2 ) as the resulting approximate

t

equilibrium. Repeated play requires storing Ri (I, a) for every information set I and action a, and

updating it after each iteration.3

2

Where a tie is considered half a win

3

The bound from Theorem 4 for the basic procedure can actually be made signi cantly tighter in the speci c

case of poker. In the appendix, we show that the bound for poker is actually independent of the size of the card

abstraction.

5

For our experiments, we actually use a variation of this basic procedure, which exploits the fact

that our abstraction has a small number of information sets relative to the number of game states.

Although each information set is crucial, many consist of a hundred or more individual histories.

This fact suggests it may be possible to get a good idea of the correct behavior for an information

set by only sampling a fraction of the associated game states. In particular, for each iteration, we

t

sample deterministic actions for the chance player. Thus, c is set to be a deterministic strategy, but

chosen according to the distribution speci ed by fc . For our abstraction this amounts to choosing

a joint bucket sequence for the two players. Once the joint bucket sequence is speci ed, there are

t

only 18,496 reachable states and 6,378 reachable information sets. Since i (I ) is zero for all other

information sets, no updates need to be made for these information sets.4

This sampling variant allows approximately 750 iterations of the algorithm to be completed in a

single second on a single core of a 2.4Ghz Dual Core AMD Opteron 280 processor. In addition, a

straightforward parallelization is possible and was used when noted in the experiments. Since betting

is public information, the op-onward information sets for a particular pre op betting sequence can

be computed independently. With four processors we were able to complete approximately 1700

iterations in one second. The complete algorithmic details with pseudocode can be found in the

appendix.

5 Experimental Results

Before discussing the results, it is useful to consider how one evaluates the strength of a near equi-

librium poker strategy. One natural method is to measure the strategy s exploitability, or its per-

formance against its worst-case opponent. In a symmetric, zero-sum game like heads-up poker5, a

perfect equilibrium has zero exploitability, while an -Nash equilibrium has exploitability . A con-

venient measure of exploitability is millibets-per-hand (mb/h), where a millibet is one thousandth

of a small-bet, the xed magnitude of bets used in the rst two rounds of betting. To provide some

intuition for these numbers, a player that always folds will lose 750 mb/h while a player that is 10

mb/h stronger than another would require over one million hands to be 95% certain to have won

overall.

In general, it is intractable to compute a strategy s exploitability within the full game. For strategies

in a reasonably sized abstraction it is possible to compute their exploitability within their own ab-

stract game. Such a measure is a useful evaluation of the equilibrium computation technique that

was used to generate the strategy. However, it does not imply the technique cannot be exploited by

a strategy outside of its abstraction. It is therefore common to compare the performance of the strat-

egy in the full game against a battery of known strong poker playing programs. Although positive

expected value against an opponent is not transitive, winning against a large and diverse range of

opponents suggests a strong program.

We used the sampled counterfactual regret minimization procedure to nd an approximate equilib-

rium for our abstract game as described in the previous section. The algorithm was run for 2 billion

iterations (T = 2 109 ), or less than 14 days of computation when parallelized across four CPUs.

The resulting strategy s exploitability within its own abstract game is 2.2 mb/h. After only 200 mil-

lion iterations, or less than 2 days of computation, the strategy was already exploitable by less than

13 mb/h. Notice that the algorithm visits only 18,496 game states per iteration. After 200 million

iterations each game state has been visited less than 2.5 times on average, yet the algorithm has

already computed a relatively accurate solution.

5.1 Scaling the Abstraction

In addition to nding an approximate equilibrium for our large abstraction, we also found approx-

imate equilibria for a number of smaller abstractions. These abstractions used fewer buckets per

round to partition the card sequences. In addition to ten buckets, we also solved eight, six, and ve

4

A regret analysis of this variant in poker is included in the appendix. We show that the quadratic decrease

in the cost per iteration only causes in a linear increase in the required number of iterations. The experimental

results in the next section coincides with this analysis.

5

A single hand of poker is not a symmetric game as the order of betting is strategically signi cant. However

a pair of hands where the betting order is reversed is symmetric.

6

25

CFR5

CFR8

CFR10

20

Abs Size Iterations Time Exp

Exploitability (mb/h)

( 109 ) ( 106 ) (h) (mb/h)

15

5 6.45 100 33 3.4

6 27.7 200 75 3.1

10

8-276-***-*** 2.7

326

10 1646 2000 2.2

: parallel implementation with 4 CPUs 5

(a)

0

0 2 4 6 *-**-**-**-**-**

Iterations in thousands, divided by the number of information sets

(b)

Figure 1: (a) Number of game states, number of iterations, computation time, and exploitability (in

its own abstract game) of the resulting strategy for different sized abstractions. (b) Convergence

rates for three different sized abstractions. The x-axis shows the number of iterations divided by the

number of information sets in the abstraction.

bucket variants. As these abstractions are smaller, they require fewer iterations to compute a simi-

larly accurate equilibrium. For example, the program computed with the ve bucket approximation

(CFR5) is about 250 times smaller with just under 1010 game states. After 100 million iterations,

or 33 hours of computation without any parallelization, the nal strategy is exploitable by 3.4 mb/h.

This is approximately the same size of game solved by recent state-of-the-art algorithms [6, 7] with

many days of computation.

Figure 1b shows a graph of the convergence rates for the ve, eight, and ten partition abstractions.

The y-axis is exploitability while the x-axis is the number of iterations normalized by the number

of information sets in the particular abstraction being plotted. The rates of convergence almost

exactly coincide showing that, in practice, the number of iterations needed is growing linearly with

the number of information sets. Due to the use of sampled bucket sequences, the time per iteration

is nearly independent of the size of the abstraction. This suggests that, in practice, the overall

computational complexity is only linear in the size of the chosen card abstraction.

5.2 Performance in Full Texas Hold em

We have noted that the ability to solve larger games means less abstraction is necessary, resulting

in an overall stronger poker playing program. We have played our four near equilibrium bots with

various abstraction sizes against each other and two other known strong programs: PsOpti4 and

S2298. PsOpti4 is a variant of the equilibrium strategy described in [2]. It was the stronger half

of Hyperborean, the AAAI 2006 Computer Poker Competition s winning program. It is available

under the name SparBot in the entertainment program Poker Academy, published by BioTools. We

have calculated strategies that exploit it at 175 mb/h. S2298 is the equilibrium strategy described in

[6]. We have calculated strategies that exploit it at 52.5 mb/h. In terms of the size of the abstract

game PsOpti4 is the smallest consisting of a small number of merged three round games. S2298

restricts the number of bets per round to 3 and uses a ve bucket per round card abstraction based

on hand-strength, resulting an abstraction slightly smaller than CFR5.

Table 1 shows a cross table with the results of these matches. Strategies from larger abstractions

consistently, and signi cantly, outperform their smaller counterparts. The larger abstractions also

consistently exploit weaker bots by a larger margin (e.g., CFR10 wins 19mb/h more from S2298

than CFR5).

Finally, we also played CFR8 against the four bots that competed in the bankroll portion of the 2006

AAAI Computer Poker Competition, which are available on the competition s benchmark server [9].

The results are shown in Table 2, along with S2298 s previously published performance against the

7

PsOpti4 S2298 CFR5 CFR6 CFR8 CFR10 Average

PsOpti4 0 -28 -36 -40 -52 -55 -35

S2298 28 0 -17 -24 -30 -36 -13

CFR5 36 17 0 -5 -13 -20 2

CFR6 40 24 5 0 -9 -14 7

CFR8 52 30 13 9 0 -6 16

CFR10 **-**-**-**-*-*-**

Max 55 36 20 14 6 0

Table 1: Winnings in mb/h for the row player in full Texas Hold em. Matches with Opti4 used 10

duplicate matches of 10,000 hands each and are signi cant to 20 mb/h. Other matches used 10

duplicate matches of 500,000 hands each are are signi cant to 2 mb/h.

Hyperborean BluffBot Monash Teddy Average

S2298 61-113-***-***-***

CFR8 106-***-***-*** 385

Table 2: Winnings in mb/h for the row player in full Texas Hold em.

same bots [6]. The program not only beats all of the bots from the competition but does so by a

larger margin than S2298.

6 Conclusion

We introduced a new regret concept for extensive games called counterfactual regret. We showed

that minimizing counterfactual regret minimizes overall regret and presented a general and poker-

speci c algorithm for ef ciently minimizing counterfactual regret. We demonstrated the technique

in the domain of poker, showing that the technique can compute an approximate equilibrium for

abstractions with as many as 1012 states, two orders of magnitude larger than previous methods. We

also showed that the resulting poker playing program outperforms other strong programs, including

all of the competitors from the bankroll portion of the 2006 AAAI Computer Poker Competition.

References

[1] D. Koller and N. Megiddo. The complexity of two-person zero-sum games in extensive form. Games and

Economic Behavior, pages 528 552, 1992.

[2] D. Billings, N. Burch, A. Davidson, R. Holte, J. Schaeffer, T. Schauenberg, and D. Szafron. Approximat-

ing game-theoretic optimal strategies for full-scale poker. In International Joint Conference on Arti cial

Intelligence, pages 661 668, 2003.

[3] A. Gilpin and T. Sandholm. Finding equilibria in large sequential games of imperfect information. In ACM

Conference on Electronic Commerce, 2006.

[4] A. Gilpin and T. Sandholm. A competitive texas hold em poker player via automated abstraction and

real-time equilibrium computation. In National Conference on Arti cial Intelligence, 2006.

[5] G. Gordon. No-regret algorithms for online convex programs. In Neural Information Processing Systems

19, 2007.

[6] M. Zinkevich, M. Bowling, and N. Burch. A new algorithm for generating strong strategies in massive

zero-sum games. In Proceedings of the Twenty-Seventh Conference on Arti cial Intelligence (AAAI), 2007.

To Appear.

[7] A. Gilpin, S. Hoda, J. Pena, and T. Sandholm. Gradient-based algorithms for nding nash equilibria in

extensive form games. In Proceedings of the Eighteenth International Conference on Game Theory, 2007.

[8] M. Osborne and A. Rubenstein. A Course in Game Theory. The MIT Press, Cambridge, Massachusetts,

1994.

[9] M. Zinkevich and M. Littman. The AAAI computer poker competition. Journal of the International

Computer Games Association, 29, 2006. News item.

8

A Appendix

A.1 Proof of Theorem 3

De ne D(I ) to be the information sets of player i reachable from I (including I ). De ne D(I ) to be a

strategy pro le equal to except in the information sets in D(I ) where it is equal to . The full counterfactual

regret is:

T

1 t

T

i (I ) ui ( t D(I ), I ) ui ( t, I )

Ri,full (I ) = max (8)

T 1

t=1

T,+

Again, we de ne R1,full (I ) = max(R1,full (I ), 0). Moreover, we de ne succ (I I, a) to be the probability

T

i

that I is the next information set of player i visited given that the action a was just selected in information

set I, and is the current strategy. If implies that I is unreachable because of an action of player i, that

action is changed to allow I to be reachable. De ne Succi (I, a) to be the set of all possible next information

sets of player i visited given that action a A(I ) was just selected in information set I . De ne Succi (I ) =

a A(I ) Succi (I, a).

The following lemma describes the relationship between full and immediate counterfactual regret.

T,+

T T

Lemma 5 Ri,full (I ) Ri,imm (I ) + Ri,full (I )

I Succi (I )

Proof:

T

1 t

T

Ri,full (I ) = max max i (I )

T a A(I ) 1 t=1

t t

succ (I t t

ui ( I a, I ) ui (, I ) + I, a) ui ( (D(I ), I ) ui (, I ) (9)

i

I Succi (I,a)

T

1 t

T

i (I ) ui ( t I a, I ) ui ( t, I )

Ri,full (I ) max max

T a A(I ) 1 t=1

T

1 t

succ (I I, a) ui ( t (D(I ), I ) ui ( t, I )

+ max max i (I ) (10)

i

T a A(I ) 1 t=1

I Succi (I,a)

The rst part of the expression on the right hand side is the immediate regret. For the second, we know that

t t

i (I )succ (I I, a) = i (I ), and that ui ( t D(I ), I ) = ui ( t D(I ), I ).

i

T T

Ri,full (I ) Ri,imm (I )

T

1 t

i (I ) ui ( t (D(I, I ) u i ( t, I )

+ max max (11)

)

T a A(I ) 1 t=1

I Succi (I,a)

T T

Ri,full (I ) Ri,imm (I )

T

1 t

i (I ) ui ( t (D(I, I ) u i ( t, I )

+ max max max (12)

)

T a A(I ) 1 t=1

a A(I )

I Succi (I,a)

T T T

Ri,full (I ) Ri,imm (I ) + max Ri,full (I ) (13)

a A(I )

I Succi (I,a)

Because the game is perfect recall, given distinct a, a A(I ), Succi (I, a) and Succi (I, a ) are disjoint. If

we de ne, Succi (I ) = a A(I ) Succi (I, a), then:

T T T,+

Ri,full (I ) Ri,imm (I ) + Ri,full (I ) (14)

I Succi (I )

We prove Theorem 3 by using a lemma that can be proven recursively:

9

T,+

T

Lemma 6 Ri,full (I ) Ri,imm (I ).

I D (I )

Proof: We prove this for a particular game recursively on the size of D(I ). Observe that if an information

set has no successors, then Lemma 5 proves the result. We use this as a basis step. Also, observe that D(I ) =

{I } I Succi (I ) D(I ), and that if I Succi (I ), then I D(I ), implying D(I )D(I ) . Thus, by

induction we can establish that:

T T T,+

Ri,full (I ) Ri,imm (I ) + Ri,imm (I ) (15)

I Succi (I ) I Succi (I )

T,+ T,+

Ri,imm (I ) + Ri,imm (I ) (16)

I Succi (I ) I Succi (I )

Because the game is perfect recall, for any distinct I, I Succi (I ), D(I ) and D(I ) are disjoint. Therefore:

T,+ T,+ T,+

Ri,imm (I ) + Ri,imm (I ) = Ri,imm (I ) (17)

I Succi (I ) I Succi (I ) I D (I )

The result immediately follows.

T T

Proof (of Theorem 3): If P = i, then Ri,full = Ri, and the theorem follows from Lemma 6. If this

is not the case, then we can simply add a new information set at the beginning of the game, where player i only

has one action.

A.2 Regret Matching

Blackwell s approachability theorem when applied to minimizing regret is known as regret matching. In

general, regret matching can be de ned in a domain where there are a xed set of actions A, a function ut :

A R, and on each round a distribution over the actions pt is selected.

De ne the regret of not playing action a A until time T as:

T

1

Rt (a) = ut (a) pt (a)ut (a) (18)

T a A

t=1

and de ne Rt,+ (a) = max(Rt (a), 0). To apply regret matching, one chooses the distribution:

Rt 1,+ (a)

Rt 1,+ (a ) > 0

if

t Rt 1,+ (a ) a A

p (a) = (19)

a A

1

otherwise

A

t

(a) ut (a )), the regret of the regret matching algorithm is

Theorem 7 If u = maxt {1...T } maxa,a A (u

bounded by:

uA

max Rt (a) (20)

a A T

Blackwell s original result [?] focused on the case where an action (or vector) is chosen at random (instead of

a distribution over actions) and gave a probabilistic guarantee. The result above focuses on the distributions

selected, and is more applicable to a scenario where a probability is selected instead of an action.

For a proof, see [5].

A.3 Proof of Theorem 4

Observe that Equation 7 is an implementation of regret matching. Moreover, observe that for all I Ii,

t

a A(I ), i (ui ( t I a, I ) ui ( t, I )) u,i . Therefore, Theorem 7 states that the counterfactual

regret of that node will be less than u,i A(I ) / T u,i Ai / T . Summing over all I Ii yields the

result.

A.4 Poker-Speci c Implementation

We need to iterate over all of the information sets reachable given the joint bucket sequence, and compute prob-

abilities and regrets. In order to do this swiftly, we represent the data in each information set in a player view

tree : in other words, we never explicitly represent every state in the abstracted game: instead, we represent the

information sets for each player in its own tree, with each node n being one of four types:

10

Bucket Nodes: nodes representing where information about the cards is observed. Has a child node

(an opponent or player node) for each different class that could be observed at that point.

Opponent Nodes: nodes representing where the opponent takes an action. Has a child node for each

action.

Player Nodes: nodes representing where the current player takes an action. Contains the average

regret with respect to each action, the total probability for each action until this point, and a child

node for each action (either an opponent, bucket, or terminal node). There is an implicit information

set associated with this node, which we will write as I (n).

Terminal Nodes: nodes where the game ends due to someone folding or a showdown. Given the

probability of a win,loss, and tie, has suf cient information to compute an expected utility for the

hand given that the node was reached.

Each player observes different pieces of information about the game, and therefore travels to a different part of

its tree during the computation. Our algorithm recurses over both trees in a paired fashion. Before we begin,

de ne ui (, I ) = i ui (, I ). For each node in the trees, there will be a value ui (, n) which we use in order

to compute the values ui (, I ) and ui (, I, a), which is the expected value given information set I is reached

and action a is taken.

Algorithm 1 WALK T REES(r1, r2, b, p1, p2 )

Require: A node r1 for an information set tree for player 1.

Require: A node r2 for an information set tree for player 2.

Require: A joint bucket sequence b.

Require: A probability p1 of player 1 playing to reach the node.

Require: A probability p2 of player 2 playing to reach the node.

Ensure: The utility ui (, ri ) for player 1 and player 2.

1: if r1 is a player node (meaning r2 is an opponent node) then

Compute 1 (I (r1 )) according to Equation 7.

2:

for Each action a A(I (r1 )) do

3:

Find the associated child of c1 of r1 and c2 of r2 .

4:

Compute u1 (, I (r1 ), a) and u2 (, r2, a) from WALK T REES(c1, c2, b, p1

5:

1 (I (r1 ))(a), p2 ).

end for

6:

Compute u1 (, I (r1 ))) = a A(I (r1 )) 1 (I (r1 ))(a)u1 (, I (r1 ), a).

7:

for Each action a A(I (r1 )) do

8:

1

R1 (I, a) = T +1 (T R1 (I, a) + p2 (u1 (, I (r1 ), a) u1 (, I (r1

9:

end for

10:

Set u1 (, r1 ) = u1 (, I (r1 ))

11:

Compute u2 (, r2 ) = a A(I (r1 )) 1 (I (r1 ))(a)u2 (, r2, a).

12:

13: else if r2 is a player node (meaning r1 is an opponent node) then

do (opposite of above)

14:

15: else if r1 is a bucket node then

Choose the child c1 of r1 according to the class in b for player 1 on the appropriate round and

16:

the child c2 of r2 similarly.

Find u1 (, c1 ) and u2 (, c2 ) from WALK T REES(c1, c2, b, p1, p2 ).

17:

Set u1 (, r1 ) = u1 (, c1 ) and u2 (, r2 ) = u2 (, c2 ).

18:

19: else if r1 is a terminal node then

Find u1 (, r1 ) and u2 (, r2 ), the utility of each player if this node is actually reached.

20:

21: end if

A.5 Poker-Speci c Analysis

We rst analyze the non-sampling algorithm from Section 3, signi cantly tightening the presented regret bounds

for the speci c case of poker games. We then give a regret analysis for the sampling implementation described

in Section 4 and used in the experimental results presented in Section 5.

A.5.1 Non-Sampling Algorithm

In Section 3, we discussed Blackwell s Approachability Theorem being applied in every information set. The

disadvantage of such an algorithm is that every iteration involves a walk across the entire game tree. The

11

advantage of such an algorithm is that it converges really quickly in terms of iterations. In this analysis, we

focus on poker.

If we can bound the difference in any two counterfactual utilities at every information set, we can achieve a

bound on the overall regret, because Blackwell s Approachability Theorem gives a guarantee based upon this.

In particular, after T time steps, if the bound for the counterfactual utility at an information set is u,1 (I ), and

there are A(I ) actions, then the counterfactual regret is bounded by:

u,1 (I ) A(I )

T

R 1 (I ) (21)

T

By Theorem 3, this means the average overall regret is bounded by:

u,1 (I ) A(I )

T

R1 (22)

T

I I1

First of all, de ne u,1 to be the overall range of utilities in limit poker (48 small bets/hand). In particular,

given 0 (I ) (the probability of chance acting to reach a node), u,1 (I ) 0 (I ) u,1 . In limit one could be

more precise, because any information set that begins with both players checking on the pre- op has a tighter

limit on the maximum won or lost, but bounding based on chance nodes is more crucial. In the next step, we

leverage the structure of poker: in particular, the fact that all actions are observable. De ne B1 to be the set

of all betting sequences where the rst player has to act: in particular, B1 can be considered a partition of the

information sets I1 (such that each B B1 is a set of information sets). Note that, for all B B1 :

0 (I ) = 1 (23)

I B

Moreover, observe that we can de ne A(B ) to be the set of actions available at any information set in B .

Applying these concepts to the equation:

A(B ) u,1

T

R1 (24)

T

B B1

u,1

T

R1 A(B ) (25)

T B B1

Thus, increasing the size of the card abstraction does not affect the rate of convergence. This is not as surprising

as one might think: if one imagined n independent algorithms minimizing regret, each with a bound on their

utility of u,1, then one would expect that the theoretical bound on the average of the algorithms would closely

resemble the theoretical bound on the average of one particular algorithm. This is very similar to what was

leveraged in this section. However, the number of information sets does have an affect on the cost of an

iteration: each game state in the abstraction must be traversed in every iteration. This is the primary motivation

for WALK T REES.

A.5.2 Sampling Algorithm

In order to analyze WALK T REES, we focus on two different measures of regret:

1. R, the regret measured by the algorithm.

2. R, the underlying regret (if all states were visited every iteration).

In this implementation, the range of counterfactual utilities can be u,1 in almost every state. De ne C T (I ) to

be the number of times an information set I was visited until time T : in particular, how many times the bucket

sequence that makes I reachable was selected. Blackwell s Approachability Theorem yields us:

A(I ) C T (I )

u,1 (I )

T

R 1 (I ) (26)

T

Note that this is the average regret bound for C T (I ) iterations averaged over T iterations. Observe that for

any B B1 (see Section A.5.1), T

I B C (I ) = T . De ne = maxB B1 B , in other words the

Y

number of card partitions on the river. Then I B C T (I ) Y T (this is because, for arbitrary m, n,

12

m m

(a1 . . . am ) Rm, where ai 0 and ai mn).

ai = n,

i=1 i=1

A(I ) C T (I )

u,1

R T (I ) (27)

T

I B I B

A(B ) u,1 Y T u,1 Y

R T (I ) A(B )

R1 (28)

T T B B1

I B

u,1 Y



Contact this candidate