Policy training in Alpha Zero, LC0 ..

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Policy training in Alpha Zero, LC0 ..

Post by chrisw »

Is there any source, or pseudo code kicking around for Policy training, AZ or LC0?
I'm a bit confused by the idea I have that the Policy net gets a 1.0 training target kick for every actual move from a training game. eg, the policy net is a move played frequency black box. That seems intuitively fine if the actual move leads to a rollout win, but if it leads to a rollout loss, isn't the Policy network then just being trained to follow down the same losing path again?
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Policy training in Alpha Zero, LC0 ..

Post by AlvaroBegue »

The code for LC0 is publicly available.

The point of the training is that the policy network is learning to guess the result of a search without searching. In order to learn something useful, you don't need the ultimate oracle; it's enough to have access to data that is of better quality that what you can currently do statically, and the result of a search fits the bill.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

AlvaroBegue wrote: Tue Dec 18, 2018 1:32 pm The code for LC0 is publicly available.

The point of the training is that the policy network is learning to guess the result of a search without searching. In order to learn something useful, you don't need the ultimate oracle; it's enough to have access to data that is of better quality that what you can currently do statically, and the result of a search fits the bill.
Yes, that’s understood. The purpose of the policy network was to provide the most likely to be played move from the current node, I guess that’s also what you mean by result of the search . It’s the evaluation network that is being trained to guess the evaluation component of the search result.
What I’m specifically hunting for is exactly what is the train data that is presented for each position durign the fit process. For the value head, 1, 0.5 or 0, as far as I can tell. For the policy head, it’s not so clear, well maybe it is clear, 1 bit for the actual move (from, to) or (piece to), 0 bits for the remaining pseudolegals, and everythign else left to float. But their could be other schemes. If it is the above scheme, then a contradiction opens.

I hunted for the actual training code, and can’t find it. There’s code for LC0, the end user engine that actually plays. There’s code for the game generator. But, unless I am very stupid, which is always possible, the code for training isn’t obviously there. the pseudo code for AZ also skips over the topic. Do you have a link?
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by trulses »

The label for the policy head is the visit count frequency from the tree search (potentially with a temperature).
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: Policy training in Alpha Zero, LC0 ..

Post by Henk »

Input of a training example consists of position value plus probability for each legal move
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Policy training in Alpha Zero, LC0 ..

Post by AlvaroBegue »

chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: Policy training in Alpha Zero, LC0 ..

Post by Henk »

chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
Use an extended format that contains probabilities for each legal move.
Probability per move are constructed by playing training games.
Assume equal probability when there are no probabilities available.
Last edited by Henk on Tue Dec 18, 2018 6:12 pm, edited 1 time in total.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: Policy training in Alpha Zero, LC0 ..

Post by Henk »

chrisw wrote: Tue Dec 18, 2018 6:03 pm
AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
So if you play d4 in the initial position you lose. Good to know. So in a real game I would play less frequently d4 in the initial position or a look alike and lower it's probability. All a tiny bit for it is only one training game.
So if you play Nf6 after d4 you win. So in a real game I would play more frequently Nf6 in the position after d4 or a look alike and increase it's probability. All a tiny bit for it is only one training game.
..
And so on.