Policy training in Alpha Zero, LC0 ..

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Policy training in Alpha Zero, LC0 ..

Post by AlvaroBegue »

Henk, you are adding noise to the conversation.

What you are describing is the sort of reinforcement learning that was described in one stage of the initial AlphaGo paper. We are not talking about that here.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Policy training in Alpha Zero, LC0 ..

Post by AlvaroBegue »

chrisw wrote: Tue Dec 18, 2018 6:03 pm
AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
They are showing you the PGN file, but that's not all the data they collected from that game.

You can see the Python code describing the data format used for training samples here: https://github.com/glinscott/leela-ches ... er.py#L115
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

Henk wrote: Tue Dec 18, 2018 6:25 pm
chrisw wrote: Tue Dec 18, 2018 6:03 pm
AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
So if you play d4 in the initial position you lose. Good to know. So in a real game I would play less frequently d4 in the initial position or a look alike and lower it's probability. All a tiny bit for it is only one training game.
So if you play Nf6 after d4 you win. So in a real game I would play more frequently Nf6 in the position after d4 or a look alike and increase it's probability. All a tiny bit for it is only one training game.
..
And so on.
Alvaro viewpoint says though, that after 1. d4 there are eighteen possible replies and as a result of searching to find the actual game reply, each reply will have a visit count. And it’s that visit count (converted to a probability) that is used to train the Policy. All illegal and impossible moves presumably given a visit count of zero.

Just a passing thought, but isn’t this breaching the zero-rule? It’s not being given a map of pseudolegal moves at the inputs, pseudolegal moves are basically an attack map and therefore non-zero information, but it is being given them at the outputs, and won’t these zero-ed outputs basically back propagate down the layers as position attack map info?
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

AlvaroBegue wrote: Tue Dec 18, 2018 6:51 pm
chrisw wrote: Tue Dec 18, 2018 6:03 pm
AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
They are showing you the PGN file, but that's not all the data they collected from that game.

You can see the Python code describing the data format used for training samples here: https://github.com/glinscott/leela-ches ... er.py#L115
ah! thanks! so, the end user crowd source sends training games to the LCZero server, with all this extra information, back of a fag packet calculation suggests each PGN increased from maybe 500 bytes to 100 moves x 30 children each x 5? bytes visit count = 15000 bytes, call it a thirty-fold size increase. 100x10^6 training games, at 15K each gives 1.5x10^12 bytes, 1.5Tb (I didn’t check my maths, so maybe out). That’s a nice data set. Where are these stored? Is there a link?
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by trulses »

chrisw wrote: Tue Dec 18, 2018 7:00 pm...

Just a passing thought, but isn’t this breaching the zero-rule?
I think knowing which moves are legal fall under "being given perfect knowledge of the game rules".
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

trulses wrote: Tue Dec 18, 2018 7:38 pm
chrisw wrote: Tue Dec 18, 2018 7:00 pm...

Just a passing thought, but isn’t this breaching the zero-rule?
I think knowing which moves are legal fall under "being given perfect knowledge of the game rules".
By that reasoning, it would not be breaching zero-rule to give a map of all legal moves at the inputs.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by trulses »

I agree.
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

trulses wrote: Tue Dec 18, 2018 8:09 pmI agree.
the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: Policy training in Alpha Zero, LC0 ..

Post by Henk »

AlvaroBegue wrote: Tue Dec 18, 2018 6:45 pm Henk, you are adding noise to the conversation.

What you are describing is the sort of reinforcement learning that was described in one stage of the initial AlphaGo paper. We are not talking about that here.
Ok then I only understand reinforcement learning of AlphaGo. I thought I was talking about AlphaZero. By the way I removed policy from network early on for costing too many resources. At this moment I even think implementing conv network was just a waste of time. Costing me a year.
Same conclusion as in 1994 neural networks are just a waste of time. Maybe you can use it for solving very small problems with a few parameters.
Collecting training examples being far too expensive. Not to mention finding right network configuration.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by trulses »

chrisw wrote: Tue Dec 18, 2018 8:27 pm
trulses wrote: Tue Dec 18, 2018 8:09 pmI agree.
the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Unless you're talking about the policy label, you're not discriminating "bad" vs "good" moves by just providing the legal moves so I'm not sure what you mean by weighted. Shouldn't all legal moves have the same weight in your input encoding? Just so we're clear I'm not suggesting that anyone actually try this, because it would be expensive in number of input planes and I doubt it would add much strength.

You're already taking advantage of the legal move information in your search both in which nodes you add to the tree and how you calculate your prior probabilities, so I don't see how it violates any rules.