Policy training in Alpha Zero, LC0 ..

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

trulses wrote: Tue Dec 18, 2018 8:58 pm
chrisw wrote: Tue Dec 18, 2018 8:27 pm
trulses wrote: Tue Dec 18, 2018 8:09 pmI agree.
the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Unless you're talking about the policy label, you're not discriminating "bad" vs "good" moves by just providing the legal moves so I'm not sure what you mean by weighted. Shouldn't all legal moves have the same weight in your input encoding? Just so we're clear I'm not suggesting that anyone actually try this, because it would be expensive in number of input planes and I doubt it would add much strength.

You're already taking advantage of the legal move information in your search both in which nodes you add to the tree and how you calculate your prior probabilities, so I don't see how it violates any rules.
weighted = weighted by attacker. sorry, ambiguity. it meant the weight of attacker type on each target square.

If the attack map/moves were being explicity given in order to provide second order information to the network inputs over and above the one-hot piece encodings,
that, to me anyway, would fall under the non-zero knowledge category. One hot is simple position data, attacks are second order for sure, what movement the one hot bit can do. Which, I think, is probably why the pure AZ didn’t do it, and went for backwards movement knowledge instead (but again via static one hot position encodes).
The attack maps are not explicitly being given as inputs, but the information has crept in back door via the outputs.
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by trulses »

chrisw wrote: Tue Dec 18, 2018 9:10 pm
trulses wrote: Tue Dec 18, 2018 8:58 pm
chrisw wrote: Tue Dec 18, 2018 8:27 pm
trulses wrote: Tue Dec 18, 2018 8:09 pmI agree.
the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Unless you're talking about the policy label, you're not discriminating "bad" vs "good" moves by just providing the legal moves so I'm not sure what you mean by weighted. Shouldn't all legal moves have the same weight in your input encoding? Just so we're clear I'm not suggesting that anyone actually try this, because it would be expensive in number of input planes and I doubt it would add much strength.

You're already taking advantage of the legal move information in your search both in which nodes you add to the tree and how you calculate your prior probabilities, so I don't see how it violates any rules.
weighted = weighted by attacker. sorry, ambiguity. it meant the weight of attacker type on each target square.
My mistake, I imagined a fairly sterile binary encoding of just from square and move type same shape as the A0 label. Certainly if you are not careful with the encoding you could accidentally be leaking some domain knowledge into the architecture. Still, this seems like a fruitless conversation, so let's prune it.

In my experience with using conv-nets for chess move prediction and position evaluation there are almost always some fairly interpretable filters in the input layer that produce attack maps for both colors. These have separate weights attached to each direction relative to the piece. I haven't had a chance to look at lc0 yet but I imagine if you take a trained net and look at the first layer filters as images you'll find some very recognizable shapes.

edit:
If the attack map/moves were being explicity given in order to provide second order information to the network inputs over and above the one-hot piece encodings,
that, to me anyway, would fall under the non-zero knowledge category. One hot is simple position data, attacks are second order for sure, what movement the one hot bit can do. Which, I think, is probably why the pure AZ didn’t do it, and went for backwards movement knowledge instead (but again via static one hot position encodes).
The attack maps are not explicitly being given as inputs, but the information has crept in back door via the outputs.
Sorry didn't see this part of the post at first. There are two ways to train your policy network for A0. You can ignore the illegal moves, and in your label treat them as 0 probability, or you can calculate your softmax only over logits corresponding to legal moves (or something equivalent, like masking and re-normalizing post-softmax or applying an infinity mask to your logits pre-softmax).

The first method produces gradients that discourage the illegal moves in your outputs, that is they push the logits corresponding to those moves towards -inf (if we just ignore machine precision for a moment). The second method doesn't provide any such information and simply avoids updating the weights directly attached to those logits.

In the first method certainly the information is available in the gradients but it's pure speculation if they aid your network in producing attack map information. I'm not familiar with the lc0 training setup so I don't know which method they use. Personally I've tried both and they both produced attack map type filters in the initial layer and there was no discernible difference in strength. The policy loss is smaller if you do re-normalization since you're predicting a slightly simpler function but this shouldn't have any effect in the search since you're only calculating softmax over the 'legal logits' anyway (or one of the previously mentioned equivalents).
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Policy training in Alpha Zero, LC0 ..

Post by gladius »

chrisw wrote: Tue Dec 18, 2018 7:24 pm
AlvaroBegue wrote: Tue Dec 18, 2018 6:51 pm
chrisw wrote: Tue Dec 18, 2018 6:03 pm
AlvaroBegue wrote: Tue Dec 18, 2018 5:35 pm
chrisw wrote: Tue Dec 18, 2018 5:26 pm
Henk wrote: Tue Dec 18, 2018 5:01 pm Input of a training example consists of position value plus probability for each legal move
And how do you know the probability for each legal move from a PGN?
You don't. In the normal training of AlphaZero and LC0, your training samples are written at the end of a search, and the visit count of each move is available.

If you want to train from PGN files (I understand people have tried things of this sort), you can use 1 for the move played and 0 for everyone else.
Yes, but, if you download a training game from lczero.org, say for example; this game: lczero.org/match_game/3188816
and look at the html, you'll find embedded the PGN + score, but no more (as below).
So, LCZero server has stripped away these visit counts of the alternative moves, and it's that data (converted to probabilities) that gets used to train the policy head?

new PgnViewer(
{ boardName: "training",
pgnString: '\n1.d4 Nf6 2.Nf3 c6 3.Bf4 Qb6 4.b3 d5 5.e3 c5 6.c4 cxd4 7.exd4 Nc6 8.c5 Qa5\x2b 9.Qd2 Qxd2\x2b 10.Nbxd2 Nb4 11.Bb5\x2b Bd7 12.Bxd7\x2b Nxd7 13.Ke2 f6 14.a3 Nc6 15.b4 a6 16.Nb3 g5 17.Bd2 Kf7 18.a4 e5 19.b5 Nxd4\x2b 20.Nbxd4 exd4 21.c6 bxc6 22.bxc6 Ne5 23.Nxd4 Bc5 24.Nb3 Bd6 25.Rhc1 Rhc8 26.Na5 Rab8 27.Rab1 Rxb1 28.Rxb1 Nxc6 29.Rb6 Nd4\x2b 30.Kd3 Be5 31.Rxa6 Ne6 32.Nb7 Rc4 33.g3 h5 34.a5 Ra4 35.Rb6 Ra3\x2b 36.Kc2 Nd4\x2b 37.Kb2 Rf3 38.Be1 Ne2\x2b 39.Kc2 Nd4\x2b 40.Kb1 Rd3 41.Bb4 Bc7 42.Nc5 Rd1\x2b 43.Kb2 Bxb6 44.axb6 Nc6 45.b7 Rf1 46.Nd3 Ke6 47.Nc5\x2b Kf5 48.Nd3 Ke4 49.Kc2 d4 50.Bd6 Kd5 51.b8=Q Nxb8 52.Bxb8 Ke4 53.Bc7 Rh1 54.h4 gxh4 55.gxh4 Rxh4 56.Bd8 Rh3 57.Nc5\x2b Kf5 58.Nd3 Rh1 59.Kd2 h4 60.Ke2 h3 61.Bc7 Ra1 62.Kf3 Ra3 63.Ke2 Ke4 64.Nc5\x2b Kd5 65.Nd3 Kc4 66.Nb2\x2b Kc3 67.Nd1\x2b Kc2 68.Bd6 Ra1 69.Ne3\x2b dxe3 70.fxe3 Ra4 71.Kf3 Kd3 72.Kg3 Kxe3 73.Kxh3 Kf3 74.Be7 f5 75.Bg5 f4 76.Kh4 Rc4 77.Kh3 Rc1 78.Bxf4 Kxf4 79.Kg2 Rb1 80.Kf2 Rb2\x2b 81.Kg1 Kg3 82.Kf1 Kf3 83.Kg1 Rd2 84.Kf1 Rd1# 0-1',
pieceSet: 'merida',
pieceSize: 55
}
);
They are showing you the PGN file, but that's not all the data they collected from that game.

You can see the Python code describing the data format used for training samples here: https://github.com/glinscott/leela-ches ... er.py#L115
ah! thanks! so, the end user crowd source sends training games to the LCZero server, with all this extra information, back of a fag packet calculation suggests each PGN increased from maybe 500 bytes to 100 moves x 30 children each x 5? bytes visit count = 15000 bytes, call it a thirty-fold size increase. 100x10^6 training games, at 15K each gives 1.5x10^12 bytes, 1.5Tb (I didn’t check my maths, so maybe out). That’s a nice data set. Where are these stored? Is there a link?
It should be available from http://lczero.org/training_data. It appears to be down right now though. But yes, it's a lot of data, so very expensive to make it available. I was hosting it from s3 originally, and the bill got pretty big, pretty quick!
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Policy training in Alpha Zero, LC0 ..

Post by jp »

chrisw wrote: Tue Dec 18, 2018 9:10 pm
trulses wrote: Tue Dec 18, 2018 8:58 pm
chrisw wrote: Tue Dec 18, 2018 8:27 pm the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Unless you're talking about the policy label, you're not discriminating "bad" vs "good" moves by just providing the legal moves so I'm not sure what you mean by weighted. Shouldn't all legal moves have the same weight in your input encoding? Just so we're clear I'm not suggesting that anyone actually try this, because it would be expensive in number of input planes and I doubt it would add much strength.

You're already taking advantage of the legal move information in your search both in which nodes you add to the tree and how you calculate your prior probabilities, so I don't see how it violates any rules.
weighted = weighted by attacker. sorry, ambiguity. it meant the weight of attacker type on each target square.

If the attack map/moves were being explicity given in order to provide second order information to the network inputs over and above the one-hot piece encodings,
that, to me anyway, would fall under the non-zero knowledge category. One hot is simple position data, attacks are second order for sure, what movement the one hot bit can do. Which, I think, is probably why the pure AZ didn’t do it, and went for backwards movement knowledge instead (but again via static one hot position encodes).
The attack maps are not explicitly being given as inputs, but the information has crept in back door via the outputs.
Yeah, I was going to ask yesterday for clarification about this...
Can you very explicitly explain the non-zeroness?
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

jp wrote: Wed Dec 19, 2018 7:11 am
chrisw wrote: Tue Dec 18, 2018 9:10 pm
trulses wrote: Tue Dec 18, 2018 8:58 pm
chrisw wrote: Tue Dec 18, 2018 8:27 pm the legal moves list is an attack map, and because of the way it is encoded, a weighted attack map, only for one side though.
Unless you're talking about the policy label, you're not discriminating "bad" vs "good" moves by just providing the legal moves so I'm not sure what you mean by weighted. Shouldn't all legal moves have the same weight in your input encoding? Just so we're clear I'm not suggesting that anyone actually try this, because it would be expensive in number of input planes and I doubt it would add much strength.

You're already taking advantage of the legal move information in your search both in which nodes you add to the tree and how you calculate your prior probabilities, so I don't see how it violates any rules.
weighted = weighted by attacker. sorry, ambiguity. it meant the weight of attacker type on each target square.

If the attack map/moves were being explicity given in order to provide second order information to the network inputs over and above the one-hot piece encodings,
that, to me anyway, would fall under the non-zero knowledge category. One hot is simple position data, attacks are second order for sure, what movement the one hot bit can do. Which, I think, is probably why the pure AZ didn’t do it, and went for backwards movement knowledge instead (but again via static one hot position encodes).
The attack maps are not explicitly being given as inputs, but the information has crept in back door via the outputs.
Yeah, I was going to ask yesterday for clarification about this...
Can you very explicitly explain the non-zeroness?
Well, from what I intuited, the tabula rasa approach says that you present to your knowledge engine a visual look at the chess board, as if a complete beginner. You see the pieces and the squares they are on. There’s no information how they move, nor how valuable each is, nor that the king is special. You then show this engine chess positions, in random order, and show it a game output (win/loss) and train it on that output. Eventually, without any knowledge of even how the pieces move, this engine will well evaluate chess positions. Totally zero.

Life is made a little more complex by introducing policy. Here you have the same inputs, but a separate output map 64x64 of all moves, possible or not. At its simplest, you take the move played from the position and light up the coresponding bit in the map, and train the engine on that lit bit (sorry, logit). In practice, actually, from the prior search, you light up, in proportion, all the legal moves, and flag the remaining of the 64x64 with zero. This gives, of course, a pattern at the outputs, and during back propagation, this pattern is transmogrified and passed back up the layered weights, affecting them. Essentially, even though you didn’t pass into the engine inputs any move/attack/mobility information, you did pass it in via this pattern in the outputs. Rule N of ML “watch out that you don’t tell it the forward data what you want it to find”, and there are many curiously wierd and wonderful and unexpected mechanisms for breaching that rule.

Truises argues this is fine because it is permitted under “rules of chess” information only, and the seatch algorithm that generate training games and the search algorithm that plays enduser games has to know how pieces move.
Yes, but. The NN isn’t supposed to know that, else we could input the moves, attacks, mobility and all other second order parameters under the disguise “rules of chess”.

Stricly speaking, I would say they, also AZ, have breached tabula rasa unintentionally and without realising it.
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by crem »

chrisw wrote: Tue Dec 18, 2018 7:24 pm
ah! thanks! so, the end user crowd source sends training games to the LCZero server, with all this extra information, back of a fag packet calculation suggests each PGN increased from maybe 500 bytes to 100 moves x 30 children each x 5? bytes visit count = 15000 bytes, call it a thirty-fold size increase. 100x10^6 training games, at 15K each gives 1.5x10^12 bytes, 1.5Tb (I didn’t check my maths, so maybe out). That’s a nice data set. Where are these stored? Is there a link?
Sorry that I'm late to the party. That data is stored here: http://data.lczero.org/files/ in training-*.tar files. There is one .gz file per game there, in a format parseable by that python script.
pgn-*.tar.bz2 contains corresponding pgn files, but moves can be recovered from training files too (with some creativity and effort, implemented here: https://github.com/so-much-meta/lczero_ ... _parser.py).
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Policy training in Alpha Zero, LC0 ..

Post by chrisw »

crem wrote: Wed Dec 19, 2018 3:43 pm
chrisw wrote: Tue Dec 18, 2018 7:24 pm
ah! thanks! so, the end user crowd source sends training games to the LCZero server, with all this extra information, back of a fag packet calculation suggests each PGN increased from maybe 500 bytes to 100 moves x 30 children each x 5? bytes visit count = 15000 bytes, call it a thirty-fold size increase. 100x10^6 training games, at 15K each gives 1.5x10^12 bytes, 1.5Tb (I didn’t check my maths, so maybe out). That’s a nice data set. Where are these stored? Is there a link?
Sorry that I'm late to the party. That data is stored here: http://data.lczero.org/files/ in training-*.tar files. There is one .gz file per game there, in a format parseable by that python script.
pgn-*.tar.bz2 contains corresponding pgn files, but moves can be recovered from training files too (with some creativity and effort, implemented here: https://github.com/so-much-meta/lczero_ ... _parser.py).
Thanks. Tempted to start imagining what could be done with all that lot.