A0 subtlety

Daniel Shawul · Post by **Daniel Shawul** » Sun Feb 17, 2019 6:31 pm

I am more and more convinced the problem with supervised learning is not being able to forget week games. There is a whole lot of literature on “experience replay” in deep reinforcement learning that solves the problem of overfitting to early/latest data. What I need to do is sort the games by Elo, use a replay buffer of 1 mil positions, sample batches from it with replacement and continuall train network.
That should simulate the reinforcement learning process in supervised learning I think ...

brianr · Post by **brianr** » Sun Feb 17, 2019 6:38 pm

Daniel Shawul wrote: ↑Sun Feb 17, 2019 6:31 pm I am more and more convinced the problem with supervised learning is not being able to forget week games. There is a whole lot of literature on “experience replay” in deep reinforcement learning that solves the problem of overfitting to early/latest data. What I need to do is sort the games by Elo, use a replay buffer of 1 mil positions, sample batches from it with replacement and continuall train network.
That should simulate the reinforcement learning process in supervised learning I think ...

I have had some success with SL doing that sorting/tiered approach. I used the CCRL games and the first subset was with large material differences, then smaller material differences, and then group of games with 400 Elo in each tier (under 1,200, then 1,600 and so on). I happen to use SCID to select groups. Also, a lot of cleanup needs to be done to the games with pgnextract and the like.

The smaller groups make it practical to train for more than one epoch (2 or 3 seem about right), and experiment with CLR (although it is not clear if this helps much).

This has not yet produced a super strong net, but I have only trained with about 4 million games and smaller 10x128 sized nets. I hope more games and larger nets would be stronger. After that, RL could always be applied too and the entire process might be faster than starting from zero with only RL.

chrisw · Post by **chrisw** » Sun Feb 17, 2019 9:31 pm

Daniel Shawul wrote: ↑Sun Feb 17, 2019 2:16 pm I think if you take lc0 self play games in the same order as they are produced (i.e. from poor games that don't know material values to strong ones) and "supervized train" a new network you should be able to get the same network.

Probably so, but not if you use different inputs (attack planes vs history planes and so on). In this case the networks would diverge and we’re back to the problem of using RL games which are produced “in tune” with a different network, not our one.

That might not be the case if the order of the games is changed.
The order is important because mini-batch gradient descent is used for optimization. In batch gradient descent, however, the order does not matter because the update is done after all gradients in the 40 million games are accumulated. I am not sure batch gradient descent will produce strong nets because, as you mentioned there will be a lot of noise in the data, and the algorithm tries to fit all samples with the same weight.. The algorithm lc0 uses has a replay buffer that keeps the last 1 million games for the purpose of randomly sampling mini-batches. Someone even claimed that lc0 nets are trained from 1 million games because of that but I think that is a mistake. The replay buffer is just a tool to shuffle positions to minimize the effect of correlation between positions in the same game.

if all games are filtered by some reasonably high Elo, then game ordering ought not to matter, move correlation always matters though, as you said. Problem still remains, how to get training variance without having to rely on weak entity results.

However, I think there is a "forgetting" effect where earlier games played with no idea of piece values are forgotten. That is why I think using batch gradient descent over the 40 million games and doing multiple passes over it will not give same net.

jackd · Post by **jackd** » Fri Feb 22, 2019 4:01 am

You guys.. my policy network isn't that bad! After 100000 steps and a batch size of 2000 on a 10 x 128 it can beat me! I'm so excited I had to post something. I'll follow up with more detail eventually

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 22, 2019 2:19 pm

Great! The policy net has a lot of positional knowledge indeed.
Though you can beat it if you constantly seek for a 2-ply tactics in your games.

chrisw · Post by **chrisw** » Fri Feb 22, 2019 8:04 pm

jackd wrote: ↑Fri Feb 22, 2019 4:01 am You guys.. my policy network isn't that bad! After 100000 steps and a batch size of 2000 on a 10 x 128 it can beat me! I'm so excited I had to post something. I'll follow up with more detail eventually

That sounds pretty good! Can you parse the second sentence though ...
How many games/positions? I got that you are batching in 2000 position batches, but not the count of positions, nor how many times round the same position set?

Nor what you are training on. Presumably move for policy and game result for value? Categorical cross entropy for value? And what is the quality of the input data?

Sorry! I'm interested!

crem · Post by **crem** » Wed Feb 27, 2019 4:29 pm

chrisw wrote: ↑Thu Feb 14, 2019 4:28 pm
Daniel Shawul wrote: ↑Mon Feb 11, 2019 6:08 pm
jackd wrote: ↑Mon Feb 11, 2019 5:17 pm Was a set of input planes representing a position at time (t - T + 1) oriented for the side to move at time t or time (t - T + 1)?
All the history input planes are oriented for the current side to move, i.e. at time t.
Would it matter if they weren’t?
(That’s a serious question btw)

In LcZero they were not flipped until ~mid-May due to bug. It's believed to reduce NN capacity, but it somehow still worked.
The first time LCZero participated in TCEC (was it TCEC 12?..), a network with that bug played.

chrisw · Post by **chrisw** » Wed Feb 27, 2019 8:59 pm

crem wrote: ↑Wed Feb 27, 2019 4:29 pm
chrisw wrote: ↑Thu Feb 14, 2019 4:28 pm
Daniel Shawul wrote: ↑Mon Feb 11, 2019 6:08 pm
jackd wrote: ↑Mon Feb 11, 2019 5:17 pm Was a set of input planes representing a position at time (t - T + 1) oriented for the side to move at time t or time (t - T + 1)?
All the history input planes are oriented for the current side to move, i.e. at time t.
Would it matter if they weren’t?
(That’s a serious question btw)
In LcZero they were not flipped until ~mid-May due to bug. It's believed to reduce NN capacity, but it somehow still worked.
The first time LCZero participated in TCEC (was it TCEC 12?..), a network with that bug played.

I guess you mean the engine code was flipped in mid-May? What happened to “fix” the bug? If you flip the planes in the code (I assume), then the new engine-training code trains the new networks with the history planes the right way round, so new networks after mid-May will work correctly with executables compiled after mid-May. What about pre-May networks with later compiled executables, do you flip the history code back when recognising pre-May networks?

Did any network runs just continue training, but with flipped planes?

Did the allegedly very strong 11xxx series networks predate this, btw? I guess they were post May.

crem · Post by **crem** » Wed Feb 27, 2019 9:33 pm

chrisw wrote: ↑Wed Feb 27, 2019 8:59 pm
crem wrote: ↑Wed Feb 27, 2019 4:29 pm
chrisw wrote: ↑Thu Feb 14, 2019 4:28 pm
Daniel Shawul wrote: ↑Mon Feb 11, 2019 6:08 pm
jackd wrote: ↑Mon Feb 11, 2019 5:17 pm Was a set of input planes representing a position at time (t - T + 1) oriented for the side to move at time t or time (t - T + 1)?
All the history input planes are oriented for the current side to move, i.e. at time t.
Would it matter if they weren’t?
(That’s a serious question btw)
In LcZero they were not flipped until ~mid-May due to bug. It's believed to reduce NN capacity, but it somehow still worked.
The first time LCZero participated in TCEC (was it TCEC 12?..), a network with that bug played.
I guess you mean the engine code was flipped in mid-May? What happened to “fix” the bug? If you flip the planes in the code (I assume), then the new engine-training code trains the new networks with the history planes the right way round, so new networks after mid-May will work correctly with executables compiled after mid-May. What about pre-May networks with later compiled executables, do you flip the history code back when recognising pre-May networks?

Did any network runs just continue training, but with flipped planes?

Did the allegedly very strong 11xxx series networks predate this, btw? I guess they were post May.

Actually I looked up chat logs, and that bug was fixed in mid-April, not mid-May.
After fixing the binary, network training had to be restarted too. I don't remember whether that binary still could work with old networks.

(it was all still lczero.exe, not lc0.exe. Actually all those flip bugs were discovered when I was writing lc0 and debugged inconsistencies)

jackd · Post by **jackd** » Thu Feb 28, 2019 4:29 pm

I just realized you can add a softmax directly after a convolutional layer. Before I was interpreting the policy head as

(71 @ 3 * 3 * 256) -> batchnorm -> relu -> dense( 64 * 71 ) -> softmax

But now I realize you apply the softmax directly to the 73 images. I am excited and expecting even better results!

A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety

Re: A0 subtlety