LCZero: Progress and Scaling. Relation to CCRL Elo

Laskos · Post by **Laskos** » Tue Apr 10, 2018 10:39 am

jkiliani wrote:
Laskos wrote:I use a varied, but solid short opening suite 3moves_GM. They seem to overfit on certain openings, finding local optima, playing most of the games on them, and training less on many other viable openings. I don't know whether they will soon find it problematic what to promote or not. 300,000 games with +0 Elo against a standard engine is not that good.
Lc0 currently has the problem that randomness in match games is provided only by the relatively small perturbations of the root node scores from Dirichlet noise. This can only shift the PV when two moves are similar already in evaluated quality, and leads to sometimes long opening lines played in almost every game within a particular match.

I wrote a solution to this issue in https://github.com/glinscott/leela-chess/pull/267. Once this is used in match games, openings will be much more varied again. It will probably happen in a couple of days, after the next forced version upgrade.

IIRC AlphaZero used Dirichlet noise. Do you know why? The cumulative distribution function looks like that (2 dimensional vector of 0.3, 1 dimensional simplex):

So, it seems to favor very small deviation and very large deviations. I don't know why say very much pruned out branches at the root should be taken so seriously into account.
Isn't this simple, 1/sqrt(x) distribution, more sensible:

Anyway, I will have to inform myself better, I don't understand the topic. And anyway, the noise in opening should be much larger. Standard opening position is just one, all other parts of the games have very high diversity, so for the beginning of the openings, a very different kind of noise should be used.

And to the progress:

Within error margins, no significant progress was made in my flawed testing from ID101 to ID113, almost 500,000 games, which is worrying. Their self-games ratings show 110 Elo points progress, while my results in 500 games against a standard engine Predateur are:

Code: Select all

LCZero CPU  ID83 4 threads	153.0/500	134-328-38

LCZero CPU ID101 4 threads	202.0/500	181-277-42
LCZero CPU ID113 4 threads	206.5/500	183-270-47

No improvement from 101 to 113 within error margins. Similarly on test suites, both positional and tactical. I hope that correcting for the overfit on openings, the progress will begin again.

jkiliani · Post by **jkiliani** » Tue Apr 10, 2018 12:02 pm

Laskos wrote:
jkiliani wrote:
Laskos wrote:I use a varied, but solid short opening suite 3moves_GM. They seem to overfit on certain openings, finding local optima, playing most of the games on them, and training less on many other viable openings. I don't know whether they will soon find it problematic what to promote or not. 300,000 games with +0 Elo against a standard engine is not that good.
Lc0 currently has the problem that randomness in match games is provided only by the relatively small perturbations of the root node scores from Dirichlet noise. This can only shift the PV when two moves are similar already in evaluated quality, and leads to sometimes long opening lines played in almost every game within a particular match.

I wrote a solution to this issue in https://github.com/glinscott/leela-chess/pull/267. Once this is used in match games, openings will be much more varied again. It will probably happen in a couple of days, after the next forced version upgrade.
IIRC AlphaZero used Dirichlet noise. Do you know why? The cumulative distribution function looks like that (2 dimensional vector of 0.3, 1 dimensional simplex):

So, it seems to favor very small deviation and very large deviations. I don't know why say very much pruned out branches at the root should be taken so seriously into account.
Isn't this simple, 1/sqrt(x) distribution, more sensible:

Anyway, I will have to inform myself better, I don't understand the topic. And anyway, the noise in opening should be much larger. Standard opening position is just one, all other parts of the games have very high diversity, so for the beginning of the openings, a very different kind of noise should be used.

And to the progress:

Within error margins, no significant progress was made in my flawed testing from ID101 to ID113, almost 500,000 games, which is worrying. Their self-games ratings show 110 Elo points progress, while my results in 500 games against a standard engine Predateur are:
Code: Select all
LCZero CPU  ID83 4 threads	153.0/500	134-328-38

LCZero CPU ID101 4 threads	202.0/500	181-277-42
LCZero CPU ID113 4 threads	206.5/500	183-270-47
No improvement from 101 to 113 within error margins. Similarly on test suites, both positional and tactical. I hope that correcting for the overfit on openings, the progress will begin again.

Dirichlet noise and temperature are completely different things, and are designed for completely different purposes as well.

Dirichlet noise will perturb the neural network scores at the root level, to raise some and lower others randomly. Those scores determine how much each root move is explored, and by adding noise we ensure that even moves that the current network thinks are bad will be somewhat explored with a certain probability. A side effect of perturbing the root node scores is that sometimes, this shifts the PV, i.e. the root move getting the most visits (which the engine thinks is best).

(Move selection) temperature is the concept of not automatically choosing the move which gets most visits. Training games always select moves proportional to their visit count. This means that even moves that got a single visit which let the network determine they're horrible blunders have a non-negligible chance of being chosen. That is the reason training games are full of blunders, not Dirichlet noise.

Regardless, the concept of temperature is designed to add variety in positions, while the concept of noise is to find blind spots in the evaluation, the variety here is just a byproduct. That's the reason for the pull request I referenced: I used temperature instead of noise for variety, but with a tweak so that the most variety is added in the opening instead of later. Really bad blunders are suppressed in this implementation by making their chance of being selected much smaller than with proportional root move selection.

lantonov · Post by **lantonov** » Tue Apr 10, 2018 12:32 pm

Laskos wrote:
Within error margins, no significant progress was made in my flawed testing from ID101 to ID113, almost 500,000 games, which is worrying.

At this site https://www.twitch.tv/ccls they do some semi-regular testing of recent stronger networks against a panel of 25 engines in the range 1645 - 2080 elo. (1 min TC, book, gauntlet of 8 rounds, 200 games all). Results show a steady progress of 50-80 elo a time. The current standing is 1955 elo.

lantonov · Post by **lantonov** » Tue Apr 10, 2018 12:38 pm

jkiliani wrote: Regardless, the concept of temperature is designed to add variety in positions, while the concept of noise is to find blind spots in the evaluation, the variety here is just a byproduct. That's the reason for the pull request I referenced: I used temperature instead of noise for variety, but with a tweak so that the most variety is added in the opening instead of later. Really bad blunders are suppressed in this implementation by making their chance of being selected much smaller than with proportional root move selection.

+1
My wish from the lczero forum fulfilled.

Laskos · Post by **Laskos** » Wed Apr 11, 2018 1:47 am

jkiliani wrote: Dirichlet noise and temperature are completely different things, and are designed for completely different purposes as well.

Dirichlet noise will perturb the neural network scores at the root level, to raise some and lower others randomly. Those scores determine how much each root move is explored, and by adding noise we ensure that even moves that the current network thinks are bad will be somewhat explored with a certain probability. A side effect of perturbing the root node scores is that sometimes, this shifts the PV, i.e. the root move getting the most visits (which the engine thinks is best).

(Move selection) temperature is the concept of not automatically choosing the move which gets most visits. Training games always select moves proportional to their visit count. This means that even moves that got a single visit which let the network determine they're horrible blunders have a non-negligible chance of being chosen. That is the reason training games are full of blunders, not Dirichlet noise.

Regardless, the concept of temperature is designed to add variety in positions, while the concept of noise is to find blind spots in the evaluation, the variety here is just a byproduct. That's the reason for the pull request I referenced: I used temperature instead of noise for variety, but with a tweak so that the most variety is added in the opening instead of later. Really bad blunders are suppressed in this implementation by making their chance of being selected much smaller than with proportional root move selection.

Thanks for clarifying some issues, I had very vague general ideas.
Is this "temperature" some sort of step function up to certain move? And you added temperature to the first moves compared to latter moves?

Laskos · Post by **Laskos** » Wed Apr 11, 2018 2:04 am

lantonov wrote:
Laskos wrote:
Within error margins, no significant progress was made in my flawed testing from ID101 to ID113, almost 500,000 games, which is worrying.
At this site https://www.twitch.tv/ccls they do some semi-regular testing of recent stronger networks against a panel of 25 engines in the range 1645 - 2080 elo. (1 min TC, book, gauntlet of 8 rounds, 200 games all). Results show a steady progress of 50-80 elo a time. The current standing is 1955 elo.

Yes, I know of these results, and his methodology and time control are better, although I have more games. I also observed steady progress up to 101 (I tested some 4-5 versions since ID59), but as of 117 no significant progress compared to 101 (600,000 games, a lot). It somehow coincides with the onset of overfitting on local optima of some openings. The intermediate result in your link is also not that good.

Another aspect is the scaling. It's pretty useless to talk of CCRL rating of LC0 without specifying the time control, LC0 scales much better than usual engines of similar strength.

At 100ms/move each engine, ID113 against Predateur 2.2.1 scored 206.5/500, at 400ms/move each engine (2 doublings) it scored 319.0/500, more than 150 Elo difference in scaling. So LC0 might perform 200-400 Elo points better at LTC than at STC compared to standard engines.

CMCanavessi · Post by **CMCanavessi** » Wed Apr 11, 2018 3:12 am

Laskos wrote:
lantonov wrote:
Laskos wrote:
Within error margins, no significant progress was made in my flawed testing from ID101 to ID113, almost 500,000 games, which is worrying.
At this site https://www.twitch.tv/ccls they do some semi-regular testing of recent stronger networks against a panel of 25 engines in the range 1645 - 2080 elo. (1 min TC, book, gauntlet of 8 rounds, 200 games all). Results show a steady progress of 50-80 elo a time. The current standing is 1955 elo.
Yes, I know of these results, and his methodology and time control are better, although I have more games. I also observed steady progress up to 101 (I tested some 4-5 versions since ID59), but as of 117 no significant progress compared to 101 (600,000 games, a lot). It somehow coincides with the onset of overfitting on local optima of some openings. The intermediate result in your link is also not that good.

Another aspect is the scaling. It's pretty useless to talk of CCRL rating of LC0 without specifying the time control, LC0 scales much better than usual engines of similar strength.

At 100ms/move each engine, ID113 against Predateur 2.2.1 scored 206.5/500, at 400ms/move each engine (2 doublings) it scored 319.0/500, more than 150 Elo difference in scaling. So LC0 might perform 200-400 Elo points better at LTC than at STC compared to standard engines.

I'm finding a similar thing, I had tested 103 which was a definite improvement over 80. Now I'm testing 116 and it's performing almost exactly the same as 103 (and maybe a tiny bit worse, but well within the error bars), even if the playing style is vastly different.

We'll see how the 128x10 net behaves.

CMCanavessi · Post by **CMCanavessi** » Wed Apr 11, 2018 5:22 am

New updated graphics with 116

Gauntlet score

Elo

jkiliani · Post by **jkiliani** » Wed Apr 11, 2018 5:55 am

Laskos wrote:
jkiliani wrote: Regardless, the concept of temperature is designed to add variety in positions, while the concept of noise is to find blind spots in the evaluation, the variety here is just a byproduct. That's the reason for the pull request I referenced: I used temperature instead of noise for variety, but with a tweak so that the most variety is added in the opening instead of later. Really bad blunders are suppressed in this implementation by making their chance of being selected much smaller than with proportional root move selection.
Thanks for clarifying some issues, I had very vague general ideas.
Is this "temperature" some sort of step function up to certain move? And you added temperature to the first moves compared to latter moves?

I didn't use a step function, but a logarithmic decay schedule:

adjusted_ply = 1+(plycount+1) * decay_constant / 50.
root_temp = 1./(1.+ln(adjusted_ply))

Now, moves are chosen proportional to exponentiated visits:
exp_visits = node_visits^(1/root_temp)

This allows moves with only slightly less visits than the most visited node to be chosen regularly, while the selection chances of nodes that the search has determined are bad are suppressed by the exponentiation. Since this exponent rises throughout the game, the suppression of moves getting few visits increases as well.

Laskos · Post by **Laskos** » Wed Apr 11, 2018 11:20 am

CMCanavessi wrote:New updated graphics with 116

Gauntlet score

The same here to ID120 not only on Elo, which had no significant progress since ID103, but also on my opening positional test suite of 200 positions.

On tactical test suite, also no significant progress. Let's hope for now that move selection in the openings will change this real stalling (that in self-games was artificial). It surely was stuck with some local optima in the openings, and the whole training process was flawed. If not this is the issue, a larger block/filters network will be required.

LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo

Re: LCZero: Progress and Scaling. Relation to CCRL Elo