Why computing K that minimizes the sigmoid func. value?...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cdani »

I found another bug :oops: :oops: and now it's

k = 1.325, e = 0.0812987569 (28.5129%)

So better and I suppose in line of what can be expected for Andscacs.

Now I started to do local search to find the local minimum. I'm tuning 1049 parameters (counting each array detailed).
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Why computing K that minimizes the sigmoid func. value?.

Post by Evert »

mvk wrote:
cdani wrote:I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)
For determining 'k' that is probably good enough, but I fear that 66,000 games is much too low for the tuner to go anywhere if you use only game outcomes as target function. The limitation is not the number of positions, but the number of independent signals to get out of it (game outcomes).

For example, the bias from the test set alone is already in the order of 1/sqrt(66000) = 0.4%.
That's actually a good point.
I haven't really played around with this tuning method because it's a bit of a hassle to process the games and extract suitable positions, but it's on my to-do list.

Now: does the quality of the games really matter very much? I ask, because I would like to use this method for chess variants, for which it's rather hard (if not impossible) to find a suitably large set of games; I'd need to generate those first and it would be much easier if I could just use fast games...
nionita
Posts: 175
Joined: Fri Oct 22, 2010 9:47 pm
Location: Austria

Re: Why computing K that minimizes the sigmoid func. value?.

Post by nionita »

mvk wrote:
nionita wrote:
mvk wrote:The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)
What is the distance (or spread) in the parameter space of the found solutions? I guess your parameters are integers, but the found solutions are real numbers. There will be a further kind of noise you get when round the parameters.
How do you calculate that distance/spread metric and what does it mean? All resulting 10 vectors are somewhat different by manual inspection.
Every distance would be ok. But ok, you answered it, the 10 solutions are really different (and probably far enough from each other).

I was curious, cause it looks for me that when optimizing a noisy function chances are big that you find every time another optimum, when the function is flat and the noise strong enough. At least this happens to me all the time.
cetormenter
Posts: 170
Joined: Sun Oct 28, 2012 9:46 pm

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cetormenter »

cdani wrote:I found another bug :oops: :oops: and now it's

k = 1.325, e = 0.0812987569 (28.5129%)

So better and I suppose in line of what can be expected for Andscacs.

Now I started to do local search to find the local minimum. I'm tuning 1049 parameters (counting each array detailed).
1049 parameters is quite a lot. I have been finding success now by using a function with 3 parameters for each array. I use this for adding new evaluation terms to Nirvana. It seems to alleviate some overfitting problems that have seen. For instance, I just added a term that scales the value of bishops based on the number of blocked pawns. With the function I get a smooth range of values but if I try to tune each value individually I would get something like a small penalty when there are 3 blocked pawns but a huge bonus when there are 4 blocked pawns and then a huge penalty when there are 5 blocked pawns. Obviously this is nonsense and there simply were a small number of positions that had 4 blocked pawns that ended up as wins for the side with one or more bishops.

The equation I mentioned is min + (max - min) * pow ( x , bias ) / pow ( arrSize - 1, bias ). I iterate min and max as usual but for the bias I typically start off using 1 (a linear function) and then use a peturbation of bias / 4 to bias / 16.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cdani »

cetormenter wrote:
cdani wrote:I found another bug :oops: :oops: and now it's

k = 1.325, e = 0.0812987569 (28.5129%)

So better and I suppose in line of what can be expected for Andscacs.

Now I started to do local search to find the local minimum. I'm tuning 1049 parameters (counting each array detailed).
1049 parameters is quite a lot. I have been finding success now by using a function with 3 parameters for each array. I use this for adding new evaluation terms to Nirvana. It seems to alleviate some overfitting problems that have seen. For instance, I just added a term that scales the value of bishops based on the number of blocked pawns. With the function I get a smooth range of values but if I try to tune each value individually I would get something like a small penalty when there are 3 blocked pawns but a huge bonus when there are 4 blocked pawns and then a huge penalty when there are 5 blocked pawns. Obviously this is nonsense and there simply were a small number of positions that had 4 blocked pawns that ended up as wins for the side with one or more bishops.

The equation I mentioned is min + (max - min) * pow ( x , bias ) / pow ( arrSize - 1, bias ). I iterate min and max as usual but for the bias I typically start off using 1 (a linear function) and then use a peturbation of bias / 4 to bias / 16.
Thanks!

Yes, I plan to use some type of smoothing also after some other ideas. Sure I will try your formula or a similar one. Cannot invent a lot here, my maths are absolutely basic :-)

For the moment I plan to keep different snapshots of the parameters at regular points during the optimization, and then select manually for some parameters which point I want to try. Also I will try to smooth by hand the most obvious overfittings. All combined with reoptimizing concrete parts that I think can win something after the modification of other parameters, ...
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cdani »

All is working nicely, wining strength after each iteration on all the parameters. Thanks to all!!

I have used mutithread to do the computations, because it was very slow.
Current e = 0.0796898969 (28.2294%)

Has someone computed the "e" value for example for Stockfish? It makes any sense compute it? I really don't understand well the formulas, nor why this works. If someone tries to explain it, please, don't use maths :-)

Particularly, what is the error "e"? Where is the "error", mistake, think that is done badly? :-)
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Why computing K that minimizes the sigmoid func. value?.

Post by petero2 »

cdani wrote:All is working nicely, wining strength after each iteration on all the parameters. Thanks to all!!

I have used mutithread to do the computations, because it was very slow.
Current e = 0.0796898969 (28.2294%)

Has someone computed the "e" value for example for Stockfish? It makes any sense compute it? I really don't understand well the formulas, nor why this works. If someone tries to explain it, please, don't use maths :-)

Particularly, what is the error "e"? Where is the "error", mistake, think that is done badly? :-)
E is basically a measure of how well the evaluation function is able to predict the game outcome. I don't think the actual value of E is very interesting, because there are at least three factors that contribute to E:

1. Mistakes in the game continuation causing the game outcome to not be equal to the game theoretical value of the evaluated position.

2. Tactical aspects of a position that can not reasonably be modeled by the evaluation function.

3. Bad values of evaluation function parameters.

The goal of the algorithm is to improve 3, without being mislead by the "noise" introduced by 1 and 2.

If there are lots of positions in the training data, the effects of 1 and 2 will on average not depend much on the parameter values, so by varying the parameters to reduce E, the parameter values get better. However, the individual contributions from 1, 2 and 3 to E is not known, so you can't say how good the evaluation function is based on the E value alone.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cdani »

petero2 wrote: E is basically a measure of how well the evaluation function is able to predict the game outcome. I don't think the actual value of E is very interesting, because there are at least three factors that contribute to E:

1. Mistakes in the game continuation causing the game outcome to not be equal to the game theoretical value of the evaluated position.

2. Tactical aspects of a position that can not reasonably be modeled by the evaluation function.

3. Bad values of evaluation function parameters.

The goal of the algorithm is to improve 3, without being mislead by the "noise" introduced by 1 and 2.

If there are lots of positions in the training data, the effects of 1 and 2 will on average not depend much on the parameter values, so by varying the parameters to reduce E, the parameter values get better. However, the individual contributions from 1, 2 and 3 to E is not known, so you can't say how good the evaluation function is based on the E value alone.
Thanks! Now is clear. I hope this explanation is included here:
https://chessprogramming.wikispaces.com ... ing+Method

In this page there is another think I don't understand. You say:

"The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial."

So if I understand well you say, "I won 39.4 changing the criteria, but apparently changing the criteria was not a good idea". So if it was a bad idea, how was able to win 39.4 elo?


Another think. I'm doing standard selfplay tests at short and long time controls after each iteration of the optimizing algorithm. Until now the new parameters has always won the previous version. After one iteration the version is quite worse, but the algorithm continues to run (in another computer), so it has not finished optimizing "E". Is an expected behavior? Or is supposed to be good to wait the algorithm to finish until no further improvement of "E" is possible, i.e. each new iteration on the parameters must be good?

Thanks!
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Why computing K that minimizes the sigmoid func. value?.

Post by cdani »

Another think I just thought. If I have already won 25 elo in self play, I suppose is better to run a new bunch of 60,000 games with the latest best version, as the new parameter values when tested will behave better correlation with the results of the games, so they will be optimized better.

But if this is true, why not use games of a stronger engine like Stockfish? It will be even better.

Probably I did not understand something if this is not true.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Why computing K that minimizes the sigmoid func. value?.

Post by petero2 »

cdani wrote:https://chessprogramming.wikispaces.com ... ing+Method

In this page there is another think I don't understand. You say:

"The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial."

So if I understand well you say, "I won 39.4 changing the criteria, but apparently changing the criteria was not a good idea". So if it was a bad idea, how was able to win 39.4 elo?
It was a good idea. What I said was that before testing I thought the idea was bad, but after testing I realized that it was actually good.
cdani wrote:Another think. I'm doing standard selfplay tests at short and long time controls after each iteration of the optimizing algorithm. Until now the new parameters has always won the previous version. After one iteration the version is quite worse, but the algorithm continues to run (in another computer), so it has not finished optimizing "E". Is an expected behavior? Or is supposed to be good to wait the algorithm to finish until no further improvement of "E" is possible, i.e. each new iteration on the parameters must be good?
My intention was to let the optimization run until it could not improve E any more, and only then use the computed parameters to test if the playing strength was improved.

If you can afford the CPU time to test playing strength after each iteration, there is however no harm in that, except that you have to spend time performing the tests. It is even possible that some of the intermediate parameter values are stronger than the final values, but I don't know if that actually happens in practice.