Why computing K that minimizes the sigmoid func. value?...

cetormenter · Post by **cetormenter** » Fri Nov 27, 2015 8:41 pm

cdani wrote:I found another bug and now it's

k = 1.325, e = 0.0812987569 (28.5129%)

So better and I suppose in line of what can be expected for Andscacs.

Now I started to do local search to find the local minimum. I'm tuning 1049 parameters (counting each array detailed).

1049 parameters is quite a lot. I have been finding success now by using a function with 3 parameters for each array. I use this for adding new evaluation terms to Nirvana. It seems to alleviate some overfitting problems that have seen. For instance, I just added a term that scales the value of bishops based on the number of blocked pawns. With the function I get a smooth range of values but if I try to tune each value individually I would get something like a small penalty when there are 3 blocked pawns but a huge bonus when there are 4 blocked pawns and then a huge penalty when there are 5 blocked pawns. Obviously this is nonsense and there simply were a small number of positions that had 4 blocked pawns that ended up as wins for the side with one or more bishops.

The equation I mentioned is min + (max - min) * pow ( x , bias ) / pow ( arrSize - 1, bias ). I iterate min and max as usual but for the bias I typically start off using 1 (a linear function) and then use a peturbation of bias / 4 to bias / 16.

cdani · Post by **cdani** » Fri Nov 27, 2015 9:42 pm

cetormenter wrote:
cdani wrote:I found another bug and now it's

k = 1.325, e = 0.0812987569 (28.5129%)

So better and I suppose in line of what can be expected for Andscacs.

Now I started to do local search to find the local minimum. I'm tuning 1049 parameters (counting each array detailed).
1049 parameters is quite a lot. I have been finding success now by using a function with 3 parameters for each array. I use this for adding new evaluation terms to Nirvana. It seems to alleviate some overfitting problems that have seen. For instance, I just added a term that scales the value of bishops based on the number of blocked pawns. With the function I get a smooth range of values but if I try to tune each value individually I would get something like a small penalty when there are 3 blocked pawns but a huge bonus when there are 4 blocked pawns and then a huge penalty when there are 5 blocked pawns. Obviously this is nonsense and there simply were a small number of positions that had 4 blocked pawns that ended up as wins for the side with one or more bishops.

The equation I mentioned is min + (max - min) * pow ( x , bias ) / pow ( arrSize - 1, bias ). I iterate min and max as usual but for the bias I typically start off using 1 (a linear function) and then use a peturbation of bias / 4 to bias / 16.

Thanks!

Yes, I plan to use some type of smoothing also after some other ideas. Sure I will try your formula or a similar one. Cannot invent a lot here, my maths are absolutely basic

For the moment I plan to keep different snapshots of the parameters at regular points during the optimization, and then select manually for some parameters which point I want to try. Also I will try to smooth by hand the most obvious overfittings. All combined with reoptimizing concrete parts that I think can win something after the modification of other parameters, ...

cdani · Post by **cdani** » Sun Nov 29, 2015 12:54 pm

All is working nicely, wining strength after each iteration on all the parameters. Thanks to all!!

I have used mutithread to do the computations, because it was very slow.
Current e = 0.0796898969 (28.2294%)

Has someone computed the "e" value for example for Stockfish? It makes any sense compute it? I really don't understand well the formulas, nor why this works. If someone tries to explain it, please, don't use maths

Particularly, what is the error "e"? Where is the "error", mistake, think that is done badly?

petero2 · Post by **petero2** » Sun Nov 29, 2015 3:01 pm

cdani wrote:All is working nicely, wining strength after each iteration on all the parameters. Thanks to all!!

I have used mutithread to do the computations, because it was very slow.
Current e = 0.0796898969 (28.2294%)

Has someone computed the "e" value for example for Stockfish? It makes any sense compute it? I really don't understand well the formulas, nor why this works. If someone tries to explain it, please, don't use maths :-)

Particularly, what is the error "e"? Where is the "error", mistake, think that is done badly? :-)

E is basically a measure of how well the evaluation function is able to predict the game outcome. I don't think the actual value of E is very interesting, because there are at least three factors that contribute to E:

1. Mistakes in the game continuation causing the game outcome to not be equal to the game theoretical value of the evaluated position.

2. Tactical aspects of a position that can not reasonably be modeled by the evaluation function.

3. Bad values of evaluation function parameters.

The goal of the algorithm is to improve 3, without being mislead by the "noise" introduced by 1 and 2.

If there are lots of positions in the training data, the effects of 1 and 2 will on average not depend much on the parameter values, so by varying the parameters to reduce E, the parameter values get better. However, the individual contributions from 1, 2 and 3 to E is not known, so you can't say how good the evaluation function is based on the E value alone.

cdani · Post by **cdani** » Sun Nov 29, 2015 9:08 pm

petero2 wrote: E is basically a measure of how well the evaluation function is able to predict the game outcome. I don't think the actual value of E is very interesting, because there are at least three factors that contribute to E:

1. Mistakes in the game continuation causing the game outcome to not be equal to the game theoretical value of the evaluated position.

2. Tactical aspects of a position that can not reasonably be modeled by the evaluation function.

3. Bad values of evaluation function parameters.

The goal of the algorithm is to improve 3, without being mislead by the "noise" introduced by 1 and 2.

If there are lots of positions in the training data, the effects of 1 and 2 will on average not depend much on the parameter values, so by varying the parameters to reduce E, the parameter values get better. However, the individual contributions from 1, 2 and 3 to E is not known, so you can't say how good the evaluation function is based on the E value alone.

Thanks! Now is clear. I hope this explanation is included here:
https://chessprogramming.wikispaces.com ... ing+Method

In this page there is another think I don't understand. You say:

"The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial."

So if I understand well you say, "I won 39.4 changing the criteria, but apparently changing the criteria was not a good idea". So if it was a bad idea, how was able to win 39.4 elo?

Another think. I'm doing standard selfplay tests at short and long time controls after each iteration of the optimizing algorithm. Until now the new parameters has always won the previous version. After one iteration the version is quite worse, but the algorithm continues to run (in another computer), so it has not finished optimizing "E". Is an expected behavior? Or is supposed to be good to wait the algorithm to finish until no further improvement of "E" is possible, i.e. each new iteration on the parameters must be good?

Thanks!

cdani · Post by **cdani** » Sun Nov 29, 2015 11:11 pm

Another think I just thought. If I have already won 25 elo in self play, I suppose is better to run a new bunch of 60,000 games with the latest best version, as the new parameter values when tested will behave better correlation with the results of the games, so they will be optimized better.

But if this is true, why not use games of a stronger engine like Stockfish? It will be even better.

Probably I did not understand something if this is not true.

petero2 · Post by **petero2** » Mon Nov 30, 2015 8:04 pm

cdani wrote:https://chessprogramming.wikispaces.com ... ing+Method

In this page there is another think I don't understand. You say:

"The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial."

So if I understand well you say, "I won 39.4 changing the criteria, but apparently changing the criteria was not a good idea". So if it was a bad idea, how was able to win 39.4 elo?

It was a good idea. What I said was that before testing I thought the idea was bad, but after testing I realized that it was actually good.

cdani wrote:Another think. I'm doing standard selfplay tests at short and long time controls after each iteration of the optimizing algorithm. Until now the new parameters has always won the previous version. After one iteration the version is quite worse, but the algorithm continues to run (in another computer), so it has not finished optimizing "E". Is an expected behavior? Or is supposed to be good to wait the algorithm to finish until no further improvement of "E" is possible, i.e. each new iteration on the parameters must be good?

My intention was to let the optimization run until it could not improve E any more, and only then use the computed parameters to test if the playing strength was improved.

If you can afford the CPU time to test playing strength after each iteration, there is however no harm in that, except that you have to spend time performing the tests. It is even possible that some of the intermediate parameter values are stronger than the final values, but I don't know if that actually happens in practice.

petero2 · Post by **petero2** » Mon Nov 30, 2015 8:18 pm

cdani wrote:Another think I just thought. If I have already won 25 elo in self play, I suppose is better to run a new bunch of 60,000 games with the latest best version, as the new parameter values when tested will behave better correlation with the results of the games, so they will be optimized better.

That might be better. The first time I did this I got an additional +13 elo in hyper-bullet self test. Trying it again later has not helped much though.

cdani wrote:But if this is true, why not use games of a stronger engine like Stockfish? It will be even better.

That could work, but I think the games from the stronger engine should also be hyper-bullet games, so that they contain some mistakes leading to unbalanced positions.

Personally I did not want to improve my engine by using data generated by an engine that I had not created myself, even though I don't think it would have been morally or legally wrong to do so.

mvk · Post by **mvk** » Mon Nov 30, 2015 11:05 pm

petero2 wrote:2. Tactical aspects of a position that can not reasonably be modeled by the evaluation function.

Correct. But there are certain values you can expect from a raw eval or a qsearch that doesn't change much from program to program. Once search gets involved, the error drops, and does so different for each program. Here you can see the result on sqrt(E) for different searches.

I think it can be useful to go from qsearch to 1 or 2 ply searches if you have many parameters with little coverage per parameter (such as piece/square tables with a parameter for each individual square).

The problem I still have with this method is that correlation isn't causation, and there will be selection bias in the input games. For example, a white bishop on h7 might be very bad on average in normal play. But if players know that, they will avoid the bad cases and then the input games only show the cases where it works. The data from such games is then biassed. Adding a search might solve that because you invite a bit of the tradeoff in the optimisation. But in the end, playing games from the new vector is still the best thing, if it weren't so expensive.

cdani · Post by **cdani** » Mon Dec 07, 2015 11:29 pm

So I published the new version of Andscacs with this self tuning done.

The first 4th-6th iterations with the tuning where the best in improvements; I was testing after each iteration. The next after the last good was really bad, so I decided to generate a new bunch of games.

This time I obtained improvements, little ones, until the 9th iteration.

Here I tuned the piece values alone, not present in previous tunings, until the tuning finished. Was a little win. I have yet to try with partial tuning.

Then I added parameters related to passed pawns and I tuned all things related to passed pawns alone, obtaining a little win but only in the 4th first iterations.

Then I tuned some new parameters related with knights that converged quickly and was a little win.

A new general tuning try was bad from the first attempt, so I decided to stop here.

For some of the parameters I use, instead of an static value, a proportional value that is increased/reduced in powers of two and derived values. To tune them I used this function:

Code: Select all

int incrementa_redueix_proporcionalment&#40;int v, int increment&#41; &#123;
	int r, j;
	if &#40;increment == 0&#41;
		return v;
	//64
	j = abs&#40;increment&#41;;
	if &#40;j == 1&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 5, PunF&#40;v&#41; >> 5&#41;; //2
	else if &#40;j == 2&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 4, PunF&#40;v&#41; >> 4&#41;; //4
	else if &#40;j == 3&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 3, PunF&#40;v&#41; >> 3&#41;; //8
	else if &#40;j == 4&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 3, PunF&#40;v&#41; >> 3&#41; + FerPun&#40;PunI&#40;v&#41; >> 4, PunF&#40;v&#41; >> 4&#41;; //12
	else if &#40;j == 5&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 2, PunF&#40;v&#41; >> 2&#41;; //16
	else if &#40;j == 6&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 2, PunF&#40;v&#41; >> 2&#41; + FerPun&#40;PunI&#40;v&#41; >> 4, PunF&#40;v&#41; >> 4&#41;; //20
	else if &#40;j == 7&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 2, PunF&#40;v&#41; >> 2&#41; + FerPun&#40;PunI&#40;v&#41; >> 3, PunF&#40;v&#41; >> 3&#41;; //24
	else if &#40;j == 8&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 1, PunF&#40;v&#41; >> 1&#41;; //32
	else if &#40;j == 9&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 1, PunF&#40;v&#41; >> 1&#41; + FerPun&#40;PunI&#40;v&#41; >> 3, PunF&#40;v&#41; >> 3&#41;; //40
	else if &#40;j == 10&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 1, PunF&#40;v&#41; >> 1&#41; + FerPun&#40;PunI&#40;v&#41; >> 2, PunF&#40;v&#41; >> 2&#41;; //48
	else if &#40;j == 11&#41;
		r = FerPun&#40;PunI&#40;v&#41; >> 1, PunF&#40;v&#41; >> 1&#41; + FerPun&#40;PunI&#40;v&#41; >> 2, PunF&#40;v&#41; >> 2&#41; + FerPun&#40;PunI&#40;v&#41; >> 3, PunF&#40;v&#41; >> 3&#41;; //56
	else if &#40;j == 12&#41;
		r = v; //64
	return increment > 0 ? v + r &#58; v - r;
&#125;

PunI means mg value, and PunF eg value.

I don't know if someone is using something similar. I have not seen something like this in other engines.

Why computing K that minimizes the sigmoid func. value?...

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.

Re: Why computing K that minimizes the sigmoid func. value?.