Page 4 of 7

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 6:56 am
by cdani
I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 8:34 am
by mvk
cdani wrote:I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)
For determining 'k' that is probably good enough, but I fear that 66,000 games is much too low for the tuner to go anywhere if you use only game outcomes as target function. The limitation is not the number of positions, but the number of independent signals to get out of it (game outcomes).

For example, the bias from the test set alone is already in the order of 1/sqrt(66000) = 0.4%.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 8:54 am
by cdani
mvk wrote:
cdani wrote:I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)
For determining 'k' that is probably good enough, but I fear that 66,000 games is much too low for the tuner to go anywhere if you use only game outcomes as target function. The limitation is not the number of positions, but the number of independent signals to get out of it (game outcomes).

For example, the bias from the test set alone is already in the order of 1/sqrt(66000) = 0.4%.
Aha. I followed the advice present here

https://chessprogramming.wikispaces.com ... ing+Method

It says 64000 games.

As I follow the procedure presented there, I use the qsearch of every position to tune, so some millions. For Texel this worked. I will give it a try...

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 9:38 am
by mvk
Selecting the test set size is still a bit of a black art. For my first experiments in 2010 I used 20,000 positions. When I did a retune with the same amount, it lost about 10 elo. When I increased to 150,000 positions I lost 20 elo. But I was using GM games at the time, and the noise from such games is quite high. When I did the most tuning in 2011 I used 1M positions. I have 3M in Floyd now, sampled from CCRL games only.

I ran a quick test last week using 10 disjoint test sets of 1M positions each. For each test set I let the tuner run for the same number of iterations, starting from the same initial vector of zeroes. The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)

This is a much smaller spread than I expected to see. I was expecting ~0.3%

What I still have to do is run a cutechess tournament with these 10 vectors and see what is the spread in elo. I will start one later this week as now I don't have an idle PC available.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 11:17 am
by Ferdy
mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 12:18 pm
by mvk
Ferdy wrote:
mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?
To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.

The only overlap is due to positions occurring more than once. But then: I also take the elo difference into consideration in my fit, because that is one of the main predictors of game outcome and I want to use it in my contempt model later.

The idea is to find out what is the impact of position selection bias.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Mon Nov 23, 2015 12:49 pm
by Ferdy
mvk wrote:
Ferdy wrote:
mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?
To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.

The only overlap is due to positions occurring more than once. But then: I also take the elo difference into consideration in my fit, because that is one of the main predictors of game outcome and I want to use it in my contempt model later.

The idea is to find out what is the impact of position selection bias.
Nice idea.
I am thinking of controlling extreme presence of features in a training sets. For example, positions with white rook in open file (res 1-0) should not dominate the sets, there must also be some positions with res 0-1 or res draw.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Wed Nov 25, 2015 6:49 pm
by nionita
mvk wrote:The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)
What is the distance (or spread) in the parameter space of the found solutions? I guess your parameters are integers, but the found solutions are real numbers. There will be a further kind of noise you get when round the parameters.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Wed Nov 25, 2015 7:57 pm
by AlvaroBegue
mvk wrote:To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.
You would get much more independent results if you divided the games into 10 bins instead.

Re: Why computing K that minimizes the sigmoid func. value?.

Posted: Thu Nov 26, 2015 10:19 am
by mvk
nionita wrote:
mvk wrote:The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)
What is the distance (or spread) in the parameter space of the found solutions? I guess your parameters are integers, but the found solutions are real numbers. There will be a further kind of noise you get when round the parameters.
How do you calculate that distance/spread metric and what does it mean? All resulting 10 vectors are somewhat different by manual inspection.

In Floyd, "K" is really effectively 0.1 because I'm aiming at millipawn resolution evaluation and not centipawns. Roughly speaking, a single integer point change on an arbitrary parameter has less than 1e-7 effect on the overall residual. I believe integer rounding plays no role in the results.
AlvaroBegue wrote:You would get much more independent results if you divided the games into 10 bins instead.
Maybe that is the case, but maybe not: To make the set truly independent one needs as many CCRL games, but there are just about 1.9M available.

Anyway, here are the results of the round robin tournament:

version 0.7 is the version I tuned using 3M positions. Version .[0-9] are exactly the same code, except that each was compiled with a vector that was tuned using one of the 10 disjoint 1M sets.

Code: Select all

110000 game(s) loaded, 0 game(s) with unknown result ignored.
00:00:00,00
Rank Name       Elo    +    - games score oppo. draws 
   1 floyd0.7     5    4    4 20000   51%    -1   23% <-- 3M &#40;default&#41;
   2 floyd.3      3    4    4 20000   50%     0   24%  <-- 1M 
   3 floyd.4      3    4    4 20000   50%     0   24%  <-- 1M
   4 floyd.7      2    4    4 20000   50%     0   24%  <-- 1M
   5 floyd.1      2    4    4 20000   50%     0   24%  <-- 1M
   6 floyd.6      1    4    4 20000   50%     0   23%  <-- 1M
   7 floyd.8      0    4    4 20000   50%     0   24%  <-- 1M
   8 floyd.9     -1    4    4 20000   50%     0   23%  <-- 1M
   9 floyd.5     -3    4    4 20000   50%     0   24%  <-- 1M
  10 floyd.0     -3    4    4 20000   49%     0   23%  <-- 1M
  11 floyd.2     -7    4    4 20000   49%     1   23%  <-- 1M

LoS matrix&#58;
          fl fl fl fl fl fl fl fl fl fl fl
floyd0.7     82 83 89 91 95 96 98 99 99 99
floyd.3   17    51 62 65 77 80 90 98 98 99
floyd.4   16 48    61 64 77 79 90 98 98 99
floyd.7   10 37 38    53 67 70 84 96 96 99
floyd.1    8 34 35 46    64 67 82 95 95 99
floyd.6    4 22 22 32 35    53 71 90 91 99
floyd.8    3 19 20 29 32 46    67 89 89 99
floyd.9    1  9  9 15 17 28 32    77 78 98
floyd.5    0  1  1  3  4  9 10 22    51 91
floyd.0    0  1  1  3  4  8 10 21 48    90
floyd.2    0  0  0  0  0  0  0  1  8  9
Some observations:
1. None of the versions tuned with 1M can beat the version tuned with 3M positions. I don't expect this outcome when there is no difference between using 1M and 3M.
2. There is probably no real elo difference among the 10 versions tuned with 1M positions.
3. The differences are really small. In practice, it might be ok to use 1M for daily work and win some computing time while tuning. Then prior to a release increase the set size.