Why computing K that minimizes the sigmoid func. value?...

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, bob, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
cdani
Posts: 2175
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: Why computing K that minimizes the sigmoid func. value?.

I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)

mvk
Posts: 589
Joined: Tue Jun 04, 2013 8:15 pm

Re: Why computing K that minimizes the sigmoid func. value?.

cdani wrote:I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)
For determining 'k' that is probably good enough, but I fear that 66,000 games is much too low for the tuner to go anywhere if you use only game outcomes as target function. The limitation is not the number of positions, but the number of independent signals to get out of it (game outcomes).

For example, the bias from the test set alone is already in the order of 1/sqrt(66000) = 0.4%.
[Account deleted]

cdani
Posts: 2175
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: Why computing K that minimizes the sigmoid func. value?.

mvk wrote:
cdani wrote:I have done it again with a serious number of games (66.000), and now it's
k = 1.495000, e = 0.1408248361 (37.526%)
For determining 'k' that is probably good enough, but I fear that 66,000 games is much too low for the tuner to go anywhere if you use only game outcomes as target function. The limitation is not the number of positions, but the number of independent signals to get out of it (game outcomes).

For example, the bias from the test set alone is already in the order of 1/sqrt(66000) = 0.4%.
Aha. I followed the advice present here

https://chessprogramming.wikispaces.com ... ing+Method

It says 64000 games.

As I follow the procedure presented there, I use the qsearch of every position to tune, so some millions. For Texel this worked. I will give it a try...

mvk
Posts: 589
Joined: Tue Jun 04, 2013 8:15 pm

Re: Why computing K that minimizes the sigmoid func. value?.

Selecting the test set size is still a bit of a black art. For my first experiments in 2010 I used 20,000 positions. When I did a retune with the same amount, it lost about 10 elo. When I increased to 150,000 positions I lost 20 elo. But I was using GM games at the time, and the noise from such games is quite high. When I did the most tuning in 2011 I used 1M positions. I have 3M in Floyd now, sampled from CCRL games only.

I ran a quick test last week using 10 disjoint test sets of 1M positions each. For each test set I let the tuner run for the same number of iterations, starting from the same initial vector of zeroes. The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)

This is a much smaller spread than I expected to see. I was expecting ~0.3%

What I still have to do is run a cutechess tournament with these 10 vectors and see what is the spread in elo. I will start one later this week as now I don't have an idle PC available.
[Account deleted]

Ferdy
Posts: 4292
Joined: Sun Aug 10, 2008 1:15 pm
Location: Philippines

Re: Why computing K that minimizes the sigmoid func. value?.

mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?

mvk
Posts: 589
Joined: Tue Jun 04, 2013 8:15 pm

Re: Why computing K that minimizes the sigmoid func. value?.

Ferdy wrote:
mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?
To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.

The only overlap is due to positions occurring more than once. But then: I also take the elo difference into consideration in my fit, because that is one of the main predictors of game outcome and I want to use it in my contempt model later.

The idea is to find out what is the impact of position selection bias.
[Account deleted]

Ferdy
Posts: 4292
Joined: Sun Aug 10, 2008 1:15 pm
Location: Philippines

Re: Why computing K that minimizes the sigmoid func. value?.

mvk wrote:
Ferdy wrote:
mvk wrote:I ran a quick test last week using 10 disjoint test sets of 1M positions each.
What criteria do you use to consider that 10 sets are disjoint? For example if there is position in set 1 with rook in open file, does this mean the other 9 sets have no position with rook in open file?
To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.

The only overlap is due to positions occurring more than once. But then: I also take the elo difference into consideration in my fit, because that is one of the main predictors of game outcome and I want to use it in my contempt model later.

The idea is to find out what is the impact of position selection bias.
Nice idea.
I am thinking of controlling extreme presence of features in a training sets. For example, positions with white rook in open file (res 1-0) should not dominate the sets, there must also be some positions with res 0-1 or res draw.

nionita
Posts: 163
Joined: Fri Oct 22, 2010 7:47 pm
Location: Austria

Re: Why computing K that minimizes the sigmoid func. value?.

mvk wrote:The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)
What is the distance (or spread) in the parameter space of the found solutions? I guess your parameters are integers, but the found solutions are real numbers. There will be a further kind of noise you get when round the parameters.

AlvaroBegue
Posts: 925
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Why computing K that minimizes the sigmoid func. value?.

mvk wrote:To be precise: I sampled 10M positions from CCRL games by dumping all positions and select every 23rd or so (some odd number) position from the dump, then shuffle all selected positions and keep the first 10M of the result. Then I split it into 10 equal parts of 1M each.
You would get much more independent results if you divided the games into 10 bins instead.

mvk
Posts: 589
Joined: Tue Jun 04, 2013 8:15 pm

Re: Why computing K that minimizes the sigmoid func. value?.

nionita wrote:
mvk wrote:The 10 resulting residuals are

average: 0.32228706 (32.2%)
3*stdev: 0.00051216 (0.05%)
What is the distance (or spread) in the parameter space of the found solutions? I guess your parameters are integers, but the found solutions are real numbers. There will be a further kind of noise you get when round the parameters.
How do you calculate that distance/spread metric and what does it mean? All resulting 10 vectors are somewhat different by manual inspection.

In Floyd, "K" is really effectively 0.1 because I'm aiming at millipawn resolution evaluation and not centipawns. Roughly speaking, a single integer point change on an arbitrary parameter has less than 1e-7 effect on the overall residual. I believe integer rounding plays no role in the results.
AlvaroBegue wrote:You would get much more independent results if you divided the games into 10 bins instead.
Maybe that is the case, but maybe not: To make the set truly independent one needs as many CCRL games, but there are just about 1.9M available.

Anyway, here are the results of the round robin tournament:

version 0.7 is the version I tuned using 3M positions. Version .[0-9] are exactly the same code, except that each was compiled with a vector that was tuned using one of the 10 disjoint 1M sets.

Code: Select all

``````110000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
00&#58;00&#58;00,00
Rank Name       Elo    +    - games score oppo. draws
1 floyd0.7     5    4    4 20000   51%    -1   23% <-- 3M &#40;default&#41;
2 floyd.3      3    4    4 20000   50%     0   24%  <-- 1M
3 floyd.4      3    4    4 20000   50%     0   24%  <-- 1M
4 floyd.7      2    4    4 20000   50%     0   24%  <-- 1M
5 floyd.1      2    4    4 20000   50%     0   24%  <-- 1M
6 floyd.6      1    4    4 20000   50%     0   23%  <-- 1M
7 floyd.8      0    4    4 20000   50%     0   24%  <-- 1M
8 floyd.9     -1    4    4 20000   50%     0   23%  <-- 1M
9 floyd.5     -3    4    4 20000   50%     0   24%  <-- 1M
10 floyd.0     -3    4    4 20000   49%     0   23%  <-- 1M
11 floyd.2     -7    4    4 20000   49%     1   23%  <-- 1M

LoS matrix&#58;
fl fl fl fl fl fl fl fl fl fl fl
floyd0.7     82 83 89 91 95 96 98 99 99 99
floyd.3   17    51 62 65 77 80 90 98 98 99
floyd.4   16 48    61 64 77 79 90 98 98 99
floyd.7   10 37 38    53 67 70 84 96 96 99
floyd.1    8 34 35 46    64 67 82 95 95 99
floyd.6    4 22 22 32 35    53 71 90 91 99
floyd.8    3 19 20 29 32 46    67 89 89 99
floyd.9    1  9  9 15 17 28 32    77 78 98
floyd.5    0  1  1  3  4  9 10 22    51 91
floyd.0    0  1  1  3  4  8 10 21 48    90
floyd.2    0  0  0  0  0  0  0  1  8  9``````
Some observations:
1. None of the versions tuned with 1M can beat the version tuned with 3M positions. I don't expect this outcome when there is no difference between using 1M and 3M.
2. There is probably no real elo difference among the 10 versions tuned with 1M positions.
3. The differences are really small. In practice, it might be ok to use 1M for daily work and win some computing time while tuning. Then prior to a release increase the set size.
[Account deleted]