Diminishing returns in fixed depth testing revisited.

Ajedrecista · Post by **Ajedrecista** » Sat Aug 25, 2012 1:56 pm

Hello:

There are several fixed depth tests in computer chess history. Some of these tests can be found here.

I have used Quazar 0.4 w32 for doing this experiment (depth d vs. depth (d - 1)), which is currently unfinished, although I do not know if I will be able of run more matches. I want to add 2500 or 5000 games more in the next months (once CuteChess GUI 1.0 is out with the feature of pause and resume matches) but I probably will not manage it because each new 2500-game match takes insane amounts of time for me.

I use cutechess-cli 0.5.1 for this fixed depth testing. The engines.json files look like this one:

Code: Select all

&#91;      
   &#123; 
      "command" &#58; "Quazar_0.4_w32.exe", 
      "name" &#58; "Depth_9", 
      "options" &#58; &#91; 
         &#123; 
            "name" &#58; "Hash", 
            "value" &#58; 16 
         &#125;
      &#93;, 
      "protocol" &#58; "uci", 
      "workingDirectory" &#58; "H&#58;\\d9" 
   &#125;, 
   &#123; 
      "command" &#58; "Quazar_0.4_w32.exe", 
      "name" &#58; "Depth_10", 
      "options" &#58; &#91; 
         &#123; 
            "name" &#58; "Hash", 
            "value" &#58; 16 
         &#125; 
      &#93;, 
      "protocol" &#58; "uci", 
      "workingDirectory" &#58; "H&#58;\\d10" 
   &#125;
&#93;

The used command lines look like this one:

Code: Select all

cutechess-cli -engine conf="Depth_9" depth=9 -engine conf="Depth_10" depth=10 -each tc=inf -concurrency 2 -draw 100 50 -resign 5 900 -games 2 -rounds 1250 -pgnin klo_250_eco_a00-e97_variations.pgn -repeat -pgnout 09_Vs_10.pgn

I have run nine 2500-game matches until now, that is, 22500 games. In the majority of the cases, error bars are below ± 10 Elo for a confidence interval of 95%:

Code: Select all

version 0057.2, Copyright &#40;C&#41; 1997-2010 Remi Coulom.
compiled Apr  5 2012 17&#58;26&#58;01.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.
ResultSet>readpgn 01_Vs_02.pgn
2500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 02_Vs_03.pgn
5000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 03_Vs_04.pgn
7500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 04_Vs_05.pgn
10000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 05_Vs_06.pgn
12500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 06_Vs_07.pgn
15000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 07_Vs_08.pgn
17500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 08_Vs_09.pgn
20000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 09_Vs_10.pgn
22500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
Iteration 100&#58; 0.0058971
Iteration 200&#58; 0.0006629
Iteration 300&#58; 7.88414e-005
00&#58;00&#58;00,00
ResultSet-EloRating>confidence 0.95
0.95
ResultSet-EloRating>ratings
Rank Name         Elo      Diff     +      -      Games  Score    Oppo.   Draws    Win            W-L-D
   1 Depth_10   850.50      0.00  10.36  10.36     2500  69.96%   716.04  44.80%  47.56%      1189-191-1120
   2 Depth_9    716.04   -134.46   7.45   7.45     5000  51.62%   701.84  41.52%  30.86%      1543-1381-2076
   3 Depth_8    553.17   -162.86   7.62   7.62     5000  49.78%   552.50  36.48%  31.54%      1577-1599-1824
   4 Depth_7    388.96   -164.21   7.94   7.94     5000  52.88%   361.27  30.36%  37.70%      1885-1597-1518
   5 Depth_6    169.37   -219.59   8.17   8.17     5000  49.47%   174.79  26.70%  36.12%      1806-1859-1335
   6 Depth_5    -39.37   -208.75   8.17   8.17     5000  49.94%   -41.55  25.92%  36.98%      1849-1855-1296
   7 Depth_4   -252.48   -213.11   8.46   8.46     5000  51.16%  -267.93  21.28%  40.52%      2026-1910-1064
   8 Depth_3   -496.48   -244.00   9.44   9.44     5000  54.37%  -547.21  15.06%  46.84%      2342-1905-753
   9 Depth_2   -841.94   -345.46   9.15   9.15     5000  43.80%  -772.13  17.20%  35.20%      1760-2380-860
  10 Depth_1  -1047.77   -205.83  11.67  11.67     2500  24.00%  -841.94  22.40%  12.80%       320-1620-560
ResultSet-EloRating>x
ResultSet>x

Surely I have not used the best settings in BayesElo, but I think that overall ratings will not change a lot with other commands. I introduce these ratings in Excel (using y axis for ratings and x axis for ln(depth)). If I adjust with a line by least squares from depth 5 up to depth 10, I get Y(depth_i) ~ 1296.6*ln(depth_i) - 2137.6 (depth_i > 4); R² ~ 0.9992. The important thing of this line is the slope because the intercept is related with the average value of the ratings (zero in this case). Elo variations with depth will be proportional to slope and inverse proportional to depth. But why depth_i > 4? Because this line fits very well with my data points... it is maybe a little cheat.

I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).

A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me. I am not taking into account error bars.

For depth_i > 4, using Y(depth_i) ~ 1296.6*ln(depth_i) - 2137.6 and calculating errors (rounding up to 0.1 Elo) as error_i = Y(depth_i) - rating_i:

Code: Select all

Depth&#58;     Rating&#58;     Error&#58; 
------     -------     ------ 
     
  5          -39.4      -11.4
  6          169.4       16.2
  7          389         -3.5
  8          553.2        5.4
  9          716         -4.7
 10          850.5       -2.6

Fortunately, errors are not too large IMHO, specially for higher depths... although there are still too few data points to ensure anything. For Quazar 0.4 w32, I expect a rating difference between depth 10 and depth 11 of Y(11) - Y(10) ~ 1296.6*ln(11/10) ~ 123.6 Elo (surely it will be far away of this value).

Another known fact is the growth of the draw ratio when depth raises (except for very low depths, where my results are a bit strange): this steady growth is easily seen in the output of BayesElo.

I provide all PGN files (around 8 MB because they are compressed), win-lose-draw statistics of each 2500-game match and used openings in a PGN file:

Fixed_depth_testing_of_Quazar_0.4_w32.rar (8.08 MB)

IIRC, this Zippyshare link will dead at 30 days of inactivity.

Code: Select all

Finished game 2499 &#40;Depth_2 vs Depth_1&#41;&#58; 1/2-1/2 &#123;Draw by 3-fold repetition&#125;
Score of Depth_1 vs Depth_2&#58; 320 - 1620 - 560  &#91;0.24&#93; 2500
ELO difference&#58; -200
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_2 vs Depth_3&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_2 vs Depth_3&#58; 140 - 2060 - 300  &#91;0.12&#93; 2500
ELO difference&#58; -353
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_3 vs Depth_4&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_3 vs Depth_4&#58; 282 - 1765 - 453  &#91;0.20&#93; 2500
ELO difference&#58; -237
Finished match
Warning&#58; QObject&#58;&#58;killTimers&#58; timers cannot be stopped from another thread

-----------------------------------------------------------------------------------

Finished game 2499 &#40;Depth_5 vs Depth_4&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_4 vs Depth_5&#58; 261 - 1628 - 611  &#91;0.23&#93; 2500
ELO difference&#58; -213
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_5 vs Depth_6&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_5 vs Depth_6&#58; 221 - 1594 - 685  &#91;0.23&#93; 2500
ELO difference&#58; -214
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_6 vs Depth_7&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_6 vs Depth_7&#58; 212 - 1638 - 650  &#91;0.21&#93; 2500
ELO difference&#58; -225
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_7 vs Depth_8&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_7 vs Depth_8&#58; 247 - 1385 - 868  &#91;0.27&#93; 2500
ELO difference&#58; -171
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_8 vs Depth_9&#41;&#58; 1/2-1/2 &#123;Draw by 3-fold repetition&#125;
Score of Depth_8 vs Depth_9&#58; 192 - 1352 - 956  &#91;0.27&#93; 2500
ELO difference&#58; -175
Finished match

-----------------------------------------------------------------------------------

Finished game 2499 &#40;Depth_10 vs Depth_9&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_9 vs Depth_10&#58; 191 - 1189 - 1120  &#91;0.30&#93; 2500
ELO difference&#58; -147
Finished match

Any suggestions, results, corrections... are welcomed as expected.

Regards from Spain.

Ajedrecista.

Adam Hair · Post by **Adam Hair** » Sat Aug 25, 2012 3:51 pm

Ajedrecista wrote:Hello:

There are several fixed depth tests in computer chess history. Some of these tests can be found here.

I have used Quazar 0.4 w32 for doing this experiment (depth d vs. depth (d - 1)), which is currently unfinished, although I do not know if I will be able of run more matches. I want to add 2500 or 5000 games more in the next months (once CuteChess GUI 1.0 is out with the feature of pause and resume matches) but I probably will not manage it because each new 2500-game match takes insane amounts of time for me.

I use cutechess-cli 0.5.1 for this fixed depth testing. The engines.json files look like this one:
Code: Select all
&#91;      
   &#123; 
      "command" &#58; "Quazar_0.4_w32.exe", 
      "name" &#58; "Depth_9", 
      "options" &#58; &#91; 
         &#123; 
            "name" &#58; "Hash", 
            "value" &#58; 16 
         &#125;
      &#93;, 
      "protocol" &#58; "uci", 
      "workingDirectory" &#58; "H&#58;\\d9" 
   &#125;, 
   &#123; 
      "command" &#58; "Quazar_0.4_w32.exe", 
      "name" &#58; "Depth_10", 
      "options" &#58; &#91; 
         &#123; 
            "name" &#58; "Hash", 
            "value" &#58; 16 
         &#125; 
      &#93;, 
      "protocol" &#58; "uci", 
      "workingDirectory" &#58; "H&#58;\\d10" 
   &#125;
&#93;
The used command lines look like this one:
Code: Select all
cutechess-cli -engine conf="Depth_9" depth=9 -engine conf="Depth_10" depth=10 -each tc=inf -concurrency 2 -draw 100 50 -resign 5 900 -games 2 -rounds 1250 -pgnin klo_250_eco_a00-e97_variations.pgn -repeat -pgnout 09_Vs_10.pgn
I have run nine 2500-game matches until now, that is, 22500 games. In the majority of the cases, error bars are below ± 10 Elo for a confidence interval of 95%:
Code: Select all
version 0057.2, Copyright &#40;C&#41; 1997-2010 Remi Coulom.
compiled Apr  5 2012 17&#58;26&#58;01.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.
ResultSet>readpgn 01_Vs_02.pgn
2500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 02_Vs_03.pgn
5000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 03_Vs_04.pgn
7500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 04_Vs_05.pgn
10000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 05_Vs_06.pgn
12500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 06_Vs_07.pgn
15000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 07_Vs_08.pgn
17500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 08_Vs_09.pgn
20000 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>readpgn 09_Vs_10.pgn
22500 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
Iteration 100&#58; 0.0058971
Iteration 200&#58; 0.0006629
Iteration 300&#58; 7.88414e-005
00&#58;00&#58;00,00
ResultSet-EloRating>confidence 0.95
0.95
ResultSet-EloRating>ratings
Rank Name         Elo      Diff     +      -      Games  Score    Oppo.   Draws    Win            W-L-D
   1 Depth_10   850.50      0.00  10.36  10.36     2500  69.96%   716.04  44.80%  47.56%      1189-191-1120
   2 Depth_9    716.04   -134.46   7.45   7.45     5000  51.62%   701.84  41.52%  30.86%      1543-1381-2076
   3 Depth_8    553.17   -162.86   7.62   7.62     5000  49.78%   552.50  36.48%  31.54%      1577-1599-1824
   4 Depth_7    388.96   -164.21   7.94   7.94     5000  52.88%   361.27  30.36%  37.70%      1885-1597-1518
   5 Depth_6    169.37   -219.59   8.17   8.17     5000  49.47%   174.79  26.70%  36.12%      1806-1859-1335
   6 Depth_5    -39.37   -208.75   8.17   8.17     5000  49.94%   -41.55  25.92%  36.98%      1849-1855-1296
   7 Depth_4   -252.48   -213.11   8.46   8.46     5000  51.16%  -267.93  21.28%  40.52%      2026-1910-1064
   8 Depth_3   -496.48   -244.00   9.44   9.44     5000  54.37%  -547.21  15.06%  46.84%      2342-1905-753
   9 Depth_2   -841.94   -345.46   9.15   9.15     5000  43.80%  -772.13  17.20%  35.20%      1760-2380-860
  10 Depth_1  -1047.77   -205.83  11.67  11.67     2500  24.00%  -841.94  22.40%  12.80%       320-1620-560
ResultSet-EloRating>x
ResultSet>x
Surely I have not used the best settings in BayesElo, but I think that overall ratings will not change a lot with other commands. I introduce these ratings in Excel (using y axis for ratings and x axis for ln(depth)). If I adjust with a line by least squares from depth 5 up to depth 10, I get Y(depth_i) ~ 1296.6*ln(depth_i) - 2137.6 (depth_i > 4); R² ~ 0.9992. The important thing of this line is the slope because the intercept is related with the average value of the ratings (zero in this case). Elo variations with depth will be proportional to slope and inverse proportional to depth. But why depth_i > 4? Because this line fits very well with my data points... it is maybe a little cheat.

I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).

A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me. I am not taking into account error bars.

For depth_i > 4, using Y(depth_i) ~ 1296.6*ln(depth_i) - 2137.6 and calculating errors (rounding up to 0.1 Elo) as error_i = Y(depth_i) - rating_i:
Code: Select all
Depth&#58;     Rating&#58;     Error&#58; 
------     -------     ------ 
     
  5          -39.4      -11.4
  6          169.4       16.2
  7          389         -3.5
  8          553.2        5.4
  9          716         -4.7
 10          850.5       -2.6
Fortunately, errors are not too large IMHO, specially for higher depths... although there are still too few data points to ensure anything. For Quazar 0.4 w32, I expect a rating difference between depth 10 and depth 11 of Y(11) - Y(10) ~ 1296.6*ln(11/10) ~ 123.6 Elo (surely it will be far away of this value).

Another known fact is the growth of the draw ratio when depth raises (except for very low depths, where my results are a bit strange): this steady growth is easily seen in the output of BayesElo.

I provide all PGN files (around 8 MB because they are compressed), win-lose-draw statistics of each 2500-game match and used openings in a PGN file:

Fixed_depth_testing_of_Quazar_0.4_w32.rar (8.08 MB)

IIRC, this Zippyshare link will dead at 30 days of inactivity.
Code: Select all
Finished game 2499 &#40;Depth_2 vs Depth_1&#41;&#58; 1/2-1/2 &#123;Draw by 3-fold repetition&#125;
Score of Depth_1 vs Depth_2&#58; 320 - 1620 - 560  &#91;0.24&#93; 2500
ELO difference&#58; -200
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_2 vs Depth_3&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_2 vs Depth_3&#58; 140 - 2060 - 300  &#91;0.12&#93; 2500
ELO difference&#58; -353
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_3 vs Depth_4&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_3 vs Depth_4&#58; 282 - 1765 - 453  &#91;0.20&#93; 2500
ELO difference&#58; -237
Finished match
Warning&#58; QObject&#58;&#58;killTimers&#58; timers cannot be stopped from another thread

-----------------------------------------------------------------------------------

Finished game 2499 &#40;Depth_5 vs Depth_4&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_4 vs Depth_5&#58; 261 - 1628 - 611  &#91;0.23&#93; 2500
ELO difference&#58; -213
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_5 vs Depth_6&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_5 vs Depth_6&#58; 221 - 1594 - 685  &#91;0.23&#93; 2500
ELO difference&#58; -214
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_6 vs Depth_7&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_6 vs Depth_7&#58; 212 - 1638 - 650  &#91;0.21&#93; 2500
ELO difference&#58; -225
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_7 vs Depth_8&#41;&#58; 0-1 &#123;Black wins by adjudication&#125;
Score of Depth_7 vs Depth_8&#58; 247 - 1385 - 868  &#91;0.27&#93; 2500
ELO difference&#58; -171
Finished match

-----------------------------------------------------------------------------------

Finished game 2500 &#40;Depth_8 vs Depth_9&#41;&#58; 1/2-1/2 &#123;Draw by 3-fold repetition&#125;
Score of Depth_8 vs Depth_9&#58; 192 - 1352 - 956  &#91;0.27&#93; 2500
ELO difference&#58; -175
Finished match

-----------------------------------------------------------------------------------

Finished game 2499 &#40;Depth_10 vs Depth_9&#41;&#58; 1/2-1/2 &#123;Draw by adjudication&#125;
Score of Depth_9 vs Depth_10&#58; 191 - 1189 - 1120  &#91;0.30&#93; 2500
ELO difference&#58; -147
Finished match
Any suggestions, results, corrections... are welcomed as expected.

Regards from Spain.

Ajedrecista.

Hi Jesús,

I started a large study for diminishing returns a few months ago that I now have on hold. I do not know when I will return to it. I did manage to accumulate some data on Houdini 2.0c, from ply 2 to ply 17. In my study, I used a suggestion from Don Dailey and matched ply x against plies x-2 through x+2. Since I am uncertain when I will do anything with my data, I am offering it to you so that you may use it in comparison to your test. This is the link: http://www.mediafire.com/?y64sk28iij0p74q

My focus was to compare the the change in Elo as depth increases to the change in move selection as depth increases. I modified the similarity tool by inserting 2000 quiet positions (from a Dann Corbit epd), changed the commands so that I could make engines search to various depths. In the case for Houdini 2.0c, I have data from ply 2 to ply 20:

Code: Select all

sim version 3

  Key&#58;

  1&#41; Houdini 2.0c 64-bit ply02&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  2&#41; Houdini 2.0c 64-bit ply03&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  3&#41; Houdini 2.0c 64-bit ply04&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  4&#41; Houdini 2.0c 64-bit ply05&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  5&#41; Houdini 2.0c 64-bit ply06&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  6&#41; Houdini 2.0c 64-bit ply07&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  7&#41; Houdini 2.0c 64-bit ply08&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  8&#41; Houdini 2.0c 64-bit ply09&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
  9&#41; Houdini 2.0c 64-bit ply10&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 10&#41; Houdini 2.0c 64-bit ply11&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 11&#41; Houdini 2.0c 64-bit ply12&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 12&#41; Houdini 2.0c 64-bit ply13&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 13&#41; Houdini 2.0c 64-bit ply14&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 14&#41; Houdini 2.0c 64-bit ply15&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 15&#41; Houdini 2.0c 64-bit ply16&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 16&#41; Houdini 2.0c 64-bit ply17&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 17&#41; Houdini 2.0c 64-bit ply18&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 18&#41; Houdini 2.0c 64-bit ply19&#40;time&#58; 100 ms  scale&#58; 1.0&#41;
 19&#41; Houdini 2.0c 64-bit ply20&#40;time&#58; 100 ms  scale&#58; 1.0&#41;

         1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18    19
  1.  ----- 76.85 66.70 58.55 50.20 44.60 40.20 36.25 33.30 31.80 30.45 29.15 28.80 27.35 26.80 26.80 25.95 25.30 25.10
  2.  76.85 ----- 81.25 68.85 58.80 51.55 46.20 41.05 38.40 37.30 35.65 33.70 32.40 31.10 29.80 29.40 28.50 27.45 27.35
  3.  66.70 81.25 ----- 81.25 67.95 58.60 51.75 45.10 41.60 40.05 38.55 36.70 35.50 33.85 32.45 32.25 31.55 30.20 29.95
  4.  58.55 68.85 81.25 ----- 79.75 66.50 57.90 49.50 45.30 43.60 41.30 39.55 37.95 35.75 34.40 34.15 33.30 32.10 31.70
  5.  50.20 58.80 67.95 79.75 ----- 79.20 65.85 55.95 51.10 47.75 45.15 43.00 40.75 39.55 37.90 37.90 36.60 34.75 33.90
  6.  44.60 51.55 58.60 66.50 79.20 ----- 78.85 64.45 56.65 52.90 49.45 47.10 45.35 43.10 41.55 41.60 40.25 38.35 37.80
  7.  40.20 46.20 51.75 57.90 65.85 78.85 ----- 77.35 65.70 59.95 56.10 53.20 51.00 49.15 47.55 46.80 45.40 43.05 42.10
  8.  36.25 41.05 45.10 49.50 55.95 64.45 77.35 ----- 79.85 70.00 64.20 60.65 57.35 55.10 53.05 51.65 49.55 47.80 46.90
  9.  33.30 38.40 41.60 45.30 51.10 56.65 65.70 79.85 ----- 84.15 74.50 68.75 63.40 59.95 56.95 55.25 53.30 51.00 50.30
 10.  31.80 37.30 40.05 43.60 47.75 52.90 59.95 70.00 84.15 ----- 86.35 77.25 70.25 65.60 62.05 60.00 58.00 55.50 53.95
 11.  30.45 35.65 38.55 41.30 45.15 49.45 56.10 64.20 74.50 86.35 ----- 87.40 78.75 72.90 68.05 65.95 63.00 59.90 58.05
 12.  29.15 33.70 36.70 39.55 43.00 47.10 53.20 60.65 68.75 77.25 87.40 ----- 89.15 81.65 75.45 72.60 68.90 65.45 63.15
 13.  28.80 32.40 35.50 37.95 40.75 45.35 51.00 57.35 63.40 70.25 78.75 89.15 ----- 89.05 81.70 77.55 72.85 69.05 66.60
 14.  27.35 31.10 33.85 35.75 39.55 43.10 49.15 55.10 59.95 65.60 72.90 81.65 89.05 ----- 89.90 83.75 77.80 73.15 69.95
 15.  26.80 29.80 32.45 34.40 37.90 41.55 47.55 53.05 56.95 62.05 68.05 75.45 81.70 89.90 ----- 90.65 83.85 78.25 74.40
 16.  26.80 29.40 32.25 34.15 37.90 41.60 46.80 51.65 55.25 60.00 65.95 72.60 77.55 83.75 90.65 ----- 90.95 83.80 78.85
 17.  25.95 28.50 31.55 33.30 36.60 40.25 45.40 49.55 53.30 58.00 63.00 68.90 72.85 77.80 83.85 90.95 ----- 91.45 85.15
 18.  25.30 27.45 30.20 32.10 34.75 38.35 43.05 47.80 51.00 55.50 59.90 65.45 69.05 73.15 78.25 83.80 91.45 ----- 92.10
 19.  25.10 27.35 29.95 31.70 33.90 37.80 42.10 46.90 50.30 53.95 58.05 63.15 66.60 69.95 74.40 78.85 85.15 92.10 -----

Ajedrecista · Post by **Ajedrecista** » Sat Aug 25, 2012 5:26 pm

Hi Adam:

EXCELLENT WORK! I used BayesElo again for compute ratings, using exactly the same commands as before:

Code: Select all

version 0057.2, Copyright &#40;C&#41; 1997-2010 Remi Coulom.
compiled Apr  5 2012 17&#58;26&#58;01.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.
ResultSet>readpgn Houdini_2.0c_Plytest_.pgn
57792 game&#40;s&#41; loaded, 0 game&#40;s&#41; with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
Iteration 100&#58; 0.0102116
Iteration 200&#58; 0.00213777
Iteration 300&#58; 0.000505583
Iteration 400&#58; 0.000122885
Iteration 500&#58; 3.00645e-005
00&#58;00&#58;00,00
ResultSet-EloRating>confidence 0.95
0.95
ResultSet-EloRating>ratings
Rank Name              Elo      Diff     +      -      Games  Score    Oppo.   Draws     Win           W-L-D
   1 Houdini_ply17    895.88     0.00   7.31   7.31     3792  67.64%   775.44  51.21%  42.04%      1594-256-1942
   2 Houdini_ply16    816.87   -79.01   5.91   5.91     5792  58.68%   756.40  52.38%  32.49%      1882-876-3034
   3 Houdini_ply15    738.32   -78.55   5.06   5.06     8000  51.17%   729.33  49.58%  26.39%      2111-1923-3966
   4 Houdini_ply14    649.52   -88.80   5.09   5.09     8000  51.02%   640.73  47.90%  27.07%      2166-2002-3832
   5 Houdini_ply13    555.06   -94.46   5.17   5.17     8000  51.31%   543.69  43.85%  29.39%      2351-2141-3508
   6 Houdini_ply12    452.69  -102.36   5.26   5.26     8000  52.21%   434.14  41.92%  31.25%      2500-2146-3354
   7 Houdini_ply11    334.23  -118.47   5.42   5.42     8000  52.54%   309.04  36.89%  34.10%      2728-2321-2951
   8 Houdini_ply10    197.76  -136.46   5.71   5.71     8000  53.07%   161.26  31.74%  37.20%      2976-2485-2539
   9 Houdini_ply09     30.63  -167.13   5.91   5.91     8000  51.98%     8.32  26.54%  38.71%      3097-2780-2123
  10 Houdini_ply08   -172.51  -203.15   5.97   5.97     8000  47.40%  -145.57  25.68%  34.56%      2765-3181-2054
  11 Houdini_ply07   -326.21  -153.70   5.92   5.92     8000  49.36%  -318.37  26.76%  35.98%      2878-2981-2141
  12 Houdini_ply06   -484.47  -158.25   5.84   5.84     8000  50.36%  -489.54  26.31%  37.20%      2976-2919-2105
  13 Houdini_ply05   -647.13  -162.67   5.94   5.94     8000  50.41%  -653.43  24.80%  38.01%      3041-2975-1984
  14 Houdini_ply04   -812.32  -165.18   6.25   6.25     8000  51.36%  -839.98  20.32%  41.20%      3296-3078-1626
  15 Houdini_ply03   -990.73  -178.42   6.99   6.99     6000  38.80%  -899.01  19.53%  29.03%      1742-3086-1172
  16 Houdini_ply02  -1237.58  -246.85   9.98   9.98     4000  13.09%  -901.53  11.72%   7.22%       289-3242-469
ResultSet-EloRating>x
ResultSet>x

If I represent ratings in the vertical axis and ln(depth) in the horizontal axis, I get a VERY similar graph in its shape, just like in my experiment! More 'chaotic' at low depths and very lineal when depth grows. Again, I do not consider error bars, which are pretty small here! Well done.

I do the same trick as before: adjust by a line of least squares some points, getting rid of the ones of lower depths (eliminating the non-lineal zone). Your excellent provided data makes difficult where to start, because the line fits very well with lots of points. I finally choosed the start point in depth 10 (a highly arbitrary decision because it could be 9, 8 or 7, for example). Then, I obtain Y(depth_i) ~ 1305.3*ln(depth_i) - 2798; calculating errors as error_i = Y(depth_i) - rating_i (R² ~ 0.9995), and rounding everything up to 0.1 Elo:

Code: Select all

Depth&#58;     Rating&#58;     Error&#58;
------     -------     ------

  10        197.8       9.8
  11        334.2      -2.2
  12        452.7      -7.1
  13        555.1      -5
  14        649.5      -2.8
  15        738.3      -1.5
  16        816.9       4.2
  17        895.9       4.3

My estimate for depth 18 is Y(18) - Y(17) ~ 1305.3*ln(18/17) ~ 74.6 Elo; the slope of the adjusted line changes with the number of adjusted points:

Code: Select all

Predictions for difference between depth 17 and depth 18, adjusting from depth d to depth 17 &#40;rounding up to 0.1 Elo&#41;&#58;

 d     R²     Slope&#58;     Slope*ln&#40;18/17&#41;
 -     --     ------     ---------------

 6   0.9976   1370.8          78.4
 7   0.9977   1396            79.8
 8   0.9966   1391.3          79.5
 9   0.9984   1341.8          76.7
10   0.9995   1305.3          74.6
11   0.9998   1283.5          73.4
12   1        1270.1          72.6
13   0.9999   1267.1          72.4
14   0.9998   1263.6          72.2

You can see that my predictions vary very little: the maximum difference is |79.8 - 72.2| = 7.6; as error bars will be over ±7.6/2 = ± 3.8 Elo for the number of games you are playing during this test, I would say that all my predictions are inside the error bars, which is good (take this assumption with extreme care because I am not fully sure of it).

I feel that my results are more valid once I saw yours. Great! Please keep up your good work.

Regards from Spain.

Ajedrecista.

Rebel · Post by **Rebel** » Sat Aug 25, 2012 5:52 pm

Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.

Don · Post by **Don** » Sat Aug 25, 2012 7:14 pm

Rebel wrote:Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.

This data could be used to provide a very crude estimation of "God's ELO" too by trying to find a diminishing return model that fits the data. If you can estimate how strong 100 ply search would be and compare it to a 150 ply search and find little difference, you would know you were close. I think a 200 ply search would come quite close to perfect chess but if you don't think so then probably you would agree that a 500 ply search would come close.

I think I came up with something around 4300 ELO but it's all speculation - no way to know for sure and I think it could vary a lot if you get any assumptions wrong.

Don

Don · Post by **Don** » Sat Aug 25, 2012 7:15 pm

Don wrote:
Rebel wrote:Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.
This data could be used to provide a very crude estimation of "God's ELO" too by trying to find a diminishing return model that fits the data. If you can estimate how strong 100 ply search would be and compare it to a 150 ply search and find little difference, you would know you were close. I think a 200 ply search would come quite close to perfect chess but if you don't think so then probably you would agree that a 500 ply search would come close.

I think I came up with something around 4300 ELO but it's all speculation - no way to know for sure and I think it could vary a lot if you get any assumptions wrong.

Don

P.S. I might have been going by the increase in the number of draws. Presumably when you get to the point that neither side can win, you are playing perfect chess.

Uri Blass · Post by **Uri Blass** » Sat Aug 25, 2012 7:58 pm

Don wrote:
Rebel wrote:Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.
This data could be used to provide a very crude estimation of "God's ELO" too by trying to find a diminishing return model that fits the data. If you can estimate how strong 100 ply search would be and compare it to a 150 ply search and find little difference, you would know you were close. I think a 200 ply search would come quite close to perfect chess but if you don't think so then probably you would agree that a 500 ply search would come close.

I think I came up with something around 4300 ELO but it's all speculation - no way to know for sure and I think it could vary a lot if you get any assumptions wrong.

Don

I think that there is no constant number for it and it is clearly dependent on the question who play against who.

Suppose A,B,C are strong enough to be unbeatable with white but they can lose with black so they are not perfect players.

A against B may give 50 elo advantage for A
B against C may give 50 elo advantage for B
A against C may only give 75 elo for A so if you play only
A against B and B against C you get 100 elo advantage for A relative to C
and if you play only A against C you get 75 elo advantage for A
relative to C.

Don · Post by **Don** » Sat Aug 25, 2012 8:27 pm

Uri Blass wrote:
Don wrote:
Rebel wrote:Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.
This data could be used to provide a very crude estimation of "God's ELO" too by trying to find a diminishing return model that fits the data. If you can estimate how strong 100 ply search would be and compare it to a 150 ply search and find little difference, you would know you were close. I think a 200 ply search would come quite close to perfect chess but if you don't think so then probably you would agree that a 500 ply search would come close.

I think I came up with something around 4300 ELO but it's all speculation - no way to know for sure and I think it could vary a lot if you get any assumptions wrong.

Don
I think that there is no constant number for it and it is clearly dependent on the question who play against who.

Yes, that is clearly true. The reason for that is very simple. The rating system is based on an assumption that is not true, that there is a transitive relationship between the strength of chess players and that your strength can be measured by a single number.

Imagine if we applied the ELO system to "general athletic performance" and gave every human being an ELO rating that represented their athletic prowess. It would be worthless to determine which of us is better at any particular contest. Would Tiger Woods beat Roger Federer at tennis? Would Federer beat Woods at golf? Of course not! Which of these 2 athletes would win at shuffleboard? The more specific the task the better ELO works but the truth of the matter is that skill at anything is multi-dimensional. That's why some players are very comfortable with tactical complications and others are masters of positional play. All top players can do both very well but that doesn't mean they are equally good at both.

I looked at the wikipedia on ELO and found this:

The first mathematical concern addressed by the USCF was the use of the normal distribution. They found that this did not accurately represent the actual results achieved, particularly by the lower rated players. Instead they switched to a logistical distribution model, which the USCF found provided a better fit for the actual results achieved. FIDE still uses the normal distribution as the basis for rating calculations as suggested by Elo himself.

So even the mathematical model used can affect the accuracy of the ELO system.

Suppose A,B,C are strong enough to be unbeatable with white but they can lose with black so they are not perfect players.

A against B may give 50 elo advantage for A
B against C may give 50 elo advantage for B
A against C may only give 75 elo for A so if you play only
A against B and B against C you get 100 elo advantage for A relative to C
and if you play only A against C you get 75 elo advantage for A
relative to C.

tvrzsky · Post by **tvrzsky** » Sat Aug 25, 2012 10:14 pm

What about the influence of hashtable size? It seems to be fair to give both opponents the same value but then unless it is big enough the engine searching deeper has relatively less space in the table than its opponent. Are you for example sure that 16MB for 10 ply search is enough? Otherwise it could imply some false diminishing effect IMHO.

Adam Hair · Post by **Adam Hair** » Sat Aug 25, 2012 11:39 pm

tvrzsky wrote:What about the influence of hashtable size? It seems to be fair to give both opponents the same value but then unless it is big enough the engine searching deeper has relatively less space in the table than its opponent. Are you for example sure that 16MB for 10 ply search is enough? Otherwise it could imply some false diminishing effect IMHO.

That was a concern of mine. For that reason, I allocated more hash for the versions of Houdini that searched more plies. So, any diminishing returns found in that data would not be related to limited hash.

I do believe that diminishing returns can be explained by the lower percentage of unique moves chosen as compared to the previous ply as depth increases. This is an assumption that is supported by my Houdini data, data I have accumulated for several other engines, and previous studies made by others.

Diminishing returns in fixed depth testing revisited.

Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.

Re: Diminishing returns in fixed depth testing revisited.