Diminishing returns in fixed depth testing revisited.

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Diminishing returns in fixed depth testing revisited.

Post by Adam Hair »

Don wrote:
Daniel Shawul wrote:I just read this interesting quote from the wiki page you linked but I am not sure I agree with it.
If two programs play with 5 vs 6 ply search, the second engine has a 20% depth advantage. With 10 vs 11 it's only 10%. So of course the difference in wins is smaller. ...

Diminishing returns are only proven (IMO) if 6 vs 5 wins more games than 12 vs 10 because only then are you comparing something linear and you give a linear advantage.
So the argument is at higher depths 11/10 is less than 6/5 which is why the return is diminishing. What if it is nodes searched that give strength in which case it is BF^11/BF^10 = BF and BF^6/BF^5 = BF. Thus nothing can be deducted from this.Yes experiment tells us there is diminishing returns with additional search but I am not sure of the explanation given above.
There is diminishing returns but part of that is due to the fact that there are more draws as programs play stronger.

In fact, many years ago it appeared the gain for adding 1 ply of search was nearly linear. In other words a constant amount for each ply of depth. It turns out it's not quite linear but slightly decreases with depth. The depth ratio has almost nothing to do with it.
For self-tests, the higher draw rate is connected to the lower percentage of different move choice between consecutive plies as the depth increases.

I suppose that a source of the higher draw rate among stronger engines is that the number of possible good moves decreases as depth increases.
User avatar
Ajedrecista
Posts: 1972
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Diminishing returns in fixed depth testing revisited.

Post by Ajedrecista »

Hello:

I have tested EXchess 6.50b w32 this time, from depth 1 up to depth 10, exactly the same as I did for Quazar. Conditions were the same as the first post of this topic. Here is BayesElo 0057.2 output, with error bars for 95% confidence:

Code: Select all

version 0057.2, Copyright (C) 1997-2010 Remi Coulom.
compiled Apr  5 2012 17:26:01.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http://www.gnu.org/copyleft/gpl.html for details.
ResultSet>readpgn 01_Vs_02.pgn
2500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 02_Vs_03.pgn
5000 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 03_Vs_04.pgn
7500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 04_Vs_05.pgn
10000 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 05_Vs_06.pgn
12500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 06_Vs_07.pgn
15000 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 07_Vs_08.pgn
17500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 08_Vs_09.pgn
20000 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>readpgn 09_Vs_10.pgn
22500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
Iteration 100: 0.0033999
Iteration 200: 0.000231515
Iteration 300: 1.61843e-005
00:00:00,00
ResultSet-EloRating>confidence 0.95
0.95
ResultSet-EloRating>ratings
Rank Name        Elo     Diff     +      -      Games  Score    Oppo.   Draws     Win           W-L-D
   1 Depth_10  720.75     0.00  11.33  11.33     2500  68.82%   589.21  33.96%  51.84%      1296-355-849
   2 Depth_9   589.21  -131.54   8.01   8.01     5000  50.01%   589.18  34.06%  32.98%      1649-1648-1703
   3 Depth_8   457.61  -131.59   8.11   8.11     5000  50.92%   448.50  32.00%  34.92%      1746-1654-1600
   4 Depth_7   307.79  -149.82   8.27   8.27     5000  50.88%   300.41  28.96%  36.40%      1820-1732-1448
   5 Depth_6   143.20  -164.59   8.34   8.34     5000  49.27%   147.45  27.22%  35.66%      1783-1856-1361
   6 Depth_5   -12.89  -156.09   8.56   8.56     5000  52.56%   -36.32  24.00%  40.56%      2028-1772-1200
   7 Depth_4  -215.84  -202.95   8.94   8.94     5000  50.13%  -221.68  19.06%  40.60%      2030-2017-953
   8 Depth_3  -430.46  -214.62   9.02   9.02     5000  49.74%  -426.48  17.44%  41.02%      2051-2077-872
   9 Depth_2  -637.12  -206.66   9.35   9.35     5000  54.40%  -676.36  18.32%  45.24%      2262-1822-916
  10 Depth_1  -922.25  -285.13  13.87  13.87     2500  15.36%  -637.12  18.24%   6.24%       156-1888-456
ResultSet-EloRating>x
ResultSet>x
The total number of games is 22500. I have not got such good results as with Quazar when fitting the curve, but it is still OK. Representing ratings on y axis and ln(depth) on x axis, only adjusting for depth > 5 (an arbitrary choice again; with Quazar was depth > 4. I included one more point in Quazar case than in this case), I get Y(depth_i) ~ 1126.3*ln(depth_i) - 1880.4 (R² ~ 0.9993); the slope of the adjusted line by least squares is less than in Quazar case: it corresponds to less Elo gain when depth grows in EXchess test. The adjusts seem good because R² > 0.999 in both Quazar and EXchess fits, so this logarithmic approach does not seem very bad.

At this moment, the expected Elo gain between depth 10 and depth 11 (under these conditions) is more less Y(11) - Y(10) ~ 1126.3*ln(11/10) ~ 107.3 Elo. Who knows? I have to remark that the shape of the curve is again very similar to Quazar and Houdini tests.

Comparing ratings with estimates, and calculating errors (rounding up to 0.1 Elo) as error_i = Y(depth_i) - rating_i:

Code: Select all

Depth:     Rating:     Error:
------     -------     ------
 
  6         143.2       -5.5
  7         307.8        3.5
  8         457.6        4.1
  9         589.2        5.1
 10         720.8       -7.7
Errors are lower than in Quazar fit: if I calculate the average error for N points as SUM(|error_i|)/N, then:

Code: Select all

Quazar: N = 6, SUM(|error_i|) ~ 43.8; |average error| ~ 43.8/6 = 7.3
EXchess: N = 5, SUM(|error_i|) ~ 25.9; |average error| ~ 25.9/5 = 5.18
The growth of the draw ratio with increasing depths is evident again, and very easy to see in BayesElo output.

I provide all PGN files (around 8.36 MB because they are compressed), win-lose-draw statistics of each 2500-game match and used openings in a PGN file:

Fixed_depth_testing_of_EXchess_v6.50b_win32.rar (8.38 MB)

IIRC, this Zippyshare link will die at 30 days of inactivity.

Code: Select all

Finished game 2500 (Depth_1 vs Depth_2): 1/2-1/2 {Draw by 3-fold repetition}
Score of Depth_1 vs Depth_2: 156 - 1888 - 456  [0.15] 2500
ELO difference: -296
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_2 vs Depth_3): 0-1 {Black wins by adjudication}
Score of Depth_2 vs Depth_3: 374 - 1666 - 460  [0.24] 2500
ELO difference: -199
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_3 vs Depth_4): 0-1 {Black wins by adjudication}
Score of Depth_3 vs Depth_4: 385 - 1703 - 412  [0.24] 2500
ELO difference: -204
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_4 vs Depth_5): 1-0 {White wins by adjudication}
Score of Depth_4 vs Depth_5: 327 - 1632 - 541  [0.24] 2500
ELO difference: -201
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_5 vs Depth_6): 1/2-1/2 {Draw by 3-fold repetition}
Score of Depth_5 vs Depth_6: 396 - 1445 - 659  [0.29] 2500
ELO difference: -155
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_6 vs Depth_7): 0-1 {Black wins by adjudication}
Score of Depth_6 vs Depth_7: 338 - 1460 - 702  [0.28] 2500
ELO difference: -168
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_7 vs Depth_8): 0-1 {Black mates}
Score of Depth_7 vs Depth_8: 360 - 1394 - 746  [0.29] 2500
ELO difference: -153
Finished match
Warning: QObject::killTimers: timers cannot be stopped from another thread

----------------------------------------------------------------------------

Finished game 2500 (Depth_8 vs Depth_9): 1/2-1/2 {Draw by adjudication}
Score of Depth_8 vs Depth_9: 352 - 1294 - 854  [0.31] 2500
ELO difference: -138
Finished match

----------------------------------------------------------------------------

Finished game 2500 (Depth_9 vs Depth_10): 0-1 {Black wins by adjudication}
Score of Depth_9 vs Depth_10: 355 - 1296 - 849  [0.31] 2500
ELO difference: -138
Finished match
I think that I will not run more fixed depth tests. Any suggestions, results, corrections... are welcomed. Good luck for the people that are running fixed depth tests now. As Ed already said (I added bold):
Rebel wrote:Excellent, I hope my experiment will produce similar results, the more engines the more reliable. Will take some weeks I fear.
Regards from Spain.

Ajedrecista.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Diminishing returns in fixed depth testing revisited.

Post by diep »

Rebel wrote:
tvrzsky wrote:What about the influence of hashtable size? It seems to be fair to give both opponents the same value but then unless it is big enough the engine searching deeper has relatively less space in the table than its opponent. Are you for example sure that 16MB for 10 ply search is enough? Otherwise it could imply some false diminishing effect IMHO.
Absolutely true. With a full HT the branch factor will go up and thus the diminishing return (DR) with it. When I am finished with the DR thing I plan to do some separate research in this area.

I am not familiar with MP programming but I can imagine each doubling in core needs a 50-75-100% HT size increase to remain effective, that is at longer time controls. Are there known numbers?
I did do an overnight test at the 1024 processor supercomputer and compared with the same version a position.

This test was performed end november 2003.

Now diep's hashtable update is not so bad of course, if we consider that what i did do in 90s nowadays most are using.

It's not like the 1 probe most did do before me at their supercomputer. Back then i did do 4-8 probes, can lookup the exact number if you're interested. If memory serves me well it was 8 sequential probes...


One position i gave 400MB , so divide that by the 460 processors or so i had searching onto it from the 512 processor partition.

And same position i did do again with 200GB hashtable.

After over 12 hours of search i was very amazed the search depth difference was just 1 ply.

I need to really mention that even at a core or 4, 4 processors back then, it was important to have more than a 150MB+ for Diep to search with.

Especially for a parallel search of course you are nonstop busy aborting and starting and stopping processors, so the hashtable is very important there, even more than sequential in order to know what you already searched.

Therefore 'just' winning 1 ply was a very big amazement to me. I really had expected way way more than that.

So don't overinvest into RAM for computerchess :)

Vincent
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Diminishing returns in fixed depth testing revisited.

Post by Don »

Adam Hair wrote:
tvrzsky wrote:What about the influence of hashtable size? It seems to be fair to give both opponents the same value but then unless it is big enough the engine searching deeper has relatively less space in the table than its opponent. Are you for example sure that 16MB for 10 ply search is enough? Otherwise it could imply some false diminishing effect IMHO.
That was a concern of mine. For that reason, I allocated more hash for the versions of Houdini that searched more plies. So, any diminishing returns found in that data would not be related to limited hash.

I do believe that diminishing returns can be explained by the lower percentage of unique moves chosen as compared to the previous ply as depth increases. This is an assumption that is supported by my Houdini data, data I have accumulated for several other engines, and previous studies made by others.
They used to quote 7% per doubling of hash table size in speedup effect - IF the hash table was under-provisioned.

So perhaps one way to do one of these studies it to set the hash table size of the highest level you intend to run and for each depth below this, cut the hash table by 1/2. The branching factor of most modern program is very close to 2 to 1 so it would work out quite well.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Diminishing returns in fixed depth testing revisited.

Post by diep »

Don wrote:
Adam Hair wrote:
tvrzsky wrote:What about the influence of hashtable size? It seems to be fair to give both opponents the same value but then unless it is big enough the engine searching deeper has relatively less space in the table than its opponent. Are you for example sure that 16MB for 10 ply search is enough? Otherwise it could imply some false diminishing effect IMHO.
That was a concern of mine. For that reason, I allocated more hash for the versions of Houdini that searched more plies. So, any diminishing returns found in that data would not be related to limited hash.

I do believe that diminishing returns can be explained by the lower percentage of unique moves chosen as compared to the previous ply as depth increases. This is an assumption that is supported by my Houdini data, data I have accumulated for several other engines, and previous studies made by others.
They used to quote 7% per doubling of hash table size in speedup effect - IF the hash table was under-provisioned.

So perhaps one way to do one of these studies it to set the hash table size of the highest level you intend to run and for each depth below this, cut the hash table by 1/2. The branching factor of most modern program is very close to 2 to 1 so it would work out quite well.

Don
For such studies one should disable reductions as it will mess up search stability when modifying hashtable size, because selecting a very bad move as the first move with reductions will search you deeper (not have a higher elo obviously). These side effects are much bigger than the effect of a larger hashtable size onto search depth; provided of course you start with some realistic size and not with like 1MB or so.

One will just measure noise otherwise and not be able to filter out the noise.