Stockfish Handicap Matches

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Stockfish Handicap Matches

Post by lkaufman »

Rebel wrote: Wed Jun 24, 2020 7:31 pm Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw
   0 Stockfish_11                   52      20    1000   57.4%   16.2%
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Komodo rules!
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Stockfish Handicap Matches

Post by lkaufman »

chrisw wrote: Wed Jun 24, 2020 7:30 pm
lkaufman wrote: Wed Jun 24, 2020 5:24 pm
Rebel wrote: Wed Jun 24, 2020 12:18 pm Finished the elo 2900 pool.

Stockfish gauntlet, knight-odds, tc=40/10

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3273.3     128.0     200   64.0%
   2 Bobcat_8        : 3250.9     122.0     200   61.0%
   3 Stockfish_11    : 3172.5     269.5     600   44.9%
   4 Crafty_25.6     : 3103.3      80.5     200   40.3%
tc=40/20

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3315.9     137.5     200   68.8%
   2 Bobcat_8        : 3294.0     132.0     200   66.0%
   3 Stockfish_11    : 3177.8     274.5     600   45.8%
   4 Crafty_25.6     : 3012.3      56.0     200   28.0%
tc=40/40

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3318.8     146.0     200   73.0%
   2 Bobcat_8        : 3295.0     140.5     200   70.3%
   3 Stockfish_11    : 3144.5     242.0     600   40.3%
   4 Crafty_25.6     : 3041.7      71.5     200   35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.

Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.

Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.

Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
First result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.
Komodo rules!
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Stockfish Handicap Matches

Post by Rebel »

lkaufman wrote: Wed Jun 24, 2020 7:57 pm
Rebel wrote: Wed Jun 24, 2020 7:31 pm Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
90% of coding is debugging, the other 10% is writing bugs.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Stockfish Handicap Matches

Post by chrisw »

lkaufman wrote: Wed Jun 24, 2020 8:08 pm
chrisw wrote: Wed Jun 24, 2020 7:30 pm
lkaufman wrote: Wed Jun 24, 2020 5:24 pm
Rebel wrote: Wed Jun 24, 2020 12:18 pm Finished the elo 2900 pool.

Stockfish gauntlet, knight-odds, tc=40/10

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3273.3     128.0     200   64.0%
   2 Bobcat_8        : 3250.9     122.0     200   61.0%
   3 Stockfish_11    : 3172.5     269.5     600   44.9%
   4 Crafty_25.6     : 3103.3      80.5     200   40.3%
tc=40/20

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3315.9     137.5     200   68.8%
   2 Bobcat_8        : 3294.0     132.0     200   66.0%
   3 Stockfish_11    : 3177.8     274.5     600   45.8%
   4 Crafty_25.6     : 3012.3      56.0     200   28.0%
tc=40/40

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3318.8     146.0     200   73.0%
   2 Bobcat_8        : 3295.0     140.5     200   70.3%
   3 Stockfish_11    : 3144.5     242.0     600   40.3%
   4 Crafty_25.6     : 3041.7      71.5     200   35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.

Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.

Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.

Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
First result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.
You should be using the 25, 100, 250 or 1000 knight odds databases, depending of how many games in the gauntlet, posted to github last night. Then everybody is using the same base data.

Personally I am not interested in what an engine that would cost me a hundred euros to use does, so, again, because of free widespread access, it’s more interesting to stay with Stockfish (or LC0). Unsurprising that your program trounces itself when given knight odds.

You should, btw, know better than to ascribe 1000 Elo to a 99% result, let alone extrapolate from it. Elo scale is not able, nor meant, to deal with tail results of that nature.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Stockfish Handicap Matches

Post by lkaufman »

Rebel wrote: Wed Jun 24, 2020 8:30 pm
lkaufman wrote: Wed Jun 24, 2020 7:57 pm
Rebel wrote: Wed Jun 24, 2020 7:31 pm Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
One question you might be able to answer is whether any of these engines have been taught the principle that you should trade when ahead, so for example it is better to be up a piece without queens on the board than with them. Just knowing this might explain a fair amount of difference in performance when taking knight odds.
Komodo rules!
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Stockfish Handicap Matches

Post by lkaufman »

chrisw wrote: Wed Jun 24, 2020 8:43 pm
lkaufman wrote: Wed Jun 24, 2020 8:08 pm
chrisw wrote: Wed Jun 24, 2020 7:30 pm
lkaufman wrote: Wed Jun 24, 2020 5:24 pm
Rebel wrote: Wed Jun 24, 2020 12:18 pm Finished the elo 2900 pool.

Stockfish gauntlet, knight-odds, tc=40/10

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3273.3     128.0     200   64.0%
   2 Bobcat_8        : 3250.9     122.0     200   61.0%
   3 Stockfish_11    : 3172.5     269.5     600   44.9%
   4 Crafty_25.6     : 3103.3      80.5     200   40.3%
tc=40/20

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3315.9     137.5     200   68.8%
   2 Bobcat_8        : 3294.0     132.0     200   66.0%
   3 Stockfish_11    : 3177.8     274.5     600   45.8%
   4 Crafty_25.6     : 3012.3      56.0     200   28.0%
tc=40/40

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3318.8     146.0     200   73.0%
   2 Bobcat_8        : 3295.0     140.5     200   70.3%
   3 Stockfish_11    : 3144.5     242.0     600   40.3%
   4 Crafty_25.6     : 3041.7      71.5     200   35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.

Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.

Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.

Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
First result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.
You should be using the 25, 100, 250 or 1000 knight odds databases, depending of how many games in the gauntlet, posted to github last night. Then everybody is using the same base data.

Personally I am not interested in what an engine that would cost me a hundred euros to use does, so, again, because of free widespread access, it’s more interesting to stay with Stockfish (or LC0). Unsurprising that your program trounces itself when given knight odds.

You should, btw, know better than to ascribe 1000 Elo to a 99% result, let alone extrapolate from it. Elo scale is not able, nor meant, to deal with tail results of that nature.
I made the book from your data before I knew you were going to post subsets yourself. Regarding Elo, I know that the elo estimate for 99.85% is subject to a large margin of error, for multiple reasons, but the point was to show that your set of positions is completely winning for Black, as it should be, and that it is not easy to explain why 2750 rated engines only break even from these positions vs SF.
My main goal with this is to find an engine that will perform just as well taking knight odds as would a strong human player of the same Elo, so that we could reasonably predict results of engine vs GM handicap matches by simulation. Clearly this is not the case with the normal engines being tested, but may be more or less true for the weakened Stockfish levels. Of course it's difficult because we have limited data on humans vs. engines at knight odds, and even less on which modern engines are equal to various level GMs in normal rapid play. But we do have some data. Of course it's also interesting to find out which engine scores the best when giving knight odds to these 2750 or so rated engines. By the way, Komodo only costs $60, not 100 euros, but anyway that's not important here, it's interesting to compare top engines, regardless of cost.
Komodo rules!
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Stockfish Handicap Matches

Post by Rebel »

lkaufman wrote: Wed Jun 24, 2020 9:11 pm
Rebel wrote: Wed Jun 24, 2020 8:30 pm
lkaufman wrote: Wed Jun 24, 2020 7:57 pm
Rebel wrote: Wed Jun 24, 2020 7:31 pm Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
One question you might be able to answer is whether any of these engines have been taught the principle that you should trade when ahead, so for example it is better to be up a piece without queens on the board than with them. Just knowing this might explain a fair amount of difference in performance when taking knight odds.
I hardly peek into other engines source code, so I am not able, but I am pretty sure every decent engine has code for that.
90% of coding is debugging, the other 10% is writing bugs.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Stockfish Handicap Matches

Post by Rebel »

lkaufman wrote: Wed Jun 24, 2020 7:57 pm I'll be curious to see if Komodo 14 can score as well against the same opponents.
At your servive:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw
   0 Komodo_14                      77      20    1000   60.9%   20.8%
   1 ProDeo 2.2                     -2      44     200   49.8%   17.5%
   2 Benjamin                      -21      42     200   47.0%   23.0%
   3 Fruit_2.3                     -23      42     200   46.8%   23.5%
   4 Fruit_2.1                    -145      45     200   30.3%   23.5%
   5 Ruffian_2                    -222      51     200   21.8%   16.5%   
90% of coding is debugging, the other 10% is writing bugs.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Stockfish Handicap Matches

Post by chrisw »

lkaufman wrote: Wed Jun 24, 2020 9:11 pm
Rebel wrote: Wed Jun 24, 2020 8:30 pm
lkaufman wrote: Wed Jun 24, 2020 7:57 pm
Rebel wrote: Wed Jun 24, 2020 7:31 pm Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
One question you might be able to answer is whether any of these engines have been taught the principle that you should trade when ahead, so for example it is better to be up a piece without queens on the board than with them. Just knowing this might explain a fair amount of difference in performance when taking knight odds.
Any engine with MG, EG phase evals will “know” this by default, especially if they’ve been texel tuned. Fruit was, I think first to use MG/EG, so the answer is presumably yes. Ed is an evaluation-first programmer, so for sure his engines will.

To turn knight odds positions round, the handicapped program needs to know to maintain complexity. Could be that SF desire to maintain complexity overrides a simplistic material/trade heuristic
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Stockfish Handicap Matches

Post by chrisw »

lkaufman wrote: Wed Jun 24, 2020 9:31 pm
chrisw wrote: Wed Jun 24, 2020 8:43 pm
lkaufman wrote: Wed Jun 24, 2020 8:08 pm
chrisw wrote: Wed Jun 24, 2020 7:30 pm
lkaufman wrote: Wed Jun 24, 2020 5:24 pm
Rebel wrote: Wed Jun 24, 2020 12:18 pm Finished the elo 2900 pool.

Stockfish gauntlet, knight-odds, tc=40/10

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3273.3     128.0     200   64.0%
   2 Bobcat_8        : 3250.9     122.0     200   61.0%
   3 Stockfish_11    : 3172.5     269.5     600   44.9%
   4 Crafty_25.6     : 3103.3      80.5     200   40.3%
tc=40/20

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3315.9     137.5     200   68.8%
   2 Bobcat_8        : 3294.0     132.0     200   66.0%
   3 Stockfish_11    : 3177.8     274.5     600   45.8%
   4 Crafty_25.6     : 3012.3      56.0     200   28.0%
tc=40/40

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3318.8     146.0     200   73.0%
   2 Bobcat_8        : 3295.0     140.5     200   70.3%
   3 Stockfish_11    : 3144.5     242.0     600   40.3%
   4 Crafty_25.6     : 3041.7      71.5     200   35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.

Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.

Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.

Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
First result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.
You should be using the 25, 100, 250 or 1000 knight odds databases, depending of how many games in the gauntlet, posted to github last night. Then everybody is using the same base data.

Personally I am not interested in what an engine that would cost me a hundred euros to use does, so, again, because of free widespread access, it’s more interesting to stay with Stockfish (or LC0). Unsurprising that your program trounces itself when given knight odds.

You should, btw, know better than to ascribe 1000 Elo to a 99% result, let alone extrapolate from it. Elo scale is not able, nor meant, to deal with tail results of that nature.
I made the book from your data before I knew you were going to post subsets yourself. Regarding Elo, I know that the elo estimate for 99.85% is subject to a large margin of error, for multiple reasons, but the point was to show that your set of positions is completely winning for Black, as it should be, and that it is not easy to explain why 2750 rated engines only break even from these positions vs SF.
My main goal with this is to find an engine that will perform just as well taking knight odds as would a strong human player of the same Elo, so that we could reasonably predict results of engine vs GM handicap matches by simulation.
trying hard to decode this.
Failed.

Clearly this is not the case with the normal engines being tested, but may be more or less true for the weakened Stockfish levels. Of course it's difficult because we have limited data on humans vs. engines at knight odds, and even less on which modern engines are equal to various level GMs in normal rapid play. But we do have some data. Of course it's also interesting to find out which engine scores the best when giving knight odds to these 2750 or so rated engines. By the way, Komodo only costs $60, not 100 euros, but anyway that's not important here, it's interesting to compare top engines, regardless of cost.