Re: Stockfish Handicap Matches
Posted: Wed Jun 24, 2020 11:18 pm
You probably haven't followed all the Komodo vs. master/GM handicap matches, so to make a long story short, we know that at Rapid time controls (let's say 15' + 10" to standardize), Komodo (on 32 core machine) does well giving knight odds to players below about 2300 FIDE, poorly vs. players above that. If I say that Komodo performs about 2300 FIDE giving knight odds at this time control, I won't be far off from the truth. So I'd like to find an engine that would be equal with a 2300 FIDE player in standard chess at 15' + 10" which would also be even with Komodo at this TC at knight odds. It's obvious that the normal engines are hundreds of elos away from this, they need to be something like 2750 CCRL rapid, and even those ratings are unrealistically low compared to human FIDE ratings at Rapid. I'm looking for some engine below 2300 CCRL that can hold its own with Komodo (or Stockfish, doesn't matter that much) at knight odds. The closest I've come is the weakened Stockfish levels, but it looks like even they need to be much stronger than 2300 CCRL to have a chance at knight odds, although I don't really know what SF levels would be rated on that list at Rapid. I can determine this, but it will take a lot of time.chrisw wrote: ↑Wed Jun 24, 2020 11:04 pmtrying hard to decode this.lkaufman wrote: ↑Wed Jun 24, 2020 9:31 pmI made the book from your data before I knew you were going to post subsets yourself. Regarding Elo, I know that the elo estimate for 99.85% is subject to a large margin of error, for multiple reasons, but the point was to show that your set of positions is completely winning for Black, as it should be, and that it is not easy to explain why 2750 rated engines only break even from these positions vs SF.chrisw wrote: ↑Wed Jun 24, 2020 8:43 pmYou should be using the 25, 100, 250 or 1000 knight odds databases, depending of how many games in the gauntlet, posted to github last night. Then everybody is using the same base data.lkaufman wrote: ↑Wed Jun 24, 2020 8:08 pmFirst result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.chrisw wrote: ↑Wed Jun 24, 2020 7:30 pmWe can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?lkaufman wrote: ↑Wed Jun 24, 2020 5:24 pmSo results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.Rebel wrote: ↑Wed Jun 24, 2020 12:18 pm Finished the elo 2900 pool.
Stockfish gauntlet, knight-odds, tc=40/10
tc=40/20Code: Select all
# ENGINE : RATING POINTS PLAYED (%) 1 cheng4_4.39 : 3273.3 128.0 200 64.0% 2 Bobcat_8 : 3250.9 122.0 200 61.0% 3 Stockfish_11 : 3172.5 269.5 600 44.9% 4 Crafty_25.6 : 3103.3 80.5 200 40.3%
tc=40/40Code: Select all
# ENGINE : RATING POINTS PLAYED (%) 1 cheng4_4.39 : 3315.9 137.5 200 68.8% 2 Bobcat_8 : 3294.0 132.0 200 66.0% 3 Stockfish_11 : 3177.8 274.5 600 45.8% 4 Crafty_25.6 : 3012.3 56.0 200 28.0%
Next, 2800 pool.Code: Select all
# ENGINE : RATING POINTS PLAYED (%) 1 cheng4_4.39 : 3318.8 146.0 200 73.0% 2 Bobcat_8 : 3295.0 140.5 200 70.3% 3 Stockfish_11 : 3144.5 242.0 600 40.3% 4 Crafty_25.6 : 3041.7 71.5 200 35.8%
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.It’s better to just use defaults, too much parameter fiddling around just confuses everything.
Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.
Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
Personally I am not interested in what an engine that would cost me a hundred euros to use does, so, again, because of free widespread access, it’s more interesting to stay with Stockfish (or LC0). Unsurprising that your program trounces itself when given knight odds.
You should, btw, know better than to ascribe 1000 Elo to a 99% result, let alone extrapolate from it. Elo scale is not able, nor meant, to deal with tail results of that nature.
My main goal with this is to find an engine that will perform just as well taking knight odds as would a strong human player of the same Elo, so that we could reasonably predict results of engine vs GM handicap matches by simulation.
Failed.