Ways to avoid "Draw Death" in Computer Chess

Laskos · Post by **Laskos** » Tue Jul 25, 2017 4:16 pm

As can be seen in Graham Banks match of top engines at LTC with many cores ( http://www.talkchess.com/forum/viewtopic.php?t=64666 ), the draw rate of above 90% is not something of the distant future. I happens already, and the problem could be further worsened because we know that the top engines are close in strength, so the prior when calculating LOS is not uniform. It is agglomerated towards the center, and that gives even less significance to what rare wins occur. We will leave that issue aside, and will work with uniform priors only.

In this thread:
http://www.talkchess.com/forum/viewtopic.php?t=64683 ,
it was shown that, as Michel noted, with incresing time control and hardware, (top) engines objectively get closer in strength, it is harder to separate them. The measure of that is the Normalized ELO ( http://hardy.uhasselt.be/Toga/normalized_elo.pdf ), which gives the power of separation to given statistical significance like LOS, p-value or SPRT stop. Normalized ELO from Andreas Strangmüller comprehensive test at wide space of time controls for doubling in time looks as:

It shows that Computer Chess is heading towards "Draw Death" with stronger engines, more cores and longer time control. Not only LTC games require more time, more games are needed at LTC than at STC to separate in strength the engines. The separation power goes down pretty fast, and engines will get closer in strength not because of artifacts of rating schemes (like simple ELO, which is clearly inadequate). Pentanomial is better used in side and reversed games on same opening, because the results of these games are not completely independent. We treat these 2 games as one even with possible outcome (0,1/2,1,3/2,2), thus "pentanomial" and its variance defined for multinomial.

As shown in this thread:
http://www.talkchess.com/forum/viewtopic.php?t=61245 ,
in both Bayeselo and Davidson models, introducing large eloBias for high draw rates (large eloDraw) and using pentanomial for side and reversed games on the same opening, can improve the sensitivity (or separation power) to arbitrarily high draw rates (for eloBias=0). The plots in that thread are basically rescaled Normalized ELO before its introduction. To achieve unbalance I see two ways: unbalanced openings and time control odds. The use of pentanomial variances will be crucial.

Laskos · Post by **Laskos** » Tue Jul 25, 2017 4:18 pm

The drawbacks of the two unbalance schemes, unbalanced openings and time odds, are:
For unbalanced openings, engines might forget to play regular balanced openings.
For time odds unbalance, the unbalance is general and not on particular opening played side and reversed, and pentanomial might not help as much. Draw rate does not decrease dramatically unless for huge time odds (factors of dozens, which is a bit silly).

To replicate at short time control large draw rates, I used a toy model, but pretty accurate one, the "Endgame Chess". It seems very related to regular Chess, only that it is very drawish for balanced openings (which are early endgame positions) because the top engines play them well. Basically, I try to replicate Graham's results at ultra-fast time controls to have many games for statistical significance. I use the recent Stockfish dev and Stockfish 8, which have moderate ELO difference to see their separation in this model. I use 2 set of openings: Balanced - early endgame positions of 0-40cp eval, and Unbalanced - early endgame positions of 100-140cp eval. Time odds match are 4 times longer time control for White with Balanced positions. Engines are playing games side and reversed. Results:

Stockfish dev vs Stockfish 8 in "Endgame Chess"

Code: Select all

Balanced&#58; 20''+ 0.2'' vs 20''+ 0.2''
Score of SF dev vs SF 8&#58; 74 - 51 - 2875  &#91;0.504&#93; 3000
ELO difference&#58; 2.66 +/- 2.53
Draw rate&#58; 95.8%

Normalized ELO (trinomial): 0.038 +/- 0.036
Normalized ELO (pentanomial): 0.047 +/- 0.036

Code: Select all

Balanced 4x Time Odds&#58; White 20''+ 0.2'' vs Black 5''+ 0.05''
Score of SF dev vs SF 8&#58; 103 - 69 - 2828  &#91;0.506&#93; 3000
ELO difference&#58; 3.94 +/- 3.02
Draw rate&#58; 94.3%

Normalized ELO (trinomial): 0.047 +/- 0.036
Normalized ELO (pentanomial): 0.060 +/- 0.036

Code: Select all

Unbalanced&#58; 20''+ 0.2'' vs 20''+ 0.2''
Score of SF dev vs SF 8&#58; 734 - 634 - 1632  &#91;0.517&#93; 3000
ELO difference&#58; 11.59 +/- 8.38
Draw rate&#58; 54.4%

Normalized ELO (trinomial): 0.049 +/- 0.036
Normalized ELO (pentanomial): 0.098 +/- 0.036

Conclusions:

1/ Pentanomial shows significant (25%) improvement in sensitivity even for Balanced positions with no time odds, so they are pretty correlated side and reversed in Drawish Chess
2/ Pentanomial does not help time odds match as much as the Unbalanced openings match
3/ The best solution to avoid "Draw Death" seems to be using Unbalanced opening positions according to theoretical model eloBias=eloDraw to have a draw rate of about 50%. Time odds seems to decrease the draw rate very slightly and increase sensitivity only moderately
4/ I marked with Red the present state of affairs (for example Graham's match), and with Green the best solution to it. There is about 2.6 factor in Normalized ELO between them, meaning about 7 times less games to reach the same statistical significance (LOS, p-value etc.)

Dirt · Post by **Dirt** » Tue Jul 25, 2017 6:05 pm

Laskos wrote:As can be seen in Graham Banks match of top engines at LTC with many cores the draw rate of above 90% is not something of the distant future.

That shows that chess is too easy for computers. For people it still works, although FRC is safer in the long term. The obvious answer is to switch to go, but if we want a chess-like game to stay playable by computers we need to nerf the pieces, enlarge the board and probably add drops. Getting everyone to agree on exactly how to do that sounds hard, though.

Sven · Post by **Sven** » Tue Jul 25, 2017 9:12 pm

One way to avoid "Draw Death" would be the appearance of one single engine that plays significantly stronger than the current top engines, say 100 or 200 Elo points. As of today this seems to be unlikely, but can we really exclude it? In the past there were already times of stagnation, then suddenly a heavy improvement came up when nobody thought it were possible.

I think today we are far away yet from knowing what the best openings really are, and there are also some endgame types that are not always evaluated correctly by top engines. Think about fortresses for instance, or about complex endgames with rooks, minor pieces and pawns for which we still lack any proven theoretical knowledge due to lack of EGTBs for more than 6 or 7 pieces. For me these are two good reasons for not believing that we are already close to "perfect play" in computer chess. It might be "perfect-looking play" only. All we know, in my opinion, is that we have a couple of very strong engines that are really hard to beat with today's chess programming knowledge and hardware.

Once that new mega engine appears (and as a programmer I hope this will happen one day, although it will probably not be my own engine ...) I expect that parts of these mathematical models may become outdated, or will have to be reviewed at least.

duncan · Post by **duncan** » Tue Jul 25, 2017 10:11 pm

Sven wrote:One way to avoid "Draw Death" would be the appearance of one single engine that plays significantly stronger than the current top engines, say 100 or 200 Elo points. As of today this seems to be unlikely, but can we really exclude it?

we cannot exclude it, but keeping that 100-200 lead for a long time like rybka would be harder still

Laskos · Post by **Laskos** » Wed Jul 26, 2017 10:41 am

Sven wrote:One way to avoid "Draw Death" would be the appearance of one single engine that plays significantly stronger than the current top engines, say 100 or 200 Elo points. As of today this seems to be unlikely, but can we really exclude it? In the past there were already times of stagnation, then suddenly a heavy improvement came up when nobody thought it were possible.

I think today we are far away yet from knowing what the best openings really are, and there are also some endgame types that are not always evaluated correctly by top engines. Think about fortresses for instance, or about complex endgames with rooks, minor pieces and pawns for which we still lack any proven theoretical knowledge due to lack of EGTBs for more than 6 or 7 pieces. For me these are two good reasons for not believing that we are already close to "perfect play" in computer chess. It might be "perfect-looking play" only. All we know, in my opinion, is that we have a couple of very strong engines that are really hard to beat with today's chess programming knowledge and hardware.

Once that new mega engine appears (and as a programmer I hope this will happen one day, although it will probably not be my own engine ...) I expect that parts of these mathematical models may become outdated, or will have to be reviewed at least.

I am more of opinion of Greg, that Chess is too easy for computers. You list deficiencies of chess engines, which are true deficiencies, but the bulk is that they are extremely strong. I took very balanced endgame positions (0cp-5cp unbalance) from real games and made the following experiments (no TBs):

Code: Select all

Score of Stockfish dev vs Zurichess_00&#58; 297 - 1 - 702  &#91;0.648&#93; 1000
ELO difference&#58; 106.01 +/- 10.81
Finished match

Stockfish dev is a recent version of Stockfish. Zurichess_00 is the first version of Zurichess I have, and it is about 1800 ELO human level (or pretty strong amateur). 30% of these ultra-balanced (endgame) openings are still playable against strong amateurs. Then I took the same Stockfish dev and Stockfish 7 separated by not that small 120 ELO points (on purpose not that small), and got:

Code: Select all

Score of Stockfish dev vs Stockfish 7&#58; 8 - 4 - 988  &#91;0.502&#93; 1000
ELO difference&#58; 1.39 +/- 2.35
Finished match

Now, only 1.2% of endgames are playable. So, 96% of endgames playable against strong amateurs are dead draws now. And it is not strength difference (a respectable 120 ELO points), but strength itself which gives more draws and less sensitivity.

The paradigm of Computer Chess is unchanged since Knuth of 40 years ago. The same alpha-beta, which reduces effective branching factor to 4-5, then ever more pruning and reductions to EBF 1.5 of today. According to this paradigm, the Computer Chess is capped to 400-800 more ELO points compared to today's top engines. The most comprehensive induction is from Andreas Strangmüller results here (and the discussion):
http://www.talkchess.com/forum/viewtopic.php?t=61784 ,
which points to lower values. That a new, unchanged for 40 years, paradigm will appear, which will make a revolution in these conditions is almost as unlikely in the near future as a solution to Chess. It will be done, but in pretty far future. I believe in 10-20 years we will still be talking in the same terms and paradigm, but with 95%+ draw rate among top engines even in fast tests (with balanced positions). Sure, it might happen that an engine will appear which will be superior significantly, but this superiority diminishes _objectively_ (as separation power) with general strength, longer TC and more hardware.

PS Having performed that test, I was curious what unbalanced positions and time odds bring to that 98.8% draw rate and no significant separation between Stockfish dev and Stockfish 7 using balanced positions:

Balanced:
+8 -4 =988 (1000)
Normalized ELO (trinomial): 0.037 +/- 0.062
Normalized ELO (pentanomial): 0.052 +/- 0.062

Balanced 4x White Time Odds:
+14 -5 =981 (1000)
Normalized ELO (trinomial): 0.065 +/- 0.062
Normalized ELO (pentanomial): 0.079 +/- 0.062

Unbalanced:
+284 -194 =522 (1000)
Normalized ELO (trinomial): 0.131 +/- 0.062
Normalized ELO (pentanomial): 0.254 +/- 0.062

With Red I marked what might happen if we continue to use balanced openings in distant future, with Green what might be the best solution to it. There is a factor of about 7 between separation power, meaning about 50 times less games needed for the same statistical significance.

jdart · Post by **jdart** » Wed Jul 26, 2017 6:10 pm

Does this situation also obtain with Chess 960?

--Jon

Evert · Post by **Evert** » Wed Jul 26, 2017 9:46 pm

What's to avoid? It's pretty clear that chess (FIDE rules) is a draw from the initial position with optimal play from both sides.
Unless you rig the opening by having one side play vastly suboptimal, a draw is the expected outcome. If you do unbalance the opening like that, you're not measuring a result, just confirming your input.

Or am I missing something?

Sven · Post by **Sven** » Wed Jul 26, 2017 9:49 pm

Laskos wrote:
Sven wrote:One way to avoid "Draw Death" would be the appearance of one single engine that plays significantly stronger than the current top engines, say 100 or 200 Elo points. As of today this seems to be unlikely, but can we really exclude it? In the past there were already times of stagnation, then suddenly a heavy improvement came up when nobody thought it were possible.

I think today we are far away yet from knowing what the best openings really are, and there are also some endgame types that are not always evaluated correctly by top engines. Think about fortresses for instance, or about complex endgames with rooks, minor pieces and pawns for which we still lack any proven theoretical knowledge due to lack of EGTBs for more than 6 or 7 pieces. For me these are two good reasons for not believing that we are already close to "perfect play" in computer chess. It might be "perfect-looking play" only. All we know, in my opinion, is that we have a couple of very strong engines that are really hard to beat with today's chess programming knowledge and hardware.

Once that new mega engine appears (and as a programmer I hope this will happen one day, although it will probably not be my own engine ...) I expect that parts of these mathematical models may become outdated, or will have to be reviewed at least.
I am more of opinion of Greg, that Chess is too easy for computers. You list deficiencies of chess engines, which are true deficiencies, but the bulk is that they are extremely strong.

What does "extremely strong" really mean, other than "much stronger than almost all other (human or computer) chess players"? There is no absolute playing strength, only a comparison.

Laskos wrote:I took very balanced endgame positions (0cp-5cp unbalance) from real games and made the following experiments (no TBs):
Code: Select all
Score of Stockfish dev vs Zurichess_00&#58; 297 - 1 - 702  &#91;0.648&#93; 1000
ELO difference&#58; 106.01 +/- 10.81
Finished match
Stockfish dev is a recent version of Stockfish. Zurichess_00 is the first version of Zurichess I have, and it is about 1800 ELO human level (or pretty strong amateur). 30% of these ultra-balanced (endgame) openings are still playable against strong amateurs. Then I took the same Stockfish dev and Stockfish 7 separated by not that small 120 ELO points (on purpose not that small), and got:
Code: Select all
Score of Stockfish dev vs Stockfish 7&#58; 8 - 4 - 988  &#91;0.502&#93; 1000
ELO difference&#58; 1.39 +/- 2.35
Finished match
Now, only 1.2% of endgames are playable. So, 96% of endgames playable against strong amateurs are dead draws now.

These results are not surprising at all. But my explanation is different than yours: the strength difference between the two Stockfish versions for the "sub-discipline" of playing balanced chess endgame positions is obviously much smaller than their Elo rating difference for playing whole games, perhaps because of their endgame evaluations being more similar than their evaluation functions as a whole. You won't get a 99% draw outcome between these two SFs when playing 1000 normal games since the expected result is something about 660:340 so you should never get significantly more than 68% draws.

Laskos wrote:And it is not strength difference (a respectable 120 ELO points), but strength itself which gives more draws and less sensitivity.

Here I disagree. The draw rate between two players A and B must also be a function of the strength difference between A and B. See my explanation regarding endgames above.

Laskos wrote:The paradigm of Computer Chess is unchanged since Knuth of 40 years ago. The same alpha-beta, which reduces effective branching factor to 4-5, then ever more pruning and reductions to EBF 1.5 of today. According to this paradigm, the Computer Chess is capped to 400-800 more ELO points compared to today's top engines. The most comprehensive induction is from Andreas Strangmüller results here (and the discussion):
http://www.talkchess.com/forum/viewtopic.php?t=61784 ,
which points to lower values.

This "capping" theory seems to be based on the assumption that letting an engine search with "infinite speed", or "infinitely long", would (in theory) lead to solving chess by searching deep enough to get a perfect analysis of a given position, and therefore may serve as a justification for stating that "infinite speed" defines an upper bound for Elo ratings. However, I am not convinced of it. Today's heavy pruning and reductions often cause even a 50-ply search to be incomplete and imperfect. We only know that searching 50 plies deep is almost always better than searching only 40 plies and usually leads to a higher rating, but we still don't know whether the resulting analysis is "perfect". Now we could think of switching off all the pruning and reduction stuff, and repeat the test - but then I guess we will get very different results for the same test that was done by Andreas, i.e. there might be a much smaller effect of diminishing returns.

Laskos wrote:That a new, unchanged for 40 years, paradigm will appear, which will make a revolution in these conditions is almost as unlikely in the near future as a solution to Chess. It will be done, but in pretty far future. I believe in 10-20 years we will still be talking in the same terms and paradigm, but with 95%+ draw rate among top engines even in fast tests (with balanced positions). Sure, it might happen that an engine will appear which will be superior significantly, but this superiority diminishes _objectively_ (as separation power) with general strength, longer TC and more hardware.

Here I mostly agree, although nobody really knows what will happen during the next 10-20 years.

As a general remark, what I am really missing is a test where not time, speed, number of threads, or any similar resource is doubled subsequently but knowledge ...

I hope someone can do that at some point in the future ?!

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Thu Jul 27, 2017 8:59 am

Laskos wrote:As can be seen in Graham Banks match of top engines at LTC with many cores ( http://www.talkchess.com/forum/viewtopic.php?t=64666 ), the draw rate of above 90% is not something of the distant future. I happens already, and the problem could be further worsened because we know that the top engines are close in strength, so the prior when calculating LOS is not uniform. It is agglomerated towards the center, and that gives even less significance to what rare wins occur. We will leave that issue aside, and will work with uniform priors only.

In this thread:
http://www.talkchess.com/forum/viewtopic.php?t=64683 ,
it was shown that, as Michel noted, with incresing time control and hardware, (top) engines objectively get closer in strength, it is harder to separate them. The measure of that is the Normalized ELO ( http://hardy.uhasselt.be/Toga/normalized_elo.pdf ), which gives the power of separation to given statistical significance like LOS, p-value or SPRT stop. Normalized ELO from Andreas Strangmüller comprehensive test at wide space of time controls for doubling in time looks as:

It shows that Computer Chess is heading towards "Draw Death" with stronger engines, more cores and longer time control. Not only LTC games require more time, more games are needed at LTC than at STC to separate in strength the engines. The separation power goes down pretty fast, and engines will get closer in strength not because of artifacts of rating schemes (like simple ELO, which is clearly inadequate). Pentanomial is better used in side and reversed games on same opening, because the results of these games are not completely independent. We treat these 2 games as one even with possible outcome (0,1/2,1,3/2,2), thus "pentanomial" and its variance defined for multinomial.

As shown in this thread:
http://www.talkchess.com/forum/viewtopic.php?t=61245 ,
in both Bayeselo and Davidson models, introducing large eloBias for high draw rates (large eloDraw) and using pentanomial for side and reversed games on the same opening, can improve the sensitivity (or separation power) to arbitrarily high draw rates (for eloBias=0). The plots in that thread are basically rescaled Normalized ELO before its introduction. To achieve unbalance I see two ways: unbalanced openings and time control odds. The use of pentanomial variances will be crucial.

no draw death.

the only thing Graham's current match shows is Komodo and Houdini are neck to neck.

at least couple thousand elos more to go until perfect play.

of course, engine X(+1000 elo to K and H) will score 95% vs each of them, even at VLTC.

Ways to avoid "Draw Death" in Computer Chess

Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess

Re: Ways to avoid "Draw Death" in Computer Chess