Playing the endgame like a boss !!

Uri Blass · Post by **Uri Blass** » Sun Mar 17, 2019 9:27 am

I do not know if NN are going to dominate but it is clear that stockfish goes in the wrong way and it is going to lose the first place.
I believe that the way to test only by many games is not the correct way to continue to get better.

I think that first step if you have an engine should be to a build a test suite from games of the engine when the engine does not find the right move.

Testing a new patch should be done first in 1000 positions that the engine failed to find the right move.
If there is no improvement then it is a waste of resources to test at short time control or long time control because improvement in elo means also improvement in the move choice of the engine in part of the cases.

There should be for every patch that pass a list of positions when the patch improve the move choice of the engine in order to help other developers.

It does not happen in the stockfish framework and people who look at the results of the tests see only that the version after the new patch passed SPRT test.

hgm · Post by **hgm** » Sun Mar 17, 2019 10:46 am

The problem is that training purely for win probability is not consistent. Even if at some point it would know perfectly how to win KRK, the training would up the value of the long wins, making it more difficult for the engine to actually find those wins. This makes the inability to find the long wins a self-fulfilling prophecy. If, given the evaluation noise of the NN, an evaluation gradient of 10% is needed from the start of the long win to the mate, the long wins cannot be evaluated much better than 90% without losing the possibility to actually convert 90%. This is then what the training will converge to (say 92%). With as a consequence that it indeed will not manage to convert in ~8% of the cases.

Of course you can try to get the evaluation noise down, so that you need a smaller gradient to convert the long wins, but with a NN of given size there will be a limit to that.

If the training objective would not have been the pure win probability S, but something like S - DTM*0.5%, the situation where it can convert 100% of the long (say 20-move, allowing it a few sub-optimal moves) KRK wins would be stable: The NN value head would output the 90% it was trained to output, which provides enough gradient to convert with certainty. But further training would then not destroy it, because the certain conversion still took about 20 moves. So it will just confirm the 90% output. The situation where the conversion is 100% is thus preserved, rather than destroyed by further training.

jp · Post by jp » Sun Mar 17, 2019 1:16 pm

Alexander Lim wrote: ↑Sun Mar 17, 2019 4:23 am Apparently Demis Hassabis said AlphaZero does not suffer from these endgame problems (One of the Leela developers mentions this on a youbtube video). Are there any AlphaZero games played to the endgame with mate to confirm this?

I think we'd need to see this to believe it. Right now it looks like DM's choice to end games early may have unintentionally worked out very well for them.

Some people claim that A0's analysis of the Carlsen-Caruana WCh. endgames was bad compared with SF. I haven't seen for myself though. Can anyone here confirm?

abulmo2 · Post by **abulmo2** » Sun Mar 17, 2019 9:18 pm

Uri Blass wrote: ↑Sun Mar 17, 2019 9:27 am I think that first step if you have an engine should be to a build a test suite from games of the engine when the engine does not find the right move.
Testing a new patch should be done first in 1000 positions that the engine failed to find the right move.

Examining positions and games to discover weaknesses of a program is of course a good practice, however it is not sufficient. If you improve your engine on those 1000 positions, you may degrade the engine on other untested positions, so that the overall strength actually diminish. For example, with Amoeba version 2.3 I get the following results on STS 15.0 (avoid pointless exchange): 44/100. With version 2.4, the results have been improved: 65/100. Unfortunately, on the same time, the results on STS 6.0 (recapturing) were degraded from 71/100 to 51/100.

Uri Blass · Post by **Uri Blass** » Mon Mar 18, 2019 5:04 am

abulmo2 wrote: ↑Sun Mar 17, 2019 9:18 pm
Uri Blass wrote: ↑Sun Mar 17, 2019 9:27 am I think that first step if you have an engine should be to a build a test suite from games of the engine when the engine does not find the right move.
Testing a new patch should be done first in 1000 positions that the engine failed to find the right move.
Examining positions and games to discover weaknesses of a program is of course a good practice, however it is not sufficient. If you improve your engine on those 1000 positions, you may degrade the engine on other untested positions, so that the overall strength actually diminish. For example, with Amoeba version 2.3 I get the following results on STS 15.0 (avoid pointless exchange): 44/100. With version 2.4, the results have been improved: 65/100. Unfortunately, on the same time, the results on STS 6.0 (recapturing) were degraded from 71/100 to 51/100.

I do not say not to play games for testing but playing games should be done only after giving positions that the patch are supposed to fix the move choice.

Practically I can see in the stockfish framework a lot of patches without a list of positions that stockfish play better.

jp · Post by jp » Sat Mar 23, 2019 10:02 pm

The LC0 blog talks about a TCEC 7-man endgame, KNPP vs kbp, which Lc0 couldn't win.
SF showed +14 & +153.
TCEC gives the engines 6-man TBs.

Ovyron · Post by **Ovyron** » Sun Mar 24, 2019 11:03 am

hgm wrote: ↑Thu Mar 14, 2019 12:14 pmIt is generally believed that Stockfish or Komodo would be much better for analysis than, say, The Baron. But this hasn't really been tested, and might very well be not true.

Oh, I've been testing this privately since 2007. Not only that, whenever a new engine pops up, I don't care about its ELO at all, I use it to analyze my Correspondence games against the strongest opposition I can find, and the same analysis methods that can beat some opponent easily, can easily lose against them if one uses very weak engines in comparison with what they're using.

The correlation between an engine's ELO and the true quality of its move choices is extraordinarily high. Learning can offset this (for instance, Rybka 3 with a correctly used Persistent Hash providing better analysis than Fritz 15 - or Private Stockfish 9 with learning providing better analysis than Stockfish 10), but not by much.

The most extraordinary case was Stockfish 2.2.2, who continued to provide better analysis than Stockfish 2.3.1, 3, 4, 5 or 6. Even at Stockfish 7 the analysis was of comparable quality, so I could keep using both Ss together as if they were different engines. It took up until Stockfish 8 to finally obsolete S2.2.2, by always finding moves of equal or better quality, and it required some 230 elo difference.

On the flipside, there was the Houdinis 2 and 4. Despite their tremendous ELO at their time, they never provided unique move choices that were better. Houdini 3 was better than 4. But then H6 is a top engine for analysis that continues to shatter the analysis of S10, Shash, McCain and anything else on its way.

And then there's the outliers: Komodo 10.4 or 11.3.1 never provided better analysis than 10.1. Deep Shredder 13 never provided better analysis than Fritz 15. Fizbo 2 is the top 8 engine and I loved its playing style, but I never was able to squeeze anything useful from it (while I could from lower rated engines.)

Other than that (and, huh, Critter 1.2 still providing better analysis than Critter 1.6?), generally more elo means more move quality.

Unfortunately, the cost of move quality is time. If you use 2 engines to analyze you have to split the analysis time, it has to be worthwhile that the time used by the second engine isn't given to the first one. Nowadays using a third engine is crazy and that time is better spent going deeper and wider with the other two, or with just one.

I reckon the best way to test this is against a correspondence chess opponent that is "out there to kill you" (not only punishing your mistakes, but also going into positions where it's very hard to find the best moves), in this environment (say, a critical position where 90% of engines suggest a losing move) it is very easy to tell engines of low move quality apart (by just looking at the moves they produce, no elo involved), and it's very clear who are giving the best moves the fastest.

And those have been, by no coincidence, the engines with the highest elo. Namely, different Stockfish flavors (which can be tricky due to the engine's nature, but hey, being tricky is a good thing, it means it is ME who hasn't been obsoleted yet.)

jp · Post by jp » Sun Mar 24, 2019 11:11 am

Ovyron wrote: ↑Sun Mar 24, 2019 11:03 am Oh, I've been testing this privately since 2007. Not only that, whenever a new engine pops up, I don't care about its ELO at all, I use it to analyze my Correspondence games against the strongest opposition I can find, and the same analysis methods that can beat some opponent easily, can easily lose against them if one uses very weak engines in comparison with what they're using.

The correlation between an engine's ELO and the true quality of its move choices is extraordinarily high.

Very interesting. Have you tested Lc like that yet?

zullil · Post by **zullil** » Sun Mar 24, 2019 11:22 am

Ovyron wrote: ↑Sun Mar 24, 2019 11:03 am
I reckon the best way to test this is against a correspondence chess opponent that is "out there to kill you" (not only punishing your mistakes, but also going into positions where it's very hard to find the best moves), in this environment (say, a critical position where 90% of engines suggest a losing move) it is very easy to tell engines of low move quality apart (by just looking at the moves they produce, no elo involved), and it's very clear who are giving the best moves the fastest.

If possible, please post an example "a critical position where 90% of engines suggest a losing move."

It seems that what you are suggesting is that Elo should correlate with performance on a carefully chosen set of test positions. This seems unsurprising to me.

MikeGL · Post by **MikeGL** » Sun Mar 24, 2019 4:20 pm

zullil wrote: ↑Sun Mar 24, 2019 11:22 am
Ovyron wrote: ↑Sun Mar 24, 2019 11:03 am
I reckon the best way to test this is against a correspondence chess opponent that is "out there to kill you" (not only punishing your mistakes, but also going into positions where it's very hard to find the best moves), in this environment (say, a critical position where 90% of engines suggest a losing move) it is very easy to tell engines of low move quality apart (by just looking at the moves they produce, no elo involved), and it's very clear who are giving the best moves the fastest.

If possible, please post an example "a critical position where 90% of engines suggest a losing move."

It seems that what you are suggesting is that Elo should correlate with performance on a carefully chosen set of test positions. This seems unsurprising to me.

i have posted a study before, but the position is funny and almost impossible to happen on OTB play.

Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!

Re: Playing the endgame like a boss !!