Obligatory scaling

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Obligatory scaling

Post by Lyudmil Tsvetkov »

tpetzke wrote:[D]8/5PP1/4K3/4R2p/6kp/7p/7r/8 w - - 0 1

I think I can win this as white

Thomas...
:D :D
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Obligatory scaling

Post by Lyudmil Tsvetkov »

Harald wrote:Your rule just counts the black and white pawns on one side.
I assume 4 vs. 2 or 1 or 0 pawns are not part of this rule, right?
(The same for 3 vs. 1 or 0 and 2 vs. 0.)
What if there are 5 pawns for one side?
What if there are double or triple pawns?
What if some of the pawns have already passed each other somehow?
What if white pawns are on file e (and f) and black pawns on h (and g)?

In this and other rules I am not sure how detailed you have looked
at _all_ possibilities of the distribution of pawns and pieces on the board.
This is interesting from a programmer's point of view. We need every
trick and bitmask and shortcut for an implementation of the rules
including absurd combinations that would only happen on a board
directly from a setup and not from a game.

Harald
Without going into much detail, for realistic situations (i.e. positions that could happen in actual games) I think the rule for single rook endgames and pawn span equal or less than 3 files (the pawn span considers the white and the black pawn spans) would be the following:

- pawn span equal or less than 3 files
- single rook endgames
- pawns for both sides equal or one side has just a pawn more (important!, for otherwise if you consider just pawn span you might go wrong with the number of pawns for each side)

then it would be possible to scale very much, maybe some 80-90% of eval. The above rule, very simple at that, will include central and edge pawn spans (but you could limit it to just edge/wings pawn spans to be even safer, i.e. a,b,c and f,g,h files), 3 vs 3 pawns, 3 vs 2, 2 vs 2, 2 vs 1, 1 vs 1 and 1 vs 0 pawns, but not 3 vs 1 or 2 vs 0 pawns.

I think such a rule should be efficient, but if you are afraid, you might scale lower.

If you would like to do scaling down of any endgames based on pawn span, not only single rook endgames, the rule for me would be:

- pawn span equal or less than 3 files
- nonpawn material equal ir just B vs N (very important!, you can not easily scale endgames when nonpawn material is different, for example R vs N, even though the pawn span is small)
- 2 or less pieces each side
- pawns for both sides equal in number, or one side having just a pawn more (again, very important, or otherwise you can also scale endgames with 3 vs 1, 3 vs 0 pawns, etc.)

then it would be possible to scale this down by maybe some 50%, but you might experiment.

The above rule would include of course endgames with equal material or B vs N, 3 vs 3, 3 vs 2, 2 vs 2, 2 vs 1, 1 vs 1 and 1 vs 0 pawns with 2 or less pieces each side.

For any random single rook endgame, no matter what the conditions are, I would scale down by maybe 15-20%.
arjuntemurnikar
Posts: 204
Joined: Tue Oct 15, 2013 10:22 pm
Location: Singapore

Re: Obligatory scaling

Post by arjuntemurnikar »

I have studied rook endgames extensively in the past so I know what I am talking about.

All rook endgames are not drawish. Full stop.

If you scale down all single rook endgames, you might as well scale down all single bishop and single knight endgames too. By your definition of "drawish" all single piece endgames are drawish!

Single rook endgames with pawns on the same side of the board where the opposing player has no more than 1 pawn advantage, are known to be drawish. This is well established theory, so I will agree with you only here. (Note: It doesn't mean its forced draw! You seem to confuse a lot between forced WDL and high probability of WDL as is evident in your other post about scaling.)

In the case of stockfish, it already has quite a bit of rook endgame knowledge such as Lucena and Philidor positions. It is not complete knowledge, but there is enough to win most won rook endings. What I would do instead of scaling everything down as you suggest is to improve this code.

If you scale every rook endgame down, stockfish will start avoiding those lines in search, even if they are winning sometimes. If you use TBs (which cover many of these trivial endgame wins), you will not even reach some of these won TB positions because before it even does so, your scaled down eval ensures that rook endgame lines are mostly all pruned away.

You just speak whatever comes to your mind without much thought because you don't understand how engines work. If you did, you won't make such stupid suggestions.

Since you don't know how engines work, you should respectfully listen when people explain things to you, instead of ignoring them and continuing to make silly claims.
arjuntemurnikar
Posts: 204
Joined: Tue Oct 15, 2013 10:22 pm
Location: Singapore

Re: Obligatory scaling

Post by arjuntemurnikar »

While searching google for some statistics about drawn rook endgames, I came across this EPIC post by Tord Romstand on the stockfish support website in response to someone who complained that stockfish mis-evaluated a drawn rook ending in a recent Kramnik-Leko game.

This post was so well written and thorough that I thought it deserves to be reposted here:


Hi Ofer,

Sorry for not replying in depth to this earlier.

This is an interesting and surprisingly complicated subject. I'll try to explain it in great detail below, but for impatient readers, the short version is that you can't consider the evaluation function in isolation from the search, that it is dangerous to add special-case evaluation code for one particular class of endgames without considering the implications for evaluations of related endgames that can transition into this endgame (or vice versa), and that even a completely bug-free implementation of entirely correct chess knowledge will not always make the program better.

I'm the author of the piece of chess knowledge in Stockfish that is most closely related to what you ask for: The evaluation scaling function for endgames with R+2 vs R+1 with no passed pawn. It works roughly like you suggest in your last paragraph: If the defending king is well placed, the evaluation score is scaled down by a factor between 0.16 and 0.62, depending on the rank of the most advanced pawn. We seriously considered removing this KRPP vs KRP function some time ago, because tests prove that it is completely worthless in terms of playing strength, and because we generally prefer to avoid needless complexity when it doesn't improve strength.

That this piece of knowledge doesn't improve the playing strength at all is initially surprising. When you watch Stockfish play against some other chess program which doesn't have this knowledge, you will occasionally observe games where Stockfish is able to save a difficult position because the opponent allows it to transition to this particular type of endgame. How is it possible, then, that this knowledge doesn't improve the strength, even if just by a couple of Elo points?

The explanation, I believe (important qualifier, because this subject is still poorly understood) is that you always have to consider the interaction between the evaluation and the search. The principal variation returned from a deep search is very rarely entirely correct. Especially towards the end of the PV, the moves are likely to be sub-optimal. The concrete position at the end of the PV is almost certainly never going to appear on the board. This is why a theoretically correct evaluation of that particular concrete position is not always the best evaluation from the perspective of trying to guide the program towards promising branches of the game tree. If the PV ends in a drawish rook endgame a pawn up, there's a decent chance that by heading towards that part of the game tree, the program will eventually discover a way to win a pawn under more favorable circumstances. More generally, in an imperfect search where many tree branches are pruned or searched with dramatically reduced depth, a more perfect evaluation function is not always better.

Moreover, in the particular case of rook endgames with an extra pawn and all pawns on the same wing, an increasing number of pawns for both sides does not increase the winning chances for the stronger side, it also increases the complexity of the endgame, including the number of obscure corner cases where the general heuristic rules do not apply. If we identify all those corner cases and add code for each of them, evaluating R+4 vs R+3 and similar endgames is no longer simple. If we decide just to ignore the corner cases and go with simple and straightforward code, we run into the problem that chess knowledge that is merely usually correct, rather than almost always correct, will often make the program play worse. This is because your practical chess strength is not primarily determined by the quality of your average moves, but the quality of your worst moves.

I may seem to be contradicting myself here, and in a way, I am: By my own reasoning, couldn't ignoring the obscure corner cases turn out to make the program play better in practice? Assume that the program, when defending an inferior position, returns a PV resulting in rook endgame a pawn down with a drawish pawn configuration, but where some random concrete aspect of the position gives the stronger side winning advantage. In that case, there would be decent chances that by heading towards this branch of the game tree, the program would discover a way to force the same drawish pawn configuration under better circumstances.

The point is that you always have to test everything carefully, even "obvious" improvements. New knowledge, even if theoretically correct, may make the program stronger, weaker, or have no measurable effect on strength. Implementing heuristic rules that are usually, but not always correct will sometimes work, and sometimes not. If it doesn't work, adding lots of special cases for when the heuristics should not apply may fix the problem, but it may also make it even worse. Intuition is a poor guide, and you never know before you try.

In the particular case of rook endgames with all pawns on the same wing, I am very skeptical that adding more knowledge would help, for the simple reason that knowledge about the simplest case (R+2 vs R+1) seems to have no value. When this doesn't work, I see no reason to believe that similar knowledge for more complex endgames where the exceptions are more numerous would prove effective. As always, I could be wrong, and it is not impossible that we will experiment with code for more complex rook endgames some time in the future.

In addition to the problems described above, introducing code various classes of endgames has an annoying consequence of introducing annoying discontinuities in the evaluation function. Writing a heuristic evaluation rule that scales down the scores in rook endgames of the type we discuss may seem simple in isolation, but consider the situation when you have similar types of rules for many other endgames with varying degrees of drawishness. Often one such type of endgame has the possibility of transitioning into another. Unless we get all the parameters for all such classes of endgames precisely right, the program would often be too eager or too hesitant to transition from one drawish (but not trivially drawn) endgame to another. This adds a level of combinatorial complexity whenever special knowledge for a new class of endgames is added.

A frustrating thing about trying to add new knowledge to a chess program is that it so often looks like it works: You can easily identify games where the new knowledge helps the program win games, or save difficult positions. But still, when playing thousands of test games, it turns out that the program performs no better, and often even worse. The games where the new knowledge turns out to hurt the program are usually impossible to identify. "Optically", the program plays better than before, but in reality, it plays worse.

If you are still reading: First of all, thanks. This message took me a lot of time to write, and I appreciate that you take the time to read it. Also, I fundamentally sympathise with your position, and like you, I would in principle like to have this sort of knowledge in the program. Like you, I hate when the program misevaluates a position because it lacks some elementary piece of knowledge that I, as a human, possess. I understand if you have been a little frustrated and disappointed with the lack of response to your request until now. The reason is that this is a much more complex subject than it appears like on the surface, and that our patience is sometimes worn thin by the frequency of such requests from people who don't understand the problems involved. When you have tried many times to attack an apparently simple problem and encountering nothing but frustration and failure, it can be annoying when outsiders appear and ask "why don't you implement this simple thing?". I don't mean to criticise you for your perfectly legitimate and politely asked question, of course. I'm just trying to help you understand why it's taken us a long time to write the comprehensive reply you deserve.

Tord


The discussion can be found here: http://support.stockfishchess.org/discu ... n-one-side
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Obligatory READING

Post by zullil »

arjuntemurnikar wrote:While searching google for some statistics about drawn rook endgames, I came across this EPIC post by Tord Romstand
The discussion can be found here: http://support.stockfishchess.org/discu ... n-one-side
Thank you for finding and posting this (and thanks to Tord for taking the time to compose such a thoughtful and informative response).
Last edited by zullil on Fri Jun 27, 2014 5:39 pm, edited 1 time in total.
arjuntemurnikar
Posts: 204
Joined: Tue Oct 15, 2013 10:22 pm
Location: Singapore

Re: Obligatory scaling

Post by arjuntemurnikar »

I did a quick google search and found this blog post claiming to have done statistical analysis on endgames.

This is by no means scientific proof, because the author of the analysis does not give any detailed accounts of which database he queried nor any information about his setup in general, so nobody can really verify his results independently.

But no matter, here is what he says:
"All rook endings are drawn", according to a common piece of chess folklore. We decided to distrust emotion and check the figures, comparing the percentages of draws in different types of endings, using a database of more than three million games. The results were very surprising. Bishop endings turned out to be the most drawish, with 47%. Second place went to queen endings on 43%. Even more surprising was the third place for knight endings, at 40%. And the notorious rook endings came only second-last at 38%, with pawn endings naturally turning out to be the least drawish at 27%.
The blog post is here: http://streathambrixtonchess.blogspot.c ... dgame.html

I also found this on the chessprogramming wiki:
In 2013, John Nunn applied the 7-piece Lomonosov Tablebases to R+2P vs. R+P positions from the famous book Rook Endings, 2nd edition, by Levenfish and Smyslov [7], which was assumed to contain the truth and, owing to Nunn, this is no longer so.
...but since I don't have access to the IGCA journal, I could not read up on it. If anybody does have access and is willing to share John Nunn's findings, it would be helpful to this discussion. :)
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Obligatory scaling

Post by Lyudmil Tsvetkov »

arjuntemurnikar wrote:While searching google for some statistics about drawn rook endgames, I came across this EPIC post by Tord Romstand on the stockfish support website in response to someone who complained that stockfish mis-evaluated a drawn rook ending in a recent Kramnik-Leko game.

This post was so well written and thorough that I thought it deserves to be reposted here:


Hi Ofer,

Sorry for not replying in depth to this earlier.

This is an interesting and surprisingly complicated subject. I'll try to explain it in great detail below, but for impatient readers, the short version is that you can't consider the evaluation function in isolation from the search, that it is dangerous to add special-case evaluation code for one particular class of endgames without considering the implications for evaluations of related endgames that can transition into this endgame (or vice versa), and that even a completely bug-free implementation of entirely correct chess knowledge will not always make the program better.

I'm the author of the piece of chess knowledge in Stockfish that is most closely related to what you ask for: The evaluation scaling function for endgames with R+2 vs R+1 with no passed pawn. It works roughly like you suggest in your last paragraph: If the defending king is well placed, the evaluation score is scaled down by a factor between 0.16 and 0.62, depending on the rank of the most advanced pawn. We seriously considered removing this KRPP vs KRP function some time ago, because tests prove that it is completely worthless in terms of playing strength, and because we generally prefer to avoid needless complexity when it doesn't improve strength.

That this piece of knowledge doesn't improve the playing strength at all is initially surprising. When you watch Stockfish play against some other chess program which doesn't have this knowledge, you will occasionally observe games where Stockfish is able to save a difficult position because the opponent allows it to transition to this particular type of endgame. How is it possible, then, that this knowledge doesn't improve the strength, even if just by a couple of Elo points?

The explanation, I believe (important qualifier, because this subject is still poorly understood) is that you always have to consider the interaction between the evaluation and the search. The principal variation returned from a deep search is very rarely entirely correct. Especially towards the end of the PV, the moves are likely to be sub-optimal. The concrete position at the end of the PV is almost certainly never going to appear on the board. This is why a theoretically correct evaluation of that particular concrete position is not always the best evaluation from the perspective of trying to guide the program towards promising branches of the game tree. If the PV ends in a drawish rook endgame a pawn up, there's a decent chance that by heading towards that part of the game tree, the program will eventually discover a way to win a pawn under more favorable circumstances. More generally, in an imperfect search where many tree branches are pruned or searched with dramatically reduced depth, a more perfect evaluation function is not always better.

Moreover, in the particular case of rook endgames with an extra pawn and all pawns on the same wing, an increasing number of pawns for both sides does not increase the winning chances for the stronger side, it also increases the complexity of the endgame, including the number of obscure corner cases where the general heuristic rules do not apply. If we identify all those corner cases and add code for each of them, evaluating R+4 vs R+3 and similar endgames is no longer simple. If we decide just to ignore the corner cases and go with simple and straightforward code, we run into the problem that chess knowledge that is merely usually correct, rather than almost always correct, will often make the program play worse. This is because your practical chess strength is not primarily determined by the quality of your average moves, but the quality of your worst moves.

I may seem to be contradicting myself here, and in a way, I am: By my own reasoning, couldn't ignoring the obscure corner cases turn out to make the program play better in practice? Assume that the program, when defending an inferior position, returns a PV resulting in rook endgame a pawn down with a drawish pawn configuration, but where some random concrete aspect of the position gives the stronger side winning advantage. In that case, there would be decent chances that by heading towards this branch of the game tree, the program would discover a way to force the same drawish pawn configuration under better circumstances.

The point is that you always have to test everything carefully, even "obvious" improvements. New knowledge, even if theoretically correct, may make the program stronger, weaker, or have no measurable effect on strength. Implementing heuristic rules that are usually, but not always correct will sometimes work, and sometimes not. If it doesn't work, adding lots of special cases for when the heuristics should not apply may fix the problem, but it may also make it even worse. Intuition is a poor guide, and you never know before you try.

In the particular case of rook endgames with all pawns on the same wing, I am very skeptical that adding more knowledge would help, for the simple reason that knowledge about the simplest case (R+2 vs R+1) seems to have no value. When this doesn't work, I see no reason to believe that similar knowledge for more complex endgames where the exceptions are more numerous would prove effective. As always, I could be wrong, and it is not impossible that we will experiment with code for more complex rook endgames some time in the future.

In addition to the problems described above, introducing code various classes of endgames has an annoying consequence of introducing annoying discontinuities in the evaluation function. Writing a heuristic evaluation rule that scales down the scores in rook endgames of the type we discuss may seem simple in isolation, but consider the situation when you have similar types of rules for many other endgames with varying degrees of drawishness. Often one such type of endgame has the possibility of transitioning into another. Unless we get all the parameters for all such classes of endgames precisely right, the program would often be too eager or too hesitant to transition from one drawish (but not trivially drawn) endgame to another. This adds a level of combinatorial complexity whenever special knowledge for a new class of endgames is added.

A frustrating thing about trying to add new knowledge to a chess program is that it so often looks like it works: You can easily identify games where the new knowledge helps the program win games, or save difficult positions. But still, when playing thousands of test games, it turns out that the program performs no better, and often even worse. The games where the new knowledge turns out to hurt the program are usually impossible to identify. "Optically", the program plays better than before, but in reality, it plays worse.

If you are still reading: First of all, thanks. This message took me a lot of time to write, and I appreciate that you take the time to read it. Also, I fundamentally sympathise with your position, and like you, I would in principle like to have this sort of knowledge in the program. Like you, I hate when the program misevaluates a position because it lacks some elementary piece of knowledge that I, as a human, possess. I understand if you have been a little frustrated and disappointed with the lack of response to your request until now. The reason is that this is a much more complex subject than it appears like on the surface, and that our patience is sometimes worn thin by the frequency of such requests from people who don't understand the problems involved. When you have tried many times to attack an apparently simple problem and encountering nothing but frustration and failure, it can be annoying when outsiders appear and ask "why don't you implement this simple thing?". I don't mean to criticise you for your perfectly legitimate and politely asked question, of course. I'm just trying to help you understand why it's taken us a long time to write the comprehensive reply you deserve.

Tord


The discussion can be found here: http://support.stockfishchess.org/discu ... n-one-side
Read more carefully, Arjun, you simply do not read carefully.

Tord says that SF had code for R+2 pawns vs R+ 1 pawn on one wing and it did not boost playing strength. He also says that he did not believe adding knowledge for more complex rook endgames would be useful, as the less complex rule did not work. But he did not try, and neither of the SF team did.

Do you know what the reason for scaling down of R+2 pawns vs R+ pawn not working in SF is? Main reason is that there is nothing to gain here, the endgame is simply too simple and the engine sees it all in the search. That is why no gain, very simple. That is why I suggested to scale 3 vs 2 in SF, and you will see the benefit. Such a scaling is meaningful, and SF often goes wrong with that, but I have not seen a game where it goes wrong with 2 vs 1 pawns.

About combinatorial problems with having many specific eval functions. Fully true, that is why I say scale what is important. If you are unable to scale most late endgames with equal material and pawn span less or equal to 3, then scale at least single rook endgames with 3 vs 2 pawns and less. One of the two, but not both at the same time.

Who needs a rule for 2 vs 1 pawns, when it is not working? So I would say drop it, and replace it with the meaningful rule for 3 vs 2 pawns, that already adds strength. Or even better, implement a rule for scaling all 1 and 2 piece endgames with equal material and pawn span less or equal to 3. Shane was very close, actually 2 of his scaling patches succeeded quite nicely. So that scaling works, you should not deny the facts and Tord's old minimalistic rules have already been superceded, but still not implemented. Is not this a proof, very evident one, that more complex scaling is good and works, and also adds nice strength?

Gary, Lucas and Joona might be right that the last Shane patch could be improved, and I think Shane or someone else should make a bit more effort so that the patch is finally implemented. Why not try what I suggest:

50% scaling when

- pawn span less or equal to 3
- non pawn material equal or just B vs N
- pawns equal or just one pawn more for one of the sides
- 1 or 2 pieces each side?

I think this is a meaningful rule that could help in many situations the search and also improve strength.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Obligatory scaling

Post by Lyudmil Tsvetkov »

And something else, Arjun.

Last Shane scaling patch for pawn span less or equal to 2 and all pieces involved adds 2 elo, while existing SF rule for rook scaling with same pawn span does not add any elo. Does that ring a bell, Arjun? It is meaningful to scale more complex endgames rather than less complex ones, there is a proof to this, you see it, it is 2 elo on the framework.

Another question is why SF still has the old rook rule that does not add any strength but only uglifies code, while still does not implement the 2 elo patch? Please tell me why. If the SF developers are rule-abiding, they could refuse the implementation of the Shane scaling patch until it is further improved, but they should also immediately throw out the old rook rule. Why not do that?
Uri Blass
Posts: 10297
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Obligatory scaling

Post by Uri Blass »

The point of knowledge of 2 vs 1 is not to draw the endgame that is simple but to get into the right endgame.

Tord said that there are games when it worked but it is harder to find games that it did not work.

I guess that it can be the case if the program is slightly slower because of the new knowledge but if it is not the case then Tord's explanation does not seem to me logical.

I can add that recently stockfish added special knowledge that include 2 vs 1 if I understand correcrtly and it was productive(see the pawn span patch).
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: Obligatory scaling

Post by Lyudmil Tsvetkov »

arjuntemurnikar wrote:I did a quick google search and found this blog post claiming to have done statistical analysis on endgames.

This is by no means scientific proof, because the author of the analysis does not give any detailed accounts of which database he queried nor any information about his setup in general, so nobody can really verify his results independently.

But no matter, here is what he says:
"All rook endings are drawn", according to a common piece of chess folklore. We decided to distrust emotion and check the figures, comparing the percentages of draws in different types of endings, using a database of more than three million games. The results were very surprising. Bishop endings turned out to be the most drawish, with 47%. Second place went to queen endings on 43%. Even more surprising was the third place for knight endings, at 40%. And the notorious rook endings came only second-last at 38%, with pawn endings naturally turning out to be the least drawish at 27%.
The blog post is here: http://streathambrixtonchess.blogspot.c ... dgame.html

I also found this on the chessprogramming wiki:
In 2013, John Nunn applied the 7-piece Lomonosov Tablebases to R+2P vs. R+P positions from the famous book Rook Endings, 2nd edition, by Levenfish and Smyslov [7], which was assumed to contain the truth and, owing to Nunn, this is no longer so.
...but since I don't have access to the IGCA journal, I could not read up on it. If anybody does have access and is willing to share John Nunn's findings, it would be helpful to this discussion. :)
This is complete BS.

You do not know what the criteria for choosing which endgames to count are.

In pawn endgames, it is very simple, you either see a win very quickly, or it is simply a draw, so you do not need stats for this, there is no WDL probability there, the author should not have included this.
The first thing a primary teaches you is that knight endgames are absolutely the same as pawn endgames, so if you are up a pawn in an knight endgame with sufficient pawn span, this is simply a win; but that is far from true for rook endgames, quite the opposite.
One pawn more in queen endings is usually easier to convert than in rook endings, so another wrong statement.

Instead of citing unsubstantiated sources, it would be better to just try one and the other idea.