Testing endgame strength

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 21, 2017 10:45 am

Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.

Dann Corbit · Post by **Dann Corbit** » Wed Jun 21, 2017 11:01 am

What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 21, 2017 1:36 pm

Dann Corbit wrote:What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?

I want to test how well the engine evaluates positions with 8 pieces or less. So perhaps starting with positions with 10 or 12 pieces would be good, but they should be "interesting", so it shouldn't be completely clear what the result will be, and there should be a representative variety of the endgames that do happen in games.

The only ideas I have for how to collect such a dataset seem very expensive, so I was hoping someone had a ready-made set, or perhaps just better ideas.

Laskos · Post by **Laskos** » Wed Jun 21, 2017 1:53 pm

AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.

http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 21, 2017 3:29 pm

Laskos wrote: http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.

Thanks, Kai! I'll try to use that and see if I get meaningful results.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 21, 2017 5:31 pm

I am testing the change of using w/d/l statistics from the database of positions I use in RuyTune to assign a value to material configurations with 7 or fewer pieces for which I have at least 31 samples.

With my usual set of opening positions, the benefit is lost in the noise:
1394-1355-2361, +3 Elo, LOS=0.771512

With Kai's End_02_05.epd so far I have:
270-215-380, +22 Elo, LOS: 0.993745

So indeed it looks like the signal to noise is much improved.

Thanks again!

jwes · Post by **jwes** » Thu Jun 22, 2017 3:09 am

Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.

One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.

AlvaroBegue · Post by **AlvaroBegue** » Thu Jun 22, 2017 3:10 am

jwes wrote:
Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.

Yes, that's the essence of how I wanted to build my own collection of positions. There seem to be quite a few positions like that in Kai's file, so maybe I'll start by filtering those out.

Laskos · Post by **Laskos** » Thu Jun 22, 2017 7:06 am

jwes wrote:
Laskos wrote:
AlvaroBegue wrote:Hi,

I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.

Do you guys have a good collection of starting positions to test endgame strength?

Thanks,
Álvaro.
http://s000.tinyupload.com/?file_id=637 ... 6449648210

Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.

That is a bit related, at least statistically, to the eval of Stockfish on these positions. In endgame, positions from roughly 0.3 to 1.7 in Stockfish eval have have high proportions of playable positions. From above 1.0-1.2, pentanomial variance is useful to be applied, and I can do it on the final result (not game by game, sadly Cutechess doesn't do it, but Richard Delorme started working on a tool).

From regular opening phase positions this 0.3-1.7 is smaller, even 0.0 positions are still playable to about 0.9, the efficiency then decreases (signal to noise ratio or t-value).

Also, depends very much too on strength of engines and time control.

Dann Corbit · Post by **Dann Corbit** » Thu Jun 22, 2017 8:08 pm

There are a few positions in 3moves_gm that are very polar. This appears to be the worst one... score is +210 for the side to move:

[d]rnb1kbnr/pppp1ppp/8/4P3/q7/5N2/PPP1PPPP/RNBQKB1R w KQkq - acd 37; acs 900; bm Nc3; cce 104; ce 210; pm Nc3 {30}; pv Nc3 Bb4; white_wins 18; black_wins 11; draws 1;

Testing endgame strength

Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength

Re: Testing endgame strength