Hi,
I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.
Do you guys have a good collection of starting positions to test endgame strength?
Thanks,
Álvaro.
Testing endgame strength
Moderators: hgm, Rebel, chrisw
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Testing endgame strength
What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Testing endgame strength
I want to test how well the engine evaluates positions with 8 pieces or less. So perhaps starting with positions with 10 or 12 pieces would be good, but they should be "interesting", so it shouldn't be completely clear what the result will be, and there should be a representative variety of the endgames that do happen in games.Dann Corbit wrote:What exactly are you looking for?
Tablebase positions which just require lookup?
Late middle game?
Early endgame with 10 pieces or more?
Something else?
The only ideas I have for how to collect such a dataset seem very expensive, so I was hoping someone had a ready-made set, or perhaps just better ideas.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Testing endgame strength
http://s000.tinyupload.com/?file_id=637 ... 6449648210AlvaroBegue wrote:Hi,
I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.
Do you guys have a good collection of starting positions to test endgame strength?
Thanks,
Álvaro.
Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Testing endgame strength
Thanks, Kai! I'll try to use that and see if I get meaningful results.Laskos wrote: http://s000.tinyupload.com/?file_id=637 ... 6449648210
Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Testing endgame strength
I am testing the change of using w/d/l statistics from the database of positions I use in RuyTune to assign a value to material configurations with 7 or fewer pieces for which I have at least 31 samples.
With my usual set of opening positions, the benefit is lost in the noise:
1394-1355-2361, +3 Elo, LOS=0.771512
With Kai's End_02_05.epd so far I have:
270-215-380, +22 Elo, LOS: 0.993745
So indeed it looks like the signal to noise is much improved.
Thanks again!
With my usual set of opening positions, the benefit is lost in the noise:
1394-1355-2361, +3 Elo, LOS=0.771512
With Kai's End_02_05.epd so far I have:
270-215-380, +22 Elo, LOS: 0.993745
So indeed it looks like the signal to noise is much improved.
Thanks again!
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: Testing endgame strength
One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.Laskos wrote:http://s000.tinyupload.com/?file_id=637 ... 6449648210AlvaroBegue wrote:Hi,
I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.
Do you guys have a good collection of starting positions to test endgame strength?
Thanks,
Álvaro.
Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Testing endgame strength
Yes, that's the essence of how I wanted to build my own collection of positions. There seem to be quite a few positions like that in Kai's file, so maybe I'll start by filtering those out.jwes wrote:One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.Laskos wrote:http://s000.tinyupload.com/?file_id=637 ... 6449648210AlvaroBegue wrote:Hi,
I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.
Do you guys have a good collection of starting positions to test endgame strength?
Thanks,
Álvaro.
Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Testing endgame strength
That is a bit related, at least statistically, to the eval of Stockfish on these positions. In endgame, positions from roughly 0.3 to 1.7 in Stockfish eval have have high proportions of playable positions. From above 1.0-1.2, pentanomial variance is useful to be applied, and I can do it on the final result (not game by game, sadly Cutechess doesn't do it, but Richard Delorme started working on a tool).jwes wrote:One idea to find useful positions is to run a tournament with engines that vary significantly in strength and discard positions that weaker engines can win or can draw with both colors. This should leave you positions where there is play.Laskos wrote:http://s000.tinyupload.com/?file_id=637 ... 6449648210AlvaroBegue wrote:Hi,
I want to start working on the late-endgame evaluation in RuyDos. I have a lot of ideas of what to try, but no good way to test them. The standard mechanism of playing a gazillion bullet games has the problem that most bullet games (at least at my engine's current skill level) don't make it to the late endgame, so from the point of view of what I am trying to test, the signal to noise is dismal.
Do you guys have a good collection of starting positions to test endgame strength?
Thanks,
Álvaro.
Varied in disbalance Endgame suites happened in games with what you ask for the number of pieces (on average). End_02_05. epd means that disbalance is 20-50cp according to Stockfish. I collected them from games of strong engines and analyzed for disbalance with Stockfish. Too balanced give too high draw rate with strong engines, sensitivity in tests decreases. Too unbalanced, and you will have to use pentanomial in assessing the variance (sigma) of the result, which is not yet implemented in Cutechess.
From regular opening phase positions this 0.3-1.7 is smaller, even 0.0 positions are still playable to about 0.9, the efficiency then decreases (signal to noise ratio or t-value).
Also, depends very much too on strength of engines and time control.
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Testing endgame strength
There are a few positions in 3moves_gm that are very polar. This appears to be the worst one... score is +210 for the side to move:
[d]rnb1kbnr/pppp1ppp/8/4P3/q7/5N2/PPP1PPPP/RNBQKB1R w KQkq - acd 37; acs 900; bm Nc3; cce 104; ce 210; pm Nc3 {30}; pv Nc3 Bb4; white_wins 18; black_wins 11; draws 1;
[d]rnb1kbnr/pppp1ppp/8/4P3/q7/5N2/PPP1PPPP/RNBQKB1R w KQkq - acd 37; acs 900; bm Nc3; cce 104; ce 210; pm Nc3 {30}; pv Nc3 Bb4; white_wins 18; black_wins 11; draws 1;
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.