Top engines or top set of engines for solving test suites

fkarger · Post by **fkarger** » Tue Apr 22, 2025 10:52 am

I have two questions.

Precondition:
Assume the test suite would contain all test positions and studies
that were ever composed.

1) What are the top engines for solving such a test suite?

My general impression is that Sting and Crystal are best
when it comes to positions which are (too) hard for Stockfish.
In extreme cases Chest is also interesting.
Otherwise Stockfish is a very good choice.

2) What would be your set of choice of the 3 engines to solve as many positions as possible?

We assume that all 3 run at the same time on different but equally strong machines
and we want to see the solution as quickly as possible.
In this case, we accept a position as solved if at least one engine shows the solution
(this is known as the 'Maximum Coverage Problem').

Thank you.

AndrewGrant · Post by **AndrewGrant** » Tue Apr 22, 2025 11:27 am

Stockfish is the best engine for solving any general suite
The impression that its not -- and that some SF variant is better -- is a result of a subtle bias against Stockfish
IE, someone is coming up with a set of test positions. If Stockfish fails to solve it, then they think "Oh, that is interesting. Lets add that"

fkarger · Post by **fkarger** » Tue Apr 22, 2025 11:53 am

AndrewGrant wrote: ↑Tue Apr 22, 2025 11:27 am Stockfish is the best engine for solving any general suite
The impression that its not -- and that some SF variant is better -- is a result of a subtle bias against Stockfish
IE, someone is coming up with a set of test positions. If Stockfish fails to solve it, then they think "Oh, that is interesting. Lets add that"

Makes sense.
Although there could also be another subtle difference:
SF is optimized to be the strongest engine in practical play.
But test suites even if they are not designed to be 'anti Stockfish' could
be different to practical play.

Hai · Post by **Hai** » Tue Apr 22, 2025 11:58 am

fkarger wrote: ↑Tue Apr 22, 2025 10:52 am I have two questions.

Precondition:
Assume the test suite would contain all test positions and studies
that were ever composed.

1) What are the top engines for solving such a test suite?

My general impression is that Sting and Crystal are best
when it comes to positions which are (too) hard for Stockfish.
In extreme cases Chest is also interesting.
Otherwise Stockfish is a very good choice.

2) What would be your set of choice of the 3 engines to solve as many positions as possible?

We assume that all 3 run at the same time on different but equally strong machines
and we want to see the solution as quickly as possible.
In this case, we accept a position as solved if at least one engine shows the solution
(this is known as the 'Maximum Coverage Problem').

Thank you.

1)
Stockfish
LC0
Torch (If you can't get it, try the "position solver engine" you like. It doesn't matter if the elo is much lower.)
All three are super strong and all three are very different if you compare them to each other.

2)
The most difficult: Top Chess Engines Testsuite 2024 v2
https://www.mediafire.com/file/cypaz2t0 ... 2.pgn/file
If this test suit isn't difficult enough, take it and put more studies inside.

To solve as many positions as possible... for what?
It should be clear that Stockfish and ... will solve 1500 elo problems.
Take 500.000 studies and delete the positions which are solved instantly and which are solved to often.

fkarger · Post by **fkarger** » Tue Apr 22, 2025 12:05 pm

Hai wrote: ↑Tue Apr 22, 2025 11:58 am
fkarger wrote: ↑Tue Apr 22, 2025 10:52 am
2) What would be your set of choice of the 3 engines to solve as many positions as possible?

We assume that all 3 run at the same time on different but equally strong machines
and we want to see the solution as quickly as possible.
In this case, we accept a position as solved if at least one engine shows the solution
(this is known as the 'Maximum Coverage Problem').

Thank you.
To solve as many positions as possible... for what?
It should be clear that Stockfish and ... will solve 1500 elo problems.
Take 500.000 studies and delete the positions which are solved instantly and which are solved to often.

The second question is about the best team of solvers if you had to choose a team of 3 solvers.

peter · Post by **peter** » Tue Apr 22, 2025 12:53 pm

fkarger wrote: ↑Tue Apr 22, 2025 11:53 am SF is optimized to be the strongest engine in practical play.
But test suites even if they are not designed to be 'anti Stockfish' could
be different to practical play.

And they should be different, what sense would there be in positional testing anyway, if I'd just want and get the same results like in game playing?
"Problem", many testers and programmers have with positional testing, as it is done most of the times, is just this difference between the results out of game playing and out of positional tests. To me these are features, not bugs.

What you have to deal with (but I'm sure you know so): there isn't one test suite as well as there isn't one sinlge position of chess answering all the questions you can have to "playing strength" of engines (as well as that of humans) not even the very basic starting position of classical chess is of really much more meaning than other positions of interest are. You see that best today, if you try to get statistically meaningful results out of eng-eng-game playing from starting position (without books or given opening test positions) only, not even with very short TC and weak hardware you get out of error bar with reasonable amounts of games, as for more than or 2 single engines, their versions and settings. This kind of eng-eng-testing is drawn- dead already since quite a while too.

So which one suite of test positions out of opening (to let engines play out against each other, of course you can use opening positions for positional testing too, MEA is a way to go like this e.g. and I like to use it also, just to mention, I've got a suite of 1001 UHO- postions in MEA- syntax also) or out of midgame and endgame you use, if positions especially chosen as anti engine puzzles or out of eng-eng-games (NICE e.g., pity latest version of Ed's is still buggy

viewtopic.php?p=978298#p978298

and evaluated with too little hardware- time

viewtopic.php?p=975854#p975854

and those 10" are used single thread with MultiPV=4, see postings below that of the link and in second one recent thread about NICE, that's too little hardware- time for me to get halfway reliable evals of positions with near to each other candidate moves as for their WDL- chances, the biggest MEA- suite I use is 10124 positions, which I did let SF evaluate for 1 minute/pos., 30 threads of a 16x3.5GHz- CPU and MultiPV=4), which test- tool (besides MEA I like EloStatTS from Frank Schubert still very much) and which hardware- TC for what kind of engine- pool you use, that's what makes the real big differences, maybe a bigger one but letting all the positions you're interested in being outplayed eng-eng, head to head, one by one, or yet just trust other ways of adjudicating and evaluating certain kinds of positons without thousands of games of outplay of each and any of all of the positions, engines, their versions, nets, parameter- settings, patches...

fkarger · Post by **fkarger** » Tue Apr 22, 2025 1:11 pm

peter wrote: ↑Tue Apr 22, 2025 12:53 pm
fkarger wrote: ↑Tue Apr 22, 2025 11:53 am SF is optimized to be the strongest engine in practical play.
But test suites even if they are not designed to be 'anti Stockfish' could
be different to practical play.
And they should be different, what sense would there be in positional testing anyway, if I'd just want and get the same results like in game playing?
"Problem", many testers and programmers have with positional testing, as it is done most of the times, is just this difference between the results out of game playing and out of positional tests. To me it's a feature, not a bug.

What you have to deal with (but I'm sure you know so): there isn't one test suite as well as there isn't one sinlge position of chess answering all the questions you can have to "playing strength" of engines (as well as that of humans) not even the very basic starting position of classical chess is of really much more meaning than other positions of interest are. You see that best today, if you try to get statistically meaningful results out of eng-eng-game playing from starting position (without books or given opening test positions) only, not even with very short TC and weak hardware you get out of error bar with reasonable amounts of games, as for more than or 2 single engines, their versions and settings. This kind of eng-eng-testing is drawn- dead already since quite a while too.

So which one suite of test positions out of opening (to let engines play out against each other, of course you can use opening positions for positional testing too, MEA is a way to go like this e.g. and I like to use it also, just to mention, I've got a suite of 1001 UHO- postions in MEA- syntax also) or out of midgame and endgame you use, if positions especially chosen as anti engine puzzles or out of eng-eng-games (NICE e.g., pity latest version of Ed's is still buggy

viewtopic.php?p=978298#p978298

and evaluated with too little hardware- time

viewtopic.php?p=975854#p975854

and those 10" are used single thread with MultiPV=4, see postings below that of the link and in second one recent thread about NICE, that's too little hardware- time for me to get halfway reliable evals of positions with near to each other candidate moves as for their WDL- chances, the biggest MEA- suite I use is 10124 positions, which I did let SF evaluate for 1 minute/pos., 30 threads of a 16x3.5GHz and MultiPV=4), which test- tool (besides MEA I like EloStatTS from Frank Schubert still very much) and which hardware- TC for what kind of engine- pool you use, that's what makes the real big differences, maybe a bigger one but letting all the positions, you're interested in, being outplayed eng-eng head to head one by one, or yet trust other ways of adjudicating certain kind posions without letting thousands of game being played out out of each and every single position

Thank you for your insights, Peter!
I agree that it is difficult to determine the playing strength of engines by using test positions.
That is probably the reason why they use billions of them in machine learning.

Another interesting question could be: what is the smallest amount of test positions
suited to precisely estimate the playing strength of an engine?
This could have practical relevance in machine learning or engine optimization.

At the moment this is not too important too me.
Currently I find it more interesting to see the engines having problems to solve
some of the positions and then to understand why.

Jouni · Post by **Jouni** » Tue Apr 22, 2025 2:00 pm

In my test suites ShashChess High Tal is the best solver. Version 35.1. Later are much weaker.

fkarger · Post by **fkarger** » Tue Apr 22, 2025 2:05 pm

Jouni wrote: ↑Tue Apr 22, 2025 2:00 pm In my test suites ShashChess High Tal is the best solver. Version 35.1. Later are much weaker.

Thank you Jouni!
I will try that version.

Is this https://github.com/amchess/ShashChess/releases/tag/35.1
the correct version (I dont see High Tal there) ?

Jouni · Post by **Jouni** » Tue Apr 22, 2025 3:28 pm

High Tal is UCI parameter in engine.

Top engines or top set of engines for solving test suites

Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites

Re: Top engines or top set of engines for solving test suites