CCET - a new difficult test suite

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

fkarger
Posts: 63
Joined: Sat Aug 15, 2020 8:08 am
Full name: Frank Karger

CCET - a new difficult test suite

Post by fkarger »

Hi everyone,

I’m excited to share my new test suite for chess engines, designed to challenge and evaluate their performance:
Download PGN, EPD, and CBH files here

To make comparing results easy, I’ve set up two pages for submitting your scores:
Standardized Competition: Enter your results here
Open Competition: Enter your results here

You can check the current rankings here:
Standardized Ranking
Open Ranking (I have already put in some data)

For all the details about the test suite, including instructions and background, visit the homepage:
Homepage of the test suite

This project is still in its early stages, and I’d love for you to give it a try!
I’m very open to constructive feedback, suggestions, or any questions you might have.

Thanks for your interest, and I hope you enjoy testing your engines :)

Best regards,

Frank
fkarger
Posts: 63
Joined: Sat Aug 15, 2020 8:08 am
Full name: Frank Karger

Re: CCET - a new difficult test suite

Post by fkarger »

Intermediate result on AMD Ryzen 7 3800X, 8 Threads:

Stockfish 17.1 has huge problems at smaller time controls but improves heavily at about 10 minutes / position.
See: https://t1p.de/CCET-Open-Ranking
Lunar
Posts: 11
Joined: Wed May 21, 2025 12:32 pm
Full name: Patrick Hilhorst

Re: CCET - a new difficult test suite

Post by Lunar »

Wow, if stockfish is able to solve so few positions in relatively long time control, that is one difficult test set indeed!

I do have one gripe with the standardized conditions. You ask for default hash table size. Is this not much too small? I believe many engines have a default hash table of only a few MB (I've even seen 0, no hash table at all). Can you explain the reasoning behind this? I feel like standardizing on a larger hash table size, say maybe 8GB or so, would make more sense for these standardized LTC tests, rather than the cramped and rather arbitrary default that many engines set. I wouldn't like to encourage an upward creep in default hash table hash table size because for most purposes, the default is completely arbitrary.

That being said, I'll see if I can run my engine on this set to see how many (if any :oops: ) a non-top engine can solve!
fkarger
Posts: 63
Joined: Sat Aug 15, 2020 8:08 am
Full name: Frank Karger

Re: CCET - a new difficult test suite

Post by fkarger »

Lunar wrote: Thu Jun 12, 2025 5:01 pm Wow, if stockfish is able to solve so few positions in relatively long time control, that is one difficult test set indeed!

I do have one gripe with the standardized conditions. You ask for default hash table size. Is this not much too small? I believe many engines have a default hash table of only a few MB (I've even seen 0, no hash table at all). Can you explain the reasoning behind this? I feel like standardizing on a larger hash table size, say maybe 8GB or so, would make more sense for these standardized LTC tests, rather than the cramped and rather arbitrary default that many engines set. I wouldn't like to encourage an upward creep in default hash table hash table size because for most purposes, the default is completely arbitrary.

That being said, I'll see if I can run my engine on this set to see how many (if any :oops: ) a non-top engine can solve!
I agree that a standardization of the hash table size makes sense.
So far I have chosen “default” because there are engines that cannot cope with other values.
I will change the default case to 8GB.
If an engine can't do that, you have to make an exception or exclude it if it wants more.
Thanks for the hint :)
fkarger
Posts: 63
Joined: Sat Aug 15, 2020 8:08 am
Full name: Frank Karger

Re: CCET - a new difficult test suite

Post by fkarger »

Lunar wrote: Thu Jun 12, 2025 5:01 pm Wow, if stockfish is able to solve so few positions in relatively long time control, that is one difficult test set indeed!

I do have one gripe with the standardized conditions. You ask for default hash table size. Is this not much too small? I believe many engines have a default hash table of only a few MB (I've even seen 0, no hash table at all). Can you explain the reasoning behind this? I feel like standardizing on a larger hash table size, say maybe 8GB or so, would make more sense for these standardized LTC tests, rather than the cramped and rather arbitrary default that many engines set. I wouldn't like to encourage an upward creep in default hash table hash table size because for most purposes, the default is completely arbitrary.

That being said, I'll see if I can run my engine on this set to see how many (if any :oops: ) a non-top engine can solve!
I have updated the description accordingly and I am pretty sure that your
engine will solve some of the positions.
The first 80 gave Stockfish a very hard time but 81 to 160 also include some easier positions.
peter
Posts: 3386
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: CCET - a new difficult test suite

Post by peter »

Hi Frank, thanks again for that.
Despite what I wrote at CSS I find the idea with the tbs- positions having one single solution at the edge of 50moves to be won and the next weaker move very near to it but winning only cursedly, really interesting to test engines with, of course without usage of tbs.
But let's have a look at the exemplified three positions I talked about at CSS already, starting with nr. 81:

[d]8/q7/6P1/4K2Q/8/8/8/k7 w - - 0 1
bm Qh1+
Second best Qd1+ is just over 50 moves to reset counter, so drawing by this rule, DTM is only 2 moves longer for the one but for the other one. Of course you can hope modern engines to separate the two moves from each other by search without tbs, but here's SF dev. with 30 threads of 16.5GHz CPU, 8G hash (too little, just because it's the standard setting at your site I began with that) and MultiPV=2:

8/q7/6P1/4K2Q/8/8/8/k7 w - -

Engine: Stockfish250602 (8192 MB)
von the Stockfish developers (see AUTHORS f

52 8:13 +0.69 1.Dd1+ Ka2 2.De2+ Kb3 3.Df3+ Kc2
4.Df5+ Kc1 5.Df1+ Kc2 6.Db5 Dg7+
7.Kf5 Df8+ 8.Ke4 Da8+ 9.Kf4 Da7
10.De5 Df2+ 11.Kg5 Dd2+ 12.Kf6 Dd8+
13.Kf5 Dd7+ 14.De6 (33.880.629.084) 68646

51 8:13 +0.65 1.Dh1+ Ka2 2.Dd5+ Kb1 3.De4+ Ka2
4.Dg2+ Kb1 5.Df1+ Kc2 6.Db5 Dg7+
7.Kf5 Df8+ 8.Ke4 Da8+ 9.Kf4 Dd8
10.De5 Dd2+ 11.Kg4 Dd1+ 12.Kf5 Dd7+
13.De6 Db5+ 14.Kf6 (33.880.629.084) 68646


As for so near to end position with so little material on board, that's practially drawing as for the eval, isn't it? So which one of the two moves the engine would choose would be rather pure accident, wouldn't it?

Next one, nr. 82, same hardware- setting, just giving 32G hash this time according to the LTC:

[d]8/8/8/1r6/3K1B2/8/7R/k7 w - - 0 1
bm Kc3, this time Kc4 together with given one solution is about as near to cursed or not cursed win as the two moves before were then:


8/8/8/1r6/3K1B2/8/7R/k7 w - -

Engine: Stockfish250602 (32768 MB)
von the Stockfish developers (see AUTHORS f

76 6:57 0.00 1.Kc3 Ta5 2.Ld6 Tf5 3.Lb4 Tf3+ 4.Kc2 Ka2
5.Lc5 Tg3 6.Tf2 Th3 7.Ld6 Te3 8.Tf8 Te2+
9.Kd3 Tb2 10.Ta8+ Kb1 11.Le5 Td2+
12.Kc4 Kc2 13.Ta1 Te2 14.Ta2+ (29.369.207.763) 70312

76 6:57 0.00 1.Kc4 Ta5 2.Kb3 Tb5+ (29.369.207.763) 70312


You see, this time SF evaluates both candidate moves 0.00, so what?

Next one, nr. 84
[d]8/8/7Q/1r4K1/2p5/8/3k4/8 w - - 0 1
Bm Kf6+ wins having the edge as for DTZ and 50 moves, next best Kg4+ is just over 50 and a cursed win again:


[d]8/8/7Q/1r4K1/2p5/8/3k4/8 w - -

Engine: Stockfish250602 (32768 MB)
von the Stockfish developers (see AUTHORS f

55 9:40 +1.36 1.Kf6+ Kc2 2.Dg6+ Kb2 3.Dg2+ Kb3
4.Dg8 Kb4 5.Dg4 Td5 6.De6 Td2 7.Db6+ Kc3
8.Da5+ Kd3 9.Df5+ Kc3 10.Ke6 Kb4
11.Db1+ Kc5 12.Dg1+ Kb4 13.Db6+ Ka4
14.Dc5 (45.871.022.887) 79083

54 9:40 +1.12 1.Kg4+ Kc2 2.Da6 Tb4 3.Da2+ Kd3
4.Da3+ Tb3 5.Da6 Tb2 6.Kf5 Te2 7.Db5 Te3
8.Dd7+ Kc3 9.Kf4 Te2 10.Da4 Kd3
11.Dd1+ Td2 12.Db1+ Kd4 13.Kf5 Kc3
14.Ke5 (45.871.022.887) 79083

And after some more ponder- time:

8/8/7Q/1r4K1/2p5/8/3k4/8 w - -

Engine: Stockfish250602 (32768 MB)
von the Stockfish developers (see AUTHORS f

57 15:54 +1.40 1.Kf6+ Kc2 2.Dg6+ Kc1 3.Dg1+ Kb2
4.Dg2+ Kb3 5.Dg8 Th5 6.Db8+ Kc2
7.De8 Th4 8.Dg6+ Kb2 9.Dg2+ Kb3
10.Dd5 Th3 11.Db5+ Kc3 12.Ke6 Td3
13.Db1 Kd2 14.Da2+ (78.511.122.412) 82229

56 15:54 +1.33 1.Kg4+ Kc2 (78.511.122.412) 82229


Again we see SF evaluate the two moves very near to each other and both not with a clearly winning eval, again the choice between them without tbs will be pure hazard, won't it? At least here the better one comes a little nearer to +-, yet the discrimination between the two evals is not big enough for what I'm used to see with classical single best moves.

The reason, why the positions at first glance fooled me at all was, I looked at DTM only with Nalimovs at Shredder- GUI, having switched off Syzygys and seeing the engine evaluate each time two moves very near to each other, I thought the positions wrongly to be cursed wins anyhow. To say it one more time, as I already did at CSS, these are (as all positions are you know the best follow- up- lines of well enough) test positions to be evaluated by engines with looking at LTC- output, but without seeing this (output) too and thus knowing the reasons for the evaluations (the lines according to the evals) adjudication by GUI or tool as the one solved and the other one not, is bias by selection, so not to be used well with quite different positions in a suite for short to medium TC. As for my personal pov, I'd rather use them with Forward Backward only to see the points (plies) where the the engines' evals start to change with much shorter hardware- time but with standalone pondering only.

As I already said at CSS, if I'd use such positions for a suite at all, I'd at least make a suite of it's own out of them only, (80 of such hardware- time are many as for the overall time to be invested anyhow, just the statistical relevance will be very low of course then with such big error bars) and without tbs (with them it doesn't make sense) engines would need even much more hardware- time then that
they 'd need on average for the first 81, regards
Peter.
peter
Posts: 3386
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: CCET - a new difficult test suite

Post by peter »

Edit- time over.
peter wrote: Fri Jun 13, 2025 8:20 am starting with nr. 81:
That was wrong, the examples started with nr. 82, then came 83, which was mistyped 82, not until nr. 84, which I then as the only one typed correctly primarily, I had those bugs in quoted numbers, sorry.
Peter.
fkarger
Posts: 63
Joined: Sat Aug 15, 2020 8:08 am
Full name: Frank Karger

Re: CCET - a new difficult test suite

Post by fkarger »

Hi Peter,

thank you for the input.
At the moment my machines are collecting data. So I will go into the specifics of these positions later.
Let me make two more generals remarks.

1) Discrimination of moves
The ability (of a chess engine) to discriminate two moves by their value is the essence of playing strength.
A better engine will better discriminate.
If a move is found by pure chance, it is unlikely that a weak engine will succeed in a series of positions.
Weak and strong engines should therefore be discriminated from each other.
That means: these positions are suited to discriminate engines by strength.

2) Current results in the endgame section (81 to 160)
I did a scaling experiment using Stockfish 17.1 (see https://t1p.de/CCET-Open-Ranking).
Increasing the time SF also increasingly got better in that section.
I will specify that in more detail later.
That indicates that 'pure chance' is not very helpful but playing strength indeed does help.

To solve a position by pure chance is a general problem (also in traditional studies from 1 to 80).
That is why all the positions of the new test suite have exactly one solution, which decreases random solving.

Best regards

Frank
peter
Posts: 3386
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: CCET - a new difficult test suite

Post by peter »

At about the same time I wrote here latest, I started letting SF dev. with 30 threads of 16x3.5 run 54 of the second 80 (just because that would be one of three chunks I'd at least divide the 160 into, making some concurrencies possible to be run, more instances than 3 wouldn't be possible with 8 threads, 4 of those would already cost all threads of CPU not leaving any free for operation system).

TC is 5 minutes per position, hash is 32G, in Shredder GUI because of the beautiful and clearly to be read tables of results at the end. This one run is now not even half way through (25 of the 54) SF dev. 250602 has solved so far 8 of the 26 positions it had analysed.

If (don't know, if I can miss the hardware so long today) I'll finish at least these 54 positions, I'll come along here with the Shredder- table of results, keeping the file of protocol (.dmp) stored too, to be able to see, where and how often the engine would have changed its mind and with which one output corresponding it would end up at TC- limit.

As much as that I guess I can say already now: if percentage of solutions won't get dramatically higher with such hardware- TC and SF dev., you cannot expect any statistical relevance of so few positions with even much less solutions found. And that only as for looking at the numbers of positions and solutions found, the question, how many of the solutions were found because of at least some kind of winning eval too, (most of the times I watched, that was not the case) and even more doubtful, if an about as well or even better performing engine would solve the same positions as the compared one, or quite different ones, and if out of correct reasons at all either or even not at all neither.

Guess, it's quite clear so far anyhow, why I don't think this second half of the suite practically useful for mediume hardware- time to judge "playing strength" of engines with. Results to be got with such positions, give, if at all, at the utmost a very special answer to a very special question (kind of positions) they'd be not at all to be compared to results out of other kinds of "normal" suites, neither to the first half of the 160.

That wouldn't matter much to me, I like asking very special questions with very special positions too, but really exact answers with such really special positions you get for the one and only single position only always, as for the single one position testing engines with, you can compare time to solution, depth to solution, time to best eval, time to best main line, started with empty hash and or with Forward- Backward and so on and on, as many parameters als you like, you can compare single threaded (for determination) or with several runs statistically SMP, but the results always stand fully for themselves alone.

There are suites giving results better comparable to each other, even to game playing too, and there are suites and single posistion with less comparability and transitivity as for other one results out of other one suites and positions. The second half of the 160 are of the second kind for sure
:)
Peter.
peter
Posts: 3386
Joined: Sat Feb 16, 2008 7:38 am
Full name: Peter Martan

Re: CCET - a new difficult test suite

Post by peter »

peter wrote: Fri Jun 13, 2025 11:36 am At about the same time I wrote here latest, I started letting SF dev. with 30 threads of 16x3.5 run 54 of the second 80 (just because that would be one of three chunks I'd at least divide the 160 into, making some concurrencies possible to be run, more instances than 3 wouldn't be possible with 8 threads, 4 of those would already cost all threads of CPU not leaving any free for operation system).

TC is 5 minutes per position, hash is 32G, in Shredder GUI because of the beautiful and clearly to be read tables of results at the end.
So here we are:

Code: Select all

Richtige Lösung! (5:00.000) 52
Bisher gelöst: 20 von 54  ;  217:26m

         1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
 -------------------------------------------------------------------------------------
   0 |   -   - 119  16   - 162 166   -   - 109   -   -   -   -   -   -   -   - 161   -
  20 |   -   - 114   - 229   -   -   -   -   -   - 248   -   -  18   6   - 275 230   -
  40 | 207   -  41  76  70   -   -  42   -   -   -   - 152 300
And now, how many of the 20 solutions did get a winning eval at the end?

None of them at all (if I haven't overseen any, but I searched with the fen- strings of the positions one by one for all the solved ones in .dmp- file).

Again only a few examples, easier to be found starting at the end, regard: to the numbers in 54- chunk 106 have to be added to each one number to get the one in full 160- suite.

Nr. 54 (160)

52/137 5:00 +0.70 1.Db7+ Lb6 2.Kg7 Lbd5 3.Db8 Kc6 4.De8+ Ld7 5.Dg6+ Kc7 6.Dd3 Kc6 7.Dc2+ Kd6 8.Dc3 L7e6 9.Da3+ Kc6 10.Da6 Kc5 11.Kf8 Lc7 12.Da3+ Kb5 13.Db2+ Kc6 14.Dc1+ (15.050.914.543) 50169
Bester Zug: De4-b7 Zeit: 5:02.766 min K/s: 50.169.715 Knoten: 15.050.914.543
Richtige Lösung! (5:00.000) 52
Bisher gelöst: 20 von 54 ; 217:26m

Nr.53 (159)

[d]2N5/4Q3/1b6/3b4/3k4/4b3/6K1/8 w - - 0 1

57/143 5:00 +0.71 1.Ka4 Lh4 2.Kb5 Lfg5 3.Dh2 Lf6 4.Kc5 Le7+ 5.Kb6 Ld8+ 6.Ka7 Ldg5 7.Dh3 Ke5 8.Kb8 Le4 9.Dc3+ Ke6 10.Db3+ Kf6 11.Db2+ Kf5 12.Db5+ Ke6 13.De2 Kd5 14.Dg4 (12.690.236.688) 42300
Bester Zug: Kb3-a4 Zeit: 5:02.781 min K/s: 42.300.788 Knoten: 12.690.236.688
Richtige Lösung! (2:32.458) 56
Bisher gelöst: 19 von 53 ; 212:26m

Nr.48 (154):

[d]2N5/8/8/2b2b2/3b4/KQ6/5k2/8 w - - 0 1

57/129 5:00 +0.68 1.Kh3 Lc5 2.Df6+ Kd3 3.Kg4 Le4 4.Db2 Kc4 5.De2+ Ld3 6.Da2+ Kc3 7.Da5+ Kb3 8.De1 Kc2 9.Kf3 Led4 10.Dg3 Kc3 11.Dc7 Kb3 12.Df4 Kc3 13.Dc1+ Kb3 14.Dd2 (13.065.331.248) 43551
Bester Zug: Kg2-h3 Zeit: 5:02.828 min K/s: 43.551.104 Knoten: 13.065.331.248
Richtige Lösung! (0:42.189) 44
Bisher gelöst: 18 von 48 ; 189:42m

"Best" one (and only one near to +- eval was nr. 3 (109):

[d]8/4Q3/5K2/8/2k5/8/bb6/8 w - - 0 1

54/107 5:00 +1.58 1.Kg5 Lb3 2.De2+ Kc3 3.Kf4 Lc4 4.Dd1 Ld3 5.Ke3 Lc4 6.Db1 Lb5 7.De1+ Kb3 8.Dd1+ Kb4 9.Dd6+ Kb3 10.Db8 Kc4 11.Db6 Lc3 12.Ke4 La4 13.De6+ Kb4 14.Dd6+ (16.225.803.216) 54086
Bester Zug: Kf6-g5 Zeit: 5:02.797 min K/s: 54.086.010 Knoten: 16.225.803.216
Richtige Lösung! (1:59.002) 49

Then there were only 3 more > 1 pawn numeric eval- height, second best was nr. 35 (141):

[d]8/1Q1Nr3/8/K6b/4n3/8/8/6k1 w - - 0 1

41/85 5:00 +1.14 1.Db1+ Kf2 2.Db2+ Le2 3.Se5 Sg3 4.Sd3+ Kg2 5.Df6 Te4 6.Df2+ Kh3 7.Sf4+ Kg4 8.Sd5 Kh3 9.Se3 Te5+ 10.Kb4 Lg4 11.Dg2+ Kh4 12.Dh2+ Lh3 13.Df2 Tg5 14.Ka3 (7.106.968.942) 23689
Bester Zug: Db7-b1 Zeit: 5:02.750 min K/s: 23.689.896 Knoten: 7.106.968.942
Richtige Lösung! (0:18.802) 30
Bisher gelöst: 10 von 35 ; 148:36m

the 2 other ones >1.00 were nr. 6 (112) with 1.07 and 7(113) with 1.03 cp).

So what does this mean? All found solutions miss an eval showing the only one winning move as for 50 moves- boundary being decisively better then all other moves drawing. There's not a single one found solution, correct not only as for move chosen, but also as for finding it the only one winning move too.

Not to speculate again about more or less luck and accident in the numeric results, one could yet well say, all found solutions in this run were found "for wrong reasons", as the saying goes. Of course, if there are only a few moves in between best and second best move as for DTZ, the better one winning, the lesser good one not, one cannot speak about "much" "discrimination" between such two moves neither, so it's somewhat exusable, that the engines don't get more difference in eval 50 moves before the decisive boundary neither, but if it's not a weakness of the engines, it's one of the positions, that are not really better distinctable as for best and second best moves but by DTZ of a few moves more or less out of 50, just my two cents.

I'm outa here now (I hope), regards
Peter.