It means where a significant portion of the engine is not written in comprehensible code where one can accept or reject certain part of its changes. This is on the opposite side of what programs are usually made, where you change one thing at a time and prove it works better.jp wrote: ↑Sun Mar 22, 2020 11:32 amWhat do you mean by "on the other side"?noobpwnftw wrote: ↑Sun Mar 22, 2020 10:46 am Even on the other side, people are using more and more comprehensive testing procedures
Poor mans testing process please
Moderators: hgm, Rebel, chrisw
-
- Posts: 560
- Joined: Sun Nov 08, 2015 11:10 pm
Re: Poor mans testing process please
-
- Posts: 491
- Joined: Sat Mar 02, 2013 11:31 pm
-
- Posts: 4556
- Joined: Tue Jul 03, 2007 4:30 am
Re: Poor mans testing process please
You can tune your engine so that it solves all positions in a test suite and make changes so that it finds those best moves faster and faster AND lose ELO in the process. The big part of a slow process of improving the engine without playing games is making changes that seem to improve it but actually hurt it.
-
- Posts: 491
- Joined: Sat Mar 02, 2013 11:31 pm
Re: Poor mans testing process please
Some WAC testing
Stockfish 11 64-bit
SF misses only WAC.287 on 60s
Sapeli 1.79 64-bit
Sapeli getting 286/300 with 60s per move. That's not bad
Stockfish 11 64-bit
Code: Select all
1s : score=283/300 [averages on correct positions: depth=8.0 time=0.07 nodes=79990]
2s : score=285/300 [averages on correct positions: depth=8.1 time=0.09 nodes=108839]
10s : score=294/300 [averages on correct positions: depth=8.6 time=0.23 nodes=265523]
60s : score=299/300 [averages on correct positions: depth=9.0 time=0.86 nodes=999890]
Sapeli 1.79 64-bit
Code: Select all
1s : score=253/300 [averages on correct positions: depth=4.4 time=0.10 nodes=239732]
2s : score=263/300 [averages on correct positions: depth=4.6 time=0.18 nodes=423953]
10s : score=271/300 [averages on correct positions: depth=4.7 time=0.35 nodes=861802]
60s : score=286/300 [averages on correct positions: depth=5.0 time=1.74 nodes=4407171]
-
- Posts: 10297
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Poor mans testing process please
I believe that tuning an engine to solve test positions without caring about playing strength can be also an option to get a strong engine.Ovyron wrote: ↑Mon Mar 23, 2020 1:34 am You can tune your engine so that it solves all positions in a test suite and make changes so that it finds those best moves faster and faster AND lose ELO in the process. The big part of a slow process of improving the engine without playing games is making changes that seem to improve it but actually hurt it.
I doubt if there is a single engine with elo<3000 that is stronger than stockfish in solving test suites
I remember that in the past weak knightdreamer was tuned to solve the gcp test suites but I cannot find the engine in order to compare with stockfish
and my guess is that stockfish is stronger than it.
-
- Posts: 288
- Joined: Sat Jun 30, 2018 10:58 pm
- Location: Ukraine
- Full name: Volodymyr Shcherbyna
Re: Poor mans testing process please
I've came to a conclusion in Igel that test suit testing is a dead end, as I lost some elo improving suits (a lot) in the past.
For those simple mortals who do not have an access to OpenBench , the only way is to test locally in a time control that is not too short, so my current approach is to do two types of the testing:
1. Self-test to see incremental improvements
2. Regression test against some top 50 engines
Both testing is going on in progress at the same time. I do not trust fully self-test results, but I trust fully the result if both self test and regression test are showing the same/similar elo increase on average.
Here are some hard data to prove:
1. Regression run of Igel 2.3.0 versus top 50 engines in 60+0.6 time control:
2. Regression run of Igel 2.3.1 versus top 50 engines in 60+0.6 time control:
The tests are run in a very stable environment, the machine has 6 cores, 12 threads, but only 6 concurrent games are allowed for final regression check. So the elo diff is ~72, then I double check in self test and if it is the same release is done.
Since I adopted this testing model I started to obtain some solid elo that I can verify in CCRL as well. I am afraid there is no other way around [for simple mortals, that is]
For those simple mortals who do not have an access to OpenBench , the only way is to test locally in a time control that is not too short, so my current approach is to do two types of the testing:
1. Self-test to see incremental improvements
2. Regression test against some top 50 engines
Both testing is going on in progress at the same time. I do not trust fully self-test results, but I trust fully the result if both self test and regression test are showing the same/similar elo increase on average.
Here are some hard data to prove:
1. Regression run of Igel 2.3.0 versus top 50 engines in 60+0.6 time control:
Code: Select all
Rank Name Elo +/- Games Score Draws
0 Igel 2.3.0 64 POPCNT -178 14 2000 26.5% 24.9%
1 Stockfish 10 64 POPCNT 531 132 100 95.5% 9.0%
2 Xiphos 0.6 BMI2 494 115 100 94.5% 11.0%
3 Ethereal 11.75 (PEXT) 478 121 100 94.0% 10.0%
4 Fire 7.1 x64 popcnt 436 103 100 92.5% 13.0%
5 Laser 1.7 330 68 100 87.0% 26.0%
6 rofChade 2.202 BMI 323 74 100 86.5% 23.0%
7 Defenchess 2.2 x64 295 73 100 84.5% 23.0%
8 Andscacs 0.95 263 72 100 82.0% 22.0%
9 Pedone 2.0 230 65 100 79.0% 28.0%
10 RubiChess 1.6 230 61 100 79.0% 32.0%
11 Strelka 5.5 x64 220 71 100 78.0% 20.0%
12 Arasan 22.0 186 61 100 74.5% 31.0%
13 Texel 1.07 168 63 100 72.5% 27.0%
14 Deep iCE 4.0.853 x64/popcnt 143 54 100 69.5% 41.0%
15 Nemorino 143 61 100 69.5% 29.0%
16 Vajolet2 2.8.0 119 58 100 66.5% 33.0%
17 Protector 1.9.0 100 56 100 64.0% 36.0%
18 Winter 0.7 BMI2 -31 58 100 45.5% 29.0%
19 zurichess neuchatel -63 53 100 41.0% 40.0%
20 GreKo 2018.08 -295 83 100 15.5% 15.0%
Code: Select all
Rank Name Elo +/- Games Score Draws
0 Igel 2.3.1 64 POPCNT -106 6 10000 35.2% 31.0%
1 Stockfish 10 64 POPCNT 527 55 500 95.4% 8.0%
2 Fire 7.1 x64 popcnt 346 35 500 88.0% 19.6%
3 Ethereal 11.75 (PEXT) 340 31 500 87.6% 23.6%
4 Xiphos 0.6 BMI2 318 31 500 86.2% 24.4%
5 Laser 1.7 259 28 500 81.6% 29.6%
6 rofChade 2.202 BMI 220 27 500 78.0% 32.8%
7 Defenchess 2.2 x64 206 26 500 76.6% 34.0%
8 Andscacs 0.95 172 27 500 72.9% 30.2%
9 RubiChess 1.6 161 24 500 71.7% 41.0%
10 Pedone 2.0 143 25 500 69.5% 34.6%
11 Strelka 5.5 x64 119 25 500 66.5% 34.6%
12 Deep iCE 4.0.853 x64/popcnt 98 24 500 63.7% 39.0%
13 Arasan 22.0 95 25 500 63.4% 35.6%
14 Texel 1.07 89 24 500 62.6% 38.8%
15 Vajolet2 2.8.0 65 23 500 59.2% 44.4%
16 Nemorino 44 25 500 56.3% 34.2%
17 Protector 1.9.0 2 23 500 50.3% 40.6%
18 Winter 0.7 BMI2 -111 26 500 34.5% 29.8%
19 zurichess neuchatel -215 26 500 22.5% 33.8%
20 GreKo 2018.08 -411 44 500 8.6% 11.2%
Since I adopted this testing model I started to obtain some solid elo that I can verify in CCRL as well. I am afraid there is no other way around [for simple mortals, that is]
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: Poor mans testing process please
Even with an old machine, you can still easily run 1000-2000 fast-ish games (e.g. 5s + 0.2s) overnight. That won't prove out <3 elo improvements, but it will validate many good patches and flag bad ones.
-
- Posts: 199
- Joined: Sun Nov 03, 2013 9:32 am
Re: Poor mans testing process please
I deleted my Qsort code in MoveGen and simply used a PickNextMove function on the basis of, if the first few moves are cuts then sorting all
the moves was a waste. I believe this is pretty standard.
My time to Depth was reduced quite a bit.
So surely this type of improvement doesn't need 1000 games, and time to depth is a sufficient indicator. Right ?
Laurie
the moves was a waste. I believe this is pretty standard.
My time to Depth was reduced quite a bit.
So surely this type of improvement doesn't need 1000 games, and time to depth is a sufficient indicator. Right ?
Laurie
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Poor mans testing process please
Right. This is not a functional change, just a speed improvement. It should still search an identical tree (which you could check on the node count), and time-to-depth would be a good way to measure the effect. (Averaged over a dozen of representative positions, though.)
Changes in move order should in general also not lead to different move or score, so time-to-depth is a good indication there as well.
When you add or change eval terms, or your pruning and reduction scheme, games are the only thing that can tell you whether you improved or deteriorated.
Changes in move order should in general also not lead to different move or score, so time-to-depth is a good indication there as well.
When you add or change eval terms, or your pruning and reduction scheme, games are the only thing that can tell you whether you improved or deteriorated.
-
- Posts: 199
- Joined: Sun Nov 03, 2013 9:32 am
Re: Poor mans testing process please
And what about move ordering ?