Poor mans testing process please

noobpwnftw · Post by **noobpwnftw** » Sun Mar 22, 2020 11:58 am

jp wrote: ↑Sun Mar 22, 2020 11:32 am
noobpwnftw wrote: ↑Sun Mar 22, 2020 10:46 am Even on the other side, people are using more and more comprehensive testing procedures
What do you mean by "on the other side"?

It means where a significant portion of the engine is not written in comprehensible code where one can accept or reject certain part of its changes. This is on the opposite side of what programs are usually made, where you change one thing at a time and prove it works better.

JohnWoe · Post by **JohnWoe** » Sun Mar 22, 2020 10:25 pm

Without games ?
1. Test suites like : https://www.chessprogramming.org/Win_at_Chess
2. Benchmarking

Ovyron · Post by **Ovyron** » Mon Mar 23, 2020 1:34 am

You can tune your engine so that it solves all positions in a test suite and make changes so that it finds those best moves faster and faster AND lose ELO in the process. The big part of a slow process of improving the engine without playing games is making changes that seem to improve it but actually hurt it.

JohnWoe · Post by **JohnWoe** » Mon Mar 23, 2020 4:32 pm

Some WAC testing

Stockfish 11 64-bit

Code: Select all

1s  : score=283/300 [averages on correct positions: depth=8.0 time=0.07 nodes=79990]
2s  : score=285/300 [averages on correct positions: depth=8.1 time=0.09 nodes=108839]
10s : score=294/300 [averages on correct positions: depth=8.6 time=0.23 nodes=265523]
60s : score=299/300 [averages on correct positions: depth=9.0 time=0.86 nodes=999890]

SF misses only WAC.287 on 60s

Sapeli 1.79 64-bit

Code: Select all

1s  : score=253/300 [averages on correct positions: depth=4.4 time=0.10 nodes=239732]
2s  : score=263/300 [averages on correct positions: depth=4.6 time=0.18 nodes=423953]
10s : score=271/300 [averages on correct positions: depth=4.7 time=0.35 nodes=861802]
60s : score=286/300 [averages on correct positions: depth=5.0 time=1.74 nodes=4407171]

Sapeli getting 286/300 with 60s per move. That's not bad

Uri Blass · Post by **Uri Blass** » Mon Mar 23, 2020 4:45 pm

Ovyron wrote: ↑Mon Mar 23, 2020 1:34 am You can tune your engine so that it solves all positions in a test suite and make changes so that it finds those best moves faster and faster AND lose ELO in the process. The big part of a slow process of improving the engine without playing games is making changes that seem to improve it but actually hurt it.

I believe that tuning an engine to solve test positions without caring about playing strength can be also an option to get a strong engine.
I doubt if there is a single engine with elo<3000 that is stronger than stockfish in solving test suites

I remember that in the past weak knightdreamer was tuned to solve the gcp test suites but I cannot find the engine in order to compare with stockfish
and my guess is that stockfish is stronger than it.

voffka · Post by **voffka** » Mon Mar 23, 2020 7:29 pm

I've came to a conclusion in Igel that test suit testing is a dead end, as I lost some elo improving suits (a lot) in the past.

For those simple mortals who do not have an access to OpenBench

, the only way is to test locally in a time control that is not too short, so my current approach is to do two types of the testing:

1. Self-test to see incremental improvements
2. Regression test against some top 50 engines

Both testing is going on in progress at the same time. I do not trust fully self-test results, but I trust fully the result if both self test and regression test are showing the same/similar elo increase on average.

Here are some hard data to prove:

1. Regression run of Igel 2.3.0 versus top 50 engines in 60+0.6 time control:

Code: Select all

Rank Name                          Elo     +/-   Games   Score   Draws
   0 Igel 2.3.0 64 POPCNT         -178      14    2000   26.5%   24.9%
   1 Stockfish 10 64 POPCNT        531     132     100   95.5%    9.0%
   2 Xiphos 0.6 BMI2               494     115     100   94.5%   11.0%
   3 Ethereal 11.75 (PEXT)         478     121     100   94.0%   10.0%
   4 Fire 7.1 x64 popcnt           436     103     100   92.5%   13.0%
   5 Laser 1.7                     330      68     100   87.0%   26.0%
   6 rofChade 2.202 BMI            323      74     100   86.5%   23.0%
   7 Defenchess 2.2 x64            295      73     100   84.5%   23.0%
   8 Andscacs 0.95                 263      72     100   82.0%   22.0%
   9 Pedone 2.0                    230      65     100   79.0%   28.0%
  10 RubiChess 1.6                 230      61     100   79.0%   32.0%
  11 Strelka 5.5 x64               220      71     100   78.0%   20.0%
  12 Arasan 22.0                   186      61     100   74.5%   31.0%
  13 Texel 1.07                    168      63     100   72.5%   27.0%
  14 Deep iCE 4.0.853 x64/popcnt   143      54     100   69.5%   41.0%
  15 Nemorino                      143      61     100   69.5%   29.0%
  16 Vajolet2 2.8.0                119      58     100   66.5%   33.0%
  17 Protector 1.9.0               100      56     100   64.0%   36.0%
  18 Winter 0.7 BMI2               -31      58     100   45.5%   29.0%
  19 zurichess neuchatel           -63      53     100   41.0%   40.0%
  20 GreKo 2018.08                -295      83     100   15.5%   15.0%

2. Regression run of Igel 2.3.1 versus top 50 engines in 60+0.6 time control:

Code: Select all

Rank Name                          Elo     +/-   Games   Score   Draws
   0 Igel 2.3.1 64 POPCNT         -106       6   10000   35.2%   31.0%
   1 Stockfish 10 64 POPCNT        527      55     500   95.4%    8.0%
   2 Fire 7.1 x64 popcnt           346      35     500   88.0%   19.6%
   3 Ethereal 11.75 (PEXT)         340      31     500   87.6%   23.6%
   4 Xiphos 0.6 BMI2               318      31     500   86.2%   24.4%
   5 Laser 1.7                     259      28     500   81.6%   29.6%
   6 rofChade 2.202 BMI            220      27     500   78.0%   32.8%
   7 Defenchess 2.2 x64            206      26     500   76.6%   34.0%
   8 Andscacs 0.95                 172      27     500   72.9%   30.2%
   9 RubiChess 1.6                 161      24     500   71.7%   41.0%
  10 Pedone 2.0                    143      25     500   69.5%   34.6%
  11 Strelka 5.5 x64               119      25     500   66.5%   34.6%
  12 Deep iCE 4.0.853 x64/popcnt    98      24     500   63.7%   39.0%
  13 Arasan 22.0                    95      25     500   63.4%   35.6%
  14 Texel 1.07                     89      24     500   62.6%   38.8%
  15 Vajolet2 2.8.0                 65      23     500   59.2%   44.4%
  16 Nemorino                       44      25     500   56.3%   34.2%
  17 Protector 1.9.0                 2      23     500   50.3%   40.6%
  18 Winter 0.7 BMI2              -111      26     500   34.5%   29.8%
  19 zurichess neuchatel          -215      26     500   22.5%   33.8%
  20 GreKo 2018.08                -411      44     500    8.6%   11.2%

The tests are run in a very stable environment, the machine has 6 cores, 12 threads, but only 6 concurrent games are allowed for final regression check. So the elo diff is ~72, then I double check in self test and if it is the same release is done.

Since I adopted this testing model I started to obtain some solid elo that I can verify in CCRL as well. I am afraid there is no other way around [for simple mortals, that is]

Robert Pope · Post by **Robert Pope** » Tue Mar 24, 2020 5:01 pm

Even with an old machine, you can still easily run 1000-2000 fast-ish games (e.g. 5s + 0.2s) overnight. That won't prove out <3 elo improvements, but it will validate many good patches and flag bad ones.

lauriet · Post by **lauriet** » Fri Mar 27, 2020 3:25 am

I deleted my Qsort code in MoveGen and simply used a PickNextMove function on the basis of, if the first few moves are cuts then sorting all
the moves was a waste. I believe this is pretty standard.
My time to Depth was reduced quite a bit.
So surely this type of improvement doesn't need 1000 games, and time to depth is a sufficient indicator. Right ?

Laurie

hgm · Post by **hgm** » Fri Mar 27, 2020 9:47 pm

Right. This is not a functional change, just a speed improvement. It should still search an identical tree (which you could check on the node count), and time-to-depth would be a good way to measure the effect. (Averaged over a dozen of representative positions, though.)

Changes in move order should in general also not lead to different move or score, so time-to-depth is a good indication there as well.

When you add or change eval terms, or your pruning and reduction scheme, games are the only thing that can tell you whether you improved or deteriorated.

lauriet · Post by **lauriet** » Sat Mar 28, 2020 4:11 am

And what about move ordering ?

Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please

Re: Poor mans testing process please