Tuning: A success story

JoAnnP38 · Post by **JoAnnP38** » Sun Feb 26, 2023 2:06 pm

Whiskers wrote: ↑Sun Feb 19, 2023 11:02 pm As for the ELO difference?
440 game test: +188 -124 =128 - +51 elo

I am also tuning my evaluation, but I am using a genetic algorithm at the moment to minimize the errors. How did you choose 440 as the number of games to play to verify? I used 100 in cutechess, but that doesn't appear to give a very solid result (35 +/- 37?!) Is 440 a magic number that I am unaware of to minimize the potential deviation? Also, what time controls are you using?

Whiskers · Post by **Whiskers** » Sun Feb 26, 2023 3:32 pm

JoAnnP38 wrote: ↑Sun Feb 26, 2023 2:06 pm
Whiskers wrote: ↑Sun Feb 19, 2023 11:02 pm As for the ELO difference?
440 game test: +188 -124 =128 - +51 elo

I am also tuning my evaluation, but I am using a genetic algorithm at the moment to minimize the errors. How did you choose 440 as the number of games to play to verify? I used 100 in cutechess, but that doesn't appear to give a very solid result (35 +/- 37?!) Is 440 a magic number that I am unaware of to minimize the potential deviation? Also, what time controls are you using?

440 games is just because the list of openings I downloaded to get Willow to start from different positions had 220 openings lol. As for time control, it’s really just a simple “1 sec per move” - I’m not running fishtest or anything!

jdart · Post by **jdart** » Sun Feb 26, 2023 4:39 pm

In general, you should really run tests until you get a positive or negative result from SPRT, indicating you have a statistically significant result.

For changes with a very large ELO impact, 440 games may be enough, but usually that is not nearly enough to get significance.

Whiskers · Post by **Whiskers** » Sun Feb 26, 2023 4:59 pm

jdart wrote: ↑Sun Feb 26, 2023 4:39 pm In general, you should really run tests until you get a positive or negative result from SPRT, indicating you have a statistically significant result.

For changes with a very large ELO impact, 440 games may be enough, but usually that is not nearly enough to get significance.

That is correct. So far my engine is still at the point where most improvements I make are either pretty big or supposed to be pretty big, like I think you can see that PVS is an improvement within 440 games, but I will definitely get to a better testing scheme. Eventually

Uri Blass · Post by **Uri Blass** » Sun Feb 26, 2023 9:35 pm

jdart wrote: ↑Sun Feb 26, 2023 4:39 pm In general, you should really run tests until you get a positive or negative result from SPRT, indicating you have a statistically significant result.

For changes with a very large ELO impact, 440 games may be enough, but usually that is not nearly enough to get significance.

SPRT is good for faster improvement after relatively short time but
SPRT is not good for understanding.

If you want a good estimate for the value of patch in elo points then you need fixed number of games.

Better understanding may be productive for better patches in the future because if you know that some patch gave more elo points then you may use the knowledge to guess better the right patches to try in the future.

lithander · Post by **lithander** » Sat Mar 04, 2023 4:20 am

I haven't used SPRT so far in developing Leorik. The reason is that I'm not merely interested in knowing whether a version is better. I want to know by how much. With SPRT that would just involve two test-runs. One where SPRT tells you that the new version is better, then another one to give you a good Elo estimate.

Instead I just schedule the Elo-estimating test and chose to abort it manually when it's not progressing as hoped. Cutechess give's you the error range (e.g. +/- 5.0) after you have aborted and so you can make sure that you didn't abort too early. (it's easy to get an intuition for that after a while)

I typically use 10k games which brings the error range down to +/- 5 Elo:

Code: Select all

Score of Leorik-2.3.7b vs Leorik-2.3.7a: 3083 - 2250 - 4667  [0.542] 10000
...      Leorik-2.3.7b playing White: 1896 - 883 - 2221  [0.601] 5000
...      Leorik-2.3.7b playing Black: 1187 - 1367 - 2446  [0.482] 5000
...      White vs Black: 3263 - 2070 - 4667  [0.560] 10000
Elo difference: 29.0 +/- 5.0, LOS: 100.0 %, DrawRatio: 46.7 %

The time-control I use is 5s + 200ms increment and I use 20 out of 24 available threads on an AMD CPU so that the computer can still be used for light work like web-browsing without risking losses on time from an overloaded CPU. Takes ~4 hours to run through.

The next thing I do with a promising version like the above is to let it run against a gauntlet of opponents in my current strength range. E.g.:

Code: Select all

./cutechess-cli.exe -engine conf="Leorik-2.3.7b" -engine conf="Cheese-3.1.1" -engine conf="Fridolin-4.0" -engine conf="MadChess-3.1" -engine conf="zurichess-neuchatel" -engine conf="MinkoChess-1.3" -engine conf="OliThink-5.10.1" -engine conf="Shallow-5" -engine conf="dumb-1.9" -engine conf="zahak-5.0" -engine conf="odonata-0.6.2" -engine conf="Inanis-1.1.1" -engine conf="Supernova-2.4" -each tc=40/30 book=varied.bin option.Hash=32 -pgnout leorik237b_gauntlet_30per40.pgn -rounds 1000 -games 2 -repeat -concurrency 7 -tournament gauntlet

Here Leorik is playing both sides of each book-opening against one of the other engines. The other opponents never play each other.
This is a different machine using an Intel CPU instead of AMD. The time-control here is 30s per 40 moves. More time per move on average but no per-move increment. This is how many tournaments are run, like the ones that go into the CCRL list, but they give even more time per move.

While this gauntlet is running (for many hours if not days because it's only an 8 core machine) and more games accumulate in the PNG I use Ordo already to get Elo estimates of the version every few hours.

Code: Select all

.\ordo-win64.exe -p leorik237_gauntlet8_30per40.pgn -m anchors_2.3.txt

Code: Select all

   # PLAYER                 :  RATING  POINTS  PLAYED   (%)
   1 Cheese-3.1.1           :  2915.0   319.0     450    71
   2 MinkoChess-1.3         :  2849.0   285.0     447    64
   3 OliThink-5.10.1        :  2844.0   271.5     448    61
   4 zurichess-neuchatel    :  2824.0   268.5     448    60
   5 Shallow-5              :  2811.0   243.5     448    54
   6 Leorik-2.3.7           :  2792.2  2781.5    5364    52
   7 Inanis-1.1.1           :  2767.0   186.0     444    42
   8 Fridolin-4.0           :  2759.0   192.0     449    43
   9 odonata-0.6.2          :  2744.0   171.0     446    38
  10 zahak-5.0              :  2730.0   175.0     445    39
  11 MadChess-3.1           :  2713.0   125.5     448    28
  12 dumb-1.9               :  2703.0   195.5     448    44
  13 Supernova-2.4          :  2687.0   150.0     443    34

White advantage = 0.00
Draw rate (equal opponents) = 50.00 %

The anchors_2.3.txt file is just a text file assigning a fixed Elo number to each engine that is not Leorik. I take these Elo numbers from the CCRL list. The win percentage should correlate with the Elo difference and you can see that some engines might perform unusually strong against you (dumb-1.9) or unusually weak (Madchess-3.1) and this is exactly why you need a gauntlet.

Also, compared to selfplay the Elo gains in the gauntlet are typically much more modest. Something that was 30 Elo in selfplay will end up being 10 Elo in the gauntlet for example. I accepted that as normal by now and stopped being disappointed.

When a local code change improves my engine both in selfplay and yield a higher Elo against the gauntlet I make a commit. I also add the match results to the comments and use well measured versions like that as a checkpoint for future selfplay-tests.

Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story

Re: Tuning: A success story