What to do when relative strength between engine versions is inconsistent

jmcd · Post by **jmcd** » Mon Jan 30, 2023 8:47 pm

I've been having a problem lately where I make changes to my engine, it beats the previous version by a significant margin, but then when I test it against an even older release, it loses or goes even. Basically a paper scissors rock situation. What is the best way to handle this sort of problem? Is the only solution to make a big testing pool of different engines and see which version does the best against the field? I'm used to doing SPRT testing in cutechess, and I am not sure of how you could do a larger test like this with multiple engines in any sort of reasonable time. Has anybody encountered the same problem? How did you deal with it?

lithander · Post by **lithander** » Mon Jan 30, 2023 10:07 pm

Instead of verifying your new version in self-play you can pit it against a gauntlet of different different opponents - ideally other engines at similar strength.

jmcd · Post by **jmcd** » Mon Jan 30, 2023 10:43 pm

Is there a way to do that concurrently like in SPRT testing? It would take a long time if not

j.t. · Post by **j.t.** » Tue Jan 31, 2023 12:07 am

jmcd wrote: ↑Mon Jan 30, 2023 10:43 pm Is there a way to do that concurrently like in SPRT testing? It would take a long time if not

The script I use is

Code: Select all

cutechess-cli \
-recover \
-concurrency 30 \ 
-ratinginterval 50 \
-games 2 -rounds 400 \
-tournament gauntlet \
-pgnout loggedGames.pgn min \
-openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \
-each tc=10+0.1 option.Hash=4 proto=uci dir=./ \
-engine cmd=./myNewVersion \
-engine cmd=./chessengines/frozenight-5.1.0-3097-uci \
-engine cmd=./chessengines/cheng4_0.41-3050-uci \
... (more engines)

This plays games with "myNewVersion" against all the other listed engines. How many games should be played at most can be adjusted by in/decreasing "-rounds". The number of concurrent threads is "-concurrency".
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)

I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.

jmcd · Post by **jmcd** » Tue Jan 31, 2023 1:06 am

Thanks a ton!

JVMerlino · Post by **JVMerlino** » Tue Jan 31, 2023 5:07 am

I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.

algerbrex · Post by **algerbrex** » Tue Jan 31, 2023 5:12 am

JVMerlino wrote: ↑Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.

I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.

JVMerlino · Post by **JVMerlino** » Tue Jan 31, 2023 5:44 pm

algerbrex wrote: ↑Tue Jan 31, 2023 5:12 am
JVMerlino wrote: ↑Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.

Heh, because cutechess worked fine, so I didn't bother checking out cli.

I play four simultaneous one-minute games and there are almost no losses on time (less than 1%), so it probably doesn't matter. Also, sometimes I observe the games for a while, just for the fun/irritation of it.

algerbrex · Post by **algerbrex** » Wed Feb 01, 2023 3:07 am

JVMerlino wrote: ↑Tue Jan 31, 2023 5:44 pm
algerbrex wrote: ↑Tue Jan 31, 2023 5:12 am
JVMerlino wrote: ↑Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.
Heh, because cutechess worked fine, so I didn't bother checking out cli. I play four simultaneous one-minute games and there are almost no losses on time (less than 1%), so it probably doesn't matter. Also, sometimes I observe the games for a while, just for the fun/irritation of it.

Ah fair points

KhepriChess · Post by **KhepriChess** » Wed Feb 01, 2023 4:05 am

j.t. wrote: ↑Tue Jan 31, 2023 12:07 am
jmcd wrote: ↑Mon Jan 30, 2023 10:43 pm Is there a way to do that concurrently like in SPRT testing? It would take a long time if not
The script I use is
Code: Select all
cutechess-cli \
-recover \
-concurrency 30 \ 
-ratinginterval 50 \
-games 2 -rounds 400 \
-tournament gauntlet \
-pgnout loggedGames.pgn min \
-openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \
-each tc=10+0.1 option.Hash=4 proto=uci dir=./ \
-engine cmd=./myNewVersion \
-engine cmd=./chessengines/frozenight-5.1.0-3097-uci \
-engine cmd=./chessengines/cheng4_0.41-3050-uci \
... (more engines)
This plays games with "myNewVersion" against all the other listed engines. How many games should be played at most can be adjusted by in/decreasing "-rounds". The number of concurrent threads is "-concurrency".
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)

I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.

"Concurrency 30" - Aren't the processes going to trip over each other fighting for compute time with so many running at once? I've always seen people say to do max concurrency at either the number of threads or cores on the CPU. Which I guess is possible if you have some crazy CPU?

What to do when relative strength between engine versions is inconsistent

What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent

Re: What to do when relative strength between engine versions is inconsistent