What to do when relative strength between engine versions is inconsistent

Discussion of chess software programming and technical issues.

Moderator: Ras

jmcd
Posts: 58
Joined: Wed Mar 18, 2020 10:00 pm
Full name: Jonathan McDermid

What to do when relative strength between engine versions is inconsistent

Post by jmcd »

I've been having a problem lately where I make changes to my engine, it beats the previous version by a significant margin, but then when I test it against an even older release, it loses or goes even. Basically a paper scissors rock situation. What is the best way to handle this sort of problem? Is the only solution to make a big testing pool of different engines and see which version does the best against the field? I'm used to doing SPRT testing in cutechess, and I am not sure of how you could do a larger test like this with multiple engines in any sort of reasonable time. Has anybody encountered the same problem? How did you deal with it?
Clovis GitHub
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: What to do when relative strength between engine versions is inconsistent

Post by lithander »

Instead of verifying your new version in self-play you can pit it against a gauntlet of different different opponents - ideally other engines at similar strength.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
jmcd
Posts: 58
Joined: Wed Mar 18, 2020 10:00 pm
Full name: Jonathan McDermid

Re: What to do when relative strength between engine versions is inconsistent

Post by jmcd »

Is there a way to do that concurrently like in SPRT testing? It would take a long time if not
Clovis GitHub
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: What to do when relative strength between engine versions is inconsistent

Post by j.t. »

jmcd wrote: Mon Jan 30, 2023 10:43 pm Is there a way to do that concurrently like in SPRT testing? It would take a long time if not
The script I use is

Code: Select all

cutechess-cli \
-recover \
-concurrency 30 \ 
-ratinginterval 50 \
-games 2 -rounds 400 \
-tournament gauntlet \
-pgnout loggedGames.pgn min \
-openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \
-each tc=10+0.1 option.Hash=4 proto=uci dir=./ \
-engine cmd=./myNewVersion \
-engine cmd=./chessengines/frozenight-5.1.0-3097-uci \
-engine cmd=./chessengines/cheng4_0.41-3050-uci \
... (more engines)
This plays games with "myNewVersion" against all the other listed engines. How many games should be played at most can be adjusted by in/decreasing "-rounds". The number of concurrent threads is "-concurrency".
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)

I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.
jmcd
Posts: 58
Joined: Wed Mar 18, 2020 10:00 pm
Full name: Jonathan McDermid

Re: What to do when relative strength between engine versions is inconsistent

Post by jmcd »

Thanks a ton!
Clovis GitHub
JVMerlino
Posts: 1396
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: What to do when relative strength between engine versions is inconsistent

Post by JVMerlino »

I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: What to do when relative strength between engine versions is inconsistent

Post by algerbrex »

JVMerlino wrote: Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.
JVMerlino
Posts: 1396
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: What to do when relative strength between engine versions is inconsistent

Post by JVMerlino »

algerbrex wrote: Tue Jan 31, 2023 5:12 am
JVMerlino wrote: Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.
Heh, because cutechess worked fine, so I didn't bother checking out cli. :wink: I play four simultaneous one-minute games and there are almost no losses on time (less than 1%), so it probably doesn't matter. Also, sometimes I observe the games for a while, just for the fun/irritation of it. :)
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: What to do when relative strength between engine versions is inconsistent

Post by algerbrex »

JVMerlino wrote: Tue Jan 31, 2023 5:44 pm
algerbrex wrote: Tue Jan 31, 2023 5:12 am
JVMerlino wrote: Tue Jan 31, 2023 5:07 am I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.
Heh, because cutechess worked fine, so I didn't bother checking out cli. :wink: I play four simultaneous one-minute games and there are almost no losses on time (less than 1%), so it probably doesn't matter. Also, sometimes I observe the games for a while, just for the fun/irritation of it. :)
Ah fair points :-)
KhepriChess
Posts: 93
Joined: Sun Aug 08, 2021 9:14 pm
Full name: Kurt Peters

Re: What to do when relative strength between engine versions is inconsistent

Post by KhepriChess »

j.t. wrote: Tue Jan 31, 2023 12:07 am
jmcd wrote: Mon Jan 30, 2023 10:43 pm Is there a way to do that concurrently like in SPRT testing? It would take a long time if not
The script I use is

Code: Select all

cutechess-cli \
-recover \
-concurrency 30 \ 
-ratinginterval 50 \
-games 2 -rounds 400 \
-tournament gauntlet \
-pgnout loggedGames.pgn min \
-openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \
-each tc=10+0.1 option.Hash=4 proto=uci dir=./ \
-engine cmd=./myNewVersion \
-engine cmd=./chessengines/frozenight-5.1.0-3097-uci \
-engine cmd=./chessengines/cheng4_0.41-3050-uci \
... (more engines)
This plays games with "myNewVersion" against all the other listed engines. How many games should be played at most can be adjusted by in/decreasing "-rounds". The number of concurrent threads is "-concurrency".
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)

I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.
"Concurrency 30" - Aren't the processes going to trip over each other fighting for compute time with so many running at once? I've always seen people say to do max concurrency at either the number of threads or cores on the CPU. Which I guess is possible if you have some crazy CPU?
Puffin: Github
KhepriChess: Github