What to do when relative strength between engine versions is inconsistent
Moderator: Ras
-
- Posts: 58
- Joined: Wed Mar 18, 2020 10:00 pm
- Full name: Jonathan McDermid
What to do when relative strength between engine versions is inconsistent
I've been having a problem lately where I make changes to my engine, it beats the previous version by a significant margin, but then when I test it against an even older release, it loses or goes even. Basically a paper scissors rock situation. What is the best way to handle this sort of problem? Is the only solution to make a big testing pool of different engines and see which version does the best against the field? I'm used to doing SPRT testing in cutechess, and I am not sure of how you could do a larger test like this with multiple engines in any sort of reasonable time. Has anybody encountered the same problem? How did you deal with it?
Clovis GitHub
-
- Posts: 915
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: What to do when relative strength between engine versions is inconsistent
Instead of verifying your new version in self-play you can pit it against a gauntlet of different different opponents - ideally other engines at similar strength.
-
- Posts: 58
- Joined: Wed Mar 18, 2020 10:00 pm
- Full name: Jonathan McDermid
Re: What to do when relative strength between engine versions is inconsistent
Is there a way to do that concurrently like in SPRT testing? It would take a long time if not
Clovis GitHub
-
- Posts: 263
- Joined: Wed Jun 16, 2021 2:08 am
- Location: Berlin
- Full name: Jost Triller
Re: What to do when relative strength between engine versions is inconsistent
The script I use is
Code: Select all
cutechess-cli \
-recover \
-concurrency 30 \
-ratinginterval 50 \
-games 2 -rounds 400 \
-tournament gauntlet \
-pgnout loggedGames.pgn min \
-openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \
-each tc=10+0.1 option.Hash=4 proto=uci dir=./ \
-engine cmd=./myNewVersion \
-engine cmd=./chessengines/frozenight-5.1.0-3097-uci \
-engine cmd=./chessengines/cheng4_0.41-3050-uci \
... (more engines)
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)
I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.
-
- Posts: 58
- Joined: Wed Mar 18, 2020 10:00 pm
- Full name: Jonathan McDermid
Re: What to do when relative strength between engine versions is inconsistent
Thanks a ton!
Clovis GitHub
-
- Posts: 1396
- Joined: Wed Mar 08, 2006 10:15 pm
- Location: San Francisco, California
Re: What to do when relative strength between engine versions is inconsistent
I use Cutechess (NOT Cutechess-cli) to run my gauntlets against 12 opponents. It updates the results and relative elo after every game, and I just let it run overnight. Works great.
-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: What to do when relative strength between engine versions is inconsistent
I'm curious, why not cutechess-cli? Would that not be much faster for testing purposes? Especially since you're not observing the games.
-
- Posts: 1396
- Joined: Wed Mar 08, 2006 10:15 pm
- Location: San Francisco, California
Re: What to do when relative strength between engine versions is inconsistent
Heh, because cutechess worked fine, so I didn't bother checking out cli.


-
- Posts: 608
- Joined: Sun May 30, 2021 5:03 am
- Location: United States
- Full name: Christian Dean
Re: What to do when relative strength between engine versions is inconsistent
Ah fair pointsJVMerlino wrote: ↑Tue Jan 31, 2023 5:44 pmHeh, because cutechess worked fine, so I didn't bother checking out cli.I play four simultaneous one-minute games and there are almost no losses on time (less than 1%), so it probably doesn't matter. Also, sometimes I observe the games for a while, just for the fun/irritation of it.
![]()

-
- Posts: 93
- Joined: Sun Aug 08, 2021 9:14 pm
- Full name: Kurt Peters
Re: What to do when relative strength between engine versions is inconsistent
"Concurrency 30" - Aren't the processes going to trip over each other fighting for compute time with so many running at once? I've always seen people say to do max concurrency at either the number of threads or cores on the CPU. Which I guess is possible if you have some crazy CPU?j.t. wrote: ↑Tue Jan 31, 2023 12:07 amThe script I use isThis plays games with "myNewVersion" against all the other listed engines. How many games should be played at most can be adjusted by in/decreasing "-rounds". The number of concurrent threads is "-concurrency".Code: Select all
cutechess-cli \ -recover \ -concurrency 30 \ -ratinginterval 50 \ -games 2 -rounds 400 \ -tournament gauntlet \ -pgnout loggedGames.pgn min \ -openings file=./Blitz_Testing_4moves.pgn order=random -repeat 2 \ -each tc=10+0.1 option.Hash=4 proto=uci dir=./ \ -engine cmd=./myNewVersion \ -engine cmd=./chessengines/frozenight-5.1.0-3097-uci \ -engine cmd=./chessengines/cheng4_0.41-3050-uci \ ... (more engines)
(This specific script only works with UCI engines, but it can be modified slightly to also allow for winboard engines. Enter "cutechess-cli --help" to get more information on how to do that, or ask if you want to know.)
I usually look at the current ratings that are printed out every 50 games and decide manually if it's worth it to continue.