Basic Questions About CuteChess

Discussion of chess software programming and technical issues.

Moderator: Ras

JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Basic Questions About CuteChess

Post by JoAnnP38 »

Another question about cutechess-cli. Do any of you use this tool to test whether a new version of your engine is stronger than the prior version? I see the command-line switch "-sprt" which seems like something that would be used for this. Do you use this and if so, what parameters do you use?
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Basic Questions About CuteChess

Post by JoAnnP38 »

Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:

Code: Select all

Score of paragon vs test: 10 - 20 - 70  [0.450] 100
...      paragon playing White: 5 - 8 - 37  [0.470] 50
...      paragon playing Black: 5 - 12 - 33  [0.430] 50
...      White vs Black: 17 - 13 - 70  [0.520] 100
Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 %
Finished match
Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?
User avatar
Ajedrecista
Posts: 2103
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Basic Questions About CuteChess

Post by Ajedrecista »

Hello:
JoAnnP38 wrote: Sun Feb 26, 2023 1:49 pm Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:

Code: Select all

Score of paragon vs test: 10 - 20 - 70  [0.450] 100
...      paragon playing White: 5 - 8 - 37  [0.470] 50
...      paragon playing Black: 5 - 12 - 33  [0.430] 50
...      White vs Black: 17 - 13 - 70  [0.520] 100
Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 %
Finished match
Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?
Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.

LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.

CPW can enlighten you more:

https://www.chessprogramming.org/Match_Statistics

The basic idea is: play more games.

Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.

Regards from Spain.

Ajedrecista.
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Basic Questions About CuteChess

Post by JoAnnP38 »

Ajedrecista wrote: Sun Feb 26, 2023 2:14 pm Hello:

Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.

LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.

CPW can enlighten you more:

https://www.chessprogramming.org/Match_Statistics

The basic idea is: play more games.

Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.

Regards from Spain.

Ajedrecista.
Thank you very much. Other than Elo, I had no clue that there was any standardization on chess statistics. Now I know.