Basic Questions About CuteChess

JoAnnP38 · Post by **JoAnnP38** » Mon Feb 13, 2023 4:50 am

Another question about cutechess-cli. Do any of you use this tool to test whether a new version of your engine is stronger than the prior version? I see the command-line switch "-sprt" which seems like something that would be used for this. Do you use this and if so, what parameters do you use?

JoAnnP38 · Post by **JoAnnP38** » Sun Feb 26, 2023 1:49 pm

Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:

Code: Select all

Score of paragon vs test: 10 - 20 - 70  [0.450] 100
...      paragon playing White: 5 - 8 - 37  [0.470] 50
...      paragon playing Black: 5 - 12 - 33  [0.430] 50
...      White vs Black: 17 - 13 - 70  [0.520] 100
Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 %
Finished match

Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?

Ajedrecista · Post by **Ajedrecista** » Sun Feb 26, 2023 2:14 pm

Hello:

JoAnnP38 wrote: ↑Sun Feb 26, 2023 1:49 pm Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:
Code: Select all
Score of paragon vs test: 10 - 20 - 70  [0.450] 100
...      paragon playing White: 5 - 8 - 37  [0.470] 50
...      paragon playing Black: 5 - 12 - 33  [0.430] 50
...      White vs Black: 17 - 13 - 70  [0.520] 100
Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 %
Finished match
Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?

Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.

LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.

CPW can enlighten you more:

https://www.chessprogramming.org/Match_Statistics

The basic idea is: play more games.

Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.

Regards from Spain.

Ajedrecista.

JoAnnP38 · Post by **JoAnnP38** » Sun Feb 26, 2023 2:58 pm

Ajedrecista wrote: ↑Sun Feb 26, 2023 2:14 pm Hello:

Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.

LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.

CPW can enlighten you more:

https://www.chessprogramming.org/Match_Statistics

The basic idea is: play more games.

Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.

Regards from Spain.

Ajedrecista.

Thank you very much. Other than Elo, I had no clue that there was any standardization on chess statistics. Now I know.

Basic Questions About CuteChess

Re: Basic Questions About CuteChess

Re: Basic Questions About CuteChess

Re: Basic Questions About CuteChess

Re: Basic Questions About CuteChess