Basic Questions About CuteChess
Moderator: Ras
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Basic Questions About CuteChess
Another question about cutechess-cli. Do any of you use this tool to test whether a new version of your engine is stronger than the prior version? I see the command-line switch "-sprt" which seems like something that would be used for this. Do you use this and if so, what parameters do you use?
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Basic Questions About CuteChess
Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:
Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?
Code: Select all
Score of paragon vs test: 10 - 20 - 70 [0.450] 100
... paragon playing White: 5 - 8 - 37 [0.470] 50
... paragon playing Black: 5 - 12 - 33 [0.430] 50
... White vs Black: 17 - 13 - 70 [0.520] 100
Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 %
Finished match
-
- Posts: 2103
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Basic Questions About CuteChess
Hello:
LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.
CPW can enlighten you more:
https://www.chessprogramming.org/Match_Statistics
The basic idea is: play more games.
Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.
Regards from Spain.
Ajedrecista.
Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.JoAnnP38 wrote: ↑Sun Feb 26, 2023 1:49 pm Here is another question I have on cutechess... I just finished running a 100-game round-robin between two version of my engine to verify whether my latest eval tuning has had some benefit and here are the results:
Am I reading this correctly -- that cutechess-cli believes that the "test" version of Pedantic (i.e. the tuned version) is essentially 35 ELO stronger (+/- 37.1)? The +/- 37.1 doesn't give me a lot of confidence. Should I have run this match for more games to get a more accurate estimate? Also, what does "LOS" stand for?Code: Select all
Score of paragon vs test: 10 - 20 - 70 [0.450] 100 ... paragon playing White: 5 - 8 - 37 [0.470] 50 ... paragon playing Black: 5 - 12 - 33 [0.430] 50 ... White vs Black: 17 - 13 - 70 [0.520] 100 Elo difference: -34.9 +/- 37.1, LOS: 3.4 %, DrawRatio: 70.0 % Finished match
LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.
CPW can enlighten you more:
https://www.chessprogramming.org/Match_Statistics
The basic idea is: play more games.
Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.
Regards from Spain.
Ajedrecista.
-
- Posts: 253
- Joined: Mon Aug 26, 2019 4:34 pm
- Location: Clearwater, Florida USA
- Full name: JoAnn Peeler
Re: Basic Questions About CuteChess
Thank you very much. Other than Elo, I had no clue that there was any standardization on chess statistics. Now I know.Ajedrecista wrote: ↑Sun Feb 26, 2023 2:14 pm Hello:
Sure, more games mean narrower error bars. There are lots of topics about that at TalkChess, but the rough idea is that their width is proportional to 1/sqrt(games). Play 400 games instead of 100 and CuteChess should report around ±18, ±19 or so.
LOS means 'likelihood of superiority' and there are tons of topics about that at TalkChess. The rough idea is that it is a one-tailed test in a normal distribution using z-score. More games mean narrower error bars and usually more extreme LOS values, either near 0% or 100%.
CPW can enlighten you more:
https://www.chessprogramming.org/Match_Statistics
The basic idea is: play more games.
Regarding 'who is better': I think that CuteChess reports the result with the POV of the first engine, so 'paragon vs test' and -35 should be understood as paragon is weaker (35 Elo weaker) than test, or logically test is 35 Elo stronger than paragon, as you said.
Regards from Spain.
Ajedrecista.