Times passes, scores change. Even now.
TCEC Question
Moderators: hgm, Rebel, chrisw
-
- Posts: 1631
- Joined: Tue Aug 21, 2018 7:52 pm
- Full name: Dietrich Kappe
Re: TCEC Question
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: TCEC Question
The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:
So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.
Sure glad that is settled.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:
So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.
Sure glad that is settled.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: TCEC Question
Elostat output:
The main thing that is interesting about this output is that 3335 - 4 = 3331. Kind of an interesting symmetry.
Code: Select all
Program Elo + - Games Score Av.Op. Draws
1 purple : 3335 4 4 13945 50.6 % 3331 56.9 %
2 gold : 3331 4 4 13945 49.4 % 3335 56.9 %
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 536
- Joined: Thu Mar 09, 2006 3:01 pm
Re: TCEC Question
I have seen surprising fluctuations and reversals with engines swapping places in relative strength.
Here are the Ordo results of your match:
The rating difference is still within the error range, so it is perhaps not conclusive, FWIW.
I am more comfortable with at least 99% CFS and often run to 100%.
Here are the Ordo results of your match:
Code: Select all
D:\Cutechess-cli>ordo-win64.exe -Q -N 0 -D -a 0 -A "gold" -W -n4 -s500 -U "0,1,2,3,4,5,7,8,9,10,6" -p results.pgn
0 10 20 30 40 50 60 70 80 90 100 (%)
|----|----|----|----|----|----|----|----|----|----|
***************************************************
# PLAYER : RATING ERROR POINTS PLAYED (%) W D L D(%) CFS(%)
1 purple : 4 4 7051.0 13945 50.6 3085 7932 2928 56.9 98
2 gold : 0 ---- 6894.0 13945 49.4 2928 7932 3085 56.9 ---
White advantage = 109.26 +/- 1.91
Draw rate (equal opponents) = 64.90 % +/- 0.46
I am more comfortable with at least 99% CFS and often run to 100%.
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: TCEC Question
Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)
And the Elo calculations do show that they have the same strength (both yours and mine)
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: TCEC Question
Considering that 1sigma error margin is 1.95Elo, and that Elo difference is 3.95Elo, which is slightly over 2 sigma error margin, it is pretty reasonable to assume that you have some systematic bias. Like scheduling on your machine doesn't give the same amount of CPU time to both copies of the engine. Or memory that was available to copy 2 was much more fragmented than to copy 1.Dann Corbit wrote: ↑Thu Jul 02, 2020 7:00 pm The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:
So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.
Sure glad that is settled.
Instead of stupidly blaming math better debug you machine.
-
- Posts: 536
- Joined: Thu Mar 09, 2006 3:01 pm
Re: TCEC Question
I understand they are the same.Dann Corbit wrote: ↑Thu Jul 02, 2020 8:08 pm Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)
I also run "sanity check" matches (hopefully identical engines) from time to time to validate my methodology.
Here is another sort of example with a very small number of games (instead of a very large number), of course with different engines. Notice how the CFS of even 100% is ... I'm not sure what it is with the math, but obviously it is an extremely small sample size.
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) W D L D(%) CFS(%)
1 BPR-128x10-0006000 : 345 141 2.0 3 66.7 1 2 0 66.7 100
2 T1060 : 0 ---- 1.0 3 33.3 0 2 1 66.7 ---
# PLAYER : RATING ERROR POINTS PLAYED (%) W D L D(%) CFS(%)
1 BPR-128x10-0006000 : 0 780 2.0 4 50.0 1 2 1 50.0 50
2 T1060 : 0 ---- 2.0 4 50.0 1 2 1 50.0 ---