TCEC Question

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: TCEC Question

Post by dkappe »

Milos wrote: Thu Jul 02, 2020 9:07 am
dkappe wrote: Wed Jul 01, 2020 12:52 am Just to throw some more fuel on the fire, the GPU server was rebooted after 26 games because admins thought there might be something amiss. Before reboot, SF +4. After reboot: SF +1. Who knows.
Before reboot, SF +4. After reboot: SF +4. Fanboys of NN usually don't know. :lol:
Times passes, scores change. Even now. :-)
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: TCEC Question

Post by Dann Corbit »

The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:


So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.

Sure glad that is settled.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: TCEC Question

Post by Dann Corbit »

Elostat output:

Code: Select all

  Program Elo    +   -   Games   Score   Av.Op.  Draws
1 purple : 3335    4   4 13945    50.6 %   3331   56.9 %
2 gold   : 3331    4   4 13945    49.4 %   3335   56.9 %
The main thing that is interesting about this output is that 3335 - 4 = 3331. Kind of an interesting symmetry.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: TCEC Question

Post by brianr »

I have seen surprising fluctuations and reversals with engines swapping places in relative strength.
Here are the Ordo results of your match:

Code: Select all

D:\Cutechess-cli>ordo-win64.exe -Q -N 0 -D -a 0 -A "gold" -W -n4 -s500 -U "0,1,2,3,4,5,7,8,9,10,6" -p results.pgn
0   10   20   30   40   50   60   70   80   90   100 (%)
|----|----|----|----|----|----|----|----|----|----|
***************************************************

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 purple    :       4      4  7051.0   13945  50.6  3085  7932  2928  56.9      98
   2 gold      :       0   ----  6894.0   13945  49.4  2928  7932  3085  56.9     ---

White advantage = 109.26 +/- 1.91
Draw rate (equal opponents) = 64.90 % +/- 0.46
The rating difference is still within the error range, so it is perhaps not conclusive, FWIW.
I am more comfortable with at least 99% CFS and often run to 100%.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: TCEC Question

Post by Dann Corbit »

Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: TCEC Question

Post by Milos »

Dann Corbit wrote: Thu Jul 02, 2020 7:00 pm The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:


So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.

Sure glad that is settled.
Considering that 1sigma error margin is 1.95Elo, and that Elo difference is 3.95Elo, which is slightly over 2 sigma error margin, it is pretty reasonable to assume that you have some systematic bias. Like scheduling on your machine doesn't give the same amount of CPU time to both copies of the engine. Or memory that was available to copy 2 was much more fragmented than to copy 1.
Instead of stupidly blaming math better debug you machine.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: TCEC Question

Post by brianr »

Dann Corbit wrote: Thu Jul 02, 2020 8:08 pm Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)
I understand they are the same.
I also run "sanity check" matches (hopefully identical engines) from time to time to validate my methodology.

Here is another sort of example with a very small number of games (instead of a very large number), of course with different engines. Notice how the CFS of even 100% is ... I'm not sure what it is with the math, but obviously it is an extremely small sample size.

Code: Select all

   # PLAYER              :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 BPR-128x10-0006000  :     345    141     2.0       3  66.7    1    2    0  66.7     100
   2 T1060               :       0   ----     1.0       3  33.3    0    2    1  66.7     ---

   # PLAYER              :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 BPR-128x10-0006000  :       0    780     2.0       4  50.0    1    2    1  50.0      50
   2 T1060               :       0   ----     2.0       4  50.0    1    2    1  50.0     ---