TCEC Question

dkappe · Post by **dkappe** » Thu Jul 02, 2020 2:45 pm

Milos wrote: ↑Thu Jul 02, 2020 9:07 am
dkappe wrote: ↑Wed Jul 01, 2020 12:52 am Just to throw some more fuel on the fire, the GPU server was rebooted after 26 games because admins thought there might be something amiss. Before reboot, SF +4. After reboot: SF +1. Who knows.
Before reboot, SF +4. After reboot: SF +4. Fanboys of NN usually don't know.

Times passes, scores change. Even now.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:00 pm

The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:

So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.

Sure glad that is settled.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:03 pm

Elostat output:

Code: Select all

  Program Elo    +   -   Games   Score   Av.Op.  Draws
1 purple : 3335    4   4 13945    50.6 %   3331   56.9 %
2 gold   : 3331    4   4 13945    49.4 %   3335   56.9 %

The main thing that is interesting about this output is that 3335 - 4 = 3331. Kind of an interesting symmetry.

brianr · Post by **brianr** » Thu Jul 02, 2020 7:11 pm

I have seen surprising fluctuations and reversals with engines swapping places in relative strength.
Here are the Ordo results of your match:

Code: Select all

D:\Cutechess-cli>ordo-win64.exe -Q -N 0 -D -a 0 -A "gold" -W -n4 -s500 -U "0,1,2,3,4,5,7,8,9,10,6" -p results.pgn
0   10   20   30   40   50   60   70   80   90   100 (%)
|----|----|----|----|----|----|----|----|----|----|
***************************************************

   # PLAYER    :  RATING  ERROR  POINTS  PLAYED   (%)     W     D     L  D(%)  CFS(%)
   1 purple    :       4      4  7051.0   13945  50.6  3085  7932  2928  56.9      98
   2 gold      :       0   ----  6894.0   13945  49.4  2928  7932  3085  56.9     ---

White advantage = 109.26 +/- 1.91
Draw rate (equal opponents) = 64.90 % +/- 0.46

The rating difference is still within the error range, so it is perhaps not conclusive, FWIW.
I am more comfortable with at least 99% CFS and often run to 100%.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 8:08 pm

Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)

Milos · Post by **Milos** » Thu Jul 02, 2020 10:24 pm

Dann Corbit wrote: ↑Thu Jul 02, 2020 7:00 pm The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:

So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.

Sure glad that is settled.

Considering that 1sigma error margin is 1.95Elo, and that Elo difference is 3.95Elo, which is slightly over 2 sigma error margin, it is pretty reasonable to assume that you have some systematic bias. Like scheduling on your machine doesn't give the same amount of CPU time to both copies of the engine. Or memory that was available to copy 2 was much more fragmented than to copy 1.
Instead of stupidly blaming math better debug you machine.

brianr · Post by **brianr** » Thu Jul 02, 2020 10:49 pm

Dann Corbit wrote: ↑Thu Jul 02, 2020 8:08 pm Yes, obviously, I was joking about purple being stronger than gold, since it was the same binary.
And the Elo calculations do show that they have the same strength (both yours and mine)

I understand they are the same.
I also run "sanity check" matches (hopefully identical engines) from time to time to validate my methodology.

Here is another sort of example with a very small number of games (instead of a very large number), of course with different engines. Notice how the CFS of even 100% is ... I'm not sure what it is with the math, but obviously it is an extremely small sample size.

Code: Select all

   # PLAYER              :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 BPR-128x10-0006000  :     345    141     2.0       3  66.7    1    2    0  66.7     100
   2 T1060               :       0   ----     1.0       3  33.3    0    2    1  66.7     ---

   # PLAYER              :  RATING  ERROR  POINTS  PLAYED   (%)    W    D    L  D(%)  CFS(%)
   1 BPR-128x10-0006000  :       0    780     2.0       4  50.0    1    2    1  50.0      50
   2 T1060               :       0   ----     2.0       4  50.0    1    2    1  50.0     ---

TCEC Question

Re: TCEC Question

Re: TCEC Question

Re: TCEC Question

Re: TCEC Question

Re: TCEC Question

Re: TCEC Question

Re: TCEC Question