Stockfish seems definitely the strongest engine

Uri Blass · Post by **Uri Blass** » Thu Jan 23, 2014 10:53 am

Laskos wrote:
ouachita wrote:There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:
Code: Select all
1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100
Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.
150-200 Elo points from 1 to 12 cores at 1'+1'' are to be expected, but I don't agree that 1-core results are unrelated to 12-core results. The MP scaling of top engines is comparable. In your place I would play 10'+10'' games on one core or on several cores SF against Houdini Contempt=0 until a SPRT stop, to dispel some myths (that SF does not scale better, for example). There were many 1-core results, but none of them had LOS of 98% SF against Houdini 4 Contempt=0, and that happens at somewhat larger TC than blitz. I will now wait for a SPRT stop in Cutechess-Cli to show that SF overtook Houdini (if that is the case).

Your assumption that the MP scaling of top engines is comparable is not something that is proved so I think that you cannot decide that 12 cores 1+1 is the same as 1 core 10+10.

Laskos · Post by **Laskos** » Thu Jan 23, 2014 12:51 pm

Uri Blass wrote: Your assumption that the MP scaling of top engines is comparable is not something that is proved so I think that you cannot decide that 12 cores 1+1 is the same as 1 core 10+10.

It is a reasonable assumption. The MP scaling of Stockfish and Houdini (not Komodo, though) can be measured by time-to-depth tests on say 100 positions at certain approximate time control (it's time dependent), and it gave, for example, on 1->4 cores, 3.15 for Houdini and 3.05 for Stockfish, only a 3% difference, or ~3 Elo points MP scaling difference. I doubt that to 12 cores it will be more than 10 Elo points difference. While the scaling with time of Stockfish compared to Houdini (45%->56%) from ultra-fast to longer than blitz is about 70-75 Elo points. So:

1. MP scaling differences are relatively minor.
2. Testing engines on many cores and large TC is almost impossible for desired number of games (needed to have conclusive LOS or a SPRT stop between closely matched engines).

Ajedrecista · Post by **Ajedrecista** » Thu Jan 23, 2014 9:28 pm

Hello:

I finally did the changes to find the median of the distribution. I also think that the results are more accurate now. With bayeselo = 60.75, drawelo = 270, alpha = beta = 0.05, bayeselo_0 = 0 and bayeselo_1 = 30, I ran 500000 simulations:

Code: Select all

Shortest simulation&#58;      39 games &#40;simulation 204448&#41;.
Longest simulation&#58;     1664 games &#40;simulation 337409&#41;.

Average number of games per simulation&#58;     275
Median of the distribution&#58;                 248

Type I errors  &#40;false positives&#41;&#58;   0.00 %
Type II errors &#40;false negatives&#41;&#58;   0.01 %

There is 1 simulation with score > 50% that failed SPRT.
There are      0 simulations with score = 50% that failed SPRT.

Code: Select all

From       0 to     999 games&#58; 499712 simulations ( 99.94 %); accumulated&#58;  99.94 %.
From    1000 to    1999 games&#58;    288 simulations (  0.06 %); accumulated&#58; 100.00 %.
 
Number of finished simulations&#58; 500000.

The stronger engine failed 57 times out of 500,000 simulations. I manually found the only time that failed with more wins than loses:

Code: Select all

349585&#41; FAIL after     886 games (+    155 -    154 =    577&#41;.
        Passes&#58; 349547   Fails&#58;     38

If you wanted a number for the median under your assumptions, here you have it: 248.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Mon Jan 27, 2014 8:51 am

Ajedrecista wrote:Hello:

I finally did the changes to find the median of the distribution. I also think that the results are more accurate now. With bayeselo = 60.75, drawelo = 270, alpha = beta = 0.05, bayeselo_0 = 0 and bayeselo_1 = 30, I ran 500000 simulations:
Code: Select all
Shortest simulation&#58;      39 games &#40;simulation 204448&#41;.
Longest simulation&#58;     1664 games &#40;simulation 337409&#41;.

Average number of games per simulation&#58;     275
Median of the distribution&#58;                 248

Type I errors  &#40;false positives&#41;&#58;   0.00 %
Type II errors &#40;false negatives&#41;&#58;   0.01 %

There is 1 simulation with score > 50% that failed SPRT.
There are      0 simulations with score = 50% that failed SPRT.
Code: Select all
From       0 to     999 games&#58; 499712 simulations ( 99.94 %); accumulated&#58;  99.94 %.
From    1000 to    1999 games&#58;    288 simulations (  0.06 %); accumulated&#58; 100.00 %.
 
Number of finished simulations&#58; 500000.
The stronger engine failed 57 times out of 500,000 simulations. I manually found the only time that failed with more wins than loses:
Code: Select all
349585&#41; FAIL after     886 games (+    155 -    154 =    577&#41;.
        Passes&#58; 349547   Fails&#58;     38
If you wanted a number for the median under your assumptions, here you have it: 248.

Regards from Spain.

Ajedrecista.

Thanks Jesus.
Meanwhile I got SPRT stop in Cutechess-Cli with H1 accepted in 388 games for:
elo0=0
elo1=30
alpha=0.05
beta=0.05

Time control: 10m+10s
Each engine on 1 i7 core
Openings 8moves_v2:

Code: Select all

Score of SF 19.01 vs H4 Contempt 0&#58; 107 - 74 - 207  &#91;0.543&#93; 388
ELO difference&#58; 30
SPRT&#58; H1 was accepted
Finished match

So, now I can claim on solid grounds that SF is the strongest engine beyond blitz time controls.

Laskos · Post by **Laskos** » Mon Jan 27, 2014 9:28 pm

Summarizing, I got a SPRT stop for Stockfish 19.01.2014 being stronger than Houdini 4 Contempt 0 under Cutechess-Cli in these conditions:

TC: 10m + 10s
i7 2600k at 3.6 GHz 4 cores
4 parallel matches, each engine on one physical core
Hash: 512MB
Ponder: Off
EGTB: No
Openings: 8moves_v2

Code: Select all

    Program                            Score      %     Elo    +   -    Draws

  1 SF 19.01                        &#58; 210.5/388  54.3   3015   24  24   53.4 %
  2 H4 Contempt 0                   &#58; 177.5/388  45.7   2985   24  24   53.4 %

30 +/- 24 Elo points 2SD advantage for Stockfish 19.01 at 10m + 10s TC. LOS=99.3%

SPRT:
elo0=0
elo1=30
alpha, beta = 0.05

H1 accepted after 388 games:

Code: Select all

Score of SF 19.01 vs H4 Contempt 0&#58; 107 - 74 - 207  &#91;0.543&#93; 388
ELO difference&#58; 30
SPRT&#58; H1 was accepted
Finished match

H1 accepted means that 30 points advantage for Stockfish was accepted instead of H0 (equal strength).

The games are here:
http://speedy.sh/aw3tG/SPRT-RR.pgn

mwyoung · Post by **mwyoung** » Mon Jan 27, 2014 9:46 pm

Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 being stronger than Houdini 4 Contempt 0 under Cutechess-Cli in these conditions:

TC: 10m + 10s
i7 2600k at 3.6 GHz 4 cores
4 parallel matches, each engine on one physical core
Hash: 512MB
Ponder: Off
EGTB: No
Openings: 8moves_v2
Code: Select all
    Program                            Score      %     Elo    +   -    Draws

  1 SF 19.01                        &#58; 210.5/388  54.3   3015   24  24   53.4 %
  2 H4 Contempt 0                   &#58; 177.5/388  45.7   2985   24  24   53.4 %
30 +/- 24 Elo points 2SD advantage for Stockfish 19.01 at 10m + 10s TC. LOS=99.3%

SPRT:
elo0=0
elo1=30
alpha, beta = 0.05

H1 accepted after 388 games:
Code: Select all
Score of SF 19.01 vs H4 Contempt 0&#58; 107 - 74 - 207  &#91;0.543&#93; 388
ELO difference&#58; 30
SPRT&#58; H1 was accepted
Finished match
H1 accepted means that 30 points advantage for Stockfish was accepted instead of H0 (equal strength).

The games are here:
http://speedy.sh/aw3tG/SPRT-RR.pgn

Yes, Stockfish is the strongest engine. I have no doubt. Thanks you for your results.

ernest · Post by **ernest** » Mon Jan 27, 2014 10:06 pm

Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...

...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression !

(see http://ls-ratinglist.beepworld.de )

syzygy · Post by **syzygy** » Mon Jan 27, 2014 10:27 pm

ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression !
(see http://ls-ratinglist.beepworld.de )

Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.

Laskos · Post by **Laskos** » Mon Jan 27, 2014 10:58 pm

syzygy wrote:
ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression !
(see http://ls-ratinglist.beepworld.de )
Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.

Thanks for the info, there were 2 binaries released on 19th, mine is 64-bit. It can be easily seen from my initial post with NPS, 32-bit is ~30% slower. So, I used an uncorrupted SF.

syzygy · Post by **syzygy** » Mon Jan 27, 2014 11:30 pm

Laskos wrote:
syzygy wrote:
ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression !
(see http://ls-ratinglist.beepworld.de )
Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.
Thanks for the info, there were 2 binaries released on 19th, mine is 64-bit. It can be easily seen from my initial post with NPS, 32-bit is ~30% slower. So, I used an uncorrupted SF.

Maybe I was too quick in concluding that the Makefile bug was behind the regression. Whether there was a regression or not seems to be a topic of discussion now on the FishCooking list. I can only assume the people there are aware of:

Author: Joona Kiiski
Date: Sat Jan 25 11:29:32 2014 +0100
Timestamp: 1390645772

Do not set default value for architeture in Makefile

Fixes a regression that ARCH parameter was not properly validated.
Invalid value would default to generic 32-bit build.

No functional change.

I guess if the ARCH parameter was set by hand (or script), the bug was not triggered. So the question is what was the source of the binary used by beepworld.de.

Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.