Stockfish seems definitely the strongest engine

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10297
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish seems definitely the strongest engine

Post by Uri Blass »

Laskos wrote:
ouachita wrote:There are a lot of one core tests posted here, so I wanted to post these results to again highlight the point that one core results are not related to multi-core results, in this case, 12 cores:

Code: Select all

1-22-14

SF0901014IP-12 core v SF 080114-1 core

1+1
50 positions, alternating colors
defaults
# of cores is sole setting difference.
                                        
1   Stockfish 090114IP 64 SSE4.2  +205  +53/=47/-0 76.50%   76.5/100
2   Stockfish 080114 64 SSE4.2    -205  +0/=47/-53 23.50%   23.5/100

Also, I misspoke by saying 12 cores win >90%. Here, 12 cores scored 76.5, but had 100% of wins.

Food for thought.
150-200 Elo points from 1 to 12 cores at 1'+1'' are to be expected, but I don't agree that 1-core results are unrelated to 12-core results. The MP scaling of top engines is comparable. In your place I would play 10'+10'' games on one core or on several cores SF against Houdini Contempt=0 until a SPRT stop, to dispel some myths (that SF does not scale better, for example). There were many 1-core results, but none of them had LOS of 98% SF against Houdini 4 Contempt=0, and that happens at somewhat larger TC than blitz. I will now wait for a SPRT stop in Cutechess-Cli to show that SF overtook Houdini (if that is the case).
Your assumption that the MP scaling of top engines is comparable is not something that is proved so I think that you cannot decide that 12 cores 1+1 is the same as 1 core 10+10.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine

Post by Laskos »

Uri Blass wrote: Your assumption that the MP scaling of top engines is comparable is not something that is proved so I think that you cannot decide that 12 cores 1+1 is the same as 1 core 10+10.
It is a reasonable assumption. The MP scaling of Stockfish and Houdini (not Komodo, though) can be measured by time-to-depth tests on say 100 positions at certain approximate time control (it's time dependent), and it gave, for example, on 1->4 cores, 3.15 for Houdini and 3.05 for Stockfish, only a 3% difference, or ~3 Elo points MP scaling difference. I doubt that to 12 cores it will be more than 10 Elo points difference. While the scaling with time of Stockfish compared to Houdini (45%->56%) from ultra-fast to longer than blitz is about 70-75 Elo points. So:

1. MP scaling differences are relatively minor.
2. Testing engines on many cores and large TC is almost impossible for desired number of games (needed to have conclusive LOS or a SPRT stop between closely matched engines).
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Stockfish seems definitely the strongest engine.

Post by Ajedrecista »

Hello:

I finally did the changes to find the median of the distribution. I also think that the results are more accurate now. With bayeselo = 60.75, drawelo = 270, alpha = beta = 0.05, bayeselo_0 = 0 and bayeselo_1 = 30, I ran 500000 simulations:

Code: Select all

Shortest simulation:      39 games (simulation 204448).
Longest simulation:     1664 games (simulation 337409).

Average number of games per simulation:     275
Median of the distribution:                 248

Type I errors  (false positives):   0.00 %
Type II errors (false negatives):   0.01 %

There is 1 simulation with score > 50% that failed SPRT.
There are      0 simulations with score = 50% that failed SPRT.

Code: Select all

From       0 to     999 games: 499712 simulations ( 99.94 %); accumulated:  99.94 %.
From    1000 to    1999 games:    288 simulations (  0.06 %); accumulated: 100.00 %.
 
Number of finished simulations: 500000.
The stronger engine failed 57 times out of 500,000 simulations. I manually found the only time that failed with more wins than loses:

Code: Select all

349585) FAIL after     886 games (+    155 -    154 =    577).
        Passes: 349547   Fails:     38
If you wanted a number for the median under your assumptions, here you have it: 248.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine.

Post by Laskos »

Ajedrecista wrote:Hello:

I finally did the changes to find the median of the distribution. I also think that the results are more accurate now. With bayeselo = 60.75, drawelo = 270, alpha = beta = 0.05, bayeselo_0 = 0 and bayeselo_1 = 30, I ran 500000 simulations:

Code: Select all

Shortest simulation:      39 games (simulation 204448).
Longest simulation:     1664 games (simulation 337409).

Average number of games per simulation:     275
Median of the distribution:                 248

Type I errors  (false positives):   0.00 %
Type II errors (false negatives):   0.01 %

There is 1 simulation with score > 50% that failed SPRT.
There are      0 simulations with score = 50% that failed SPRT.

Code: Select all

From       0 to     999 games: 499712 simulations ( 99.94 %); accumulated:  99.94 %.
From    1000 to    1999 games:    288 simulations (  0.06 %); accumulated: 100.00 %.
 
Number of finished simulations: 500000.
The stronger engine failed 57 times out of 500,000 simulations. I manually found the only time that failed with more wins than loses:

Code: Select all

349585) FAIL after     886 games (+    155 -    154 =    577).
        Passes: 349547   Fails:     38
If you wanted a number for the median under your assumptions, here you have it: 248.

Regards from Spain.

Ajedrecista.
Thanks Jesus.
Meanwhile I got SPRT stop in Cutechess-Cli with H1 accepted in 388 games for:
elo0=0
elo1=30
alpha=0.05
beta=0.05

Time control: 10m+10s
Each engine on 1 i7 core
Openings 8moves_v2:

Code: Select all

Score of SF 19.01 vs H4 Contempt 0: 107 - 74 - 207  [0.543] 388
ELO difference: 30
SPRT: H1 was accepted
Finished match
So, now I can claim on solid grounds that SF is the strongest engine beyond blitz time controls.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine.

Post by Laskos »

Summarizing, I got a SPRT stop for Stockfish 19.01.2014 being stronger than Houdini 4 Contempt 0 under Cutechess-Cli in these conditions:

TC: 10m + 10s
i7 2600k at 3.6 GHz 4 cores
4 parallel matches, each engine on one physical core
Hash: 512MB
Ponder: Off
EGTB: No
Openings: 8moves_v2

Code: Select all

    Program                            Score      %     Elo    +   -    Draws

  1 SF 19.01                        : 210.5/388  54.3   3015   24  24   53.4 %
  2 H4 Contempt 0                   : 177.5/388  45.7   2985   24  24   53.4 %
30 +/- 24 Elo points 2SD advantage for Stockfish 19.01 at 10m + 10s TC. LOS=99.3%

SPRT:
elo0=0
elo1=30
alpha, beta = 0.05

H1 accepted after 388 games:

Code: Select all

Score of SF 19.01 vs H4 Contempt 0: 107 - 74 - 207  [0.543] 388
ELO difference: 30
SPRT: H1 was accepted
Finished match
H1 accepted means that 30 points advantage for Stockfish was accepted instead of H0 (equal strength).

The games are here:
http://speedy.sh/aw3tG/SPRT-RR.pgn
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Stockfish seems definitely the strongest engine.

Post by mwyoung »

Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 being stronger than Houdini 4 Contempt 0 under Cutechess-Cli in these conditions:

TC: 10m + 10s
i7 2600k at 3.6 GHz 4 cores
4 parallel matches, each engine on one physical core
Hash: 512MB
Ponder: Off
EGTB: No
Openings: 8moves_v2

Code: Select all

    Program                            Score      %     Elo    +   -    Draws

  1 SF 19.01                        : 210.5/388  54.3   3015   24  24   53.4 %
  2 H4 Contempt 0                   : 177.5/388  45.7   2985   24  24   53.4 %
30 +/- 24 Elo points 2SD advantage for Stockfish 19.01 at 10m + 10s TC. LOS=99.3%

SPRT:
elo0=0
elo1=30
alpha, beta = 0.05

H1 accepted after 388 games:

Code: Select all

Score of SF 19.01 vs H4 Contempt 0: 107 - 74 - 207  [0.543] 388
ELO difference: 30
SPRT: H1 was accepted
Finished match
H1 accepted means that 30 points advantage for Stockfish was accepted instead of H0 (equal strength).

The games are here:
http://speedy.sh/aw3tG/SPRT-RR.pgn
Yes, Stockfish is the strongest engine. I have no doubt. Thanks you for your results.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
ernest
Posts: 2041
Joined: Wed Mar 08, 2006 8:30 pm

Re: Stockfish seems definitely the strongest engine.

Post by ernest »

Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression ! 8-)
(see http://ls-ratinglist.beepworld.de )
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish seems definitely the strongest engine.

Post by syzygy »

ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression ! 8-)
(see http://ls-ratinglist.beepworld.de )
Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish seems definitely the strongest engine.

Post by Laskos »

syzygy wrote:
ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression ! 8-)
(see http://ls-ratinglist.beepworld.de )
Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.
Thanks for the info, there were 2 binaries released on 19th, mine is 64-bit. It can be easily seen from my initial post with NPS, 32-bit is ~30% slower. So, I used an uncorrupted SF.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish seems definitely the strongest engine.

Post by syzygy »

Laskos wrote:
syzygy wrote:
ernest wrote:
Laskos wrote:Summarizing, I got a SPRT stop for Stockfish 19.01.2014 ...
...and Stockfish is even better, since it seems that Stockfish 19.01.2014 is an unfortunate 10-Elo regression ! 8-)
(see http://ls-ratinglist.beepworld.de )
Apparently because of a bug in the Makefile that resulted in the compilation of a 32-bit binary instead of 64-bit. It shouldn't be too difficult for Kai to tell whether the binary he tested is 32-bit or 64-bit.
Thanks for the info, there were 2 binaries released on 19th, mine is 64-bit. It can be easily seen from my initial post with NPS, 32-bit is ~30% slower. So, I used an uncorrupted SF.
Maybe I was too quick in concluding that the Makefile bug was behind the regression. Whether there was a regression or not seems to be a topic of discussion now on the FishCooking list. I can only assume the people there are aware of:
Author: Joona Kiiski
Date: Sat Jan 25 11:29:32 2014 +0100
Timestamp: 1390645772

Do not set default value for architeture in Makefile

Fixes a regression that ARCH parameter was not properly validated.
Invalid value would default to generic 32-bit build.

No functional change.
I guess if the ARCH parameter was set by hand (or script), the bug was not triggered. So the question is what was the source of the binary used by beepworld.de.