Stockfish seems definitely the strongest engine

Laskos · Post by **Laskos** » Tue Jan 21, 2014 8:28 pm

I tested the latest SF (19.01.2014) against Houdini 4 Contempt 0 at intermediate time control 10m + 10s.

i7 2600k, 3.6GHz, 4 physical cores
GUI: LittleBlitzer
1 real core each engine, 4 simultaneous matches
Hash: 512MB
Ponder: Off
Openings: 8 moves EPD (32,000 randomized positions)
Reversed Colors: Yes
EGTB: No
Adjudication: No

Code: Select all

Games Completed = 100 of 100 &#40;Avg game length = 2650.582 sec&#41;
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD&#58;C&#58;\LittleBlitzer\8moves_v2.epd&#40;32000&#41;
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	&#40;L&#58; m=11 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=15184.2 d=34.90 nps=2007547&#41;
 2.  H4 Contempt=0              44.0/100	11-23-66  	&#40;L&#58; m=23 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=14413.0 d=23.69 nps=2349506&#41;

LOS = 98.0%, highly significant.
At 15s + 0.05s TC Houdini scored 55% in 2,000 games. So:

1. Stockfish scales better to longer TC
2. At time controls larger than blitz, on modern hardware, Stockfish seems definitely the strongest engine.

These are somehow known facts, but there are folks who express doubts. Still, it would have been useful to have a SPRT stop at alpha, beta 0.05, but it would probably need 2-3 times more games.

Jouni · Post by **Jouni** » Tue Jan 21, 2014 8:49 pm

How many times previously has free engine been number one? I think at least Rybka beta and Houdini 1.5 were, but others? Ruffian and Fruit were also very close if not quite the strongest.

Laskos · Post by **Laskos** » Tue Jan 21, 2014 8:56 pm

Jouni wrote:How many times previously has free engine been number one? I think at least Rybka beta and Houdini 1.5 were, but others? Ruffian and Fruit were also very close if not quite the strongest.

Strongest and open source only Ippos, before Houdini 1.0 and Rybka 4 had appeared.

Ajedrecista · Post by **Ajedrecista** » Tue Jan 21, 2014 9:20 pm

Hello Kai:

Laskos wrote:I tested the latest SF (19.01.2014) against Houdini 4 Contempt 0 at intermediate time control 10m + 10s.

i7 2600k, 3.6GHz, 4 physical cores
GUI: LittleBlitzer
1 real core each engine, 4 simultaneous matches
Hash: 512MB
Ponder: Off
Openings: 8 moves EPD (32,000 randomized positions)
Reversed Colors: Yes
EGTB: No
Adjudication: No
Code: Select all
Games Completed = 100 of 100 &#40;Avg game length = 2650.582 sec&#41;
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD&#58;C&#58;\LittleBlitzer\8moves_v2.epd&#40;32000&#41;
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	&#40;L&#58; m=11 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=15184.2 d=34.90 nps=2007547&#41;
 2.  H4 Contempt=0              44.0/100	11-23-66  	&#40;L&#58; m=23 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=14413.0 d=23.69 nps=2349506&#41;
LOS = 98.0%, highly significant.
At 15s + 0.05s TC Houdini scored 55% in 2,000 games. So:

1. Stockfish scales better to longer TC
2. At time controls larger than blitz, on modern hardware, Stockfish seems definitely the strongest engine.
These are somehow known facts, but there are folks who express doubts. Still, it would have been useful to have a SPRT stop at alpha, beta 0.05, but it would probably need 2-3 times more games.

Thanks for the test!

I am not an expert but I think that SPRT is more intended for small Elo differences. Writing from memory, I think that the last cutechess-cli accepts SPRT.

I do not know the real difference of Houdini 4 and current SF development version. For seeing which of two engines is better, I think that elo0=-elo1 should be the right choice (Bayeselo units), for example SPRT(-3, 3). If I suppose that both Houdini 4 and current SF have the same rating at a certain TC and I also suppose that drawelo = 270 (that is, circa 69.8% of expected draws):

Code: Select all

Shortest simulation&#58;   1611 games (+235 -339 =1037&#41;.
Longest simulation&#58;  266008 games (+46587 -46691 =172730&#41;.

Average number of games per simulation&#58; 30792

Type I errors  &#40;false positives&#41;&#58; 50.06 %
Type II errors &#40;false negatives&#41;&#58;  0.00 %

I ran 10000 simulations. In theory, with the input parameters I choosed, false positives should be 50% exactly, so I guess my SPRT works fine; alpha = beta = 0.05 = 5% for your info. The median is between 23000 and 24000 games, more probably between 23000 and 23100 games.

If I maintain drawelo = 270 but I make Bayeselo gap ~ 17.36 Bayeselo (around 10 Elo and an expected draw ratio of more less 65.01% ~ 65%):

Code: Select all

Shortest simulation&#58;   997 games (+220 -118 =659&#41;.
Longest simulation&#58;  11157 games (+1980 -1876 =7301&#41;.

Average number of games per simulation&#58; 3613

Type I errors  &#40;false positives&#41;&#58; 0.00 %
Type II errors &#40;false negatives&#41;&#58; 0.00 %

I ran 10000 simulations again. The stronger engine always prevails (elo0 = -elo1 implies that there are not SPRT fails with wins >= loses and viceversa). The median is between 3000 and 4000 games, more probably between 3400 and 3500 games.

------------

@Jouni: probably the first Robbolito was number one at the end of 2009. The same goes for IvanHoe in early 2010. Of course I am not completely sure.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Tue Jan 21, 2014 9:35 pm

Ajedrecista wrote:Hello Kai:
Laskos wrote:I tested the latest SF (19.01.2014) against Houdini 4 Contempt 0 at intermediate time control 10m + 10s.

i7 2600k, 3.6GHz, 4 physical cores
GUI: LittleBlitzer
1 real core each engine, 4 simultaneous matches
Hash: 512MB
Ponder: Off
Openings: 8 moves EPD (32,000 randomized positions)
Reversed Colors: Yes
EGTB: No
Adjudication: No
Code: Select all
Games Completed = 100 of 100 &#40;Avg game length = 2650.582 sec&#41;
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD&#58;C&#58;\LittleBlitzer\8moves_v2.epd&#40;32000&#41;
Time = 66710 sec elapsed, 0 sec remaining
 1.  SF 19.01.14                56.0/100	23-11-66  	&#40;L&#58; m=11 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=15184.2 d=34.90 nps=2007547&#41;
 2.  H4 Contempt=0              44.0/100	11-23-66  	&#40;L&#58; m=23 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=16 f=10 s=0 a=0&#41;	&#40;tpm=14413.0 d=23.69 nps=2349506&#41;
LOS = 98.0%, highly significant.
At 15s + 0.05s TC Houdini scored 55% in 2,000 games. So:

1. Stockfish scales better to longer TC
2. At time controls larger than blitz, on modern hardware, Stockfish seems definitely the strongest engine.
These are somehow known facts, but there are folks who express doubts. Still, it would have been useful to have a SPRT stop at alpha, beta 0.05, but it would probably need 2-3 times more games.
Thanks for the test!

I am not an expert but I think that SPRT is more intended for small Elo differences. Writing from memory, I think that the last cutechess-cli accepts SPRT.

I do not know the real difference of Houdini 4 and current SF development version. For seeing which of two engines is better, I think that elo0=-elo1 should be the right choice (Bayeselo units), for example SPRT(-3, 3). If I suppose that both Houdini 4 and current SF have the same rating at a certain TC and I also suppose that drawelo = 270 (that is, circa 69.8% of expected draws):
Code: Select all
Shortest simulation&#58;   1611 games (+235 -339 =1037&#41;.
Longest simulation&#58;  266008 games (+46587 -46691 =172730&#41;.

Average number of games per simulation&#58; 30792

Type I errors  &#40;false positives&#41;&#58; 50.06 %
Type II errors &#40;false negatives&#41;&#58;  0.00 %
I ran 10000 simulations. In theory, with the input parameters I choosed, false positives should be 50% exactly, so I guess my SPRT works fine; alpha = beta = 0.05 = 5% for your info. The median is between 23000 and 24000 games, more probably between 23000 and 23100 games.

If I maintain drawelo = 270 but I make Bayeselo gap ~ 17.36 Bayeselo (around 10 Elo and an expected draw ratio of more less 65.01% ~ 65%):
Code: Select all
Shortest simulation&#58;   997 games (+220 -118 =659&#41;.
Longest simulation&#58;  11157 games (+1980 -1876 =7301&#41;.

Average number of games per simulation&#58; 3613

Type I errors  &#40;false positives&#41;&#58; 0.00 %
Type II errors &#40;false negatives&#41;&#58; 0.00 %
I ran 10000 simulations again. The stronger engine always prevails (elo0 = -elo1 implies that there are not SPRT fails with wins >= loses and viceversa). The median is between 3000 and 4000 games, more probably between 3400 and 3500 games.

------------

@Jouni: probably the first Robbolito was number one at the end of 2009. The same goes for IvanHoe in early 2010. Of course I am not completely sure.

Regards from Spain.

Ajedrecista.

Hello Jesus,

Can you simulate a "real" 35 Elo gap, drawelo=270, elo0 = 0, elo1 = 30, alpha, beta=0.05, to see the median at the stop? I suspect it will be several hundred. SPRT is not only about small Elo differences, that's entirely depending on the window. Thanks!

ernest · Post by **ernest** » Wed Jan 22, 2014 3:16 am

Laskos wrote:1. Stockfish scales better to longer TC

55% => 56% with 56% based on 100 games...
Kai, I am ashamed of you...

ouachita · Post by **ouachita** » Wed Jan 22, 2014 3:31 am

maybe so, maybe no:

1/21/14
10+1, 150 Game Round Robin, Alternate colors
12 Physical Cores/HT disabled
GUI: DF13
2048 hash
Engine Parameters: SF default; K non-default table memory and draw score.
GUI TB: syzygy 3-4-5
PB: On
Opening DB: Key Positions from Bobby Johnson's Games

Code: Select all

                                1                                 2                                 3                                  
1   Komodo TCECr 64-bit         13.0 - 12.012.5 - 12.5**    25.5/50  631.50 
2   Stockfish 190114 64 SSE4.2  12.0 - 13.013.5 - 11.5 **   25.5/50  630.00 
3   Houdini 4 Pro x64B          12.5 - 12.511.5 - 13.5  **  24.0/50

Laskos · Post by **Laskos** » Wed Jan 22, 2014 9:45 am

ernest wrote:
Laskos wrote:1. Stockfish scales better to longer TC
55% => 56% with 56% based on 100 games...
Kai, I am ashamed of you...

45% -> 56%

Laskos · Post by **Laskos** » Wed Jan 22, 2014 9:51 am

ouachita wrote:maybe so, maybe no:

1/21/14
10+1, 150 Game Round Robin, Alternate colors
12 Physical Cores/HT disabled
GUI: DF13
2048 hash
Engine Parameters: SF default; K non-default table memory and draw score.
GUI TB: syzygy 3-4-5
PB: On
Opening DB: Key Positions from Bobby Johnson's Games
Code: Select all
                                1                                 2                                 3                                  
1   Komodo TCECr 64-bit         13.0 - 12.012.5 - 12.5**    25.5/50  631.50 
2   Stockfish 190114 64 SSE4.2  12.0 - 13.013.5 - 11.5 **   25.5/50  630.00 
3   Houdini 4 Pro x64B          12.5 - 12.511.5 - 13.5  **  24.0/50

10'+10'' is almost double TC compared to 10'+1''.

M ANSARI · Post by **M ANSARI** » Wed Jan 22, 2014 10:19 am

I think it was going to be only a matter of time before SF became the strongest engine out there. The progress they are making is very steady and hard to match or keep up with. I think this all boils down to their testing framework setup, that is simply too hard to match or beat.

Stockfish seems definitely the strongest engine

Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine.

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine

Re: Stockfish seems definitely the strongest engine