Hello Kai:
Laskos wrote:I tested the latest SF (19.01.2014) against Houdini 4 Contempt 0 at intermediate time control
10m + 10s.
i7 2600k, 3.6GHz, 4 physical cores
GUI: LittleBlitzer
1 real core each engine, 4 simultaneous matches
Hash: 512MB
Ponder: Off
Openings: 8 moves EPD (32,000 randomized positions)
Reversed Colors: Yes
EGTB: No
Adjudication: No
Code: Select all
Games Completed = 100 of 100 (Avg game length = 2650.582 sec)
Settings = Gauntlet/512MB/600000ms+10000ms/M 700000cp for 1000 moves, D 150000 moves/EPD:C:\LittleBlitzer\8moves_v2.epd(32000)
Time = 66710 sec elapsed, 0 sec remaining
1. SF 19.01.14 56.0/100 23-11-66 (L: m=11 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=15184.2 d=34.90 nps=2007547)
2. H4 Contempt=0 44.0/100 11-23-66 (L: m=23 t=0 i=0 a=0) (D: r=40 i=16 f=10 s=0 a=0) (tpm=14413.0 d=23.69 nps=2349506)
LOS =
98.0%, highly significant.
At 15s + 0.05s TC Houdini scored 55% in 2,000 games. So:
- 1. Stockfish scales better to longer TC
2. At time controls larger than blitz, on modern hardware, Stockfish seems definitely the strongest engine.
These are somehow known facts, but there are folks who express doubts. Still, it would have been useful to have a SPRT stop at alpha, beta 0.05, but it would probably need 2-3 times more games.
Thanks for the test!
I am not an expert but I think that SPRT is more intended for small Elo differences. Writing from memory, I think that the last cutechess-cli accepts SPRT.
I do not know the real difference of Houdini 4 and current SF development version. For seeing which of two engines is better, I think that elo0=-elo1 should be the right choice (Bayeselo units), for example SPRT(-3, 3). If I suppose that both Houdini 4 and current SF have the same rating at a certain TC and I also suppose that drawelo = 270 (that is, circa 69.8% of expected draws):
Code: Select all
Shortest simulation: 1611 games (+235 -339 =1037).
Longest simulation: 266008 games (+46587 -46691 =172730).
Average number of games per simulation: 30792
Type I errors (false positives): 50.06 %
Type II errors (false negatives): 0.00 %
I ran 10000 simulations. In theory, with the input parameters I choosed, false positives should be 50% exactly, so I guess my SPRT works fine; alpha = beta = 0.05 = 5% for your info. The median is between 23000 and 24000 games, more probably between 23000 and 23100 games.
If I maintain drawelo = 270 but I make Bayeselo gap ~ 17.36 Bayeselo (around 10 Elo and an expected draw ratio of more less 65.01% ~ 65%):
Code: Select all
Shortest simulation: 997 games (+220 -118 =659).
Longest simulation: 11157 games (+1980 -1876 =7301).
Average number of games per simulation: 3613
Type I errors (false positives): 0.00 %
Type II errors (false negatives): 0.00 %
I ran 10000 simulations again. The stronger engine always prevails (elo0 = -elo1 implies that there are not SPRT fails with wins >= loses and viceversa). The median is between 3000 and 4000 games, more probably between 3400 and 3500 games.
------------
@Jouni: probably the first Robbolito was number one at the end of 2009. The same goes for IvanHoe in early 2010. Of course I am not completely sure.
Regards from Spain.
Ajedrecista.