CCRL 40/40 Rating List - Custom engine selection
816025 games played by 2140 programs, run by 21 testers
Ponder off, General books (up to 12 moves), 3-4-5 piece EGTB
Time control: Equivalent to 40 moves in 40 minutes on Athlon 64 X2 4600+ (2.4 GHz),
about 15 minutes on a modern Intel CPU.
Computed on April 5, 2018 with Bayeselo based on 816'025 games
Tested by CCRL team, 2005-2018, http://computerchess.org.uk/ccrl/4040/
Engine Elo + - Score AvOp Games
Dorpsgek Dillinger 64-bit 2202 +21 -21 49.2% +4.7 790
Dorpsgek Eves-Temptation 64-bit 2200 +26 -26 51.5% -10.8 525
ZirconiumX wrote:But the expected improvement is outside the noise, which is what I'm worried about.
I've just started a gauntlet of Dillinger vs. 16 other engines with very similar strenght (8 rounds, total 128 games). After it's done I'll run the same with Eve's Temptation and report back the results.
ZirconiumX wrote:But the expected improvement is outside the noise, which is what I'm worried about.
The expected improvement (+45) is within the error margins (47) in this case. The older version's margins must also be taken into account, not just those of the new version.
ZirconiumX wrote:But the expected improvement is outside the noise, which is what I'm worried about.
I've just started a gauntlet of Dillinger vs. 16 other engines with very similar strenght (8 rounds, total 128 games). After it's done I'll run the same with Eve's Temptation and report back the results.
CCRL 40/40 Rating List - Custom engine selection
816025 games played by 2140 programs, run by 21 testers
Ponder off, General books (up to 12 moves), 3-4-5 piece EGTB
Time control: Equivalent to 40 moves in 40 minutes on Athlon 64 X2 4600+ (2.4 GHz),
about 15 minutes on a modern Intel CPU.
Computed on April 5, 2018 with Bayeselo based on 816'025 games
Tested by CCRL team, 2005-2018, http://computerchess.org.uk/ccrl/4040/
Engine Elo + - Score AvOp Games
Dorpsgek Dillinger 64-bit 2202 +21 -21 49.2% +4.7 790
Dorpsgek Eves-Temptation 64-bit 2200 +26 -26 51.5% -10.8 525
This table doesn't show the LOS (the corresponding CCRL page does), which is a mere 53.9%. Thus with the current number of games for each version it's impossible to tell which one is stronger.
I intend to rectify this situation by the next update.
ZirconiumX wrote:But the expected improvement is outside the noise, which is what I'm worried about.
The expected improvement (+45) is within the error margins (47) in this case. The older version's margins must also be taken into account, not just those of the new version.
This is not quite correct. The error margin for the rating difference R2-R1, when given error margins E1 and E2, is sqrt(E1^2 + E2^2), so in the given case it would be around 33. For the special case E1=E2 you would get E(diff)=sqrt(2) * E1. So the statement that the expected improvement was outside the error margin was in fact correct.