I'm testing my engine by running 2 gauntlets for the current and previous versions of my engine , against a pool of engines which I previously tested using a round robin at the same time control , how much could the error margins of the round robin affect that of the gauntlets and is it better to increase the accuracy of the estimation of the pool strength before increasing that of the gauntlets ?
*one more thing so far the modification I used to do were ones that result in huge Elo difference compared to the error margin, but now it no longer the case , so to verify that an engine A is stronger than another one B , do I need to ensure this :
A+(Error margin) > B+(+Error margin) with A's Elo > B's ?
Engine testing & error margin ?
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 916
 Joined: Tue Mar 09, 2010 2:46 pm
 Location: New York
 Full name: Álvaro Begué (RuyDos)
Re: Engine testing & error margin ?
I know this has been the subject of some controversy in the past, so take this as just my opinion: Just test your new version against your previous version directly. This is much better than using a variety of opponents because:
A) The size of the improvements is exaggerated when matching two similar versions of the same program.
B) The arithmetic of the error bars is such that you need 4 times fewer games to reach the same level of confidence.
It might look at first that A) is a disadvantage, but you are primarily interested in whether a change is an improvement or not, regardless of size. If you want to measure progress over long periods of time, run an occasional gauntlet to measure the size of the cumulative effect of your improvements.
A) The size of the improvements is exaggerated when matching two similar versions of the same program.
B) The arithmetic of the error bars is such that you need 4 times fewer games to reach the same level of confidence.
It might look at first that A) is a disadvantage, but you are primarily interested in whether a change is an improvement or not, regardless of size. If you want to measure progress over long periods of time, run an occasional gauntlet to measure the size of the cumulative effect of your improvements.

 Posts: 439
 Joined: Tue Apr 19, 2016 4:08 am
 Location: U.S.A
 Full name: Andrew Grant
 Contact:
Re: Engine testing & error margin ?
Heres what I do. I play against a pool of engines. After the test and base versions of my engine have played 250+ games against each opponent, I start computing the following value:
I then compare this to a T value with the correct degrees of freedom (numopponents  1) for both type I and type II errors, and play games until the computed score exceeds one.
http://faculty.washington.edu/heagerty/ ... /tTables/
I had been trying your described method for a while, but found the time to verify and reject a patch to be too long.
Code: Select all
for i in range(numopponents):
# Compute score and variance for the test matchup
n = testmatchups[i].games()
w = testmatchups[i].wins / n
d = testmatchups[i].draws / n
l = testmatchups[i].losses / n
stest = w + d / 2
# Compute score and variance for the base matchup
n = basematchups[i].games()
w = basematchups[i].wins / n
d = basematchups[i].draws / n
l = basematchups[i].losses / n
sbase = w + d / 2
# Add this difference to our master list
differences += [ELO(stest)  ELO(sbase)]
# Compute the average difference between pairings
d = sum(differences) / numopponents
# Compute the variance using our difference samples
variances = []
for i in range(numopponents):
variances.append((differences[i]  d) ** 2)
v = sum(variances) / (numopponents  1)
# Compute the T statistic
return (d / sqrt(v / numopponents))
http://faculty.washington.edu/heagerty/ ... /tTables/
I had been trying your described method for a while, but found the time to verify and reject a patch to be too long.