Page 1 of 1

Engine testing & error margin ?

Posted: Wed Jul 05, 2017 12:34 am
by MahmoudUthman
I'm testing my engine by running 2 gauntlets for the current and previous versions of my engine , against a pool of engines which I previously tested using a round robin at the same time control , how much could the error margins of the round robin affect that of the gauntlets and is it better to increase the accuracy of the estimation of the pool strength before increasing that of the gauntlets ?
*one more thing so far the modification I used to do were ones that result in huge Elo difference compared to the error margin, but now it no longer the case , so to verify that an engine A is stronger than another one B , do I need to ensure this :
A+(-Error margin) > B+(+Error margin) with A's Elo > B's ?

Re: Engine testing & error margin ?

Posted: Wed Jul 05, 2017 1:51 am
by AlvaroBegue
I know this has been the subject of some controversy in the past, so take this as just my opinion: Just test your new version against your previous version directly. This is much better than using a variety of opponents because:
A) The size of the improvements is exaggerated when matching two similar versions of the same program.
B) The arithmetic of the error bars is such that you need 4 times fewer games to reach the same level of confidence.

It might look at first that A) is a disadvantage, but you are primarily interested in whether a change is an improvement or not, regardless of size. If you want to measure progress over long periods of time, run an occasional gauntlet to measure the size of the cumulative effect of your improvements.

Re: Engine testing & error margin ?

Posted: Wed Jul 05, 2017 2:19 am
by AndrewGrant
Heres what I do. I play against a pool of engines. After the test and base versions of my engine have played 250+ games against each opponent, I start computing the following value:

Code: Select all

for i in range(numopponents):
    
        # Compute score and variance for the test matchup
        n = testmatchups[i].games()
        w = testmatchups[i].wins / n
        d = testmatchups[i].draws / n
        l = testmatchups[i].losses / n
        stest = w + d / 2
        
        # Compute score and variance for the base matchup
        n = basematchups[i].games()
        w = basematchups[i].wins / n
        d = basematchups[i].draws / n
        l = basematchups[i].losses / n
        sbase = w + d / 2
        
        # Add this difference to our master list
        differences += [ELO(stest) - ELO(sbase)]
        
    # Compute the average difference between pairings
    d = sum(differences) / numopponents
    
    # Compute the variance using our difference samples
    variances = []
    for i in range(numopponents):
        variances.append((differences[i] - d) ** 2)
    v = sum(variances) / (numopponents - 1)
    
    
    # Compute the T statistic
    return (d / sqrt(v / numopponents))
I then compare this to a T value with the correct degrees of freedom (numopponents - 1) for both type I and type II errors, and play games until the computed score exceeds one.

http://faculty.washington.edu/heagerty/ ... /t-Tables/

I had been trying your described method for a while, but found the time to verify and reject a patch to be too long.