Originally, this was roughly my process:

I will have engines, one called test and one called base. They will both play a number of games against engines A, B, C, ....

I decide that I want 95% conf that there will not be a false positive / false negative. Based on this I get some bounds, [-X, X]. If the Z value I calculate falls outside those bounds, I will terminate the test. Otherwise I will play more games.

To calculate Z, I do the following

Code: Select all

```
testmean = 0; teststd= 0;
for matchup in test matchups:
w, d, l = matchup.results
n = w + d + l
s = w + d/2
p = s / n
diff = -400 * log10(1/p - 1)
testmean += matchup.opponentsELO + diff
std = sqrt(p * (1-p) / n)
upperstd = -400 * log10(1/(p+std) - 1)
lowerstd = -400 * log10(1/(p-std) - 1)
teststd += (upperstd + lowerstd) / 2
Repeat again for basemean and basestd
Z = ((testmean - basemean) / numOpponents) / sqrt((testvar + basevar) / numOpponents)
```

Any help, or a pointer torwards some helpful reading materials would be appreciated greatly.

Thanks,

Andrew Grant