I guess it is a bit complicated. To match the table's 5% column I translate as follows: First, since that page discusses 1-sided tests and my program wants to do 2-sided tests, I tell my program alpha=0.1. (5% probability of mistakenly saying program A is stronger, 5% probability of mistakenly saying program B is stronger.)Edmund wrote:Maybe I am doing something wrong here, but I am still not getting quite the same values. eg n=100; alpha=5% --> should be 57.5%

Second, I set cutoff = 0 each round and look at the largest cutoff allowed each round. (The table assumes you don't terminate early, so that's the same as terminating if score<0.)

Finally, I translate the reported max-cutoff values: if the program says "h" half-points, then the winning percentage must be >= h/(2g) to continue, hence > (2g-h)/(2g) to conclude superiority, hence >=(2g-h+0.5)/(2g) to conclude superiority. This last number should match those in Mr. Ciarrochi's table.

It just means, insert a chance of early termination whenever it is possible (i.e., doesn't push the entire test's probability of false positive over the requested alpha). These insertions are likely to pay off if one program is far superior, and to hurt otherwise. It's not a practical way to design the superiority test, but it and the opposite approach (constant # of games with no early termination rules) are the simplest to code.Edmund wrote:Furthermore, would you elaborate on the "Optimize for very different strength." feature, I don't quite understand that.

Sure, rounding errors could happen. I wouldn't worry much because none of the operations cancel significant digits. Furthermore you do the same number of ops regardless of the order of recursion (since you increment w, d, and l the same # of times). So I'd make an arbitrary choice (and also memoize results if my goal was to output a table), then compare answers after changing doubles to floats in the program to gauge the roundoff errors.Edmund wrote:Great idea! I didn't think of that. The problem I see though is that you may worsen rounding errors by that. Do you have any ideas on how to travers the permutations most efficiently given that win is changed at last (to generate the output) and to minimize the rounding errors (so maybe go from biggest to smallest)

Well, now we have (assuming my code doesn't have more nasty bugs) a huge family of statistically sound tests, some more aggressive than others about stopping early. To choose between them you'd need to guess how probable each ELO difference is, since that determines the average # of games these tests require. I haven't done this yet, but I'd start with a very crude guess and see whether you can even get a significant improvement in average testing time vs the standard method of running a fixed # of games.Edmund wrote:Back to the original problem; any ideas on how we could use these new findings to answer the question on when to stop without the need of a simulation.