It is a great resource to me, but what surprises me is one of the sidenotes by the author:
# It is critical that you choose your sample size ahead of time, and do not make any conclusions until you have run the full tournament. It is incorrect, statistically, to watch the running of the tournament, wait until an engine reaches a cut-off, and then stop the tournament.
Does this make sense? Why would it make a difference if I stopped the tournament in the middle (eg to save time) in case of obviously stronger players. After all I only want to tell is A > B and no exact rating.
Stopping early because you're out of time should be the same as a shorter tournament.
Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.
Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.
It was developed during WW2 and deemed so important that it was classified till '45.
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.
It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.
Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.
It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.
Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.
It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.
Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
I would imagine your N also depends on the number of games played sofar.