I fully understand that 25,000 games produces a smaller SD than 800.
Apparently not, as you continue complaining that the 800 games show a larger deviation than would make it a 'useful' replacement for your 25,000-game run.
But, unlike yourself, I also fully understand that running 800 games takes a lot less time,
Where did you get that silly idea from?
and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.
Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.
You're welcome.
OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.
Yes, life is difficult isn't it, for those relying on blind guesses in the face of infinities. The scientific approach would of course be to calculate how many you need. You have tried 40 different positions, and did a good deal more than 40 games on each of those. So you are in a position to calculate the standard deviation of the result over variation of the number of games. Calculate by which factor that fall short of the accuracy you want, calculate the square, and, magical trick, you have the number of needed positions.
Oh, sorry, too difficult. Another useless non-contribution. And you of course no longer have the results of the 25,000-game matches specified by position...
BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions.
In your dreams, yes.
But as you used the same positions and opponents in both runs, these dependencies (if any) should
decrease the variability of the runs (and hence the typical difference between two runs), and thus cannot explain your result, which is
too large a difference, not too small. That these effects drive up the difference between what your runs produce and what you really wanted to know is irrelevant for explaining the 6-sigma deviation, as you remarked yourself in the very beginning of the previous thread.
But not where one game influences another in any possible way. But we don't seem to be able to get away from that.
I guess this is because of the unfortunate coincidence that I continue to overlook your explanation of how you excluded slow time-dependence of the involved engines as an artifact spiling your experiment. Can you give me the link back to the post where you did that? :roll