Not only that: even answered it about 30 pages of posts ago.bob wrote:Believe I got that 55 years ago or so. The point being addressed was how does something with a SD of 12, (or less) still produce two runs where the results are well outside that range.
Get that?
Reducing the number of games is not compatible with insisting on being able to measure the same tinny Elo differences you wanted to measure with 25,000 games. No use sulking about it. If you want to create a model problem that you can test quickly, you will have to scale the requirement for accuracy accordingly.I just reduced the number of positions so that I could run the various tests being suggested, and not make the discussion drag out over months.
Apparently not, as you continue complaining that the 800 games show a larger deviation than would make it a 'useful' replacement for your 25,000-game run.I fully understand that 25,000 games produces a smaller SD than 800.
Where did you get that silly idea from?But, unlike yourself, I also fully understand that running 800 games takes a lot less time,
You're welcome.and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.
Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.

Yes, life is difficult isn't it, for those relying on blind guesses in the face of infinities. The scientific approach would of course be to calculate how many you need. You have tried 40 different positions, and did a good deal more than 40 games on each of those. So you are in a position to calculate the standard deviation of the result over variation of the number of games. Calculate by which factor that fall short of the accuracy you want, calculate the square, and, magical trick, you have the number of needed positions.OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.
Oh, sorry, too difficult. Another useless non-contribution. And you of course no longer have the results of the 25,000-game matches specified by position...
In your dreams, yes.BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions.
But as you used the same positions and opponents in both runs, these dependencies (if any) should decrease the variability of the runs (and hence the typical difference between two runs), and thus cannot explain your result, which is too large a difference, not too small. That these effects drive up the difference between what your runs produce and what you really wanted to know is irrelevant for explaining the 6-sigma deviation, as you remarked yourself in the very beginning of the previous thread.
I guess this is because of the unfortunate coincidence that I continue to overlook your explanation of how you excluded slow time-dependence of the involved engines as an artifact spiling your experiment. Can you give me the link back to the post where you did that?But not where one game influences another in any possible way. But we don't seem to be able to get away from that.

Well, so at least you are following my earlier advice, then:Meanwhile, in spite of all the noise, there is a small and steadily helpful signal buried in here that others are contributing. And I am willing to test 'em all without dismissing _anything_ outright. Unlike yourself.
Muddle on!


