The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.
I think they can, as they undoubtedly conduct the tests properly, keeping an eye on the statistics, and confirming that the games were indeed independent.
The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.
I love that "superior attitude" you have. Nobody but you knows how to run a "proper test". I have already spent months analyzing what is going on. And I tracked it down to "timing jitter" and absolutely nothing else. Several have looked at the data. We've checked NPS. We restart engines after each game. The list goes on and on. The games are as independent as games can be when they use the same group of opponents and starting positions are used. There is no left over hash. No learning of any kind. No random loads on the nodes. Nothing that changes from one game to another. Carefully confirmed by hundreds of NPS checks. I even run with code in Crafty to watch this from time to time, on occasions when I run on the cluster in "open mode" (where other users can use nodes I am not using but not the nodes I am using). But these results were on a "closed system" where I was "it".
You keep wanting to say the experimental setup is wrong. I say you are full of it and need to change the channel to something that is worthwhile. Perhaps you could explain to me how even random performance noise would affect the match anyway. Isn't random timing taken care of in statistical sampling? In my books it is.
I am sorry I have to say this for the umptieth time, but this is pure nonsense. Non-determinism cannot explain these results. Only dependence of the game results can.
And I will say it for the umptieth + 1 time, there is _zero_ dependence. Why don't you tell me a methodology to make the games somehow dependent on each other in the first place. Using my program plus 5 completely unmodified open source programs (one of which is an old version of my program of course). Just tell me how to make the games "dependent". when each game is played, then two more instances of the two programs are started, sides are switched, and the game is played again.
I'm waiting on that intellectual jewel so that I might understand how it is even possible. My testing scheme (lower level this time around) goes like this: I create a pot load of simple shell scripts to run my referee program, which fires up one instance of each opponent, and connects them together much as xboard/winboard does. Once that game ends, everything terminates, and once the script ends, another one is sent to that node where it is started and plays another game. No shared data. No shared files. No interaction between nodes. No endgame tables. No nothing.
So I am waiting for you to tell me how, testing in that methodology, I might contrive an experimental setup such that the games are dependent on each other.
Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
As I said before, there was nothing suspicious about these 4 results. You have no case.