Dirt wrote:hgm wrote:mathmoi wrote:If Dr Hyatt is right the crafty elo of each sample should still vary "a lot".
I don't understand that conclusion. I would say that if the difference between two randomly picked 25,00 samples from the 50,000- games would differ to much, it shows you did not pick randomly.
That and only that.
That and nothing more. Quoth the raven, "nevermore"...
Or that there is something wrong with your, and BayesElo's, statistics.
Would such a test which failed to show the large swings he's been seeing convince Bob that something is wrong with his cluster testing, even if he can't find the cause?
No. First a side-track here.
I am certain that my test platform is far more consistent than anything a normal user can test on. The cluster is "behind" a front-end box that users have to log in to in order to run things. Users then submit "shell scripts" to the scheduling manager and it kicks the scripts off on nodes that are unused. The cluster nodes do not run any typical daemon processes that user systems would use. No http daemon, no email, no crontab, etc... All they do is run whatever is sent to them. They all have identical processors, identical memory sizes, identical disk drives, identical configurations (they are all cloned from a standard kernel setup whenever the cluster is cold-started, the only thing that is different for each node is its namp and IP address. We run QAC monthly to pick up on network cards that are going bad, or whatever (we currently have 130 compute nodes, two are down waiting on parts from the last such test). Etc. The platform is as controlled as a machine can be, unless each were in single-user with no network connection to the head, which would be next-to-useless.
Now, if any of those things are an issue, then they are even more of an issue for normal users with machines that are running other things. Windows is a good example. Boot your box, let it get all the way up, then run a network sniffer to watch the network traffic it produces, even when not running any user programs at all and with no one logged in.
So, since if I have a problem as given above, then by george, everyone else has an even bigger problem, and I don't see why it would be an issue since we can't eliminate it anywhere.
Whether or not BayesElo is right is something I can't/won't address since I didn't write it and am not going to go thru it with a fine-tooth comb. But since I have seen this variability for a couple of years, I suspect it has been there all along. Just that nobody has ever had the resources and taken the time to make two such runs with the same version. What would be the point and how long would it take the average person to play such a number of games.
I'm simply reporting what I found. 4 short tests, and two runs that I would consider "long" by any stretch of the imagination. We normally run 1+1 time controls to make the tests run faster, but for the first test of the Elo output, I stretched the time out a bit to attempt to reduce the effect of timing randomness, even though it can't be eliminated completely.
What would be more interesting would be fore someone else to play a bunch of very fast games (1+1 is OK) and see what their 800 games against 5 opponents looks like. Like mine? Or something much more consistent? But nobody is willing to do that, and a couple are only willing to say "no way" and offer nothing concrete in rebuttal.
But to run 160 games against 5 opponents, at 1+1 ought to be doable. A single game would take around 3 minutes probably. 20 per hour. Or under two days total. Then repeat for another 2 days to have two runs to compare. I could run the same test several times at 1+1 here, and we could compare results. That would answer a lot of questions. Because I don't believe I will be the _only_ one with this variability. I could even provide the source for the 5 versions + current crafty so that the programs would be the same on both platforms...