OK... for that I ran the following tests:Tony wrote:OK, for me the big difference is strange (between fixed nodes and normal mode). So either the result is wrong, or my expectation.bob wrote:I don't know if that could happen when using learning or not. But in this case, it can't possibly happen for several reasons:Tony wrote:Would that mean that an engine that has learning enabled, and (accidently) gets a fast win, will have a new game faster than the other machines, and with the extra knowledge (learnfile) will give it a bigger chance of a new fast win wich will go into the leraning file etc ?bob wrote:First, the highest deviation so far was not in the first 4. The next two were worse (14 and 15).Uri Blass wrote:bob wrote:The wild cases don't always come first. They don't usually come last. They just "come" which makes using a small sample unreliable. Nothing more, nothing less..hgm wrote:Grand average 0.7, observed SD of result of mini-matches 8.1. Prediction according to the 0.8*sqrt(M) rule-of-thumb 7.2. Note, however, that the draw percentage is low (18%), so that the pre-factor really should be sqrt(1-18%) = sqrt(0.82) = 0.9. So the observed SD is smack on what we would expect for independent games. Nothin special in the 13-match result.
None of the individual deviations is exceptional:the largest, -17.7, corresponds to 2.18*sigma, which (two-sided) should have a frequency of 2.8%. In 13 samples it should thus occur on the average 13*2.8% = 0.364 times. The probability that it occurs once is thus 0.364/1! * exp(-0.364) = 25.3%. This is about as high as it gets. (one seldomly sees things that have 100% probability, as that would mean that there could be no variation in what you see at all...) So not very remarkable.
If one would want to see anything unlikely in this data is that the 4 largest deviations, although not very exceptional in themselves, occur all in the first 4 samples. One observation of this is not significant enough to draw any conclusions on, though.
Now if you would have several such data sets, and they would all have significantly larger deviations in the first ~4 mini-matches, or at least pass some objective test that shows these erarly variances are significantly larger on average than the others, it would be very interesting. Because the first mini-matches can only be different from the later ones if they somehow _know_ that they belong to the first four. And there should be no way they could know that, as that would violate the independence of the games.
So if such above-average variances would occur systematically occur in the first few mini-matches, it would point to some defect in the setup subverting the independence. My guess would still be that this is merely a coincidence, and that other such data sets would show that the largest deviations are randomly scattered over the data set, and do not prefer particular mini-matches.
Or, in summary, nothing remarkable in this data; variances exactly as expected. But watch those early variances.
The fact that the 4 largest deviations, occur all in the samples X X+1 X+2 X+3.
cause me to suspect that something is wrong in your testing even without the case that X=1 in your testing.
I do not know what is wrong.
Maybe there is something that you did not think about that cause correlation between result of the match.
A possibility that I thought about except the possibility of the case that one program is slowed down by a significant factor is that by some reason you do not use the same exe in different samples or you use the same exe but with different parameters so the exe show a different behaviour in different matches.
Uri
Second, you can think something is wrong all you want. The worst runs are not always first. The big set I published last week had the biggest variances farther down.
There is absolutely no difference in how things are executed. Each program is started exactly the same way each time, exactly the same options, exactly the same hash, nobody else running on that processor. And I monitor the NPS for each and every game inside Crafty.
Unfortunately the "other programs" can't produce logs because they just use one log file as opposed to crafty that can log for each separate game that is played.
But forget about the idea of things changing from game to game. I have one giant command file, produced automatically, one line per game, and I fire these off to individual computers one at a time until all machines are busy, and then as one completes I send that node the next game after it re-initializes...
You have to take this data for what it is, significant random variance in some matches, apparent stability in others. For the matches that produce more stable results, that is only because I am using the total score for the 80 games. If you were to look at the individual game strings, you would find lots of variance even though the final scores are nearly the same. So the variance is still there, it is just not expressed in the final result...
Whereas if there is no accidentle fast win, this won't happen ?
Tony
(1) learning is explicitly disabled (position learning is all that could be used since there are no opening books used at all).
(2) no files are carried over between individual games. Each game is played in a completely "sterile" environment.
(3) to further rule that out, I have played these same tests but using a fixed number of nodes for both opponents, and there the games are identical run after run. Even with fixed node counts, position learning would still change scores, and moves, particularly for the side that lost. But with this kind of test, 64 matches produce 64 sets of 80 games that match perfectly, move for move. The only variance is very slight time differences (one move takes 5.21 seconds, in the next match it might take 5.23 or 5.19, but since time is not used to end searches in that test, it made no difference at all.
OTOH
Flipping a coin should give 50%, with expected deviations etc.
Assuming the result of the coinflip is dependent on the force and angle I flip it, fixing this force and angle, would result in a repeatable experiment.
But would that improve the measurement ? There would be a big difference with the nonfixed flips. But how do I know my fixed variables aren't biased ?
Did you find a noticable difference between fixed node matches and nonfixed ? (results or std )
If the results are "equal" but the std is lower, fixed nodes could be an alternative (to the lengthy matches)
Tony
1. crafty version A, to fixed number of nodes, against a set of opponents to the same number of nodes. Results were 100% reproducible, as expected, since there are no random evaluation terms or random search extensions in any of the programs used. This gave me a result B.
2. Crafty version A', to fixed number of nodes also, against the same opponents. A' had a couple of minor eval changes that over a very large sample showed no significant difference from version A. But for the 80 game match, the results were _way_ different. I looked at the games and in each case, Crafty varied in several places, primarily due to the different shape of the tree with the slightly different evaluation.
3. Crafty version A vs the opponents to a fixed number of nodes, then Crafty version A vs the same opponents to the same fixed number of nodes, except crafty searched 10K more nodes than the previous test (I was doing 10M node tests to 10K is a tiny change). The results were way different.
What that taught me was that there is so much variability, it is necessary to (a) run a large number of games and (b) If I want to test with fixed nodes, I need to use a bunch of different node limits to make sure my sample cross-section is broad enough to get a representative collection of games.
After a lot of fiddling around, it became pretty apparent that the best solution was to just use time, let the results vary, and run enough samples to smooth the variance out... which is what I am doing.