But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.
Well, then he apparently failed, as he offered no explanation for the 6-sigma event whatsoever.
Semantics. He clearly explained _how_ it could happen. And how a small number of positions made it more likely that it would happen. Nobody has attempted to explain "why" it happened. Obviously something is going on that is unexpected. I can think of several potential "whys" but verifying them would be difficult. For example, take a case of time jitter. And for the sake of argument, assume that once you start the program, the jitter is constant. Which means that if you sample just before the time "flips over to the next unit". So that your next sample comes after the flip and you think a complete "unit" elapsed but it didn't. Suppose that this repeats for the course of the game. So that all searches get bit by that same "short time" measurement. And once this is set up, it continues for several games so that the games are almost identical. Until "wham" something happens, and now we start sampling right after the time jump and we can search longer before the next unit elapses. And on these searches, we win more than we lose, where on the last set we lost more than we won. And there are two distinct runs with correlated results, but the results are opposite, and the error bars are far from overlapping. What happened between the two runs? Who knows. Perhaps nothing other than the delay in starting the second run. Did this happen? No idea. The time I gather for the PGN files won't show this. So it will be a pain to "pin the tail on the donkey" should that be it.
And I don't know that that is what is happening. Only that the small number of positions exhibit correlation across multiple games, and one batch can say "good" and the next batch can say bad. Are there other scenarios? Possibly. But no matter what the issue is, _everybody_ has it. There is a risk in the 4K positions that many of them could be quite similar and exhibit correlation, and they could be sensitive to time jitter as well. It would just be less probable with a small number of games per large number of positions, as opposed to the inverse.
But, elaborating on (3), the fact remains that:
3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that.
Then he was wrong, as it is mathematical fact. But it is only your claim he was wrong. In fact he did admit to this, and thus was right.
Let's be precise: N games played over 40 positions, vs N games played over 4,000 positions. I believe Karl said the latter will have _less_ variability than the former, because the correlation effect is effectively removed, so that whatever is causing the strange results with respect to timing jitter will not be a factor.
Let me quote the two parts of the complete text I had posted previously, to show what _I_ am reading. You can then respond where you think that is wrong:
=========================================================
Let's continue our trials:
Trial E: Same as Trial C, but instead of limiting by nodes, we limit by
time.
Trial F: Same as Trial E, including the same time control, except that the
room temperature is two degrees cooler, and because of that or some other
factor, the engines are able to search at 1.001 times the speed they were
searching before.
In these last two trials, the 64 repetitions of each position-opponent-color
combination will not necessarily be identical. Miniscule time variations
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.
When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather affected
each block of 64 playouts in coordinated fashion.
==========================================================
Ok. that is quote number one that addresses the correlation issue, and suggests that some sort of glitch in the time jitter might be a cause of the problem (but note he does not claim that is what happens, because we could not verify this). Any argument with my interpretation of the above???
=========================================================
Now let me pass from trying to give a plausible explanation of your posted
results to trying to solve the practical problem of detecting whether a code
change makes an engine stronger or weaker. I am entirely persuaded of your
opening thesis, namely that our testing appears to find significance where
there is none. We think we see a trend or pattern when it is only random
fluctuation. We need to re-examine our methodology and assumptions so that
we don't jump to conclusions too quickly.
The bugbear is correlation. We are wasting our time if we run sets of trials
that _tend_ to have the same result, even if they don't always have the same
result. Yes, we want code A and code A' to run against exactly the same test
suite, but we don't want code A to run against the same test position more
than once.
The bedrock of the test suite is a good selection of positions. If the
positions are representative of actual game situations, then they will give
us information about how the engine will perform in the wild. They can't be
too heavy on any particular strategic theme that would bias the test results
and induce us to over-fit the engine to do well on that one strategy.
Assuming you have a good way to choose test positions, I think it is a
mistake to re-use them in any way, because that creates correlations. If A'
as white can outplay A as white from a certain position, then probably A' as
black can outplay A as black from the same position. The same strategic
understanding will apply. Re-running the same test with different colors is
not giving us independent information, it is giving us information correlated
to what we already know. Similarly it is a mistake to re-use a position
against different opponents. If A' can play the position better than A
against Fruit, then A' can probably play the position better than A against
Glaurung. The correlation won't be perfect, but neither will the tests be
independent.
In other words, I am saying that if you want to run 25,600 playouts, then you
should have a set of 25,600 unique starting positions that are representative
of the positions you want Crafty to do well on. If you want to remove color
bias, good, have Crafty play white in the even-numbered positions and black
in the odd-numbered positions, but don't re-use positions. If you want to
avoid tuning for a specific opponent, good, have Crafty play against Fruit in
positions numbered 1 mod 5, against Glaurung in positions numbered 2 mod 5,
etc., but don't re-use positions. Come to think of it, re-using opponents
creates a different source of correlation that also minimizes the usefulness
of your results. One hundred opponents will be better than five, and ideally
you wouldn't re-use anything at all. If nothing else, vary the opponents by
making them marginally stronger or weaker via the time control to kill some
of the correlation.
=========================================================
I interpret that to say that the correlation is what is causing the variability, and that by using enough positions to not repeat any of them, we can make this go away and reduce this effect. The suggestion about playing them with different time handicaps was another thing I had not thought about and that would be easy enough to do as well to make it appear is if there are more opponents than just 5.
But in any case, please explain where I am "missing his point" or "putting words into his mouth" because if I am doing so, it is not intentional.