bob wrote:Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?
Well, the number 80 is something that is entirely your fabrication. I never mentioned any number, and everything I said so far applies as well to 80 million games. The information produced in 80 games is useful, as you first have to play 80 games before you can have played 80 million games. If one would throw the results of each 80 games away because they were 'useless', you would never get a result on a larger number of games. Each game is exactly equally useful, and contains as much information as any other game. How large the batch was it belonged to does not have any effect on this.
Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...
I experience exactly the opposite problem. When testing uMax I could not play more than 2 games against most opponents, (one with white and one with black), or the same 2 games would be exactly repeated over and over. And of the opponents that randomized their opening play most were not suitable for automatic testing, as they crashed and hung the system. I finally solved it by playing Nunn matches, forcing the dozen or so opponents that would not hang the system to play 20 different games each.
I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.
That you probe it is not enough to affect the tree. The course evaluation of any KBNK position is +6, and before any of those get in window the game has to be long since decided. The subtle positional differences between having the bare King in a white or black corner will never change the nature of the fail until the root score gets very close to +6 or -6. And if that happens in the opening, a revision of your book is in order...
Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.
I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.
Like I said before, I already had to go through substantial trouble NOT to get that. If uMax (and many of the opponents I tried for it) would not repeat the game move for move, I probably should send my computer back for repairs under guarantee, for there must be a hardware error then. It is not normal when a computer program with a precisely specified algorithm does not give the same result every time you run it, as computer languages are defined such that the result of any statement is uniquely defined and leaves no room for undeterministic outcome.
That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population.
Are you now saying that testing is never any good, because even if you would do billiones times billions of independent games the gamee tree of Chess is so large that whatever you tried would always be just a tiny slice of all possible games that the program could be made to play by varying its nr of nodes or seconds it could search? Well, in standaard statistical theory it is only the size of the sample that determines the reliability of the result, not the size of the population the sample was drawn from.
Anyway, we seem to be making progress, as now at least it is 'just as error prone' to fix the number of nodes as to fix the number of seconds, in stead of 'totally useles'.
This is absolutely standard statistical procedure.
Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...
Elections are the only counterexample I know of. Everyone uses sampling when confronted with a huge or infinite data set. How big the sample has to be to draw conclusions with a certain reliability from it can be calculated, and can be strongly dependent on the method of sampling. (e.g.
stratified sampling.)