bob wrote:
In any case, 100 games is not enough if you play the same 100 game match 2 times and get a different indication each time. For example, you play 2 matches before a change and get 50-50 and
45-65. You play 2 matches after the change and get 30-70 and 70-30.
Well, for that particular set of data (intuitively, without having measured it), you wouldn't be able to draw any conclusion.
Actually, to me this particular set would indicate that I haven't run enough tests before the change. If you are seeing that kind of behaviour, you would be well-advised to run more tests. I never claimed that your observations are wrong for your particular situation. I only questioned whether your situation can necessarily be applied to mine, or perhaps even others (although that part I'm not all that concerned about), which you seem to be claiming, again, correct me if I'm wrong.
[Incidentally, because the debate is getting a little heated, please let me assure you that if I ever slip and get personal or anything, that it was not intentional. With some of my posts I'm not so sure if they can be misunderstood in that way].
And 90-10 would also be enough.
I keep saying this all over, but you seem to simply ignore it.
I seem to be having that same problem. I have not seen you quote a single set of 100 game matches and give the results of each.
This seems a little unfair, as I don't have your resources. I don't have a cluster or anything. My Laptop is running day and night running a gauntlet against >50 other engines (working on the engine is on hold for now). I am currently in round 4 of the gauntlet, at game 193 out of 530. I will post the intermediate results of two matches once I have them.
Also please don't compare your 100 games to my 100 games. You could choose a meaningful starting position out of your 40 and then the results are more comparable. Or you could just run them from the starting positions, with own books enabled. I know that that is not the way you normally test, but given the fact that you have much more horsepower available, perhaps you could spare a few cycles on this particular analysis, just like I am sparing 100 % of my cycles to do something I normally don't.
I did that. My results showed that just running 100 games, which could produce any one of those results I posted (or many other possible results as well) could lead you to the wrong conclusion.
Where is your data to show that 100 games is enough? It is easy enough to run my test and post the results to prove they are stable enough that 100 is enough.
There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.
Yes I am. I posted samples where A beat B and lost to B, yet A is provably stronger than B. I posted sample games where A beat B and lost to B where they are equal. And I posted games where A beat B and lost to B and A was far stronger than B.
Please provide some data to contradict mine. And don't just rely on your program if it is pretty deterministic. Try others as I did, to convince yourself things are not as stable as you think, which means the number of games needed (N) is much larger than you think.
But I have never made any claims about other programs. I have only ever been talking about my program, and asking for advice on how to handle its particular situation. I am not looking for the underlying theoretical variance.
And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.
with 95% you will be wrong one of every 20 changes. Too high. Being wrong once can negate 5 previous rights...
To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.
[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?][/quote]
20000 might not be enough. However, I have no idea what you mean "without qualification". I believe, if you re-read my post, I qualified what I found very precisely. I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large. Somehow the discussion keeps drifting away from that point.
[/quote]
Well, the heading of _this_ thread is "...for the rest of us", which means that I am explicitly taking Crafty out of the picture, because from my point of view, Crafty's situation is special (just as probably from your point of view, Eden's situation is special). Now, I never claimed that any of the conclusions you made about Crafty is wrong. What I did claim was that your results do not necessarily apply to "the rest of us", which it seemed to me (but I may of course have been wrong) you have been implying, saying things like "100 games is not enough".
Without qualification means "100 games is not enough".
With qualification would mean "100 games is not enough for me to prove that Crafty version x.1 is stronger than x.2".
If you want to discuss something else, fine. But _my_ discussion was about this specific point alone. Nothing more, nothing less. Somehow the subject keeps getting shifted around however.
Yes, and I am happy that you're still listening and that we are starting to clear up some of our misunderstandings.
Back to the main idea, once again. I have 6 programs that I play against each other, one of which is mine. Generally I only play mine against 4 of that set of 5 (I weed out one that is significantly worse, as one of those is enough). I want to run Crafty against the set of 4 opponents, then run Crafty' against the same set, and then be able to conclude with a high degree of accuracy whether Crafty or Crafty' is better. Nothing more, nothing less. 80 or 160 or 320 games is absolutely worthless to make that determination. I have the data to prove this convincingly.
I have said elsewhere that intuitively (basically, just 4 samples), and without having done the maths, your result is most likely right. I never said it wasn't.
Any other discussion is not about my premise at all, and I don't care about them. I don't care which of my pool of programs is stronger, unless you count Crafty and Crafty'. I don't care about really large N as it becomes intractable. I don't care about really small N (N <=100) as the data is worthless. So I am somewhere in the middle, currently using 20K games, 5K against each of 4 opponents, and still see a small bit of variance there as expected. But not a lot.
So can we discuss _that_ issue alone? Comparing and old to new version to decide whether the new version is stronger. And get off of all the tagential issues that keep creeping in?
But for that issue, in the context of Crafty, there is no discussion. I am saying that you are probably right for that context. I have been trying to talk about something else.
And while I would prefer if you could help with my particular issue, I will accept it if it doesn't interest you.
I have stated my case concisely above. I have offered data to support it. Feel free to offer data that refutes it, and to have that scrutinized as I did my own when I started this testing last year...
My claim so far regarding Crafty is that my engine is seeing lower variance in its results than Crafty is. I am working on the data to test this hypothesis. Unfortunately it is currently infeasible for me to do the exact same test that you have done, although I can find another computer and leave it running for a week or so.
So, for now I am only doing that gauntlet against >50 opponents. I will do that 40-position test at some stage, I just can't do it right now. Perhaps someone else could chip in? In addition, perhaps you would be willing to do just one small test where the conditions are closer to what I'm doing right now.