To be a reasonable test, I want to play enough games to give the new change a chance to be used in many types of positions. Tactical attacks, positional games, simple endgames, complex endgames. Otherwise I can't be sure whether a simple "outside passed pawn" term is good or bad. I have seen such a term work very well in some positions, and cause the program to trade B for 3 pawns to create a distant passed pawn, only to lose because the opponent has an extra piece. If you test a change only in positions where the change is obviously important, you are overlooking a critical part of the testing methodology you need. Hence my decision to use Albert's 40 positions even though I know that Crafty will likely not play into some of those positions. But the positions that occur are still important with respect to understanding how you r evaluation performs over a large cross-section of potential positions you might see in tournaments.nczempin wrote:I feel like I am stuck in some endless loop in an episode of either "The Twilight Zone" or "Groundhog day."bob wrote: In any case, 100 games is not enough if you play the same 100 game match 2 times and get a different indication each time. For example, you play 2 matches before a change and get 50-50 and 45-65. You play 2 matches after the change and get 30-70 and 70-30.
Well, for that particular set of data (intuitively, without having measured it), you wouldn't be able to draw any conclusion.
Actually, to me this particular set would indicate that I haven't run enough tests before the change.
If you don't get those kind of results, something is wrong with your testing. Or else Fruit, Glaurung 1/2, Crafty, Arasan, GnuchessX (and even a few others I won't mention) have serious problems because _they_ can't produce deterministic results for the 40 positions / 80 game matches I am running. So the point is, those _are_ the kinds of results at least most of us are seeing. Whether you or HGM do or not, I have no idea, and do not care. I am reporting what I see with a broad group of programs, not just my own.
Again, just run the test and report back, rather than continuing the discussion forever. Let's see what kind of variability you get over an 80 game match played at least 2-3 times...
This seems a little unfair, as I don't have your resources. I don't have a cluster or anything. My Laptop is running day and night running a gauntlet against >50 other engines (working on the engine is on hold for now). I am currently in round 4 of the gauntlet, at game 193 out of 530. I will post the intermediate results of two matches once I have them.If you are seeing that kind of behaviour, you would be well-advised to run more tests. I never claimed that your observations are wrong for your particular situation. I only questioned whether your situation can necessarily be applied to mine, or perhaps even others (although that part I'm not all that concerned about), which you seem to be claiming, again, correct me if I'm wrong.
[Incidentally, because the debate is getting a little heated, please let me assure you that if I ever slip and get personal or anything, that it was not intentional. With some of my posts I'm not so sure if they can be misunderstood in that way].
I seem to be having that same problem. I have not seen you quote a single set of 100 game matches and give the results of each.
And 90-10 would also be enough.
I keep saying this all over, but you seem to simply ignore it.
Also please don't compare your 100 games to my 100 games. You could choose a meaningful starting position out of your 40 and then the results are more comparable. Or you could just run them from the starting positions, with own books enabled. I know that that is not the way you normally test, but given the fact that you have much more horsepower available, perhaps you could spare a few cycles on this particular analysis, just like I am sparing 100 % of my cycles to do something I normally don't.
Then why are we having this discussion? I _know_ the concept applies to typical commercial and amateur engines that play at a very high level. Do you ever intend to reach that level? If you don't have the non-determinism I see in these other programs, I suspect you will as you get better. Maybe then you will see the need for a better testing regime, or else you are going to do as I have done _many_ times in the past and make some serious false steps that are going to cause some embarassing losses. I need point no further than the first two games of the 1986 WCCC event to show how an insufficient number of test games can lead to a horrible decision and result.But I have never made any claims about other programs. I have only ever been talking about my program, and asking for advice on how to handle its particular situation. I am not looking for the underlying theoretical variance.I did that. My results showed that just running 100 games, which could produce any one of those results I posted (or many other possible results as well) could lead you to the wrong conclusion.
Where is your data to show that 100 games is enough? It is easy enough to run my test and post the results to prove they are stable enough that 100 is enough.
Yes I am. I posted samples where A beat B and lost to B, yet A is provably stronger than B. I posted sample games where A beat B and lost to B where they are equal. And I posted games where A beat B and lost to B and A was far stronger than B.There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.
Please provide some data to contradict mine. And don't just rely on your program if it is pretty deterministic. Try others as I did, to convince yourself things are not as stable as you think, which means the number of games needed (N) is much larger than you think.
20000 might not be enough. However, I have no idea what you mean "without qualification". I believe, if you re-read my post, I qualified what I found very precisely. I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large. Somehow the discussion keeps drifting away from that point.with 95% you will be wrong one of every 20 changes. Too high. Being wrong once can negate 5 previous rights...
And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.
To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.
[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?]
[/quote]
Well, the heading of _this_ thread is "...for the rest of us", which means that I am explicitly taking Crafty out of the picture, because from my point of view, Crafty's situation is special (just as probably from your point of view, Eden's situation is special). Now, I never claimed that any of the conclusions you made about Crafty is wrong. What I did claim was that your results do not necessarily apply to "the rest of us", which it seemed to me (but I may of course have been wrong) you have been implying, saying things like "100 games is not enough".
Without qualification means "100 games is not enough".
With qualification would mean "100 games is not enough for me to prove that Crafty version x.1 is stronger than x.2".
I certainly can't claim my observations apply to "all of the rest of you". But based on the opponents I _have_ tested, I would claim that they apply to _most_ of the rest of you. And I will also bet that as time goes by, they will apply to you more and more as well...
Yes, and I am happy that you're still listening and that we are starting to clear up some of our misunderstandings.
If you want to discuss something else, fine. But _my_ discussion was about this specific point alone. Nothing more, nothing less. Somehow the subject keeps getting shifted around however.
I have said elsewhere that intuitively (basically, just 4 samples), and without having done the maths, your result is most likely right. I never said it wasn't.Back to the main idea, once again. I have 6 programs that I play against each other, one of which is mine. Generally I only play mine against 4 of that set of 5 (I weed out one that is significantly worse, as one of those is enough). I want to run Crafty against the set of 4 opponents, then run Crafty' against the same set, and then be able to conclude with a high degree of accuracy whether Crafty or Crafty' is better. Nothing more, nothing less. 80 or 160 or 320 games is absolutely worthless to make that determination. I have the data to prove this convincingly.
But for that issue, in the context of Crafty, there is no discussion. I am saying that you are probably right for that context. I have been trying to talk about something else.
Any other discussion is not about my premise at all, and I don't care about them. I don't care which of my pool of programs is stronger, unless you count Crafty and Crafty'. I don't care about really large N as it becomes intractable. I don't care about really small N (N <=100) as the data is worthless. So I am somewhere in the middle, currently using 20K games, 5K against each of 4 opponents, and still see a small bit of variance there as expected. But not a lot.
So can we discuss _that_ issue alone? Comparing and old to new version to decide whether the new version is stronger. And get off of all the tagential issues that keep creeping in?
And while I would prefer if you could help with my particular issue, I will accept it if it doesn't interest you.
My claim so far regarding Crafty is that my engine is seeing lower variance in its results than Crafty is. I am working on the data to test this hypothesis. Unfortunately it is currently infeasible for me to do the exact same test that you have done, although I can find another computer and leave it running for a week or so.
I have stated my case concisely above. I have offered data to support it. Feel free to offer data that refutes it, and to have that scrutinized as I did my own when I started this testing last year...
So, for now I am only doing that gauntlet against >50 opponents. I will do that 40-position test at some stage, I just can't do it right now. Perhaps someone else could chip in? In addition, perhaps you would be willing to do just one small test where the conditions are closer to what I'm doing right now.[/quote]
You can do this test pretty easily. You don't need to run long games. I have run 1+0, 1+1, 2+1, 1+2, 5+5, 10+10, 30+30 and 60+60 and the variability doesn't change very much. A time control of 1+0 lets you play about 30 games per hour, actually a bit more. You could produce a 100 game match in 3 hours. Two in 6 hours and that's all you need to test this hypothesis...