New testing thread

Zach Wegner · Post by **Zach Wegner** » Sat Aug 09, 2008 11:34 pm

Fritzlein wrote:
Zach Wegner wrote:Isn't this pretty much exactly what I've been saying?
Yes it is. I apologize for not quoting you directly; before my account was activated it was difficult to respond to everyone at once, but I agreed with the distinction you were making, and just wanted to try to say it clearly myself.

I think you misunderstand. My comment was more directed towards Bob, who seemed to contend that using time rather than node counts gives enough variation that it will give reliable results. It looked to me that the person he was quoting was disagreeing with him! Anyways, I'm glad you came to the forum, so that somebody can agree with me.

Cheers,
Zach

Dirt · Post by **Dirt** » Sat Aug 09, 2008 11:44 pm

bob wrote:
Dirt wrote:
bob wrote:To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
I don't think there is any need to be sure the positions are balanced. That is, you get more information from balanced positions whether you used them for both black and white or not. What you lose by testing only one way is accuracy about how well Crafty plays against the opponents. Theoretically you don't care about that, only about how the different versions of Crafty compare.

In practice I don't think it makes much difference, and the convenience of having half the number of positions probably outweighs any loss of testing efficiency.
Take an extreme point... all positions are such that white is winning. The test would not show a thing since either A or A' will win every game with white or lose every game if it plays black. Could that happen? Hard to say but with so many gambit lines available, and so many lines that are _very_ deeply analyzed (Sicilian for example) it would be within the realm of reasonable probability to choose at least a large number of unbalanced positions. If you chose 1/2 of those, then 1/2 of the testing effort won't show anything. The other issue is that if you don't get a good representative sample of the openings you typically play, and you do get a lot of games from openings you avoid because you don't handle them very well, then the results get biased.

My point is partly that such an unbalanced position won't show anything even if you test it again with opposite colors, so you need roughly balanced positions regardless. On the other hand, not replaying the games with colors reversed will get you a wider sample which should result in a more accurate A vs A' test, but I don't think the small gain would be worth the effort.

bob · Post by **bob** » Sat Aug 09, 2008 11:46 pm

Nope, sorry. "stampee foot" was _not directed at you. You are looking at the data as if it were produced exactly as I said (which it was) and trying to conjecture why the results were so different and how that could happen. Others are just saying "your testing is flawed and games are dependent on each other somehow" and letting it go at that. There is absolutely no way there is any sort of causality relationship since the individual matches are played on 128 nodes that are completely separate and independent, different processors, different memory, different disk drives, etc. I have even run on two different clusters (three actually as one was done on an AMD cluster). And the results have not changed in any appreciable way. That is why I am so certain the testing platform is irrelevant here. And why I welcomed someone that was willing to look at it with an eye for trying to explain the results, rather than trying to explain why they are not possible, or they are selective sets of data, or they are fabricated, or whatever...

bob · Post by **bob** » Sat Aug 09, 2008 11:50 pm

hgm wrote:
Fritzlein wrote:I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.
Yes, I noticed that. But the fuss is all about the fact that two of Bobs samples differed more than 6-sigma from each other. About how much they differ from this true quantity, which you and I agree is so relevant, no one has any clue of. Except of course, that if two samples differ by a certain amount, at least one of them must differ from the true winning percentage by least half that. But it could be arbitrary much more.
I was not talking about the standard deviation between a test run and an
Aha. new item. "Stampee foot, revise history, stampee foot." Exactly when have you been claiming that _anyone_ using either the Nunn positions, the Silver positions, the Noomen positions, or any other small set of positions. That would be interesting to see. Your primary contribution has been that my testing is somehow flawed in how the cluster works and it is introducing dependencies that are causing the variability...

identical or nearly-identical test run.
Clear. But Bob was. So that makes your remarks not very relevant to his problem. Which is what I pointed out when Bob first quoted them.
We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision?
Well, seems like Bob wants that very much. You might say it is his principle drive in life. Otherwise he would not be using only 40 positions and 5 opponents some 8 months after I pointed out this problem to him. This was one of the first useless non-contributions to this discussion, you see...
(Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone. What I have explained (to my own satisfaction at least) is why each of the two 25600-game test runs can be expected to have a large error relative to Crafty's true strength, and thus I only need any old random change between the two big runs to complete the less interesting (to me) explanation as well.
Well, that is something we definitely agree on.

Tell me now if we also agree on this:

The magnitude of the correlations you point out, can be meausred empirically by taking the marginal scores of the big run, projected onto the different positions, different opponents, different playing times, and calculating the score variance over those marginal distributions. (Povided that the number of games played from each position (or against each opponent) is so large that the sampling noise in the result for that position (opponent) is small compared to the typical difference in score between the different positions.) That will tell you how large the result for a single position will typically deviate from the result of the average over all positions. And thus how many positions you minimally need to bring this source of sampling noise below the required accuracy bound.

bob · Post by **bob** » Sat Aug 09, 2008 11:54 pm

OK, I'll buy that. But there is something to be learned that you probably would not think about. What about positions where crafty loses from both black and white. I call those "problematic positions" and look at them very carefully because something interesting has to be happening to be unable to win from either side, or at least draw from one...

bob · Post by **bob** » Sat Aug 09, 2008 11:56 pm

Zach Wegner wrote:
Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?

Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.

There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...

xsadar · Post by **xsadar** » Sat Aug 09, 2008 11:58 pm

bob wrote:
xsadar wrote: This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.
Of course. However, once I run the first run (cluster is a bit busy but I just started a test so results will start to trickle in) I would like to start a discussion about how to select the positions.

To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.

That's the main thing I was worried about.

What I have done, is to take the PGN collection we use for our normal wide book (good quality games) and then just modify the book create procedure so that on wtm move 11 (both sides have played 10 moves) I write out the FEN. The good side is that probably most of these positions are decent. The down side is that this does not cover unusual openings very well, and might not cover some at all. So you might find out your program does well in normal positions but you might not know it handles off-the-wall positions poorly...

So there is a ton of room for further discussion. But let me get some hard Elo data from these positions. I want to run them 4 times so that I get 4 sets of Elo data, which will hopefully be very close to the same each time...

I hope it gives some good results. This is with Crafty playing both black and white against each engine for each position, right?

bob · Post by **bob** » Sun Aug 10, 2008 12:08 am

hgm wrote:
bob wrote:Believe I got that 55 years ago or so. The point being addressed was how does something with a SD of 12, (or less) still produce two runs where the results are well outside that range.
Get that?
Not only that: even answered it about 30 pages of posts ago.

I just reduced the number of positions so that I could run the various tests being suggested, and not make the discussion drag out over months.
Reducing the number of games is not compatible with insisting on being able to measure the same tinny Elo differences you wanted to measure with 25,000 games. No use sulking about it. If you want to create a model problem that you can test quickly, you will have to scale the requirement for accuracy accordingly.

No, but if you'd just follow the discussion once in a while, you would see that reducing the number of games is good enough to test the hypothesis that a round-robin will stabilize the ratings more. One does have to have an attention span measured in something longer than milliseconds to be able to carry on discussions here of course. I've never proposed using 800 games for the good/bad testing. But I did suggest using it to test the hypothesis that playing a round robin would help, otherwise the test would take over a week for one run, not a day.

Get that now? Nobody is saying _reduce_ the number of games for real testing. At least not me...

[/quote]

I fully understand that 25,000 games produces a smaller SD than 800.
Apparently not, as you continue complaining that the 800 games show a larger deviation than would make it a 'useful' replacement for your 25,000-game run.

But, unlike yourself, I also fully understand that running 800 games takes a lot less time,
Where did you get that silly idea from?

and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.

Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.
You're welcome.
OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.
Yes, life is difficult isn't it, for those relying on blind guesses in the face of infinities. The scientific approach would of course be to calculate how many you need. You have tried 40 different positions, and did a good deal more than 40 games on each of those. So you are in a position to calculate the standard deviation of the result over variation of the number of games. Calculate by which factor that fall short of the accuracy you want, calculate the square, and, magical trick, you have the number of needed positions.

Oh, sorry, too difficult. Another useless non-contribution. And you of course no longer have the results of the 25,000-game matches specified by position...

BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions.
In your dreams, yes.

But as you used the same positions and opponents in both runs, these dependencies (if any) should decrease the variability of the runs (and hence the typical difference between two runs), and thus cannot explain your result, which is too large a difference, not too small. That these effects drive up the difference between what your runs produce and what you really wanted to know is irrelevant for explaining the 6-sigma deviation, as you remarked yourself in the very beginning of the previous thread.

But not where one game influences another in any possible way. But we don't seem to be able to get away from that.
I guess this is because of the unfortunate coincidence that I continue to overlook your explanation of how you excluded slow time-dependence of the involved engines as an artifact spiling your experiment. Can you give me the link back to the post where you did that? :roll

easy enough. Because the engines run on "virgin" machines each time. The machines do not cyclically ramp up their clock, then ramp it back down. Which would not introduce a dependency anyway. So somehow the act of winning or losing a game on machine X would have to alter the clock on machine X so that the next time a game is played there, the altered clock has to somehow affect (in a consistent way) the outcome of that game.

It is all absolutely poppycock and you (as well as everyone else) knows that is simply not possible. I also don't want to have to prove that sunspots, cosmic rays, close-proximity black holes, gravity waves, etc are introducing dependencies also. So back to the real world. If my cluster produces dependencies, so does everyone else's testing, particularly those just using a paltry single machine or 2 or 4.

:
Meanwhile, in spite of all the noise, there is a small and steadily helpful signal buried in here that others are contributing. And I am willing to test 'em all without dismissing _anything_ outright. Unlike yourself.
Well, so at least you are following my earlier advice, then:
Muddle on!
Your signal/noise ratio is so bad, I am not aware of your offering any useful signal anyway. So I believe I will just "muddle on" and before long you will actually know how to test engines as a result, since you obviously do not know how at present... And neither do I, at present. But at least _I_ am working on changing that situation.

bob · Post by **bob** » Sun Aug 10, 2008 12:34 am

Dirt wrote: No insults? How boring! Plus when you concentrate on how to improve the testing rather than identify why it went wrong you don't leave much to argue about.

Actually even trying to identify why it went wrong would be a refreshing change from what I have been seeing. If I could explain what was going wrong, I could fix it. Alternatively if I can fix it without knowing precisely why it wasn't working as expected before, that is _also_ OK.

but not "stampee foot, etc..."

Sven · Post by **Sven** » Sun Aug 10, 2008 12:39 am

Dirt wrote:
bob wrote:To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
I don't think there is any need to be sure the positions are balanced. That is, you get more information from balanced positions whether you used them for both black and white or not. What you lose by testing only one way is accuracy about how well Crafty plays against the opponents. Theoretically you don't care about that, only about how the different versions of Crafty compare.

In practice I don't think it makes much difference, and the convenience of having half the number of positions probably outweighs any loss of testing efficiency.

I think that including too many unbalanced positions without playing them with both colors may suffer from one problem. You want to compare estimates of the playing strength of version A and A'. But actually you would deal too often with comparing only the abilities to handle a won position, or a lost position, which is only one aspect of playing strength. So while the ability of A' to win better positions may have increased compared to A, its ability to handle balanced positions may have decreased.

Therefore I propose to include both balanced and unbalanced positions in the test set (perhaps more of the balanced kind since this might be closer to "reality"), but to play all positions with both colors. The latter seems important for me since it may be quite difficult to decide whether a given position belongs to one or the other group (you may ask Rybka but you still have to define some artificial margin to say where "unbalanced" begins).

Sven

New testing thread

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: More readable

Re: Correlated data discussion