HYATT: Here's Yet Another Testing Thread

Discussion of chess software programming and technical issues.

Moderator: Ras

Uri Blass
Posts: 10820
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: HYATT: Here's Yet Another Testing Thread

Post by Uri Blass »

hgm wrote:
Uri Blass wrote:The point is that even if there are positions that CraftyX and CraftyX+1 score 100% then it does not mean that they are not balanced and
it is possible that Crafty simply understands them better than the opponents.
But I would like to remove such positions from my test set, as they give only very little information on the relative strength of X and X+1. Even if there is the occasional rare win, and one of them wins a little bit more often than the other, it contributes more to the noise than to the expectation value. Playing this positions is a waste of time, and perhaps even worse.

This is why I would be suspicious: my aim would be to subject the Engine Under Test to a gauntlet where each game has a winning chance between 75% and 25%. If the total variance of the gauntlet score betrays that I am not acheiving that goal, I want to correct that. So I would analyze the resuts of individual positions and games in such a case of low variability, until I would have traced the cause of the problem, to determine if it is harmful and should be corrected.

Any other approach would be tantamount to a philosophy that says: "I know these results are very likely to be in error somehow, but the error is in the direction I would have liked the truth to be, so I wll ignore it".
My point is you cannot be sure that 100% is going to remain the same
after future changes that may be counter productive for part of the positions.

If the reason for 100% or 0% is that the position is not balanced then it may be a good idea to drop the position but if the reason is different then I think that it is not a good idea to drop the position from the test.

The problem is that some future change may be productive in the positions that you did not drop but counter productive in the positions that you dropped and you will no way to know it if you drop part of the positions.

If you know that 100% will remain after future changes then it is a good idea to drop the position from the test but you do not know it in the first place.

The same idea is of course correct to percentage that is bigger than 75% or smaller than 25% but I posted only about the case of 100% to make things more simple.

Uri
frankp
Posts: 233
Joined: Sun Mar 12, 2006 3:11 pm

Re: HYATT: Here's Yet Another Testing Thread

Post by frankp »

bob wrote:
It is commonly called "a prejudiced idea". If you haven't seen it before, it doesn't exist. In Japan, Jabbar would be considered impossible. But the characteristics of chess engines are unexplored territory for the most part. Trees of billions of nodes have different characterics from trees of tens of nodes.

But just because you haven't seen it is hardly evidence that it doesn't exist. Unless you are a statistician. There's a lot I have not seen before starting this latest research. non-repeatability is as remarkable as it is unexpected.
Yes, that is the part I am alluding too. You would need prior knowledge that the condition X (=30 foot tall human) was impossible, to be sure it was not a statistical freak. And that may be the case here, but ..............
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: HYATT: Here's Yet Another Testing Thread

Post by hgm »

bob wrote:I'm hardly what one would call "mathematically shy" but I have discovered over the years, that one has to learn to avoid the "to the man who has a hammer, everything looks like a nail." It is tough to apply statistical analysis to something that has underlying properties that are not well understood, if those properties might violate certain key assumptions (independence of samples for just one).
But the mathematical properties of the system you are studying are totally understood. They should be independent repeats of the same experiment (position and opponent combination), that have fixed (in time) probabilities for ending on win, draw or loss. That is all that is necessary to do 100% accurate statistical analysis. And if the results in practice do not conform to that analysis in any credible way, it means that the assumption are apparently not valid. And their were only three. That the scores of an individual game where 0, 1/2 or 1 (and it stretches even my imagination that you would bungle that), that the probabilities were fixed in time, and that the games were independent.

That games starting from a different position might have a different WDL probability distribution, as Karl remarked, does not have the slightest impact on the analysis for the variability.

[qoute]There have been so many new characteristics discovered during these tests, and there are most likely more things yet to be discovered.[/quote]
Well, so far I have not seen anything being discovered...


But your story has _one_ fatal flaw. What happens if you bump into someone over 8' tall every few days? Do you still hide your head in the sand and say "can't be"? [/quote]
Of course not. I would say: "obviously the assumption that male adults are not 6' +/- 8" tall". Like I say to your 'frequent' observations: "obviously the assumption of independent games or time-constant WDL probability is not fulfilled". It is you that hide your head in the sand, insistin: "Can't be! My cluster is independent!" <bang head> "Can't be!" <bang head>. (Painful! Although computers are made out of silicon they are not quite as soft as sand! :lol: )
Because this is the kind of results I have gotten on _multiple_ (and multiple does not mean more than 1.. that is, if I used normal Elo numbers +/- error, the min for A is larger than the max for B. You seem to want to imply that I only post such numbers here when I encounter them.
Well that woulld be a natural thing to do. Why post data that is absolutely unremarkable? But sorry, "more than 1" won't cut it. That is no science. To make such a claim without actually knowing excatly how much each deviation occurs is simply not credible. Because, like you remarked, no result is absolutely impossible. Just astronomically improbable. You brought this skepticism on our part upon yourself by very often (multiple times, where in this case multiple actually does not mean <= 1) posting data here complaining about the hypervariability, while in fact the variability was lower than expected.
The only reason I posted the last set was that it was the _first_ trial that produced BayesElo numbers for each run. It was the _first_ time I had posted results from Elo computations as opposed to just raw win/lose/draw results.
Be that as it may, that does not mean this is a non-sssue. Would you also have posted it here if this first time the BayesElo ratings of Crafty would have been exactly the same? If you switch on an internal filter to only post something that is remarkable, it would still be "cherry picking" if it happened the first time. But let us not quarrel over that, as it is counter-productive.: Only you can know if you would have posted a non-remarkable result.
I posted several results when we started. I increased the number of games and posted more. And have not posted anything since because nothing changed. We were still getting significant variability,
But the problem is that you (based on all these posts) seems to call totally normal, unavoidable sampling noise "significant variability". And if that is your criterion, it does not mean a thing that you were still getting "significant" variability. You have cried 8' giant too often when a perfectly normal 6' individualleft your premises. So now people doubt your eyesight...
So now your remarks about variability would only impress us, if we would know your standards, which can only be acheived by telling us exactly how many percent of your data showed 3-sigma deviation, how much 4-sigma, etc. But you fail to do it, with a stubbornness that creates a very strong impression you don't know this yourself...
I had proven that it was a direct result of time jitter, or at least that removing timing jitter eliminated the problem completely. And there was nothing new to say. we were still having the same problems within our "group" where the same big run said "A is better" and a repeat said "B is better". And nothing new was happening. I then ran four "reasonable" runs and two _big_ runs, and ran the results thru BayesElo and thought that the results were again "interesting" and posted here, primarily to see what Remi thought and if he thought I might be overlooking something or doing something wrong.

But I run into these 8' people way too regularly to accept "your ruler is wrong, or you are cherry-picking, or whatever". I wrote when I started this that there is something unusual going on. It appears to be related to the small number of positions and the inherent correlation when using the same positions and same opponents over and over.
Well, here we obviously continue to disagree. On both counts. It was not related to that, and nothing 'apears' so far.
OK, that way of testing was wrong. It wasn't obviously wrong, as nobody suggested what Karl suggested previously.
Well, this becomes a bit of a sticky siubject now. Because we can no longer know if you mean that no one suggested what you think Karl suggested (which is of course true, as what you acribe to Karl is simply untrue and non-sensical), or if you mean that no one said what Karl actually said (which would be false, as I said the same thing 11 months ago).
Theron mentioned that he has _also_ been using that method of testing and then discovered (whether it was Karls post or his own evaluation was not clear however) that this approach led to unreliable results.
Of course it led to unreliable results. But that is not the same as hyper-variability. As I wrote 11 month ago, the results would even be unreliable if you had zero variability (by playing a billion games).
So, to recap, these odd results have been _far_ more common than once a decade,
depends on your standards of 'odd', which apparently differ from the usual...
... I, apparently like CT, had thought that "OK, N positions is not enough, but since the results are so random, playing more games with the same positions will address the randomness. But it doesn't And until Karl's discussion, the problem was not obvious to me.
And the only thing that is obvious to us, is that what you consider obvious is obviously wrong...
I have been trying to confirm that his "fix" actually works. waste of time, you say. But _only_ if it confirms his hypothesis was correct. There was no guarantee of that initially, the cause could have been something else, from the simple (random PGN errors) to the complex (programs vary their strength based on day or time). I still want to test with one game per position rather than 2, to see if the "noise" mentioned by some will be a factor or not. I still want to test with fewer positions to see if the "correlation effect" returns in some unexpected way.
I would applaud the latter. I think this is very important, because if it does not return, and the 38,000-game runs with 80 positions (40, black and white) would show simiar variability as the 'many-positions' runs you did recently, the problem might recur just as easily in any scheme.
But note that it does not _always_ work that way. My 40 position testing is a case in point. The math did _not_ predict outcomes anywhere near correctly.
Well, math is absolute truth. If you get a result that deviates from mathematical prediction, it can only mean that the assumptions on which the mathematical prediction was based cannot have been satisfied. And the only assumptions going into the deviation where that the games played from the result for games from the same position were drawn independently from the same probability distribution in both runs.
The positions were not even chosen by me. _many_ aure using those positions to test their programs. Hopefully no longer. But then, they are going to be stuck beause using my positions requires a ton of computing power. So what is left? If they add search extensions, small test sets might detect the improvement since it will be significant. Null-move? less significant. Reductions? even less significant. Eval changes? very difficult to measure. yet everyone wants to and needs to do so. Unless someone had run these tests, and posted the results for large numbers, this would _still_ be an unknown issue and most would be making decisions that are almost random. Unless someone verifies that the new positons are usable, the same issue could be waiting in hiding again. Far too often, theory and reality are different, due to an unknown feature that is present.
Well, in general this way of testing is doomed to failure, even if you can eliminate the hypervariability. The normal statistical fluctuation even on uncorrupted data is simply to large to be useful unless you play billions of games.

This is why I designed the tree-game comparison method., which eliminates the sampling noise.
And I believe that I had concluded from day 1 that the 40 positions were providing unusual data. But if you do not understand why, you can continue to pick positions from now on and still end up with the same problem.
Well, If I would need to do accurate measurements I would use tree-games, and I would not have this problem, as the initial position makes only a negligible fraction of the test positions there.
Karl gave a simple explanation about why the SD is wrong here.
So you seem to think. Wrongly so.
It should shrink as number of games increases, assuming the games are independent. But if the 40 positions produce the same results each time, the statistics will say the SD is smaller if you play more games, when in fact, it does not change at all.
It does not change because it is already zero. Would you have expected it to become negative with more games, or what?
I think it has been an interesting bit of research. I've learned, and posted some heretofore _unknown_ knowledge. Examples:

1. I thought I needed a wide book, and then "position learning" to avoid playing the same losing game over and over. not true. It is almost impossible to play the same game twice, even if everything is set up the same way for each attempt.
Not generally valid, as we discussed before. Most people have exactly the opposite problem...
2. I thought programs were _highly_ deterministic in how they play, except when you throw in the issue of parallel search. Wrong.
Not wrong for most programs. To be sure of sufficiently random behavior, it is better to not rely on any uncntrollable jitter, randomize explicitly. As the Rybka team does.
3. If someone had told me that allowing a program to search _one_ more node per position would change the outcome of a game, I would have laughed. It is true. In fact, one poster searched N and N+2 nodes in 40 positions and got (IIRC 11 different games out of 80 or 100). Nobody knew that prior to my starting this discusson. He played the same two programs from the same position for 100 games and got just one duplicate game. Unexpected? Yes, until I started this testing and saw just how pronounced this effect was.


4. I (and many others) thought that hash collisions are bad. And that they would wreck a program. Cozzie and I tested this and published the results. And even with an error rate of one bad hash collision for every 1000 nodes, it hs no measurable effect on the search quality.

5. Beal even found that random evaluation was enough to let a full-width search play reasonably. Nobody had thought that prior to his experiment.

I have said previously, the computer chess programs of today have some highly unusual (and unexpected, and possibly even undiscovered) properties that manifest themselves in ways nobody knows about nor understands. I had also said previously that standard statistical analysis (which includes Elo) might not apply as equally to computer chess programs beause of unknown things that are happening but which have not been observed specifically. Some of the above fall right in to that category.

It has been interesting, and it isn't over.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: HYATT: Here's Yet Another Testing Thread

Post by bob »

hgm wrote:
bob wrote:I'm hardly what one would call "mathematically shy" but I have discovered over the years, that one has to learn to avoid the "to the man who has a hammer, everything looks like a nail." It is tough to apply statistical analysis to something that has underlying properties that are not well understood, if those properties might violate certain key assumptions (independence of samples for just one).
Again, you are making assumptions that have not been proven. Are the games actually independent? Is there some bizarre circumstance that makes use of the timing jitter so that one run is biased in one way, but the next major run is biased in the opposite way. And the third run is not biased at all? First, I don't know what data to gather as games are played to have any way of drawing such a conclusion. I can't measure when interrupts occur, when they don't, when context switches occur, when they don't. Etc. So there is no way I could collect enough data so that one could start thinking T-test to try to determine what might be the factor that is influencing things. The computer environment is chaotic. But nothing says it is a random chaos or one that is orderly in some axis or another.

Something interesting is going on with respect to non-deterministic behavior for situations that most of us thought were totally deterministic. And it would be about as hard to prove as the old butterfly wings on the opposite side of the world type folklore. So, since the environment is unstable, in ways that are unpredictable and even unmeasurable, So just maybe Elo calculations are the wrong way to try to measure things in a chess engine.

If you assume uniform distribution, or normal distribution, but the actual behavior is a variable probabilistic distribution that varies wildly over time, then the numbers are going to be wrong. Or it will take so many samples from such a wild distribution to make sure that your samples are representative, that the process is intractable.

I am not sure what is causing the behavior I am seeing. Random I understand. uniform I understand. But not something that seems uniform on one run, random on another, and biased on a third.
But the mathematical properties of the system you are studying are totally understood. They should be independent repeats of the same experiment (position and opponent combination), that have fixed (in time) probabilities for ending on win, draw or loss. That is all that is necessary to do 100% accurate statistical analysis. And if the results in practice do not conform to that analysis in any credible way, it means that the assumption are apparently not valid. And their were only three. That the scores of an individual game where 0, 1/2 or 1 (and it stretches even my imagination that you would bungle that), that the probabilities were fixed in time, and that the games were independent.

That games starting from a different position might have a different WDL probability distribution, as Karl remarked, does not have the slightest impact on the analysis for the variability.
There have been so many new characteristics discovered during these tests, and there are most likely more things yet to be discovered.
Well, so far I have not seen anything being discovered...


But your story has _one_ fatal flaw. What happens if you bump into someone over 8' tall every few days? Do you still hide your head in the sand and say "can't be"?
Of course not. I would say: "obviously the assumption that male adults are not 6' +/- 8" tall". Like I say to your 'frequent' observations: "obviously the assumption of independent games or time-constant WDL probability is not fulfilled". It is you that hide your head in the sand, insistin: "Can't be! My cluster is independent!" <bang head> "Can't be!" <bang head>. (Painful! Although computers are made out of silicon they are not quite as soft as sand! :lol: )[/quote]

That is not what I have said. I said there is no _causality_ from A to B. I have not said there is no dependencies. In fact, as far back as the original discussion, my comment was that the only sort of dependency I can see is that the same programs are playing the same positions over and over. The issue I see today is not a cluster issue. It is not any sort of "testing issue". It seems to be related to how computers function in general, in ways that are not deterministic at all, but in ways that might lead to repeated bias for a while, before the bias is reversed. How? I don't know. But I do know that no game influences any other game in any way we could name, since nothing is shared, files are initialized, etc. But the same computers are being used repeatedly, and they may well have unexpected properties that would apply to anyone using a computer for testing, whether it is one node or a thousand.

I have said _all along_ that it appears to me that some assumptions being made are unsound. And I agree that the independent trial assumption is the only viable candidate. I am certain that the method used to play games is not biased or suspect. Nor are the programs suspect for playing weaker or stronger depending on the date/time. So computers _in general_ have some characteristic that is causing that assumption to be invalid to an extent. How remains to be discovered.
Because this is the kind of results I have gotten on _multiple_ (and multiple does not mean more than 1.. that is, if I used normal Elo numbers +/- error, the min for A is larger than the max for B. You seem to want to imply that I only post such numbers here when I encounter them.
Well that woulld be a natural thing to do. Why post data that is absolutely unremarkable? But sorry, "more than 1" won't cut it. That is no science. To make such a claim without actually knowing excatly how much each deviation occurs is simply not credible. Because, like you remarked, no result is absolutely impossible. Just astronomically improbable. You brought this skepticism on our part upon yourself by very often (multiple times, where in this case multiple actually does not mean <= 1) posting data here complaining about the hypervariability, while in fact the variability was lower than expected.
How can it be both "lower than expected" and also "a six sigma event?" If you go back to the original thread from last year, I posted _several_ sets of data that had strange variability. Not just one. Several. In fact, I never had any results that offered any stability at all, for measuring smallish changes.

The only reason I posted the last set was that it was the _first_ trial that produced BayesElo numbers for each run. It was the _first_ time I had posted results from Elo computations as opposed to just raw win/lose/draw results.
Be that as it may, that does not mean this is a non-sssue. Would you also have posted it here if this first time the BayesElo ratings of Crafty would have been exactly the same?
Actually, I did that. But in posting, when I looked at those results, it caught my eye. I copied them as soon as they were finished, as my intent was to show that the Elo calculation was not very stable, which it wasn't. I just didn't realize how bad those last two runs were until I pasted them here and started to make some comments about lower supposed error with larger runs.

And I have done so after trying the new test position approach as well, if you noticed.

If you switch on an internal filter to only post something that is remarkable, it would still be "cherry picking" if it happened the first time. But let us not quarrel over that, as it is counter-productive.: Only you can know if you would have posted a non-remarkable result.
You have seen six non-remarkable results posted recently. So perhaps I am missing your point...
I posted several results when we started. I increased the number of games and posted more. And have not posted anything since because nothing changed. We were still getting significant variability,
But the problem is that you (based on all these posts) seems to call totally normal, unavoidable sampling noise "significant variability". And if that is your criterion, it does not mean a thing that you were still getting "significant" variability. You have cried 8' giant too often when a perfectly normal 6' individualleft your premises. So now people doubt your eyesight...
I think you are just hung up on semantics of "significant". Not as in T-test, but as in "this is important with respect to the initial goal of measuring programming improvements." When I get runs where the Elos are so far apart the error bar won't bridge the gap, then which should I trust? That is what I call "significant". When I started doing this, many assumed 20-40 games were enough to decide if something was better or not. My testing suggested that even 800 games were not enough. Current testing suggests that 40K with old, 40K with new might pick up most changes unless they are _very_ small. Initially there was an outcry about needing 800 games. That seems to have passed as everyone realizes it is nowhere near enough for this purpose.

So now your remarks about variability would only impress us, if we would know your standards, which can only be acheived by telling us exactly how many percent of your data showed 3-sigma deviation, how much 4-sigma, etc. But you fail to do it, with a stubbornness that creates a very strong impression you don't know this yourself...
I posted raw data here every time, and for the longest that was all I used. I still have much of the w/l/d results, and a program that will take those and produce fake PGN that BayesElo will accept. But if things continue as they have with the new testing, that will be a back-burner project since it will be moot for the moment.

[/quote
I had proven that it was a direct result of time jitter, or at least that removing timing jitter eliminated the problem completely. And there was nothing new to say. we were still having the same problems within our "group" where the same big run said "A is better" and a repeat said "B is better". And nothing new was happening. I then ran four "reasonable" runs and two _big_ runs, and ran the results thru BayesElo and thought that the results were again "interesting" and posted here, primarily to see what Remi thought and if he thought I might be overlooking something or doing something wrong.

But I run into these 8' people way too regularly to accept "your ruler is wrong, or you are cherry-picking, or whatever". I wrote when I started this that there is something unusual going on. It appears to be related to the small number of positions and the inherent correlation when using the same positions and same opponents over and over.
Well, here we obviously continue to disagree. On both counts. It was not related to that, and nothing 'apears' so far.
So this is not about the small number of positions and the correlation Karl had mentioned? And moving to a large number of positions, which has at least so far eliminated the odd results without requiring more runs, doesn't suggest that the original test positions were an issue? Hmmm, thought that was what we have been discussing for a while now. Didn't you claim to have said I needed more positions 11 months ago? (for the wrong reason, but you still made the claim). So that comment came from where, and why, if this is now not an issue?
OK, that way of testing was wrong. It wasn't obviously wrong, as nobody suggested what Karl suggested previously.
Well, this becomes a bit of a sticky siubject now. Because we can no longer know if you mean that no one suggested what you think Karl suggested (which is of course true, as what you acribe to Karl is simply untrue and non-sensical), or if you mean that no one said what Karl actually said (which would be false, as I said the same thing 11 months ago).

It means precisely this: No one suggested that using a small number of positions played multiple times, would have more correlation built-in than using a larger set of positions played only one time. And that the small number of positions would therefore violate the independent sample assumption enough that the odd results were actually not that odd.
Theron mentioned that he has _also_ been using that method of testing and then discovered (whether it was Karls post or his own evaluation was not clear however) that this approach led to unreliable results.
Of course it led to unreliable results. But that is not the same as hyper-variability. As I wrote 11 month ago, the results would even be unreliable if you had zero variability (by playing a billion games).
So, to recap, these odd results have been _far_ more common than once a decade,
depends on your standards of 'odd', which apparently differ from the usual...
... I, apparently like CT, had thought that "OK, N positions is not enough, but since the results are so random, playing more games with the same positions will address the randomness. But it doesn't And until Karl's discussion, the problem was not obvious to me.
And the only thing that is obvious to us, is that what you consider obvious is obviously wrong...
Which "us" are you speaking for? If Karl is wrong, so be it. But so far, his predictions are spot on.
I have been trying to confirm that his "fix" actually works. waste of time, you say. But _only_ if it confirms his hypothesis was correct. There was no guarantee of that initially, the cause could have been something else, from the simple (random PGN errors) to the complex (programs vary their strength based on day or time). I still want to test with one game per position rather than 2, to see if the "noise" mentioned by some will be a factor or not. I still want to test with fewer positions to see if the "correlation effect" returns in some unexpected way.
I would applaud the latter. I think this is very important, because if it does not return, and the 38,000-game runs with 80 positions (40, black and white) would show simiar variability as the 'many-positions' runs you did recently, the problem might recur just as easily in any scheme.
But note that it does not _always_ work that way. My 40 position testing is a case in point. The math did _not_ predict outcomes anywhere near correctly.
Well, math is absolute truth. If you get a result that deviates from mathematical prediction, it can only mean that the assumptions on which the mathematical prediction was based cannot have been satisfied. And the only assumptions going into the deviation where that the games played from the result for games from the same position were drawn independently from the same probability distribution in both runs.
The positions were not even chosen by me. _many_ aure using those positions to test their programs. Hopefully no longer. But then, they are going to be stuck beause using my positions requires a ton of computing power. So what is left? If they add search extensions, small test sets might detect the improvement since it will be significant. Null-move? less significant. Reductions? even less significant. Eval changes? very difficult to measure. yet everyone wants to and needs to do so. Unless someone had run these tests, and posted the results for large numbers, this would _still_ be an unknown issue and most would be making decisions that are almost random. Unless someone verifies that the new positons are usable, the same issue could be waiting in hiding again. Far too often, theory and reality are different, due to an unknown feature that is present.
Well, in general this way of testing is doomed to failure, even if you can eliminate the hypervariability. The normal statistical fluctuation even on uncorrupted data is simply to large to be useful unless you play billions of games.

This is why I designed the tree-game comparison method., which eliminates the sampling noise.
I am not so sure it eliminates anything. But that is a different issue. When I test, I want to test the _program_. Not the individual parts. The old "the whole is greater than the sum of the parts" applies and many parts of the program are inter-related and need to be measured as a whole. One could play fixed-depth games and eval improvements should result in more wins. Proving that the new eval is better. But what about the _cost_ of those new eval terms that actually slow you down so that you would normally lose a little depth in a real game, but not here? So your test says "better" when in reality you are "worse" because you are slowing down too much. This is not easy to deal with, when the _real_ game is based on time, not nodes. Artificially limiting things can lead to incorrect assumptions, even though the results look statistically significant. Because what you are testing statistically is not what you are going to be testing when you play real games.

And I believe that I had concluded from day 1 that the 40 positions were providing unusual data. But if you do not understand why, you can continue to pick positions from now on and still end up with the same problem.
Well, If I would need to do accurate measurements I would use tree-games, and I would not have this problem, as the initial position makes only a negligible fraction of the test positions there.
And the conclusions from such testing would be flawed as I explained earlier You have to "dance with the one what brung ya". You can't change partners and then draw conclusions about how the other partner will do.

Karl gave a simple explanation about why the SD is wrong here.
So you seem to think. Wrongly so.
It should shrink as number of games increases, assuming the games are independent. But if the 40 positions produce the same results each time, the statistics will say the SD is smaller if you play more games, when in fact, it does not change at all.
It does not change because it is already zero. Would you have expected it to become negative with more games, or what?


unfortunately, the games are not all identical. So it would not be zero.[/quote]
I think it has been an interesting bit of research. I've learned, and posted some heretofore _unknown_ knowledge. Examples:

1. I thought I needed a wide book, and then "position learning" to avoid playing the same losing game over and over. not true. It is almost impossible to play the same game twice, even if everything is set up the same way for each attempt.
Not generally valid, as we discussed before. Most people have exactly the opposite problem...[/quote]

No. They _think_ they have the opposite problem. I have tried these A vs B games among all 6 programs I am testing with and each exhibits the _same_ level of non-deterministic behavior. So they have it, they just might not know they have it.

2. I thought programs were _highly_ deterministic in how they play, except when you throw in the issue of parallel search. Wrong.
Not wrong for most programs. To be sure of sufficiently random behavior, it is better to not rely on any uncntrollable jitter, randomize explicitly. As the Rybka team does.

Sorry but wrong for _most_ programs. Only a very few use primitive timing approaches such as complete full iterations before timing out. And even if you do that, if you ponder, you get the same issue again, because there is no way to predict how long the opponent will think and you have to stop and re-start whenever he moves. Leaving lots of old information around to influence the new search. It is amusing to see you speak for most. How many have you tried? We have tested with _all_ major commercial programs manually, and know exactly how they are subject to this issue. I have not seen a single example of a program that syncs at the end of iterations only, until you said yours did. Clearly "most" is wrong. "Many" would be just as wrong. Possibly a "few" might still be overstating.
3. If someone had told me that allowing a program to search _one_ more node per position would change the outcome of a game, I would have laughed. It is true. In fact, one poster searched N and N+2 nodes in 40 positions and got (IIRC 11 different games out of 80 or 100). Nobody knew that prior to my starting this discusson. He played the same two programs from the same position for 100 games and got just one duplicate game. Unexpected? Yes, until I started this testing and saw just how pronounced this effect was.


4. I (and many others) thought that hash collisions are bad. And that they would wreck a program. Cozzie and I tested this and published the results. And even with an error rate of one bad hash collision for every 1000 nodes, it hs no measurable effect on the search quality.

5. Beal even found that random evaluation was enough to let a full-width search play reasonably. Nobody had thought that prior to his experiment.

I have said previously, the computer chess programs of today have some highly unusual (and unexpected, and possibly even undiscovered) properties that manifest themselves in ways nobody knows about nor understands. I had also said previously that standard statistical analysis (which includes Elo) might not apply as equally to computer chess programs beause of unknown things that are happening but which have not been observed specifically. Some of the above fall right in to that category.

It has been interesting, and it isn't over.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: HYATT: Here's Yet Another Testing Thread

Post by bob »

Uri Blass wrote:
hgm wrote:
Uri Blass wrote:The point is that even if there are positions that CraftyX and CraftyX+1 score 100% then it does not mean that they are not balanced and
it is possible that Crafty simply understands them better than the opponents.
But I would like to remove such positions from my test set, as they give only very little information on the relative strength of X and X+1. Even if there is the occasional rare win, and one of them wins a little bit more often than the other, it contributes more to the noise than to the expectation value. Playing this positions is a waste of time, and perhaps even worse.

This is why I would be suspicious: my aim would be to subject the Engine Under Test to a gauntlet where each game has a winning chance between 75% and 25%. If the total variance of the gauntlet score betrays that I am not acheiving that goal, I want to correct that. So I would analyze the resuts of individual positions and games in such a case of low variability, until I would have traced the cause of the problem, to determine if it is harmful and should be corrected.

Any other approach would be tantamount to a philosophy that says: "I know these results are very likely to be in error somehow, but the error is in the direction I would have liked the truth to be, so I wll ignore it".
My point is you cannot be sure that 100% is going to remain the same
after future changes that may be counter productive for part of the positions.

If the reason for 100% or 0% is that the position is not balanced then it may be a good idea to drop the position but if the reason is different then I think that it is not a good idea to drop the position from the test.

The problem is that some future change may be productive in the positions that you did not drop but counter productive in the positions that you dropped and you will no way to know it if you drop part of the positions.

If you know that 100% will remain after future changes then it is a good idea to drop the position from the test but you do not know it in the first place.

The same idea is of course correct to percentage that is bigger than 75% or smaller than 25% but I posted only about the case of 100% to make things more simple.

Uri
Actually that is a good point. If a new program plays against some older programs, he could possibly conclude that all positions are bad for him because he loses all games, or he conclude that all positions are unbalanced because white always wins. But what you would like to see is for some of those "unbalanced" positions to become balanced when you include something in your program that your opponent doesn't know about.

I have found positions where Crafty wins all games against all opponents, and positions where it loses all games against all opponents, and everything in between. Just because both programs play the black side of a Sicilian badly so that it "looks" unbalanced, once you get the evaluation refined a bit, it suddenly might begin to play the black side much better and draw or even win with black, while still winning with white because the opponent has no clue.

One such well-known position is position 1 from the original Bratko-Kopec test, where it was a trivial mate in 3 and nobody missed it, making it useless.
User avatar
hgm
Posts: 28356
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: HYATT: Here's Yet Another Testing Thread

Post by hgm »

bob wrote:Actually that is a good point. If a new program plays against some older programs, he could possibly conclude that all positions are bad for him because he loses all games, or he conclude that all positions are unbalanced because white always wins. But what you would like to see is for some of those "unbalanced" positions to become balanced when you include something in your program that your opponent doesn't know about.
The reason I don't like to have such positions in my test set, is that the results are so consistent because the program makes an error very early on, and it is usually very easy to mistune the evaluation to avoid that error through upgrading a nearly equivalent move for a reason totally unrelated to the source of the error.

If these positions are always lost by the engine under test because it lets the situation slowly deteriorate by an accumulation of indpendent mistakes, it would be better to test that position against weaker opponents, so that fixing one of the mistakes would express itself in the result much earlier.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: HYATT: Here's Yet Another Testing Thread

Post by bob »

hgm wrote:
bob wrote:Actually that is a good point. If a new program plays against some older programs, he could possibly conclude that all positions are bad for him because he loses all games, or he conclude that all positions are unbalanced because white always wins. But what you would like to see is for some of those "unbalanced" positions to become balanced when you include something in your program that your opponent doesn't know about.
The reason I don't like to have such positions in my test set, is that the results are so consistent because the program makes an error very early on, and it is usually very easy to mistune the evaluation to avoid that error through upgrading a nearly equivalent move for a reason totally unrelated to the source of the error.

If these positions are always lost by the engine under test because it lets the situation slowly deteriorate by an accumulation of indpendent mistakes, it would be better to test that position against weaker opponents, so that fixing one of the mistakes would express itself in the result much earlier.
This is all hypothetical, because I am not going to go thru the games from 4,000 positions any time soon. But if I were looking at smaller numbers of positions, and I did, carefully, in the case of the original 40 Silver positions I have been using, then I asked the question "is this tactically lost or positionally lost, or did Crafty just lose it from both sides due to poor move choices?" It is actually fairly hard to find unbalanced positions prior to move 15 if they come from a recent PGN file with good opponents... This is also how I have been evaluating changes in the program for years. By watching games to see how it plays in positions similar to ones where it has played badly in the past. I finally decided that was too inaccurate and got into the current testing experimentation.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: HYATT: Here's Yet Another Testing Thread

Post by Michael Sherwin »

swami wrote:

Code: Select all

HYATT: Here's Yet Another Testing Thread
That's creative :) How did you come up with that?!

Like, just thinking what possible letters/words (HY) you could substitute to already existing one YATT, to match your last name?

Sorry for OT, but thumbs up for that! :wink:
Cute, but, it did take Bob 1.5 years to set it up! :lol:
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through