HYATT: Here's Yet Another Testing Thread

Uri Blass · Post by **Uri Blass** » Sun Aug 17, 2008 9:11 am

hgm wrote:
bob wrote: To date, the issue has been Elo +/- error, where the ugly cases had two different Elo values that even when you took max(elo(i)) - error, it was still greater than min(elo(i)) + error.
Perhaps it is good to point out here that it is quite normal for this to happen, occasionally:

The error bars given by BayesElo hare not hard limits, but only 95% confidence intervals. That means that in 5% of the cases, the 'true' value of the performance rating will ly outside the quoted error range.

When you compare two values that each have an independent statistical error, the error bar in their difference is not the simple sum of their individual error bars, has to be added as root-mean-squars: sqaure them, add and take the square root. For two equal error bars this means the error bar of the rating difference is 1.41 times as large as that in an individual result. In practice, this means that the 95% intervls of the individual rarings will not overlap once every 178 times you measure identical quantities. If such 'flukes' happen any less frequently than that, you should start to become suspicious that there is something wrong with your tsting. (e.g. too many positios in your test set tht give dead-cert outcome, and do not help to differentite the quality of the engine you re testing.)

I do not think that you should start to become suspicious that there is something wrong in the case that you mention.

The point is that even if there are positions that CraftyX and CraftyX+1 score 100% then it does not mean that they are not balanced and
it is possible that Crafty simply understands them better than the opponents.

These positions are clearly part of real games so not using them to estimate rating difference is not a good idea because if you do not use them Later Crafty may score worse in them when you may think that it is better based on ignoring this data(because you decided not to use the positions).

Uri

tiger · Post by **tiger** » Sun Aug 17, 2008 10:33 am

swami wrote:
Code: Select all
HYATT: Here's Yet Another Testing Thread
That's creative How did you come up with that?!

Like, just thinking what possible letters/words (HY) you could substitute to already existing one YATT, to match your last name?

Sorry for OT, but thumbs up for that!

It takes a talented chess programmer to come up with something like that. I don't think the uninformed masses could follow the line of thinking.

Sorry!

It's just humour!

You know what? I am not used to see Bob playing with words like that. It's like Uri doing a joke. When it happens, one might wonder wether the stars are aligned or if some cosmic event is about to happen.

// Christophe

PS: this being written just after karaoké and Get27*4, so moderation can do anything with it, good luck!

PS2: I realize that (Get27*4)==(Get27<<2) but shifting on the bar was not allowed as it previously led to several unfortunate accidents.

swami · Post by **swami** » Sun Aug 17, 2008 10:50 am

tiger wrote:You know what? I am not used to see Bob playing with words like that. It's like Uri doing a joke. When it happens, one might wonder wether the stars are aligned or if some cosmic event is about to happen.
// Christophe

My thoughts exactly, I'm having a vulcan mind meld with you on this, Christophe

SCNR, but most of the people I know are also too literal and they don't play word games to bring in humour either especially in serious discussions such as this, which is IMO, a good thing.

hgm · Post by **hgm** » Sun Aug 17, 2008 2:56 pm

Uri Blass wrote:The point is that even if there are positions that CraftyX and CraftyX+1 score 100% then it does not mean that they are not balanced and
it is possible that Crafty simply understands them better than the opponents.

But I would like to remove such positions from my test set, as they give only very little information on the relative strength of X and X+1. Even if there is the occasional rare win, and one of them wins a little bit more often than the other, it contributes more to the noise than to the expectation value. Playing this positions is a waste of time, and perhaps even worse.

This is why I would be suspicious: my aim would be to subject the Engine Under Test to a gauntlet where each game has a winning chance between 75% and 25%. If the total variance of the gauntlet score betrays that I am not acheiving that goal, I want to correct that. So I would analyze the resuts of individual positions and games in such a case of low variability, until I would have traced the cause of the problem, to determine if it is harmful and should be corrected.

Any other approach would be tantamount to a philosophy that says: "I know these results are very likely to be in error somehow, but the error is in the direction I would have liked the truth to be, so I wll ignore it".

bob · Post by **bob** » Sun Aug 17, 2008 4:44 pm

swami wrote:
Code: Select all
HYATT: Here's Yet Another Testing Thread
That's creative How did you come up with that?!

Like, just thinking what possible letters/words (HY) you could substitute to already existing one YATT, to match your last name?

Sorry for OT, but thumbs up for that!

Actually it came to me after the "Yet Another Testing Thread" (YATT) discussion.

bob · Post by **bob** » Sun Aug 17, 2008 4:51 pm

hgm wrote:
bob wrote: To date, the issue has been Elo +/- error, where the ugly cases had two different Elo values that even when you took max(elo(i)) - error, it was still greater than min(elo(i)) + error.
Perhaps it is good to point out here that it is quite normal for this to happen, occasionally:

Hmmm.... then why have you been so outspoken about the topic? It is either quite normal, or it is a one in one million trial case. But it can't be both. Karl made the point that the more games I played (old approach) the greater the chance that I would see these kinds of results due to the dependencies caused by using the same position and players multiple times.

The error bars given by BayesElo hare not hard limits, but only 95% confidence intervals. That means that in 5% of the cases, the 'true' value of the performance rating will ly outside the quoted error range.

When you compare two values that each have an independent statistical error, the error bar in their difference is not the simple sum of their individual error bars, has to be added as root-mean-squars: sqaure them, add and take the square root. For two equal error bars this means the error bar of the rating difference is 1.41 times as large as that in an individual result. In practice, this means that the 95% intervls of the individual rarings will not overlap once every 178 times you measure identical quantities. If such 'flukes' happen any less frequently than that, you should start to become suspicious that there is something wrong with your tsting. (e.g. too many positios in your test set tht give dead-cert outcome, and do not help to differentite the quality of the engine you re testing.)

I only wish we could use the same numbers every time. So is it once every 178 times, once every million times, or just when the discussion gets too boring? Actually I think you used once every 100,000 times the last time.

For the math-inclined: the 1:178 probability is obtained as follows: a 95% confidence interval is +/- 1.96*SD, where SD (aka sigma) is the standard deviation of the normal distribution describing the statistical sampling noise. So for non-oerlapping intervls, the central values are at least 3.92 sigma apart. But you have to compare tht to the sigma off the difference, which is larger by a factor 1.41, or you would be comparing apples with oranges. So the central values are 3.92/1.41 = 2.77 sigma apart. Looking in the table of the normal distribution, one sees the probability to exceed 2.77 is 0.56%.

hgm · Post by **hgm** » Sun Aug 17, 2008 5:56 pm

Perhaps it is better then to cast the message into an anecdotal form, so that the mathematically shy have a better chance of following the discussion.

The average healthy male adults is 6 feet tall, give or take 8 inches. Yet, if I take a walk through a crowded shopping mall, occasionally I encounter people that measure 6'9". Once I even encountered a giant of 6'10", and I thought: "Wow, this must be the tallest guy in the county. But since I shop here every day, I was bound to bump into him some time!".

But when I meet someone that is 30 feet tall, I would exclude the idea that he actually consisted of a single part. Much more likely it would be someone riding a Giraffe. Despite the fact that Giraffes are not very common in shopping malls. If I would think: "Well, if that tall guy I met the other day can be 6'10", his older brother can just as well be 30' tall, and this must be him!"... Well, I guess even the mathematically challenged would not have much difficulty finishing that sentence.

So indeed it can be both: Some guys with above-average length are merely very tall, while other lengths simply do not occur even amongst a population of 6 billion. And even thos most mathematically illiterate can usually see the difference between an excess of 2 inches and 24 feet.

So the mystery is all hidden in the numbers. Non-overlapping 95% confidence intervals occur once every 178 pairs. (And remember the birthday paradox: if you have 10 people, you already have 45 pairs!) But to have them disjoint so far that another 95% onfidence interval could be fit in between... That is much rarer. Just as 10" above-average tall is rare, but appreciably more common like 24' above-average tall. And the beauty of math is that it enables you to calculate exactly how much a deviation of whichever magnitude will occur. Sometimes that is once every 178 tries. And sometimes that is once in a million.

This is why serious scientists do quantitative statistical analysis on their data, recoding the frequency of every deviation, to see if the data can be trusted or not, and to what extent.

frankp · Post by **frankp** » Sun Aug 17, 2008 7:15 pm

hgm wrote:Perhaps it is better then to cast the message into an anecdotal form, so that the mathematically shy have a better chance of following the discussion.

The average healthy male adults is 6 feet tall, give or take 8 inches. Yet, if I take a walk through a crowded shopping mall, occasionally I encounter people that measure 6'9". Once I even encountered a giant of 6'10", and I thought: "Wow, this must be the tallest guy in the county. But since I shop here every day, I was bound to bump into him some time!".

But when I meet someone that is 30 feet tall, I would exclude the idea that he actually consisted of a single part. Much more likely it would be someone riding a Giraffe. Despite the fact that Giraffes are not very common in shopping malls. If I would think: "Well, if that tall guy I met the other day can be 6'10", his older brother can just as well be 30' tall, and this must be him!"... Well, I guess even the mathematically challenged would not have much difficulty finishing that sentence.

So indeed it can be both: Some guys with above-average length are merely very tall, while other lengths simply do not occur even amongst a population of 6 billion. And even thos most mathematically illiterate can usually see the difference between an excess of 2 inches and 24 feet.

So the mystery is all hidden in the numbers. Non-overlapping 95% confidence intervals occur once every 178 pairs. (And remember the birthday paradox: if you have 10 people, you already have 45 pairs!) But to have them disjoint so far that another 95% onfidence interval could be fit in between... That is much rarer. Just as 10" above-average tall is rare, but appreciably more common like 24' above-average tall. And the beauty of math is that it enables you to calculate exactly how much a deviation of whichever magnitude will occur. Sometimes that is once every 178 tries. And sometimes that is once in a million.

This is why serious scientists do quantitative statistical analysis on their data, recoding the frequency of every deviation, to see if the data can be trusted or not, and to what extent.

I admit to being mathematically-challenged, but does not this depend on the a prior assumptions that humans (in this case) cannot be 30 foot or taller?

ps My comment has nothing to do with this thread, or the on-going feud. Just demonstrating my ignorance of maths - or in this case statistics.

bob · Post by **bob** » Sun Aug 17, 2008 7:22 pm

hgm wrote:Perhaps it is better then to cast the message into an anecdotal form, so that the mathematically shy have a better chance of following the discussion.

I'm hardly what one would call "mathematically shy" but I have discovered over the years, that one has to learn to avoid the "to the man who has a hammer, everything looks like a nail." It is tough to apply statistical analysis to something that has underlying properties that are not well understood, if those properties might violate certain key assumptions (independence of samples for just one). There have been so many new characteristics discovered during these tests, and there are most likely more things yet to be discovered.

The average healthy male adults is 6 feet tall, give or take 8 inches. Yet, if I take a walk through a crowded shopping mall, occasionally I encounter people that measure 6'9". Once I even encountered a giant of 6'10", and I thought: "Wow, this must be the tallest guy in the county. But since I shop here every day, I was bound to bump into him some time!".

But when I meet someone that is 30 feet tall, I would exclude the idea that he actually consisted of a single part. Much more likely it would be someone riding a Giraffe. Despite the fact that Giraffes are not very common in shopping malls. If I would think: "Well, if that tall guy I met the other day can be 6'10", his older brother can just as well be 30' tall, and this must be him!"... Well, I guess even the mathematically challenged would not have much difficulty finishing that sentence.

So indeed it can be both: Some guys with above-average length are merely very tall, while other lengths simply do not occur even amongst a population of 6 billion. And even thos most mathematically illiterate can usually see the difference between an excess of 2 inches and 24 feet.

But your story has _one_ fatal flaw. What happens if you bump into someone over 8' tall every few days? Do you still hide your head in the sand and say "can't be"? Because this is the kind of results I have gotten on _multiple_ (and multiple does not mean more than 1.. that is, if I used normal Elo numbers +/- error, the min for A is larger than the max for B. You seem to want to imply that I only post such numbers here when I encounter them. The only reason I posted the last set was that it was the _first_ trial that produced BayesElo numbers for each run. It was the _first_ time I had posted results from Elo computations as opposed to just raw win/lose/draw results.

I posted several results when we started. I increased the number of games and posted more. And have not posted anything since because nothing changed. We were still getting significant variability, I had proven that it was a direct result of time jitter, or at least that removing timing jitter eliminated the problem completely. And there was nothing new to say. we were still having the same problems within our "group" where the same big run said "A is better" and a repeat said "B is better". And nothing new was happening. I then ran four "reasonable" runs and two _big_ runs, and ran the results thru BayesElo and thought that the results were again "interesting" and posted here, primarily to see what Remi thought and if he thought I might be overlooking something or doing something wrong.

But I run into these 8' people way too regularly to accept "your ruler is wrong, or you are cherry-picking, or whatever". I wrote when I started this that there is something unusual going on. It appears to be related to the small number of positions and the inherent correlation when using the same positions and same opponents over and over. OK, that way of testing was wrong. It wasn't obviously wrong, as nobody suggested what Karl suggested previously. Theron mentioned that he has _also_ been using that method of testing and then discovered (whether it was Karls post or his own evaluation was not clear however) that this approach led to unreliable results.

So, to recap, these odd results have been _far_ more common than once a decade, I, apparently like CT, had thought that "OK, N positions is not enough, but since the results are so random, playing more games with the same positions will address the randomness. But it doesn't And until Karl's discussion, the problem was not obvious to me.

I have been trying to confirm that his "fix" actually works. waste of time, you say. But _only_ if it confirms his hypothesis was correct. There was no guarantee of that initially, the cause could have been something else, from the simple (random PGN errors) to the complex (programs vary their strength based on day or time). I still want to test with one game per position rather than 2, to see if the "noise" mentioned by some will be a factor or not. I still want to test with fewer positions to see if the "correlation effect" returns in some unexpected way.

So the mystery is all hidden in the numbers. Non-overlapping 95% confidence intervals occur once every 178 pairs. (And remember the birthday paradox: if you have 10 people, you already have 45 pairs!) But to have them disjoint so far that another 95% onfidence interval could be fit in between... That is much rarer. Just as 10" above-average tall is rare, but appreciably more common like 24' above-average tall. And the beauty of math is that it enables you to calculate exactly how much a deviation of whichever magnitude will occur. Sometimes that is once every 178 tries. And sometimes that is once in a million.

But note that it does not _always_ work that way. My 40 position testing is a case in point. The math did _not_ predict outcomes anywhere near correctly. The positions were not even chosen by me. _many_ aure using those positions to test their programs. Hopefully no longer. But then, they are going to be stuck beause using my positions requires a ton of computing power. So what is left? If they add search extensions, small test sets might detect the improvement since it will be significant. Null-move? less significant. Reductions? even less significant. Eval changes? very difficult to measure. yet everyone wants to and needs to do so. Unless someone had run these tests, and posted the results for large numbers, this would _still_ be an unknown issue and most would be making decisions that are almost random. Unless someone verifies that the new positons are usable, the same issue could be waiting in hiding again. Far too often, theory and reality are different, due to an unknown feature that is present.

This is why serious scientists do quantitative statistical analysis on their data, recoding the frequency of every deviation, to see if the data can be trusted or not, and to what extent.

And I believe that I had concluded from day 1 that the 40 positions were providing unusual data. But if you do not understand why, you can continue to pick positions from now on and still end up with the same problem. Karl gave a simple explanation about why the SD is wrong here. It should shrink as number of games increases, assuming the games are independent. But if the 40 positions produce the same results each time, the statistics will say the SD is smaller if you play more games, when in fact, it does not change at all. I think it has been an interesting bit of research. I've learned, and posted some heretofore _unknown_ knowledge. Examples:

1. I thought I needed a wide book, and then "position learning" to avoid playing the same losing game over and over. not true. It is almost impossible to play the same game twice, even if everything is set up the same way for each attempt.

2. I thought programs were _highly_ deterministic in how they play, except when you throw in the issue of parallel search. Wrong.

3. If someone had told me that allowing a program to search _one_ more node per position would change the outcome of a game, I would have laughed. It is true. In fact, one poster searched N and N+2 nodes in 40 positions and got (IIRC 11 different games out of 80 or 100). Nobody knew that prior to my starting this discusson. He played the same two programs from the same position for 100 games and got just one duplicate game. Unexpected? Yes, until I started this testing and saw just how pronounced this effect was.

4. I (and many others) thought that hash collisions are bad. And that they would wreck a program. Cozzie and I tested this and published the results. And even with an error rate of one bad hash collision for every 1000 nodes, it hs no measurable effect on the search quality.

5. Beal even found that random evaluation was enough to let a full-width search play reasonably. Nobody had thought that prior to his experiment.

I have said previously, the computer chess programs of today have some highly unusual (and unexpected, and possibly even undiscovered) properties that manifest themselves in ways nobody knows about nor understands. I had also said previously that standard statistical analysis (which includes Elo) might not apply as equally to computer chess programs beause of unknown things that are happening but which have not been observed specifically. Some of the above fall right in to that category.

It has been interesting, and it isn't over.

bob · Post by **bob** » Sun Aug 17, 2008 7:43 pm

frankp wrote:
hgm wrote:Perhaps it is better then to cast the message into an anecdotal form, so that the mathematically shy have a better chance of following the discussion.

The average healthy male adults is 6 feet tall, give or take 8 inches. Yet, if I take a walk through a crowded shopping mall, occasionally I encounter people that measure 6'9". Once I even encountered a giant of 6'10", and I thought: "Wow, this must be the tallest guy in the county. But since I shop here every day, I was bound to bump into him some time!".

But when I meet someone that is 30 feet tall, I would exclude the idea that he actually consisted of a single part. Much more likely it would be someone riding a Giraffe. Despite the fact that Giraffes are not very common in shopping malls. If I would think: "Well, if that tall guy I met the other day can be 6'10", his older brother can just as well be 30' tall, and this must be him!"... Well, I guess even the mathematically challenged would not have much difficulty finishing that sentence.

So indeed it can be both: Some guys with above-average length are merely very tall, while other lengths simply do not occur even amongst a population of 6 billion. And even thos most mathematically illiterate can usually see the difference between an excess of 2 inches and 24 feet.

So the mystery is all hidden in the numbers. Non-overlapping 95% confidence intervals occur once every 178 pairs. (And remember the birthday paradox: if you have 10 people, you already have 45 pairs!) But to have them disjoint so far that another 95% onfidence interval could be fit in between... That is much rarer. Just as 10" above-average tall is rare, but appreciably more common like 24' above-average tall. And the beauty of math is that it enables you to calculate exactly how much a deviation of whichever magnitude will occur. Sometimes that is once every 178 tries. And sometimes that is once in a million.

This is why serious scientists do quantitative statistical analysis on their data, recoding the frequency of every deviation, to see if the data can be trusted or not, and to what extent.
I admit to being mathematically-challenged, but does not this depend on the a prior assumptions that humans (in this case) cannot be 30 foot or taller?

ps My comment has nothing to do with this thread, or the on-going feud. Just demonstrating my ignorance of maths - or in this case statistics.

It is commonly called "a prejudiced idea". If you haven't seen it before, it doesn't exist. In Japan, Jabbar would be considered impossible. But the characteristics of chess engines are unexplored territory for the most part. Trees of billions of nodes have different characterics from trees of tens of nodes.

But just because you haven't seen it is hardly evidence that it doesn't exist. Unless you are a statistician. There's a lot I have not seen before starting this latest research. non-repeatability is as remarkable as it is unexpected.

HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread

Re: HYATT: Here's Yet Another Testing Thread