YATT.... (Yet Another Testing Thread)

Uri Blass · Post by **Uri Blass** » Mon Aug 18, 2008 7:50 pm

hgm wrote: At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents

Again you do the same mistake.
I guess you mean small number of positions and opponents.

hgm · Post by **hgm** » Mon Aug 18, 2008 8:37 pm

Uri Blass wrote:<snipped>
hgm wrote: At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents
Again you do the same mistake.
I guess you mean small number of positions and opponents.

You are so right...

Well, at least I am consistent...

Problem is that I am only doing this with a half eye, while running tablebases for Shatranj end-games. This is fundamentally different from tablebases for normal Chess, as in Shatranj a bare King loses. So all 3-men end-games are 100% won with wtm, even KNK and K*K pieces much weaker than a knight. I added this rule to my generator, and it seemed to work OK, but there is still must be something horribly wrong: my generator starts with calculating all the subset endgames, and the subset of KNEKF is much longer as both KNKF and KEKF. This should not be possible, but it is hard to get a handle on figuring out what is wrong...

Dirt · Post by **Dirt** » Mon Aug 18, 2008 9:08 pm

hgm wrote:3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.

I think it does a great deal to reduce the variability. The small number of positions makes and lack of random influences made the testing hypersensitive to the conditions, so whatever was causing the problem before is likely to have no discernible effect now.

tiger · Post by **tiger** » Mon Aug 18, 2008 9:31 pm

hgm wrote:
Uri Blass wrote:<snipped>
hgm wrote: At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents
Again you do the same mistake.
I guess you mean small number of positions and opponents.
You are so right... Well, at least I am consistent...

Problem is that I am only doing this with a half eye, while running tablebases for Shatranj end-games. This is fundamentally different from tablebases for normal Chess, as in Shatranj a bare King loses. So all 3-men end-games are 100% won with wtm, even KNK and K*K pieces much weaker than a knight. I added this rule to my generator, and it seemed to work OK, but there is still must be something horribly wrong: my generator starts with calculating all the subset endgames, and the subset of KNEKF is much longer as both KNKF and KEKF. This should not be possible, but it is hard to get a handle on figuring out what is wrong...

OK, it's your problem. You do not show us the data and come saying it is broken. Yes it's broken and it's up to you to fix it...

It's just humour, I hope you get the reference.

Your comments are appreciated, especially when they are about statistics and scientific facts and not about someone else's supposed deficiencies!

// Christophe

bob · Post by **bob** » Mon Aug 18, 2008 11:08 pm

hgm wrote:Oops! You got me there!

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.

Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.

The point I intended to summarize in (1) was that even with an infinite number of games, the results would be far from the truth if these games were only played from a small number of positions.

I agree with that after Karl's explanation, and then playing multiple runs with about the same number of games, but with far more positions. With this number of positions, the statistical results from BayesElo seem to be perfectly reasonable each and every run, which is a change from the previous approach.

The point you deny is (3), btw, not (1).

The point we have been discussing the past month seems a moving target. At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents. But at the time I accepted your dismissal of that, when you said we were not discussing the difference of your results with the truth, but from each other.

OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not. Just that they were repeatable so that I could test and draw conclusions...

But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...

Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.

Sorry, but Karl did _not_ say that. He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions. And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds, and then offered a solution. The "truth" was a different issue.

3b) I said this

3c) Karl said this

3d) I said it again

3e) You keep denying it.

Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not. I understood that. Do you not?

Uri Blass · Post by **Uri Blass** » Mon Aug 18, 2008 11:29 pm

bob wrote:
hgm wrote:Oops! You got me there!

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.
Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.

The point I intended to summarize in (1) was that even with an infinite number of games, the results would be far from the truth if these games were only played from a small number of positions.
I agree with that after Karl's explanation, and then playing multiple runs with about the same number of games, but with far more positions. With this number of positions, the statistical results from BayesElo seem to be perfectly reasonable each and every run, which is a change from the previous approach.

The point you deny is (3), btw, not (1).

The point we have been discussing the past month seems a moving target. At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents. But at the time I accepted your dismissal of that, when you said we were not discussing the difference of your results with the truth, but from each other.
OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not. Just that they were repeatable so that I could test and draw conclusions...

But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that. He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions. And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds, and then offered a solution. The "truth" was a different issue.

3b) I said this

3c) Karl said this

3d) I said it again

3e) You keep denying it.
Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not. I understood that. Do you not?

I think that most programmers do not use thousands of games to test small changes.
I can say that I never did it.

I used the noomen positions but I did not repeat the same positions against the same opponents many times.

I simply knew that I cannot get accurate estimate about testing small changes in that way and I did the testing only to make sure that I have no serious bugs.

Edit:About your last comment I can only say that I agree with the opinion of karl and not with the opinion of you.

I guess that Karl prefer not to make comments against you and when he has to say yes or no for question that he knows that his opinion contradicts what you say he prefers to leave the discussion and say nothing.

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 18, 2008 11:55 pm

Note that Karl tries to say things that will be correct and not make you angry but he does not say things that he does not believe in them.

It is important to pay attention to things that karl did not say.

Karl said the following based on your post:
"playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not."

Karl did not explain the reason that it did not reduce the standard deviation with more games(from the wrong result that it is supposed to give) and I guess that you and karl disagree about the reason.

Uri

bob · Post by **bob** » Tue Aug 19, 2008 12:14 am

Uri Blass wrote:
bob wrote:
hgm wrote:Oops! You got me there!

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.
Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.

The point I intended to summarize in (1) was that even with an infinite number of games, the results would be far from the truth if these games were only played from a small number of positions.
I agree with that after Karl's explanation, and then playing multiple runs with about the same number of games, but with far more positions. With this number of positions, the statistical results from BayesElo seem to be perfectly reasonable each and every run, which is a change from the previous approach.

The point you deny is (3), btw, not (1).

The point we have been discussing the past month seems a moving target. At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents. But at the time I accepted your dismissal of that, when you said we were not discussing the difference of your results with the truth, but from each other.
OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not. Just that they were repeatable so that I could test and draw conclusions...

But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that. He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions. And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds, and then offered a solution. The "truth" was a different issue.

3b) I said this

3c) Karl said this

3d) I said it again

3e) You keep denying it.
Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not. I understood that. Do you not?
I think that most programmers do not use thousands of games to test small changes.
I can say that I never did it.

If you read my post carefully, you won't find me suggesting that you did. i said "most use the noomen, or the Nunn, or the silver positions in this same way" but I didn't say anyone was able to play thousands of games, and repeat it dozens of times with the same opponents, so that they would notice this randomization. Most don't repeat at all. CT said he noticed the ramdomness just as I did and played more games to address the issue.

I used the noomen positions but I did not repeat the same positions against the same opponents many times.

And by doing so, you missed something important, namely that each time you run just N positions, where N is so small as in the sets named above, you get incredible randomness that makes evaluating changes a coin-toss or worse.

I simply knew that I cannot get accurate estimate about testing small changes in that way and I did the testing only to make sure that I have no serious bugs.

For that kind of debugging, you would have _far_ more luck on ICC where you play very fast games, long games, games with and without increment, games against computers, games against weak and strong humans. If you can run that gauntlet for a day, you can feel pretty good about the absence of major bugs.

Edit:About your last comment I can only say that I agree with the opinion of karl and not with the opinion of you.

I guess that Karl prefer not to make comments against you and when he has to say yes or no for question that he knows that his opinion contradicts what you say he prefers to leave the discussion and say nothing.

Uri

What don't you agree with? That using a small set of positions introduces a statistical problem? I gave the example he gave me. You play 80 games, and then 80 more. Suppose both matches have the same result. But we are not looking at matches, we are looking at each game as an individual trial, correct? That is how BayesElo works. Twice as many games gives a smaller standard deviation? Correct? And no it isn't already zero, because the 80 games are not all wins are losses, and we are looking at each independently. And the SD for the 80 games is larger than the SD for 160 games, correct? If you buy that, then you see the flaw. Because the two matches that are duplicates are not going to provide any greater degree of accuracy, because we only have 80 actual games and 80 duplicates. I had not considered this effect, because I never get the same 80 games. But many of them are identical, many of them are different with same result, and many are different with different results. So the correlation between games from the same position was not something I considered. What in the above appears to be wrong? If you want, I can quote Karl's original explanation word-for-word and you can point out where I am going wrong...

bob · Post by **bob** » Tue Aug 19, 2008 12:24 am

Uri Blass wrote:Note that Karl tries to say things that will be correct and not make you angry but he does not say things that he does not believe in them.

It is important to pay attention to things that karl did not say.

Karl said the following based on your post:
"playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not."

Karl did not explain the reason that it did not reduce the standard deviation with more games(from the wrong result that it is supposed to give) and I guess that you and karl disagree about the reason.

Uri

Then I am not sure what we are disagreeing on. I agree that he said the SD was not reduced was because of the correlation between games from the same position with same opponents. OK so far? He even explained how that could cause a 6SD variation. No, neither he nor I could explain _why_ the small number of games behave as they do. And I haven't said that he _did_ explain that particular "why". Only "how".

Now, are we on the same page perhaps? _NOBODY_ has explained _WHY_ the games seem to have swings that appear to be dependent and yet the way they are dependent seems to swing from favoring A for a while, to favoring B. I haven't tried to explain that either as I have absolutely no idea at the moment, and it will take a _ton_ of testing to begin to understand what is going on, because there are so many variables to consider that are intrinsic in how computers operate.

I do not see where the confusion is coming from. I provided quotes from him that clearly explained how the duplicate positions _could_ cause a problem, and he suggested an experiment to test his hypothesis. I am not sure anyone will ever be able to answer the other question about inherent randomness that is not quite as "random" as it should be. And if the problem has now disappeared, I am less interested in determining the "why" part of this because it will be a -huge- project.

I am interested in seeing what happens as the number of positions is reduced. Will the error ranges continue to overlap, or will I begin to see more and more runs where the error ranges do not overlap, which will suggest that the problem is beginning to come back. I may well run the original test a few times using 40 new positions, just to see if the numbers are as unstable as before using different positions with everything else the same...

By the way, from private communications, I can guarantee you we don't disagree about the _reason_ because we have no real idea, other than that it is obviously related to timing, since that is easy enough to prove (done already, in fact). He just explained how he thought this could happen with such a small set of positions. Too many are reading _way_ too much into "between the lines" when there really is nothing there.

bob · Post by **bob** » Tue Aug 19, 2008 12:34 am

I will update later, but the final run finished, with an elo of -20, which is consistent with the other runs and duplicates at least a couple of them for some pretty solid comparisons...

YATT.... (Yet Another Testing Thread)

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: more data... six runs done