YATT.... (Yet Another Testing Thread)

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Karl, input please...

Post by hgm »

bob wrote:
hgm wrote:Oops! You got me there! :oops:

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.
Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.
It seems to me that you are not addressing the quoted remark. The no-brainer is that small number of games = large error. No matter if you play them from different positions or not. The statement about the positions was was what I intended to write as (1), and I never said that was a no-brainer.
OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not.
When I am talking about "far from the truth" I am not referring to ratings. What is far from the truth is the rating difference between the two engines you test. So that A might perform better in the gauntlet than A', while A' is in fact the stronger engine. And I would think you do care about that.
Just that they were repeatable so that I could test and draw conclusions...
I don't think that is a sufficient condition. What good would it do you if they were repeatable, but would make you draw the wrng conclusions?
But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.
Well, then he apparently failed, as he offered no explanation for the 6-sigma event whatsoever.
But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that.
Then he was wrong, as it is mathematical fact. But it is only your claim he was wrong. In fact he did admit to this, and thus was right.
He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions.
I already went over this with Karl. The example he gave for the N-times repeated games did not wreck any statistical assumption. What assumption would that be? That if you do two exactly repeatable meaurements of a different quantity, the results must be the same? That is an obviously wong assumption to start with.
And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds,
Yes, and I exposed it as wrong. There are no statistical bounds on the difference of different quantities. If humans measure 6' +/- 8", and giraffes 30' +/- 3', no statistical bounds are violated, as there are no bounds on how different humans and giraffes are.
and then offered a solution. The "truth" was a different issue.
Well, we can all read what he responded when I questioned him about this. So if you want to repeat this as a mantra, have fun.
Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not.
Well, repeat it as much as you like. Perhaps you think that repeating things often enough will make them true. Mathematical fact is that it is not true. What you say doesn't matter. What Karl says does not matter. Only the math counts.
I understood that. Do you not?
Most certainly not. Do you really think I would take pride in 'understanding' things that are completely wrong?
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Karl, input please...

Post by hgm »

bob wrote:What don't you agree with? That using a small set of positions introduces a statistical problem? I gave the example he gave me. You play 80 games, and then 80 more. Suppose both matches have the same result. But we are not looking at matches, we are looking at each game as an individual trial, correct? That is how BayesElo works. Twice as many games gives a smaller standard deviation? Correct? And no it isn't already zero, because the 80 games are not all wins are losses, and we are looking at each independently. And the SD for the 80 games is larger than the SD for 160 games, correct? If you buy that, then you see the flaw. Because the two matches that are duplicates are not going to provide any greater degree of accuracy, because we only have 80 actual games and 80 duplicates. I had not considered this effect, because I never get the same 80 games. But many of them are identical, many of them are different with same result, and many are different with different results. So the correlation between games from the same position was not something I considered. What in the above appears to be wrong? If you want, I can quote Karl's original explanation word-for-word and you can point out where I am going wrong...
What is wrong, is that if repeating the same 80 games within the first run would be duplicates, or have a chance to be duplicats, that same 80 games would also be duplicats (with the same high chance) in the second run. So that the results of the two runs would be closer together than BayesElo would indicate. Not further apart.

If this effect were important, you would get the exact opposite of what you see, variabillity that is consitently an order of magnitude smaller than the BayesElo error bars.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Karl, input please...

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:Oops! You got me there! :oops:

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.
Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.
It seems to me that you are not addressing the quoted remark. The no-brainer is that small number of games = large error. No matter if you play them from different positions or not. The statement about the positions was was what I intended to write as (1), and I never said that was a no-brainer.
OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not.
OK, in that case , I misread the statement... And I agree with it, fewer games = greater error. Although it does appear that not many realize just how big this issue is. Kaufman (Rybka) says they play 80K games a night to detect improvements as small as 1 elo. Not quite.
When I am talking about "far from the truth" I am not referring to ratings. What is far from the truth is the rating difference between the two engines you test. So that A might perform better in the gauntlet than A', while A' is in fact the stronger engine. And I would think you do care about that.
Absolutely.
Just that they were repeatable so that I could test and draw conclusions...
I don't think that is a sufficient condition. What good would it do you if they were repeatable, but would make you draw the wrng conclusions?
It appears, based on the millions of games I have produced and tallied the results for, that this is a pretty likely outcome no matter what kind of testing is done, until you reach the 1M game plateau or whatever. I'm still concerned about past variability with few positions. It seems to be better to have a larger number of positions, but it doesn't take much though to realize that this is just "hiding" the original issue, not eliminating it. And I wonder if it is really gone, or just "on vacation".
But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.
Well, then he apparently failed, as he offered no explanation for the 6-sigma event whatsoever.
Semantics. He clearly explained _how_ it could happen. And how a small number of positions made it more likely that it would happen. Nobody has attempted to explain "why" it happened. Obviously something is going on that is unexpected. I can think of several potential "whys" but verifying them would be difficult. For example, take a case of time jitter. And for the sake of argument, assume that once you start the program, the jitter is constant. Which means that if you sample just before the time "flips over to the next unit". So that your next sample comes after the flip and you think a complete "unit" elapsed but it didn't. Suppose that this repeats for the course of the game. So that all searches get bit by that same "short time" measurement. And once this is set up, it continues for several games so that the games are almost identical. Until "wham" something happens, and now we start sampling right after the time jump and we can search longer before the next unit elapses. And on these searches, we win more than we lose, where on the last set we lost more than we won. And there are two distinct runs with correlated results, but the results are opposite, and the error bars are far from overlapping. What happened between the two runs? Who knows. Perhaps nothing other than the delay in starting the second run. Did this happen? No idea. The time I gather for the PGN files won't show this. So it will be a pain to "pin the tail on the donkey" should that be it.

And I don't know that that is what is happening. Only that the small number of positions exhibit correlation across multiple games, and one batch can say "good" and the next batch can say bad. Are there other scenarios? Possibly. But no matter what the issue is, _everybody_ has it. There is a risk in the 4K positions that many of them could be quite similar and exhibit correlation, and they could be sensitive to time jitter as well. It would just be less probable with a small number of games per large number of positions, as opposed to the inverse.

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that.
Then he was wrong, as it is mathematical fact. But it is only your claim he was wrong. In fact he did admit to this, and thus was right.
Let's be precise: N games played over 40 positions, vs N games played over 4,000 positions. I believe Karl said the latter will have _less_ variability than the former, because the correlation effect is effectively removed, so that whatever is causing the strange results with respect to timing jitter will not be a factor.

Let me quote the two parts of the complete text I had posted previously, to show what _I_ am reading. You can then respond where you think that is wrong:

=========================================================
Let's continue our trials:

Trial E: Same as Trial C, but instead of limiting by nodes, we limit by
time.

Trial F: Same as Trial E, including the same time control, except that the
room temperature is two degrees cooler, and because of that or some other
factor, the engines are able to search at 1.001 times the speed they were
searching before.

In these last two trials, the 64 repetitions of each position-opponent-color
combination will not necessarily be identical. Miniscule time variations
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather affected
each block of 64 playouts in coordinated fashion.
==========================================================

Ok. that is quote number one that addresses the correlation issue, and suggests that some sort of glitch in the time jitter might be a cause of the problem (but note he does not claim that is what happens, because we could not verify this). Any argument with my interpretation of the above???

=========================================================
Now let me pass from trying to give a plausible explanation of your posted
results to trying to solve the practical problem of detecting whether a code
change makes an engine stronger or weaker. I am entirely persuaded of your
opening thesis, namely that our testing appears to find significance where
there is none. We think we see a trend or pattern when it is only random
fluctuation. We need to re-examine our methodology and assumptions so that
we don't jump to conclusions too quickly.


The bugbear is correlation. We are wasting our time if we run sets of trials
that _tend_ to have the same result, even if they don't always have the same
result. Yes, we want code A and code A' to run against exactly the same test
suite, but we don't want code A to run against the same test position more
than once.

The bedrock of the test suite is a good selection of positions. If the
positions are representative of actual game situations, then they will give
us information about how the engine will perform in the wild. They can't be
too heavy on any particular strategic theme that would bias the test results
and induce us to over-fit the engine to do well on that one strategy.

Assuming you have a good way to choose test positions, I think it is a
mistake to re-use them in any way, because that creates correlations. If A'
as white can outplay A as white from a certain position, then probably A' as
black can outplay A as black from the same position. The same strategic
understanding will apply. Re-running the same test with different colors is
not giving us independent information, it is giving us information correlated
to what we already know. Similarly it is a mistake to re-use a position
against different opponents. If A' can play the position better than A
against Fruit, then A' can probably play the position better than A against
Glaurung. The correlation won't be perfect, but neither will the tests be
independent.

In other words, I am saying that if you want to run 25,600 playouts, then you
should have a set of 25,600 unique starting positions that are representative
of the positions you want Crafty to do well on. If you want to remove color
bias, good, have Crafty play white in the even-numbered positions and black
in the odd-numbered positions, but don't re-use positions. If you want to
avoid tuning for a specific opponent, good, have Crafty play against Fruit in
positions numbered 1 mod 5, against Glaurung in positions numbered 2 mod 5,
etc., but don't re-use positions. Come to think of it, re-using opponents
creates a different source of correlation that also minimizes the usefulness
of your results. One hundred opponents will be better than five, and ideally
you wouldn't re-use anything at all. If nothing else, vary the opponents by
making them marginally stronger or weaker via the time control to kill some
of the correlation.
=========================================================

I interpret that to say that the correlation is what is causing the variability, and that by using enough positions to not repeat any of them, we can make this go away and reduce this effect. The suggestion about playing them with different time handicaps was another thing I had not thought about and that would be easy enough to do as well to make it appear is if there are more opponents than just 5.

But in any case, please explain where I am "missing his point" or "putting words into his mouth" because if I am doing so, it is not intentional.
He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions.
I already went over this with Karl. The example he gave for the N-times repeated games did not wreck any statistical assumption. What assumption would that be? That if you do two exactly repeatable meaurements of a different quantity, the results must be the same? That is an obviously wong assumption to start with.
The assumption of independence. Again, we play 80 games and get a result of 25-35-20. You can compute the SD from that. Then we play 160 games, but get 50-70-40. SD goes down as observations goes up. But not in this case, as he explained above. If something (unknown) makes successive 80 games repeat the same results. And then later when the test is re-run, again something unknown makes the games reproduce but in a different w/d/l total, the SD for each set would be way too small because of the lack of independence. And using the normal SD would again say "odd result, N-sigma variation should be impossibly rare. But if the original SD for 80 games had been used, then these would not be 6 sigma apart. I thought it made perfect sense. Perhaps not???

And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds,
Yes, and I exposed it as wrong. There are no statistical bounds on the difference of different quantities. If humans measure 6' +/- 8", and giraffes 30' +/- 3', no statistical bounds are violated, as there are no bounds on how different humans and giraffes are.
However, that is a very poor analogy. We are measuring game results from the identically same programs, opponents, and positions. But the above example happened, and once the correlation gets factored in, the SD increases, and the anomaly goes away.

and then offered a solution. The "truth" was a different issue.
Well, we can all read what he responded when I questioned him about this. So if you want to repeat this as a mantra, have fun.
Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not.
Well, repeat it as much as you like. Perhaps you think that repeating things often enough will make them true. Mathematical fact is that it is not true. What you say doesn't matter. What Karl says does not matter. Only the math counts.
I understood that. Do you not?
Most certainly not. Do you really think I would take pride in 'understanding' things that are completely wrong?
can't answer what you take pride in. I am simply trying to understand an observed scenario and try to make it work better. You have apparently not observed this since you haven't played that many games. You refuse to believe it really happens and claim cherry-picking and the like. And yet you can never give me any reason why I would want to cherry-pick, or make the data up, or anything else. How, exactly, does it help me? Do you think I get paid by the number of cluster cycles I can burn? Or by the number of jobs I can run? Etc? I'm trying to address a very precisely defined problem, and I simply asked for suggestions. Not endless acusations.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Karl, input please...

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:Oops! You got me there! :oops:

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.
Actually, it isn't. If you play the same position 10 times, and notice that the outcome varies from win to lose to draw, that the games are not duplicated, then it is a pretty reasonable conclusion to believe that repeating the same position multiple times will be as reasonable as playing multiple positions just once each. CT fell into this. _many_ others have been using the Nunn, Noomen, and two versions of Silver positions in this same way. Once the issue is exposed, it seems pretty obvious. But not initially. Or at least not to most of us doing this.
It seems to me that you are not addressing the quoted remark. The no-brainer is that small number of games = large error. No matter if you play them from different positions or not. The statement about the positions was was what I intended to write as (1), and I never said that was a no-brainer.
OK. So far so good. That was _the_ issue. I never cared, and stated so, whether the ratings were accurate or not.
OK, in that case , I misread the statement... And I agree with it, fewer games = greater error. Although it does appear that not many realize just how big this issue is. Kaufman (Rybka) says they play 80K games a night to detect improvements as small as 1 elo. Not quite.
When I am talking about "far from the truth" I am not referring to ratings. What is far from the truth is the rating difference between the two engines you test. So that A might perform better in the gauntlet than A', while A' is in fact the stronger engine. And I would think you do care about that.
Absolutely.
Just that they were repeatable so that I could test and draw conclusions...
I don't think that is a sufficient condition. What good would it do you if they were repeatable, but would make you draw the wrng conclusions?
It appears, based on the millions of games I have produced and tallied the results for, that this is a pretty likely outcome no matter what kind of testing is done, until you reach the 1M game plateau or whatever. I'm still concerned about past variability with few positions. It seems to be better to have a larger number of positions, but it doesn't take much though to realize that this is just "hiding" the original issue, not eliminating it. And I wonder if it is really gone, or just "on vacation".
But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...
Not IMHO. He was interested, specifically, in first explaining the "six sigma event" that happened on back-to-back runs, and then suggesting a solution that would prevent this from happening in the future.
Well, then he apparently failed, as he offered no explanation for the 6-sigma event whatsoever.
Semantics. He clearly explained _how_ it could happen. And how a small number of positions made it more likely that it would happen. Nobody has attempted to explain "why" it happened. Obviously something is going on that is unexpected. I can think of several potential "whys" but verifying them would be difficult. For example, take a case of time jitter. And for the sake of argument, assume that once you start the program, the jitter is constant. Which means that if you sample just before the time "flips over to the next unit". So that your next sample comes after the flip and you think a complete "unit" elapsed but it didn't. Suppose that this repeats for the course of the game. So that all searches get bit by that same "short time" measurement. And once this is set up, it continues for several games so that the games are almost identical. Until "wham" something happens, and now we start sampling right after the time jump and we can search longer before the next unit elapses. And on these searches, we win more than we lose, where on the last set we lost more than we won. And there are two distinct runs with correlated results, but the results are opposite, and the error bars are far from overlapping. What happened between the two runs? Who knows. Perhaps nothing other than the delay in starting the second run. Did this happen? No idea. The time I gather for the PGN files won't show this. So it will be a pain to "pin the tail on the donkey" should that be it.

And I don't know that that is what is happening. Only that the small number of positions exhibit correlation across multiple games, and one batch can say "good" and the next batch can say bad. Are there other scenarios? Possibly. But no matter what the issue is, _everybody_ has it. There is a risk in the 4K positions that many of them could be quite similar and exhibit correlation, and they could be sensitive to time jitter as well. It would just be less probable with a small number of games per large number of positions, as opposed to the inverse.

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.
Sorry, but Karl did _not_ say that.
Then he was wrong, as it is mathematical fact. But it is only your claim he was wrong. In fact he did admit to this, and thus was right.
Let's be precise: N games played over 40 positions, vs N games played over 4,000 positions. I believe Karl said the latter will have _less_ variability than the former, because the correlation effect is effectively removed, so that whatever is causing the strange results with respect to timing jitter will not be a factor.

Let me quote the two parts of the complete text I had posted previously, to show what _I_ am reading. You can then respond where you think that is wrong:

=========================================================
Let's continue our trials:

Trial E: Same as Trial C, but instead of limiting by nodes, we limit by
time.

Trial F: Same as Trial E, including the same time control, except that the
room temperature is two degrees cooler, and because of that or some other
factor, the engines are able to search at 1.001 times the speed they were
searching before.

In these last two trials, the 64 repetitions of each position-opponent-color
combination will not necessarily be identical. Miniscule time variations
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather affected
each block of 64 playouts in coordinated fashion.
==========================================================

Ok. that is quote number one that addresses the correlation issue, and suggests that some sort of glitch in the time jitter might be a cause of the problem (but note he does not claim that is what happens, because we could not verify this). Any argument with my interpretation of the above???

=========================================================
Now let me pass from trying to give a plausible explanation of your posted
results to trying to solve the practical problem of detecting whether a code
change makes an engine stronger or weaker. I am entirely persuaded of your
opening thesis, namely that our testing appears to find significance where
there is none. We think we see a trend or pattern when it is only random
fluctuation. We need to re-examine our methodology and assumptions so that
we don't jump to conclusions too quickly.


The bugbear is correlation. We are wasting our time if we run sets of trials
that _tend_ to have the same result, even if they don't always have the same
result. Yes, we want code A and code A' to run against exactly the same test
suite, but we don't want code A to run against the same test position more
than once.

The bedrock of the test suite is a good selection of positions. If the
positions are representative of actual game situations, then they will give
us information about how the engine will perform in the wild. They can't be
too heavy on any particular strategic theme that would bias the test results
and induce us to over-fit the engine to do well on that one strategy.

Assuming you have a good way to choose test positions, I think it is a
mistake to re-use them in any way, because that creates correlations. If A'
as white can outplay A as white from a certain position, then probably A' as
black can outplay A as black from the same position. The same strategic
understanding will apply. Re-running the same test with different colors is
not giving us independent information, it is giving us information correlated
to what we already know. Similarly it is a mistake to re-use a position
against different opponents. If A' can play the position better than A
against Fruit, then A' can probably play the position better than A against
Glaurung. The correlation won't be perfect, but neither will the tests be
independent.

In other words, I am saying that if you want to run 25,600 playouts, then you
should have a set of 25,600 unique starting positions that are representative
of the positions you want Crafty to do well on. If you want to remove color
bias, good, have Crafty play white in the even-numbered positions and black
in the odd-numbered positions, but don't re-use positions. If you want to
avoid tuning for a specific opponent, good, have Crafty play against Fruit in
positions numbered 1 mod 5, against Glaurung in positions numbered 2 mod 5,
etc., but don't re-use positions. Come to think of it, re-using opponents
creates a different source of correlation that also minimizes the usefulness
of your results. One hundred opponents will be better than five, and ideally
you wouldn't re-use anything at all. If nothing else, vary the opponents by
making them marginally stronger or weaker via the time control to kill some
of the correlation.
=========================================================

I interpret that to say that the correlation is what is causing the variability, and that by using enough positions to not repeat any of them, we can make this go away and reduce this effect. The suggestion about playing them with different time handicaps was another thing I had not thought about and that would be easy enough to do as well to make it appear is if there are more opponents than just 5.

But in any case, please explain where I am "missing his point" or "putting words into his mouth" because if I am doing so, it is not intentional.
He was specifically addresssing the "position played N times correlation" that was wrecking the statistical assumptions.
I already went over this with Karl. The example he gave for the N-times repeated games did not wreck any statistical assumption. What assumption would that be? That if you do two exactly repeatable meaurements of a different quantity, the results must be the same? That is an obviously wong assumption to start with.
The assumption of independence. Again, we play 80 games and get a result of 25-35-20. You can compute the SD from that. Then we play 160 games, but get 50-70-40. SD goes down as observations goes up. But not in this case, as he explained above. If something (unknown) makes successive 80 games repeat the same results. And then later when the test is re-run, again something unknown makes the games reproduce but in a different w/d/l total, the SD for each set would be way too small because of the lack of independence. And using the normal SD would again say "odd result, N-sigma variation should be impossibly rare. But if the original SD for 80 games had been used, then these would not be 6 sigma apart. I thought it made perfect sense. Perhaps not???

And he suggested a way to eliminate that. I can again quote from his original email to clearly show that he first addressed the variable results that were outside normal statistical bounds,
Yes, and I exposed it as wrong. There are no statistical bounds on the difference of different quantities. If humans measure 6' +/- 8", and giraffes 30' +/- 3', no statistical bounds are violated, as there are no bounds on how different humans and giraffes are.
However, that is a very poor analogy. We are measuring game results from the identically same programs, opponents, and positions. But the above example happened, and once the correlation gets factored in, the SD increases, and the anomaly goes away.

and then offered a solution. The "truth" was a different issue.
Well, we can all read what he responded when I questioned him about this. So if you want to repeat this as a mantra, have fun.
Nope, I just keep quoting what Karl said, namely that playing the same position multiple times using the same opponents violates the independent trial requirement and while it appeared to reduce the standard deviation with more games, it did not.
Well, repeat it as much as you like. Perhaps you think that repeating things often enough will make them true. Mathematical fact is that it is not true. What you say doesn't matter. What Karl says does not matter. Only the math counts.
I understood that. Do you not?
Most certainly not. Do you really think I would take pride in 'understanding' things that are completely wrong?
can't answer what you take pride in. I am simply trying to understand an observed scenario and try to make it work better. You have apparently not observed this since you haven't played that many games. You refuse to believe it really happens and claim cherry-picking and the like. And yet you can never give me any reason why I would want to cherry-pick, or make the data up, or anything else. How, exactly, does it help me? Do you think I get paid by the number of cluster cycles I can burn? Or by the number of jobs I can run? Etc? I'm trying to address a very precisely defined problem, and I simply asked for suggestions. Not endless accusations.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Karl, input please...

Post by bob »

hgm wrote:
bob wrote:What don't you agree with? That using a small set of positions introduces a statistical problem? I gave the example he gave me. You play 80 games, and then 80 more. Suppose both matches have the same result. But we are not looking at matches, we are looking at each game as an individual trial, correct? That is how BayesElo works. Twice as many games gives a smaller standard deviation? Correct? And no it isn't already zero, because the 80 games are not all wins are losses, and we are looking at each independently. And the SD for the 80 games is larger than the SD for 160 games, correct? If you buy that, then you see the flaw. Because the two matches that are duplicates are not going to provide any greater degree of accuracy, because we only have 80 actual games and 80 duplicates. I had not considered this effect, because I never get the same 80 games. But many of them are identical, many of them are different with same result, and many are different with different results. So the correlation between games from the same position was not something I considered. What in the above appears to be wrong? If you want, I can quote Karl's original explanation word-for-word and you can point out where I am going wrong...
What is wrong, is that if repeating the same 80 games within the first run would be duplicates, or have a chance to be duplicats, that same 80 games would also be duplicats (with the same high chance) in the second run. So that the results of the two runs would be closer together than BayesElo would indicate. Not further apart.

If this effect were important, you would get the exact opposite of what you see, variabillity that is consitently an order of magnitude smaller than the BayesElo error bars.
And then on the next run, where the time jitter goes the other way, the results change by the same amount. And _that_ BayesElo output will again have an elo/error-bar which fails to overlap with the first one.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Karl, input please...

Post by hgm »

bob wrote:And then on the next run, where the time jitter goes the other way, the results change by the same amount. And _that_ BayesElo output will again have an elo/error-bar which fails to overlap with the first one.
That is exactly what makes this explanation wrong:

You need a systematically different time jitter in the second run than in the first, but you need exactly the same time jitter (at least to the extent where they produce the same games) for each repertition within the same run.

So the explanation for the large difference between runs has nothing to do with the number of positions. It all ascribes it to a change in the behavior of the time jitter. When I repeat the block of 80 games in the same run, they all produce one result. Then when I repeat them on the next run, they suddenly all give a different result.

So in fact this explanation does ascribe the deviation between the runs to alteration of the hardware behavior from one run to another (but constancy during a run). Like some of us pointed out from the beginning: "conditions cannot have been the same".

And indeed, if during one run you had a very large time jitter, randomizing the games very well despite the fact that many of them were played from the same position, while in the other run the time jitter was extremely small, so that there was little variation of the games, and games started to repeat (wholely or partially) when you repeated the position, the run where the latter happened could be off much more.

This is why you should have analyzed the results of either run in detail, rather than deleting the PGN. It would have been immediately obvious that the games from the same position correlated with each other more in one run than in another. (And this would also make it immediately obvious which was the run you could trust, and which was the corrupted one.) This is why you never should rely on an unpredictable factor like time jitter to randomize the games: you might not be able to control it as much as you would like / think. By randomizing explicitly (as the Rybka team does, and as I always do) to the extent that repeat positions give completely independent games, you become completely insensitive to time jitter, as there would be no way for the jitter to make the game result more random as they already are.

But yoou cannot get hypervariability when the time jitter obeys the same statistics in both runs. If an 8-game block could produce 2 appreciably different results, A and B, it should have produced a mixture of As and Bs within both the first and the second run. E.g. if you repeat each position 100 times, for 8,000 games, in Run 1 yu might have 30 As and 70Bs, and then in run 2 you would also have approximately 30 As and 70 Bs. (give or take a handful). Not 100A+0B in run 1 and 0A+100B in run 2. And in absence of the latter, the results of the runs would obey absolutely normal statistics.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Karl, input please...

Post by bob »

hgm wrote:
bob wrote:And then on the next run, where the time jitter goes the other way, the results change by the same amount. And _that_ BayesElo output will again have an elo/error-bar which fails to overlap with the first one.
That is exactly what makes this explanation wrong:

You need a systematically different time jitter in the second run than in the first, but you need exactly the same time jitter (at least to the extent where they produce the same games) for each repertition within the same run.


Sorry, but here you are _wrong_. What I need to create is two different sets of N games, one biases in one direction, one biased in the other. Don't have any idea what you are trying to create, but I am trying to create a situation where run one has elo N +/-5 and run two has elo N+X +/-5, where X is way larger than 5.

My scenario does that _exactly_. And I explained exactly how it would do so. I can't explain how this could physically happen, but assuming it could, this would provide two non-overlapping Elo ranges that are 6 sigma apart on two 40,000 game runs.

So the explanation for the large difference between runs has nothing to do with the number of positions. It all ascribes it to a change in the behavior of the time jitter. When I repeat the block of 80 games in the same run, they all produce one result. Then when I repeat them on the next run, they suddenly all give a different result.

So in fact this explanation does ascribe the deviation between the runs to alteration of the hardware behavior from one run to another (but constancy during a run). Like some of us pointed out from the beginning: "conditions cannot have been the same".
Nope. Read what I wrote. It depends on precisely when a job is started, with an accuracy greater than the system clock can provide.

Hardware is not changing at all, and in fact, can not change.

And note that is would not begin to claim that this is the issuegg I would only say it is possible, but not probable at all.


And indeed, if during one run you had a very large time jitter, randomizing the games very well despite the fact that many of them were played from the same position, while in the other run the time jitter was extremely small, so that there was little variation of the games, and games started to repeat (wholely or partially) when you repeated the position, the run where the latter happened could be off much more.
You are not reading what I wrote. I did _not_ say "jitter varied". In fact, the jitter is constant. It is caused by the fact that the clock is artificially divided up into "periods". The clock increases by "1" when one "period" elapses. But you can sample anywhere during that period. If you sample near the end, and then check the time right after the period ends, you get 1 unit of time, but the actual elapsed time was much smaller than that. If you sample near the beginning of the period, and then sample until it changes, you get 1 unit of time, and the actual elapsed time was approximately one period long. The jitter is not something where the clock moves in a "jerky"" way. It is an artifact of the sampling algorithm not knowing that 1.8 units vs 1.1 units have elapsed, because it only measures units of 1.0



This is why you should have analyzed the results of either run in detail, rather than deleting the PGN. It would have been immediately obvious that the games from the same position correlated with each other more in one run than in another. (And this would also make it immediately obvious which was the run you could trust, and which was the corrupted one.) This is why you never should rely on an unpredictable factor like time jitter to randomize the games:
You still don't even come close to "getting it". The first run above I described was _wrong_. And the second run was _also_ wrong. So one was not good, and one was not bad, they both had correlation, but on opposite ends of the Elo scale.

you might not be able to control it as much as you would like / think. By randomizing explicitly (as the Rybka team does, and as I always do) to the extent that repeat positions give completely independent games, you become completely insensitive to time jitter, as there would be no way for the jitter to make the game result more random as they already are.
"as you always do?" How many games do you play and guarantee that you use a different starting position for each? 800 is worthless still. And how are you choosing the starting positions since you have not mentioned that. I explained how I did mine, just as Uri suggested, and which seems to be OK.


But yoou cannot get hypervariability when the time jitter obeys the same statistics in both runs. If an 8-game block could produce 2 appreciably different results, A and B, it should have produced a mixture of As and Bs within both the first and the second run. E.g. if you repeat each position 100 times, for 8,000 games, in Run 1 yu might have 30 As and 70Bs, and then in run 2 you would also have approximately 30 As and 70 Bs. (give or take a handful). Not 100A+0B in run 1 and 0A+100B in run 2. And in absence of the latter, the results of the runs would obey absolutely normal statistics.
Again, re-read what I wrote and think. If it is possible to synchronize each job so that one time they mostly see jitter N, then the next time they see jitter M, then the next time they see jitter O, they could all be biased but in three different directions. And again, I do not believe this is what is happening, and if it is, there is nothing _anyone_ can do about it as it will affect every computer on the planet.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Karl, input please...

Post by hgm »

bob wrote:Sorry, but here you are _wrong_. What I need to create is two different sets of N games, one biases in one direction, one biased in the other. Don't have any idea what you are trying to create, but I am trying to create a situation where run one has elo N +/-5 and run two has elo N+X +/-5, where X is way larger than 5.
It is not necessary that they are biased in opposite direction. In fact they don't have to be biased in the usual sense at all. The only thing that is needed to create a 6-sigma event is that the variance of one of the runs is much larger than it ought to be. (e.g. because the games are not sufficiently random, and the results thus correlated). The other run can be very accurate, but the corrupted one will be very far off frequently enough then.
My scenario does that _exactly_.
What does "your scenario" refer to, now? Karls hypothetical repetition of the same small set of positions? That does not do that at all. It could only work if every game in the run from a given position would start in the same clock phase. And then, as by magic, in the next run they would all start in a different clock phase. So they starting time of the games should be phase locked to the prcision of the clock tick. And that despite the fact that you assume the intervening games were different in the second run than in the first. There is no way this scenario would work.
You are not reading what I wrote. I did _not_ say "jitter varied".
Who cres what you said? Do you thing the posts of other serve the purpose of repeating what you said. This is what I said, and it could explain the observed effect. Why what you said can not.
In fact, the jitter is constant. It is caused by the fact that the clock is artificially divided up into "periods".
You don't know what causes the jitter. It could be due to scheduling delays, disk rotation, and many other things we cannot even imagine.
The clock increases by "1" when one "period" elapses. But you can sample anywhere during that period. If you sample near the end, and then check the time right after the period ends, you get 1 unit of time, but the actual elapsed time was much smaller than that. If you sample near the beginning of the period, and then sample until it changes, you get 1 unit of time, and the actual elapsed time was approximately one period long. The jitter is not something where the clock moves in a "jerky"" way. It is an artifact of the sampling algorithm not knowing that 1.8 units vs 1.1 units have elapsed, because it only measures units of 1.0
This is why you should have analyzed the results of either run in detail, rather than deleting the PGN. It would have been immediately obvious that the games from the same position correlated with each other more in one run than in another. (And this would also make it immediately obvious which was the run you could trust, and which was the corrupted one.) This is why you never should rely on an unpredictable factor like time jitter to randomize the games:
You still don't even come close to "getting it". The first run above I described was _wrong_. And the second run was _also_ wrong. So one was not good, and one was not bad, they both had correlation, but on opposite ends of the Elo scale.
[/quote]
What I do get is that your statements are based on thin air. The only thing we know is that the runs were different. You cannot know wich one was wrong, or if they were both wrong. This is just blind guessing on your part.

"as you always do?" How many games do you play and guarantee that you use a different starting position for each? 800 is worthless still. And how are you choosing the starting positions since you have not mentioned that. I explained how I did mine, just as Uri suggested, and which seems to be OK.
Usually I play from 216 starting position (FRC-type shufflings of the opening), each white and black to move, and repeat them as often as I need to get the error low enough. Never saw any suspicious correltion. But of course my engine randomizes well, so I even have normal statistics when I would play from a single position. That I take different positions is more to reduce my sensitivity to a hidden imbalance in the position, than to create diversity.

Again, re-read what I wrote and think. If it is possible to synchronize each job so that one time they mostly see jitter N, then the next time they see jitter M, then the next time they see jitter O, they could all be biased but in three different directions.
I am not sure what you mean by "job" in this context. If a job contains repeated games, there is no way you could make that all the repeats experience the same jitter, if the games in between the repeats are different. So part of the games within a job would be 'bisaed' in one direction, an othe part in the other, and in total the effect would cancel.
And again, I do not believe this is what is happening, and if it is, there is nothing _anyone_ can do about it as it will affect every computer on the planet.
Well, making the engines randomize their moves better seems a good cure...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Karl, input please...

Post by bob »

hgm wrote:
bob wrote:Sorry, but here you are _wrong_. What I need to create is two different sets of N games, one biases in one direction, one biased in the other. Don't have any idea what you are trying to create, but I am trying to create a situation where run one has elo N +/-5 and run two has elo N+X +/-5, where X is way larger than 5.
It is not necessary that they are biased in opposite direction. In fact they don't have to be biased in the usual sense at all. The only thing that is needed to create a 6-sigma event is that the variance of one of the runs is much larger than it ought to be. (e.g. because the games are not sufficiently random, and the results thus correlated). The other run can be very accurate, but the corrupted one will be very far off frequently enough then.
The problem is still based on "what is causing this?" And how well does using 4K positions (as opposed to 40) "hide" the problem? Because more positions is not eliminating the problem, if this is what is going on. It is just making it less likely to see one of these.

My scenario does that _exactly_.
What does "your scenario" refer to, now? Karls hypothetical repetition of the same small set of positions? That does not do that at all. It could only work if every game in the run from a given position would start in the same clock phase. And then, as by magic, in the next run they would all start in a different clock phase. So they starting time of the games should be phase locked to the prcision of the clock tick. And that despite the fact that you assume the intervening games were different in the second run than in the first. There is no way this scenario would work.
"no way" = "never". I believe that any sentence with the word "never" in it is inherently false. I can't see any way for this to happen, period, in a real world. But I certainly have enough of the "final results" to know that it does, if not why it does. I do know that the hardware is constant. The programs and positions are constant. The games are played 100% independently with respect to the conditions that can be controlled. But whatever underlying thing is going on is simply not obvious to me, and I am certain it is not related just to our cluster because I have produced this on both machines. And I have seen the same problem running on my office machine two games at a time, although I have never run that many games. But I have seen the error bounds be non-overlapping in the past.


You are not reading what I wrote. I did _not_ say "jitter varied".
Who cres what you said? Do you thing the posts of other serve the purpose of repeating what you said. This is what I said, and it could explain the observed effect. Why what you said can not.
In fact, the jitter is constant. It is caused by the fact that the clock is artificially divided up into "periods".
You don't know what causes the jitter. It could be due to scheduling delays, disk rotation, and many other things we cannot even imagine.
The clock increases by "1" when one "period" elapses. But you can sample anywhere during that period. If you sample near the end, and then chec
Back up. I know _exactly_ what causes the jitter. No question. Absolutely authorative answer: Jitter is caused by a program sampling _inside_ the accuracy of the clock. If I make the clock run fast enough so that a program can only sample once per tick, the jitter is gone. But then the machine spends all of its cycles keeping time and never runs a program anyway. So the jitter is not a mystery. At least not to me. It has always been there. Jitter is an artifact that affects results when you can sample 0, 1 or more times between clock updates, because now you can only measure a period of time that is a multiple of clock ticks, which might be more than enough variability to change things, particularly when searching at 20M nodes per second and a single extra node can produce a different game.

What I don't understand is how we could contrive an experiment, without intentionally breaking everything, to make this happen. Today we start new processes only right after a clock tick happens. Tomorrow we start new processes only right before a clock tick happens. I could certainly make that happen, but what would be the reason in a normal system? There is none. So the program is lucking into something apparently...



k the time right after the period ends, you get 1 unit of time, but the actual elapsed time was much smaller than that. If you sample near the beginning of the period, and then sample until it changes, you get 1 unit of time, and the actual elapsed time was approximately one period long. The jitter is not something where the clock moves in a "jerky"" way. It is an artifact of the sampling algorithm not knowing that 1.8 units vs 1.1 units have elapsed, because it only measures units of 1.0
This is why you should have analyzed the results of either run in detail, rather than deleting the PGN. It would have been immediately obvious that the games from the same position correlated with each other more in one run than in another. (And this would also make it immediately obvious which was the run you could trust, and which was the corrupted one.) This is why you never should rely on an unpredictable factor like time jitter to randomize the games:
You still don't even come close to "getting it". The first run above I described was _wrong_. And the second run was _also_ wrong. So one was not good, and one was not bad, they both had correlation, but on opposite ends of the Elo scale.
What I do get is that your statements are based on thin air. The only thing we know is that the runs were different. You cannot know wich one was wrong, or if they were both wrong. This is just blind guessing on your part.[/quote]

Aha. You finally get the point. You said "one was good, one was bad." Also "blind guessing". I figured I would have to lead you around the barn before you saw the barn. So we can't assume _anything_ about either run, just that they were different. And most likely, statistically, the real value lies somewhere in between the two.


"as you always do?" How many games do you play and guarantee that you use a different starting position for each? 800 is worthless still. And how are you choosing the starting positions since you have not mentioned that. I explained how I did mine, just as Uri suggested, and which seems to be OK.
Usually I play from 216 starting position (FRC-type shufflings of the opening), each white and black to move, and repeat them as often as I need to get the error low enough. Never saw any suspicious correltion. But of course my engine randomizes well, so I even have normal statistics when I would play from a single position. That I take different positions is more to reduce my sensitivity to a hidden imbalance in the position, than to create diversity.
OK, totally different animal, as the evaluation and everything else is different. Although 216 positions probably has the same flaw. And why "repeat them" when that is _exactly_ the same mistake I had in my earlier runs. 40 was bad. 4,000 is good (so far). Nothing convinces me yet that 216 is enough. And when you start repeating you introduce the same "correlation effect". You just don't repeat the runs to detect it I'd bet.


Again, re-read what I wrote and think. If it is possible to synchronize each job so that one time they mostly see jitter N, then the next time they see jitter M, then the next time they see jitter O, they could all be biased but in three different directions.
I am not sure what you mean by "job" in this context. If a job contains repeated games, there is no way you could make that all the repeats experience the same jitter, if the games in between the repeats are different. So part of the games within a job would be 'bisaed' in one direction, an othe part in the other, and in total the effect would cancel.
I have already explained. Job = game. I play 'em one at a time to give small chunks of work that can be balanced pretty well across the cluster to keep all nodes busy.

And again, I do not believe this is what is happening, and if it is, there is nothing _anyone_ can do about it as it will affect every computer on the planet.
Well, making the engines randomize their moves better seems a good cure...
Seems to me that they _already_ randomize their moves pretty well, just do to time variation.