An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:The problem is that you are so disinterested in doing statistical analysis on your data, that you don't seem to be able to distinguish a one-in-a-million event from a one-in-a-hundred event.

What "once in a million event have you seen?" Let's back up just a minute. I reported a while back that when playing 80 game matches, I discovered that it was impossible to use them to detect whether a change to my program produced an improvement or not. In fact, I posted that it was impossible to use these 80 game matches to predict whether A was better than B or vice-versa, within a pool of players that contained my program plus several well-known program that are all quite strong.

You seem to be worrying about the fact that four consecutive matches have a wild variance when the next 10 do not. I don't care about that at the moment. I care about the wild variance _period_ because when I run a test, I can get any one of those random match samples as the result, and that doesn't help me in any way.

Statistically I don't care about how much these samples do or do not vary. I will analyze this at some point. But for now, I want a test that says "better or worse". And the variability I am seeing in 80 game matches makes that fail. I then went to 4-match sets and discovered that was not enough. I explained this in the past during these discussions as well. I took a pretty simple change that I knew made a difference (I changed the null-move reduction depth, I eliminated an important search extension, etc) and running 1-match samples randomly told me "better" or worse" for any of those cases. 320 game matches did exactly the same thing. Even though I was not talking about very small changes, I still was unable to determine whether the change was good or bad.

That was what I reported on initially, that the results were so random it was surprising. "how" random was (and still is) not so important, because I just wanted to know "OK, the programs all have a strong non-deterministic aspect to their play, so how many games should I play to be able to accurately measure this effect.

That is a hard question to answer. The more significant the change, in theory I would need fewer games. But I made some significant changes and found that the number of games needed to determine the effect with high accuracy was much larger than I had ever heard anyone mention in the past.

I first decided to try to play enough games to drive this non-deterministic result influence down to zero, if possible. It wasn't possible, but I drove it down to a very low level by playing 64 matches against an opponent. And I discovered that I could determine whether a change had any significant effect on performance.

Somehow we get neck deep in "is that atypical?" and such, which at the time I was not interested in analyzing. I could _see_ the randomness, and I realized that I could not use that kind of random observations to make any sort of meaningful decision.

So, to assist us in _our_ problem of trying to evaluate changes that I make, or Tracy or Mike make, I started playing longer and longer tests, I tried longer time controls to see if that would help (it didn't). And I arrived at what we do today which is reasonably accurate.

I do plan, when I have time, to let my stat friend loose to determine how many games we need to be playing to get reasonable results, and see if that number is smaller than what I am using (which I doubt based on simple experimental tests we have run, and we have run a _bunch_ of them already as I have mentioned.

"once in a million" is not something I see often since I have only run a paltry 100,000 80 game matches so far. And nothing I have given here represents any such rare event, taken in context. In a 32 match test, whether the first 4 are wildly random, or whether the wildly random matches are distributed uniformly makes no difference _to me_. The fact that they exist at all is enough to influence how I test. You want to latch onto a tiny sample and say "that is way too rare to possibly happen." You are wrong. It _did_ happen as given. And I continue to get significant randomness. I don't care whether I should not get 4 wild results in a row. The fact that I get 4 wild results at all is enough to show that I can't reliably depend on one single match to predict anything. And the more such wild results I get, the more forceful that becomes.

So, in summary, can we stop trying to take individual runs apart, and focus on the big picture. I will post a long statistical analysis when we get the time to do this at some point in the future. Right now, all I care about is trying to determine if a change is good or not. Nothing more, nothing less.

If you believe the samples I posted are a one in a million event, then I just flipped heads 20 times in a row. It happens. Again, I could not tell you the circumstances where that original 4-game sample was produced. I did explain exactly how that last 10-12 game sample was produced. And I have posted others with less (but still highly significant) variance, as those were what I had at the time. There's nothing to be gained by trying to make this data up. I'm just showing a "work in progress" result here and there as the experimental testing progresses.
[/quote]
A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event. You seem to think "Oh well, it is not even twice larger than 2.5*sigma, so if 2.5*sigma occurs frequently, 4.5*sigma cannot be considered unusual". Well, that is as wrong and as naive as thinking that a billion is only 50% more than a million, because it has 9 zeros in stead of 6. Logic at the level of "I have seen two birds, so anything flies".



I don't believe any such thing. But I do believe this:

Whether it is a 4.5*sigma or a 1.0*sigma event, if sigma is big enough (and it is here) it makes the results unusable. I've not said anything else since starting this discussion.
No one contest your right to remain in ignorant bliss of statistics, and concentrate on things that interest you more. But than don't mingle in discussions about statistics, as your uninformed and irrelevant comments only serve to confuse people.

Funny. It was a topic _I_ started. :) So who is "mingling"???


The matter of randomness in move choice was already discussed ad nauseam in another thread, and not really of interest to anyone, as it is fully understood, and most of us do not seem to suffer from this as much as you do, if at all.
Wow. Talk about inaccurate statements. What about the set of programs I have used? On my cluster, I use fruit, glaurung 1/2, arasan 9/10, gnuchess 4 and 5. And a couple of others I won't mention since the authors asked me to play games and send them back. I use shredder and junior manually and they do the same.

So somehow "most of us" represents a couple of samples, and you complain that I "jump to conclusions"???

complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's. And the minute that goes away, randomness pops up, and there is nothing that can be done about it if you use any sort of time limit to control the search, because time intrinsically varies on computers.


The fact that you cannot tell apart a one-in-15,000 fluke from a 1-in-4 run-off-the-mill data set, really says it all: this discussion is completely over your head. Note that I did not say that your gang-of-four was a "fake" (if you insist on calling a hypothetical case such), but that I only _hoped_ it was a fake, and not a cheat or a goof. OK, so you argue that it was a goof, and that you are not to blame for it because you are not intellectually equiped to know what you are doing ("statistics doesn't interest me"). Well, if that suits you better, fine. But if you want to masquerade for a scientist, it is'n't really good advertisement...
Where did I say it was a "goof"? I reported _exactly_ what it was. 4 consecutive results obtained in a much larger test. I claimed it ws the first _four_ test results in that series of matches. I didn't claim anything else. The question about "how random" came up, I took what I had. Again if I flipped heads 20 times in a row to get that sample, so be it. I flipped heads 20 times in a row. whether you take that first group of 4, or the first group of 4 from the last data I posted, you get the same result. There is enough variability that it takes a _lot_ of games to smooth it out. All the other arguments, tangents, etc don't change that one iota.

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

mhull wrote:
hgm wrote: A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event.
Could the apparent correlation in the data be related to favorable/unfavorable memory and cache issues mentioned by Bob at program startup. Could this introduce a perturbed periodicity to the quality of play by one engine or another that undermines consistency of play?
No. I specifically eliminate that at present. But I have run tests with and without the node re-initialization and the results for each way are consistent. the cache/memory mapping is not much of an issue as I am running a version of linux that addresses this issue directly, anyway. (it has a "page coloring" algorithm in the memory allocation code.)

In testing this as a possible cause, I ran a _bunch_ of positions using Crafty and one of the other programs (I don't recall which, just one that produced a log (very short log) with NPS information.) Crafty's NPS never varied by even 1%. I hit about 2M nps on the cluster nodes. Here are a few samples I just ran for illustration:

log.001: time=26.14 mat=0 n=49252375 fh=93% nps=1.9M
log.002: time=26.04 mat=0 n=49252375 fh=93% nps=1.9M
log.003: time=26.05 mat=0 n=49252375 fh=93% nps=1.9M
log.004: time=26.10 mat=0 n=49252375 fh=93% nps=1.9M

If you look carefully, the total time varied from 26.05 to 26.14 seconds, which is a 0.4% variance. Here is data from a node that I turned off the page coloring algorithm so that things might vary more:

log.001: time=26.04 mat=0 n=49252375 fh=93% nps=1.9M
log.002: time=26.21 mat=0 n=49252375 fh=93% nps=1.9M
log.003: time=26.12 mat=0 n=49252375 fh=93% nps=1.9M
log.004: time=26.16 mat=0 n=49252375 fh=93% nps=1.9M


So slightly more difference, but if you look, not much. This variability is caused by the operating system time jittering, since these represent complete 16 ply searches that don't depend on time to terminate.

That tiny variance in the time is not enough to give any rating increase or decrease, but it does clearly show that if I searched for a fixed amount of time, I will see a different number of nodes each time if I search until a specific time interval has elapsed...

hope that clears that up...
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

hgm wrote:
Gerd Isenberg wrote:Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?

Thanks,
Gerd
Do you mean on a per-move basis? Or on a per-game or per-match basis? Usually engines gain 70 Elo points for doubling the time. That means that at 20 sec per move they are 29 Elo stronger than at 15, and at 10 sec/move they are 40 Elo weaker. What exactly the effect will be if you alternate (or randomly mix) +29 Elo moves with -40 Elo moves, is not completely clear. If the result of a game is decided by adding a small probability for a fatal mistake in all the moves, that probability would approximately be linear over such a small Elo range, and you could simply take the average Elo. So the engine being modulated would lose about 5 Elo.

If an engine has to play the entire game with 15 sec in stead of 20, it simply is 40 Elo weaker. But the Elo describes the effect on the average score. On the variance such things would hardly have any effect. Unless you would have one engine use a different time during the entire mini-match. Then the variance could go up (and the game results within one mini-match get correlated, through all using the same time control, which could be different in other mini-matches).
I mean in each game of the match the time from move to move fluctuates randomly as mentioned. How does it affect the variance or the occurrence of your mentioned "gang x" events?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

mhull wrote:
mhull wrote:
hgm wrote: A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event.
Could the apparent correlation in the data be related to favorable/unfavorable memory and cache issues mentioned by Bob at program startup. Could this introduce a perturbed periodicity to the quality of play by one engine or another that undermines consistency of play?
Or maybe the cluster is not entirely "uniform" with some nodes containing inconsistently performing hardware.
That I know is not happening. We have a sanity test we run from time to time to test everything. Each processor's speed. Network speed (the cluster has both gigabit ethernet cards in each node as well as infiniband cards. The sanity check tests everything, and is run whenever we have to repair anything...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:Well, this is what I proposed as well, but Bob denies that it can happen.

Problem is that it is not really apparent anymore how the data looks at all, as we have only Bob's vague statements that "large" deviations occur "often". And as he obviously is not able to see or appreciate the difference between a 2-sigma and a 4-sigma deviation, and seems to consider something as "typical" when he has seen it before while sifting through 8 million samples, it becomes kind of hard to know what such statements mean.

So I now tend to dismiss all these claims all as completely meaningless, and just go on the data that has actually been posted here. And that data does not contain any hint that the distribution and variance of the mini-match results is different from what one would expect for 80 independent games (i.e. normally distributed, with a variance of 7 to 8). All except the infamous gang of four, of course, the origin of which is unclear, but which conveniently and magically happened to "be around".

So I don't think it would be worth speculating on anything before we have seen a histogram indicating the observed occurrence frequency of the result of a few thousands of mini-matches. Otherwise we likely will only be chasing ghosts.

The best bet currently seems that this whole business of "large" variance is just a red herring, caused by lack of understanding of statistical matters on Bob's part.
Entirely correct. I made up the results completely, because I had nothing better to do with my time. Crafty is already far stronger than any other program around, so there is no way I can meaningfully improve its play until the rest of the world catches up in 10-20 years.

And that concludes this week's episode of "The Twilight Zone"...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Gerd Isenberg wrote:
hgm wrote:
Gerd Isenberg wrote:Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?

Thanks,
Gerd
Do you mean on a per-move basis? Or on a per-game or per-match basis? Usually engines gain 70 Elo points for doubling the time. That means that at 20 sec per move they are 29 Elo stronger than at 15, and at 10 sec/move they are 40 Elo weaker. What exactly the effect will be if you alternate (or randomly mix) +29 Elo moves with -40 Elo moves, is not completely clear. If the result of a game is decided by adding a small probability for a fatal mistake in all the moves, that probability would approximately be linear over such a small Elo range, and you could simply take the average Elo. So the engine being modulated would lose about 5 Elo.

If an engine has to play the entire game with 15 sec in stead of 20, it simply is 40 Elo weaker. But the Elo describes the effect on the average score. On the variance such things would hardly have any effect. Unless you would have one engine use a different time during the entire mini-match. Then the variance could go up (and the game results within one mini-match get correlated, through all using the same time control, which could be different in other mini-matches).
I mean in each game of the match the time from move to move fluctuates randomly as mentioned. How does it affect the variance or the occurrence of your mentioned "gang x" events?
The time variance on the PC is roughly 18.667 ms. That is the frequency the real-time-clock ticks at, which means that measuring a quantity of time smaller than that is impossible. Other platforms running unix tick at 10ms (1/100th of a second). I don't think anyone would want to run the timer at 1ms as that is 1000 interrupts per second on top of everything else going on.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: No. probability is probability here. A royal flush requirea 10-J-Q-K-A of the same suit. If you get one, you will see your 7 cards (I assume you don't fold if you have one after the flop, for example, and after you have one you are at least going to play all-in rather than folding).
Why do you assume away the only thing I am talking about? Could you please try to also credit me with not just talking garbage all the time?

If I have 10-J I could easily fold before the flop. even if then my ace plus rainbow flops, I have many reasons to fold, and if then I get the K and Q after I folded, it would still have been the correct decision against e. g. AA.

Note that I also didn't mention whether it was limit or no-limit.
The probability for your _getting_ a royal flush is different from the probability that you actually _see_ the royal flush. You originally said "get". The math there is fixed. However, even with a 10-J there's a decent chance you might stick around for the flop. Two high cards. Same suit. Would depend on multiple factors. But at that instant you do have a fixed probability for getting a royal, and it is far higher than usual because you already have two of the needed cards. You are 2/5 of the way there already... None of that has anything to do with the probability of getting the thing however, starting at the beginning of the round.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Unfortunately, that is where theory and reality sometimes don't agree. I'll be happy to run several 100 game matches against a program that I finish almost dead even with after 5000 games. And let you examine each match against the others to see how many times you get R+/-E (rating +/- error) that do not "overlap". So you take match one and think your rating is 2400+/- 25. The next match you get 2550 +/- 25. And that happens. Repeatedly. Which just shows that the "standard error" is not telling us that much, it is only about that one sample. I want as accurate an estimate of my rating as possible. 2543+/-5 for example. And I want to get that estimate on multiple matches to convince me it really does mean what it says.

But when I see two distinct ratings where the standard error isn't enough to make them overlap, then I can't reliably use one of those samples to keep/reject a change. Yet that is exactly what I see posted here over and over in various results. Very small standard error suggests that the actual rating produced is very accurate. It isn't. It is only very accurate for that specific sample. Unfortunately I know that the next sample can be significantly different. So either I produce enough samples to be able to take them and combine into one rating with a standard error, or I can't use any of them to determine anything at all.

I have re-started my test match to test the current version. Here are the first 5 80-game matches:

1: +----+-=-+-=--=+--=+-+==---+++=++-++=+++-==+==++-+-+-+=--++==-+-=+++-+=-+=+++--- (4)
2: +-=+=+-==+--+-++=-=+--++---=+-=---++=----=+==-=+++=+=++---+==-=+++++=+-=++=+=-+- (4)
3: ++=--+-+--=---====+--=-=--=++-+=--++==-=-++-+-++---+-+=---=+==+-+++---=--++++-=+ (-7)
4: +----+--+=-+--==-=+++--+---++-+=+-=+==+--=+-=+=+-+-+-++=+---+-+--++==++---++=-=- (-4)
5: ++==--==--=---====++=-++--=+=++=+-++=-+=-=++++=+-=-+-+--=+-++++=+++-=+++--=++--+ (10)

Do your statistical wizardry on each and tell me if, after any one of those, I could tell whether my new version is better or worse, knowing that a 5K match almost always ends up exactly at 0, with a +/- 1 variability between any 2 5K matches (5160 game matches actually).

That's why elostats is not very useful. It is just about _that_ one sample. Two different samples, produced by the same identical two opponents, can produce two completely different rating "ranges". (and these two ranges can be _significantly_ disjoint.

The bottom line is that it requires far more than this tiny number of games to approach "the truth". which is what I have been saying all along, but it keeps getting lost in the discussion.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

bob wrote:
Gerd Isenberg wrote:Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?
The time variance on the PC is roughly 18.667 ms. That is the frequency the real-time-clock ticks at, which means that measuring a quantity of time smaller than that is impossible. Other platforms running unix tick at 10ms (1/100th of a second). I don't think anyone would want to run the timer at 1ms as that is 1000 interrupts per second on top of everything else going on.
I did not mean the architectural immanent implicite timing fluctuations, but an explicite random timing fluctuation by one (or even both) opponent(s), let say target time = 10 + rnd(10). How would that influence the match result variance or sigma and the likelyhood of such 1 per million events?

I mean considering the whole search space of the game, one per million isn't that huge. And when the additional randomness extremly widens the search space over all games randomly, chaotical and none determistic things may happen.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Gerd Isenberg wrote:
bob wrote:
Gerd Isenberg wrote:Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?
The time variance on the PC is roughly 18.667 ms. That is the frequency the real-time-clock ticks at, which means that measuring a quantity of time smaller than that is impossible. Other platforms running unix tick at 10ms (1/100th of a second). I don't think anyone would want to run the timer at 1ms as that is 1000 interrupts per second on top of everything else going on.
I did not mean the architectural immanent implicite timing fluctuations, but an explicite random timing fluctuation by one (or even both) opponent(s), let say target time = 10 + rnd(10). How would that influence the match result variance or sigma and the likelyhood of such 1 per million events?

I mean considering the whole search space of the game, one per million isn't that huge. And when the additional randomness extremly widens the search space over all games randomly, chaotical and none determistic things may happen.
I understood that. I was pointing out that we can expect to be off by 18+ ms over any interval solely due to the PC's real time clock frequency.

Larger variances could occur if a random process fires up for a few seconds here and there (not on our cluster however, at least from system daemons, since we run bare-bones kernels on the individual nodes). However, users can run processes on any node, although it is _very_ rare that they do because they know our scheduling policy is to go through qsub, rather than logging in and running directly. But I even watch for this during every single game, as Crafty always measures its own NPS and in cluster runs, it watches for variance over 20% between any two consecutive moves (the NPS can vary by 2x-3x over the course of the game as pieces come off and I don't want false errors from those). I do, occasionally, get an error indication and abort the run. We talk to the user and fix the problem.

Otherwise, I am not seeing any timing variance to speak of. Some examples from a current game on the cluster:
time=2.22 mat=1 n=3268843 fh=92% nps=1.5M
time=2.24 mat=1 n=3418773 fh=92% nps=1.5M
time=5.05 mat=1 n=8026048 fh=90% nps=1.6M
time=2.41 mat=1 n=4054462 fh=92% nps=1.7M
time=4.27 mat=1 n=6824153 fh=90% nps=1.6M
time=1.95 mat=1 n=3153268 fh=91% nps=1.6M
time=4.09 mat=0 n=6885048 fh=92% nps=1.7M
time=1.85 mat=0 n=3122538 fh=90% nps=1.7M
time=1.83 mat=1 n=3055358 fh=91% nps=1.7M
time=2.64 mat=1 n=4252789 fh=90% nps=1.6M
time=2.96 mat=1 n=4782047 fh=90% nps=1.6M
time=2.89 mat=1 n=4699446 fh=91% nps=1.6M
time=2.63 mat=1 n=4452369 fh=91% nps=1.7M
time=7.76 mat=1 n=14672882 fh=92% nps=1.9M
time=1.39 mat=1 n=2825622 fh=97% nps=2.0M
time=3.16 mat=1 n=6495702 fh=93% nps=2.1M
time=1.84 mat=1 n=3704907 fh=92% nps=2.0M
time=2.59 mat=1 n=5230985 fh=92% nps=2.0M
time=6.51 mat=1 n=14725619 fh=95% nps=2.3M
time=1.48 mat=2 n=3374470 fh=95% nps=2.3M
time=1.09 mat=2 n=2373068 fh=89% nps=2.2M
time=4.69 mat=5 n=10529418 fh=92% nps=2.2M


This is a 2+1 time control match that is in progress. NPS climbs as pieces (particularly bishops) come off (bishops because we do some sliding mobility stuff that is somewhat expensive). But we don't see wildly varying NPS values or it would be worthless to test...

Note that the time per move varies a lot (no pondering, no book, no endgame tables, no SMP search) because of crafty's time allocation algorithm, easy move logic, and extend on fail low results, but that the NPS remains pretty constant as expected.