An objective test process for the rest of us?

Uri Blass · Post by **Uri Blass** » Fri Sep 21, 2007 9:37 pm

H.G.Muller saw only data of 4 matches in the beginning and we talk about
that data.

That data was one in million event assuming his calculation is correct(I did not calculate exact number but noticed that it is something rare).

Considering the fact that this data was not worst case that you chose it is logical to suspect that something that you are not aware of it cause dependency in the result and in that case the next step should be to do statistical analysis of all the results to see if results that are supposed to be independent are really independent.

I suggest for you simply to calculate the variance in the result for every game out of the 80 games and also calculate the variance
for match results.

If you see that the variance of match result is significantly bigger than the sum of the variances of results of a single game then it suggest that you have a problem in the conditions that you test.

I think that it is better if you do that statistical analysis for every time that you play many matches because it may be possible that in one set of matches you have no problem and later you start to have some problem in the conditions and it is better for you to know if there is a reason to suspect that there are problems as soon as possible.

Uri

hgm · Post by **hgm** » Fri Sep 21, 2007 10:55 pm

bob wrote: What "once in a million event have you seen?"

A deviation from average of 29 plus a deviation of 16 from average in in 4 samples supposedly randomly chosen from your total data set. These are deviations of 4*sigma and 2.2*sigma, corresponding to probabilities of 3e-5 and 1e-2, respectively. Those probabilities together bring you down to the once-in-a-million level for occurring together in 4 traces that are supposed to be independent:

Code: Select all

When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-=-+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world.

Let's back up just a minute. I reported a while back that when playing 80 game matches, I discovered that it was impossible to use them to detect whether a change to my program produced an improvement or not. In fact, I posted that it was impossible to use these 80 game matches to predict whether A was better than B or vice-versa, within a pool of players that contained my program plus several well-known program that are all quite strong.

This is because you require discriminating very small differences in scoring fraction with near 100% certainty. That requires large number N of games, to bring the SD of your sampling noise down to a small fraction of that difference, and sampling noise only decreases as the sqrt(N).

You seem to be worrying about the fact that four consecutive matches have a wild variance when the next 10 do not. I don't care about that at the moment. I care about the wild variance _period_ because when I run a test, I can get any one of those random match samples as the result, and that doesn't help me in any way.

Not at all. I am worrying about a large fraction of the matches being shown having extreme deviation from the average. The order in which these deviations occur is irrelevant, as the traces are supposed to be independent. Thus any deviation should equally occur in any position. (Of course, if you observe that they do not, it points to the traces not being so independent as you think. But to observe that would needs orders of magnitude more data than has been shown here.)

The only "worry" is the magnitude of some of the deviations shown.

Statistically I don't care about how much these samples do or do not vary. I will analyze this at some point. But for now, I want a test that says "better or worse". And the variability I am seeing in 80 game matches makes that fail. I then went to 4-match sets and discovered that was not enough. I explained this in the past during these discussions as well. I took a pretty simple change that I knew made a difference (I changed the null-move reduction depth, I eliminated an important search extension, etc) and running 1-match samples randomly told me "better" or worse" for any of those cases. 320 game matches did exactly the same thing. Even though I was not talking about very small changes, I still was unable to determine whether the change was good or bad.

A 100-Elo difference should cause on the average a +23 score in an 80-game match. SD in a difference of two matches is sqrt(2)*sqrt(80)*sqrt(1-DrawFraction) = ~10. So you talk about a 2.3*sigma threshold here. That will be crossed in 1% of the cases.

If with a single 80-match sample you would "randomly" design one or the other as better, I presume that you mean that the bad one is designated better in close to 50% of the cases, rather than 1%. If that would be so, there is something horribly wrong with your measurement setup, as even random coin flips could not do that. So I suppose you mean something else. But what? Be more exact.

That was what I reported on initially, that the results were so random it was surprising. "how" random was (and still is) not so important, because I just wanted to know "OK, the programs all have a strong non-deterministic aspect to their play, so how many games should I play to be able to accurately measure this effect.

Like I said before: this would not be a sign of randomness. On the contrary, randomness would make you pass the test as flipping of (appropriately biased) coins would do, i.e. only in 1% of the cases. If you pass it more often you have the opposite of randomnes, correlation.

That is a hard question to answer. The more significant the change, in theory I would need fewer games. But I made some significant changes and found that the number of games needed to determine the effect with high accuracy was much larger than I had ever heard anyone mention in the past.

Never mind what others mention in the past. Calculate how many games you need. Everyone could have been wrong, but math never lies.

I first decided to try to play enough games to drive this non-deterministic result influence down to zero, if possible. It wasn't possible, but I drove it down to a very low level by playing 64 matches against an opponent. And I discovered that I could determine whether a change had any significant effect on performance.

Well, playing multiple matches isn't a miracle cure. You run the risk that, with the engines under test, the starting positions are biased, and they could be biased differently after reversing colors. In that case the result of the matches would be more determined by the choice of initial positions than by the variability of the engines. The averaging would not weed out that effect. Safer to take more starting positions. This problem gets worse if the engines are very deterministic, so you might get away with it.

Somehow we get neck deep in "is that atypical?" and such, which at the time I was not interested in analyzing. I could _see_ the randomness, and I realized that I could not use that kind of random observations to take any sort of meaningful decision.

So, to assist us in _our_ problem of trying to evaluate changes that I make, or Tracy or Mike make, I started playing longer and longer tests, I tried longer time controls to see if that would help (it didn't). And I arrived at what we do today which is reasonably accurate.

I do plan, when I have time, to let my stat friend loose to determine how many games we need to be playing to get reasonable results, and see if that number is smaller than what I am using (which I doubt based on simple experimental tests we have run, and we have run a _bunch_ of them already as I have mentioned.

Yes, you should certainly do that. And you should also let him have a look at the over-all distribution of minimatch results.

"once in a million" is not something I see often since I have only run a paltry 100,000 80 game matches so far. And nothing I have given here represents any such rare event, taken in context. In a 32 match test, whether the first 4 are wildly random, or whether the wildly random matches are distributed uniformly makes no difference _to me_. The fact that they exist at all is enough to influence how I test. You want to latch onto a tiny sample and say "that is way too rare to possibly happen." You are wrong. It _did_ happen as given. And I continue to get significant randomness. I don't care whether I should not get 4 wild results in a row. The fact that I get 4 wild results at all is enough to show that I can't reliably depend on one single match to predict anything. And the more such wild results I get, the more forceful that becomes.

All good and well, but you did not post millions of match results here. So the probability that we get to see a once-in-a-million event should be quite low, no matter how much you get to see. Unless you would pick them out for us, of course...

Nothing is "too rare to ever happen". The question is, "how often does it happen"?. We cannot take for granted that the things you post have the same frequencies as which they occur in your data. Of course you did some selection before posting that data. Do ou really expect us to believe that you felt like posting something in CCC, made a totally random selection out of your data, and then started thinking what point you could make with that? Of course not! You wanted to write that sentence: "When 4 sequential 80 game runs produce these kinds of results it is hard to draw any conclusions about which represents the real world", and then you had to be careful not to take 4 mini-matches that had scores of 2, 3, 1, 2, because then that sentence would sound rather silly. So you had to select 4 traces that indeed showed a large deviation, as your other data showed that there are also plenty (and if fact far more) sequences of 4 mini-matches that score nearly identical, within 2 or 3 points. So you _selected" the ganng of four. There is nothing wrong with that, as it is understood. But for us, to understand what that data means, it is f the utmost relevance to know from _how many_ traces you selected those.

So, in summary, can we stop trying to take individual runs apart, and focus on the big picture. I will post a long statistical analysis when we get the time to do this at some point in the future. Right now, all I care about is trying to determine if a change is good or not. Nothing more, nothing less.

If you believe the samples I posted are a one in a million event, then I just flipped heads 20 times in a row. It happens.

Yes it happens. And then uncritically publishing that data is _bad science_. Have you ever heard of the procedure of throwing out points that deviate more than 4*sigma, to prevent they corrupt your average?

Again, I could not tell you the circumstances where that original 4-game sample was produced. I did explain exactly how that last 10-12 game sample was produced. And I have posted others with less (but still highly significant) variance, as those were what I had at the time. There's nothing to be gained by trying to make this data up. I'm just showing a "work in progress" result here and there as the experimental testing progresses.

Again a (statistically) completely unsound statement. _None_ of that other data contained significant variance, (let alone highly significant), as I clearly demonstrated by calculation.

A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event. You seem to think "Oh well, it is not even twice larger than 2.5*sigma, so if 2.5*sigma occurs frequently, 4.5*sigma cannot be considered unusual". Well, that is as wrong and as naive as thinking that a billion is only 50% more than a million, because it has 9 zeros in stead of 6. Logic at the level of "I have seen two birds, so anything flies".

I don't believe any such thing. But I do believe this:

Whether it is a 4.5*sigma or a 1.0*sigma event, if sigma is big enough
(and it is here) it makes the results unusable. I've not said anything
else since starting this discussion.

But sigma _cannot_ be big enough for independent games. This is what fails to sink in with you, time after time. 4.5*sigma deviations are of no importance, as they occur only once in a million times, and you are not going to test that many changes. 1.0*sigma events are not important because, although they occur often, sigma is limited.

How large do you think that sigma can be, for a single min-match?
Funny. It was a topic _I_ started. So who is "mingling"???
Yes, very funn. But guess who is the clown? This thread was started by Nicolai. What do you think he means with "rest of us"?
The matter of randomness in move choice was already discussed ad
nauseam in another thread, and not really of interest to anyone, as it
is fully understood, and most of us do not seem to suffer from this as
much as you do, if at all.
Wow. Talk about inaccurate statements. What about the set of programs I have used? On my cluster, I use fruit, glaurung 1/2, arasan 9/10, gnuchess 4 and 5. And a couple of others I won't mention since the authors asked me to play games and send them back. I use shredder and junior manually and they do the same.

So somehow "most of us" represents a couple of samples, and you complain that I "jump to conclusions"???

All engines with 2400+ ratings. A small minority. There are _hundreds_ of engines with ratings below 2000. Advanced engines tend to have advanced time management (that is very susceptible to this effect), but weak engines tend to have simplistic time management, that hardly is. I have tested scores of engines in the 1600-1800 range, and the large majority was very deterministic. (Which is why I needed so many to play a meaningful number of games.) Only if they came with a book (which is also not so common in that range) the only variation was through the choice of book line.

complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's. And the minute that goes away, randomness pops up, and there is nothing that can be done about it if you use any sort of time limit to control the search, because time intrinsically varies on computers.

Like I said, most engines are rather primitive. There are even some that play fixed-depth. But that is not worth arguing about. What is more shocking that you admit yourself that you don't know how big a hit one would take in terms of Elo, by using such a "primitive", and "inferior" time management. What good is all your precission measurement Elo microscope if you don't even know fundamental things like that?

Where did I say it was a "goof"? I reported _exactly_ what it was. 4 consecutive results obtained in a much larger test. I claimed it ws the first _four_ test results in that series of matches. I didn't claim anything else. The question about "how random" came up, I took what I had. Again if I flipped heads 20 times in a row to get that sample, so be it. I flipped heads 20 times in a row. whether you take that first group of 4, or the first group of 4 from the last data I posted, you get the same result. There is enough variability that it takes a _lot_ of games to smooth it out. All the other arguments, tangents, etc don't change that one iota.

I consider it a goof to claim "look, when I flip a fair coin, this happens", and then show a row of 20 heads that happened to be lying around, without knowing exactly why it was lying there. As you can be expected to know that this is exactly what almost _never_ happens when you do 20 coin flips, although, no doubt, it has happened some time to someone. And you know that publishing such a fluke will be very misleading as to the properties of a fair coin. Serious scientists delete 4-sigma deviations from their data, or at least take the trouble of confirming the measurement. Especially if they cannot recal the circumstances.

And if you don't, you should expect criticism.

hgm · Post by **hgm** » Fri Sep 21, 2007 11:05 pm

Gerd Isenberg wrote:I mean in each game of the match the time from move to move fluctuates randomly as mentioned. How does it affect the variance or the occurrence of your mentioned "gang x" events?

That depends on the engines in question. In engines that play practically deterministically, so that such events with constant move timing would not occur at al, varying the move timing, and varying it differently and independently in each game, could cause induce variability so that in the end the events could occur once in a million times.

If the engines had already enough intrinsic variability to cause such events to occur once in a million times, they would still occur uonce in a million time. Randomly changing elements of a sequence that is already random is basically pointless. It remains a random sequence.

Uri Blass · Post by **Uri Blass** » Fri Sep 21, 2007 11:38 pm

Note that Movei also complete the iteration in almost all cases in ponder off games(in ponder on games it may play immediatly in part of the cases but otherwise it will also try to finish the iteration).

It seems that Bob is basing his opinion based on the past:

He says:
"complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's."

I can say that most programmers never programmed chess engines in the late 70's or early 80's and many do not care about the small improvement that they may get by not finishing the iteration.

Uri

bob · Post by **bob** » Fri Sep 21, 2007 11:44 pm

hgm wrote:
Gerd Isenberg wrote:I mean in each game of the match the time from move to move fluctuates randomly as mentioned. How does it affect the variance or the occurrence of your mentioned "gang x" events?
That depends on the engines in question. In engines that play practically deterministically, so that such events with constant move timing would not occur at al, varying the move timing, and varying it differently and independently in each game, could cause induce variability so that in the end the events could occur once in a million times.

If the engines had already enough intrinsic variability to cause such events to occur once in a million times, they would still occur uonce in a million time. Randomly changing elements of a sequence that is already random is basically pointless. It remains a random sequence.

I don't agree. This kind of change is not independent events. If you use more time on move N, you have less time for the next N moves and that will induce more variability. Time usage variability can cause other side-effects as well. You can change history values or hash entries when you suddenly get to search one ply deeper than the last run because you used less time on some of the preceeding moves. And that affects the size/shape of the tree being searched now which can further influence your time usage. The result is a cascading sequence of events that lead to a bubble that bursts at some point and changes the game completely when you choose a different move...

bob · Post by **bob** » Fri Sep 21, 2007 11:48 pm

Uri Blass wrote:Note that Movei also complete the iteration in almost all cases in ponder off games(in ponder on games it may play immediatly in part of the cases but otherwise it will also try to finish the iteration).

It seems that Bob is basing his opinion based on the past:

He says:
"complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's."

I can say that most programmers never programmed chess engines in the late 70's or early 80's and many do not care about the small improvement that they may get by not finishing the iteration.

Uri

If you do an instant reply termination, you have the _same_ problem in ponder=on matches. So what exactly is your point??? It doesn't matter where the variability comes from. The fact that it is there at all is the main point here. And doing what you do will cause two types of variability. The first is time usage changes since you won't move until your opponent moves and how he uses time is beyond your control and probably varies as well. The second is whatever data your search alters while it runs (at least the transposition table) so that if your opponent takes longer on this move than the last time we played this game, your hash table has different stuff in it, history counters are changed. Killer moves are changed. Etc. And all of those add up to the potential for a different move or score from this run even if you do use the same amount of time or search to the same depth.

Again, we drift far afield where things are still as I said they are...

Uri Blass · Post by **Uri Blass** » Fri Sep 21, 2007 11:54 pm

I can add that another reason for not being deterministic for programs is that search is effected by result of previous searches and movei does not suffer from that problem.

I think that at level of 2700-2750 CCRL rating this is a mistake to spend time on things like that because it will make debugging error later harder and it is probably better to try to find bigger improvements by other means.

Uri

Uri Blass · Post by **Uri Blass** » Fri Sep 21, 2007 11:58 pm

bob wrote:
Uri Blass wrote:Note that Movei also complete the iteration in almost all cases in ponder off games(in ponder on games it may play immediatly in part of the cases but otherwise it will also try to finish the iteration).

It seems that Bob is basing his opinion based on the past:

He says:
"complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's."

I can say that most programmers never programmed chess engines in the late 70's or early 80's and many do not care about the small improvement that they may get by not finishing the iteration.

Uri
If you do an instant reply termination, you have the _same_ problem in ponder=on matches. So what exactly is your point??? It doesn't matter where the variability comes from. The fact that it is there at all is the main point here. And doing what you do will cause two types of variability. The first is time usage changes since you won't move until your opponent moves and how he uses time is beyond your control and probably varies as well. The second is whatever data your search alters while it runs (at least the transposition table) so that if your opponent takes longer on this move than the last time we played this game, your hash table has different stuff in it, history counters are changed. Killer moves are changed. Etc. And all of those add up to the potential for a different move or score from this run even if you do use the same amount of time or search to the same depth.

Again, we drift far afield where things are still as I said they are...

I practically clear the hash after every search(or to be more correct increase generation number so the program does not use positions with different generation number)

I also reset the killers at the beginning of every search.
I use today only knowledge and number of move in the order of moves
for pruning but if I use history I will reset the history table after every search.

different searches are clearly undependent regardless to the question if it is ponder off or ponder on games.

Uri

bob · Post by **bob** » Sat Sep 22, 2007 12:39 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:Note that Movei also complete the iteration in almost all cases in ponder off games(in ponder on games it may play immediatly in part of the cases but otherwise it will also try to finish the iteration).

It seems that Bob is basing his opinion based on the past:

He says:
"complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's."

I can say that most programmers never programmed chess engines in the late 70's or early 80's and many do not care about the small improvement that they may get by not finishing the iteration.

Uri
If you do an instant reply termination, you have the _same_ problem in ponder=on matches. So what exactly is your point??? It doesn't matter where the variability comes from. The fact that it is there at all is the main point here. And doing what you do will cause two types of variability. The first is time usage changes since you won't move until your opponent moves and how he uses time is beyond your control and probably varies as well. The second is whatever data your search alters while it runs (at least the transposition table) so that if your opponent takes longer on this move than the last time we played this game, your hash table has different stuff in it, history counters are changed. Killer moves are changed. Etc. And all of those add up to the potential for a different move or score from this run even if you do use the same amount of time or search to the same depth.

Again, we drift far afield where things are still as I said they are...
I practically clear the hash after every search(or to be more correct increase generation number so the program does not use positions with different generation number)

I also reset the killers at the beginning of every search.
I use today only knowledge and number of move in the order of moves
for pruning but if I use history I will reset the history table after every search.

different searches are clearly undependent regardless to the question if it is ponder off or ponder on games.

Uri

First, not using old hash entries is a horrible idea. If you don't trust the values, get rid of the table. I _never_ clear the entries. Because I trust them for as long as they stay around. Just as much as I trust my memory of positions in the fried-liver attack and would not do a "brain purge" and have to re-learn all of that stuff again.

Why would you eliminate _any_ data that can help your move ordering and save you time or give you better results in the same time frame?

Your last statement is 100% wrong. Two successive positions in a real game of chess are _not_ independent in any shape, form or fashion. They are part of a "stream" of positions that are related based on piece/pawn placement and previous moves played in the game. If this were true as you stated it, we would not even worry about the 50 move rule or repetitions.

Further, If I complete a 16 ply search and start pondering, I don't start over, I start at depth=15, since I assume you will play the second move, leaving me 14 plies of stuff completed, and I see no reason to re-search those trees again when I already have the result...

That kind of information loss I don't understand...

bob · Post by **bob** » Sat Sep 22, 2007 12:42 am

Uri Blass wrote:I can add that another reason for not being deterministic for programs is that search is effected by result of previous searches and movei does not suffer from that problem.

I think that at level of 2700-2750 CCRL rating this is a mistake to spend time on things like that because it will make debugging error later harder and it is probably better to try to find bigger improvements by other means.

Uri

So you want to use what we call the "goose effect"??? That is, every time your program "wakes up" it is a brand new day, no day before, no tomorrow, just search for now and give no thought to what will happen tomorrow?

I don't play chess like that. I don't think it even remotely reasonable that a computer plays chess like that...

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?