Observator bias or...

MartinBryant · Post by **MartinBryant** » Sun Jun 03, 2007 11:07 am

For what it's worth Allesandro I too get the same issue testing and I believe it just demonstrates the huge number of games you need to detect small differences.

Consider the following graph showing the ELO progression over a 1000 game match between Colossus 2007c and 2007b...

Early on I too had the 'woo-hoo!' factor only to be dissappointed later when the advantage gradually collapsed to nothing.

Still, if you enjoy the programming, then keep plugging away at it, who knows when the next Rybka will appear!

Good luck,
Martin

Eelco de Groot · Post by **Eelco de Groot** » Sun Jun 03, 2007 11:07 am

Uri Blass wrote:
I disagree with you.

It is easy to change program's style by only changing the evaluation without changing the playing strength significantly.

I know it from movei and I believe that the same is for rybka and Vas can
also release some versions of rybka who play different with no big difference in playing strength(less than 40 elo difference).

Uri

Hello Uri, I think that with playing differently Michael here meant playing less strongly than what Hamsters is capable of, not with a different playing style. If a program plays uncharacteristically weaker at times, this would limit its maximum eloperformance, no matter how strong it would play at other times. Not only that, but the nondeterministic playing, or at least what appears to be non-deterministic behavior as long as its cause is not known, makes testing very difficult: too much noise in the results.

I had been thinking about testing with a fixed number of nodes, how I understand Robert Hyatt's objection is that playing with a fixed number of nodes does not have a real linear relation to used time, because the number of nodes per timeunit is never a constant, not even after many games. Because nodes per second (or per move, or per game) is even more variable than any shortcomings there may be in the OS that hinder equal allotted time for each program, or because the OS is doing multitasking while a chessprogram is thinking, this defeats the purpose you want to use your method for, if you use playing with fixed nodes instead of fixed time per game (or per move) to avoid the timing problems, the problem of both chessprograms not getting equal time, you will never achieve it.

It would be different if later on all tournaments and rating list games for Movei would also only use fixed nodes instead of fixed time, then the nodes per second would not be a factor anymore, but this is probably not going to be the case... It then becomes a totally different form of computer chess, and the program with the largest hash will probably get more hashhits which are not counted as nodes and have an advantage that way, while the processor speed would become totally irrelevant...

Regards, Eelco

Uri Blass · Post by **Uri Blass** » Sun Jun 03, 2007 11:39 am

Eelco de Groot wrote:
Uri Blass wrote:
I disagree with you.

It is easy to change program's style by only changing the evaluation without changing the playing strength significantly.

I know it from movei and I believe that the same is for rybka and Vas can
also release some versions of rybka who play different with no big difference in playing strength(less than 40 elo difference).

Uri
Hello Uri, I think that with playing differently Michael here meant playing less strongly than what Hamsters is capable of, not with a different playing style. If a program plays uncharacteristically weaker at times, this would limit its maximum eloperformance, no matter how strong it would play at other times. Not only that, but the nondeterministic playing, or at least what appears to be non-deterministic behavior as long as its cause is not known, makes testing very difficult: too much noise in the results.

I had been thinking about testing with a fixed number of nodes, how I understand Robert Hyatt's objection is that playing with a fixed number of nodes does not have a real linear relation to used time, because the number of nodes per timeunit is never a constant, not even after many games. Because nodes per second (or per move, or per game) is even more variable than any shortcomings there may be in the OS that hinder equal allotted time for each program, or because the OS is doing multitasking while a chessprogram is thinking, this defeats the purpose you want to use your method for, if you use playing with fixed nodes instead of fixed time per game (or per move) to avoid the timing problems, the problem of both chessprograms not getting equal time, you will never achieve it.

It would be different if later on all tournaments and rating list games for Movei would also only use fixed nodes instead of fixed time, then the nodes per second would not be a factor anymore, but this is probably not going to be the case... It then becomes a totally different form of computer chess, and the program with the largest hash will probably get more hashhits which are not counted as nodes and have an advantage that way, while the processor speed would become totally irrelevant...

Regards, Eelco

The testing with fixed number of nodes is not done to find if the program is better than other programs but only to find if it is better than previous version.

If the change that I make has no significant effect on the number of nodes per second then hopefully the better program has better chances to win.

I never thought to compare movei with other programs based on games with fixed number of nodes and I only compared movei with previous version based on this test.

Uri

hgm · Post by **hgm** » Sun Jun 03, 2007 11:47 am

If the nr of nodes per second is not constant enough, you could simply count something else. E.g. number of moves generated.

You could simply include counters that count the number of times each section of code that contributes significantly to the execution time in an independent and varying way, (e.g. one for the evaluation, one for the move generator, one for each move scored and searched) and the total time sent on the program would then be the sum of the (fixed and known) duration of those code sections times the nr of times the section was executed.

Ron Murawski · Post by **Ron Murawski** » Sun Jun 03, 2007 7:50 pm

Michael Sherwin wrote:
Ron has also noticed this about Hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!

Mike

Yes, Michael is right. I tested Hamsters against my own engine and got wildly fluctuating results. In 100 games the success ratio could vary from 25% to 75%.

Hamsters is not alone here. There are many programs that I have labelled as "too volatile". (In fact *most* programs are!) But the volatility is vs my own engine for my own testing purposes. The volatility could be caused just as easily by opening book interactions. (I always play using own books.)

Alessandro Scotti wrote: Ouch Michael... this screams "bug" all the way! I must say that I haven't noticed _extremely_ wild fluctuations in my tests, but yes there is almost definitely something strange in my engine...

P.S. Version 0.2 is particularly buggy though!

The version of Hamsters that I tested was one of the early releases. I haven't played games vs Hamsters since then.

Ron

bob · Post by **bob** » Sun Jun 03, 2007 10:34 pm

Uri Blass wrote:
bob wrote:
hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.

For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...

my upcoming ICGA paper will show just how horrible this is...
I disagree
Of course if you do not have enough games the weaker program may win but the advantage of fixed number of nodes is that you do not need to care about the problem when one program was slowed down by a significant factor and you can do other things in the same computer at the same time without changing the result.

I use fixed number of nodes in tests.
I am not sure that every change is an improvement because of not having enough games but hopefully the total result after many changes is an improvement.

Uri

You can disagree all you want. But if you are comparing two _versions_ of the same program, meaning they are pretty close in strength, 200 games won't tell you a thing about which is better. When the ICGA paper is finished, I can show you hundreds of thousands of games based on 200 game matches, to show just how bad depending on 200 games really is.

the 20M vs 20.01M node test is a big hint...

hgm · Post by **hgm** » Mon Jun 04, 2007 11:52 am

I don't see how this observation in anyway can be interpreted as a drawback of the fixed-nr-of-nodes testing method. IMO you would get exactly the same variability if you would compare testing at 20 sec per move vs 20.01 sec per move.

But if you repeat two tests of the same engines at 20M nodes/move, you will exactly reproduce the result. Testing the same version twice at 20 sec/move, will give you different, and most likely completely independent results.

If you do very small changes in the program, e.g. in the evaluation of an end-game situation that rarely occurs, comparing the two versions at fixed nr of nodes would eliminate a large amount of variance from the difference, as in most games there would be no difference at all. That testing with 0.05% more nodes would cause both the result of the old and the new version to be different, is completely irrelevant, as even in that case, it would likely be the same. (Unless the program change would have changed the result, which is exactly what you want to measure.)

bob · Post by **bob** » Mon Jun 04, 2007 8:58 pm

hgm wrote:I don't see how this observation in anyway can be interpreted as a drawback of the fixed-nr-of-nodes testing method. IMO you would get exactly the same variability if you would compare testing at 20 sec per move vs 20.01 sec per move.

This observation is simply based on the central limit theorem of statistics. The population of possible games two programs can play is huge. Taking a _small_ sample of those games can end up choosing a sub-set of games that make the new version look better, worse, or the same compared to the old version. And with so many possible "small sample sizes" available, picking just one is just a random choice.

If you used fixed nodes and make _any_ change to the search or evaluation, then even though you search the same number of nodes per move in the new test, the trees are differently shaped due to the changes you made. Again you take a tiny sample from a very large population and you can't reliably draw conclusions.

Here's some samples from crafty data. Using the 40 silver positions, two games per position alternating colors. program identities not provided to avoid starting a tangential discussion:

Code: Select all

 
 1&#58;  30/ 16/ 34 ( -4&#41;   
 2&#58;  31/ 19/ 30 (  1&#41;
 3&#58;  31/ 18/ 31 (  0&#41;  
 4&#58;  24/ 21/ 35 (-11&#41;
 5&#58;  35/ 16/ 29 (  6&#41;

Those 5 tests were run consecutively, same time control for each 80 game test, same starting positions, same everything. Both programs using a time limit per move. No pondering. No SMP search. No opening book. No tablebases. No other users on the machines being used whatsoever. All conditions repeated for each run as near to identical as possible.

Now which one of those "tells the truth"??? You could pick two of 'em and conclude your changes are better, you could pick two and conclude your changes are worse. You could pick one and conclude no change at all.

That's why I said 80 games is _not_ enough. I have 4 programs I use to test, one of which is Crafty. Any two programs show this behavior when played against each other.

But if you repeat two tests of the same engines at 20M nodes/move, you will exactly reproduce the result. Testing the same version twice at 20 sec/move, will give you different, and most likely completely independent results.

Yes, but what's the point of that testing, because if you make one tiny change to anything, the results will change because now you have a different sample of games since parts of the trees will expand or contract.

To recap:

search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.

Is that hard to understand given the data I provided above?

Remember, I've been able to run hundreds of thousands of games in analyzing this effect, something probably no one else here could do in their lifetime.

If you do very small changes in the program, e.g. in the evaluation of an end-game situation that rarely occurs, comparing the two versions at fixed nr of nodes would eliminate a large amount of variance from the difference, as in most games there would be no difference at all. That testing with 0.05% more nodes would cause both the result of the old and the new version to be different, is completely irrelevant, as even in that case, it would likely be the same. (Unless the program change would have changed the result, which is exactly what you want to measure.)

You should first try the test before stating whether or not the results will be irrelevant. I can tell you with 100% certainty your thoughts are absolutely wrong here, based on a huge volume of data.

hgm · Post by **hgm** » Tue Jun 05, 2007 4:24 pm

bob wrote:search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.

Is that hard to understand given the data I provided above?

What is hard to understand is why the results would be useless. I could understand that they were different. I would think it is very useful that they were different. This can tell you if the change was good or bad, provided you do enough games to make the difference statistically significant.

The smaller a change, the closer the average results will be, and the more games you will need to get a statistically meaningful test result. On the other hand, the smaller the change, the more the trees will start to look like each other, and the more of the moves played in the games of the two versions will be the same. If I make a change in the evaluation to use a different piece-square table for the bare King in KBNK, to prevent it from taking shelter in the wrong corner, that change will not have any effect on any of the moves played in games that do not end in KBNK. Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.

To measure the score difference of 0.05% would require millions of games if the games played by the two versions were independent (because the OS taking CPU time in an irreproducible way would make them independent when you allot fixed wall-clock time per move). By eliminating this source of randomness, you could do with ~10,000 games (which would than contain ~10 games that ended in KBNK, of which ~5 were different between the versions).

This is absolutely standard statistical procedure.

bob · Post by **bob** » Wed Jun 06, 2007 3:31 am

hgm wrote:
bob wrote:search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.

Is that hard to understand given the data I provided above?
What is hard to understand is why the results would be useless. I could understand that they were different. I would think it is very useful that they were different. This can tell you if the change was good or bad, provided you do enough games to make the difference statistically significant.

Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?

The way to determine if your testing is worthwhile is to repeat the test N times and see how consistent things are. You might be surprised. If your test is wrong only 1/4 times, that gives you a good chance of getting an erroneous result and making a poor decision.

The smaller a change, the closer the average results will be, and the more games you will need to get a statistically meaningful test result. On the other hand, the smaller the change, the more the trees will start to look like each other, and the more of the moves played in the games of the two versions will be the same.

Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...

If I make a change in the evaluation to use a different piece-square table for the bare King in KBNK, to prevent it from taking shelter in the wrong corner, that change will not have any effect on any of the moves played in games that do not end in KBNK.

I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.

Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.

I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.

To measure the score difference of 0.05% would require millions of games if the games played by the two versions were independent (because the OS taking CPU time in an irreproducible way would make them independent when you allot fixed wall-clock time per move). By eliminating this source of randomness, you could do with ~10,000 games (which would than contain ~10 games that ended in KBNK, of which ~5 were different between the versions).

That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population. And you hope that subset is indicative of something. And then you change any single thing in the program and everything changes and now you are suddenly observing a different subset of the game population, and comparing them when they are not easily comparable. Without the kind of hardware I have been using these experiments would have been totally hopeless. I play 100,000 game matches regularly. Who can reproduce that kind of data (and I am talking 5+5 or 10+10 type time controls here).

This is absolutely standard statistical procedure.

Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...

Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...