Observator bias or...

Alessandro Scotti · Post by **Alessandro Scotti** » Tue May 29, 2007 8:25 pm

I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...

Dann Corbit · Post by **Dann Corbit** » Tue May 29, 2007 8:33 pm

Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...

In "Scalable Search for Computer Chess" Ernst Heinz suggests a minimum of 800 games to reach a decision for closely matched engines.

Consider also, these two entries from the SSDF:
5 Junior 10 256MB Athlon 1200 MHz 2851 25 -24 874 70% 2703
6 Hiarcs 10 HypMod 256MB Athlon 1200 MHz 2845 22 -21 1238 73% 2672

Even after both have played well over 800 games, we can't really decide which one is stronger.

ed · Post by ed » Tue May 29, 2007 8:59 pm

Hi Alessandro,

And yet the reason hasn't to be the number of games only, see the below observation I made a couple of years ago.

http://members.home.nl/matador/testing.htm

Very depressing.

Regards,

Ed

Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...

hgm · Post by **hgm** » Wed May 30, 2007 11:43 am

Did you ever solve this puzzle, Ed?

It seems that +/- 1% is really a bit better than you can expect over 800 games: the standard error in a single run should be 0.4/sqrt(800) = 1.4%, but the difference of the result of two independent runs should be sqrt(2) larger, i.e. 2%. And in 32% of the cases the difference would be larger than that.

If I calculate the spread of the 3 percentages you quote, it is 1.9%. This is a bit larger than expected, but not so much that it is wildly improbable.

But one might argue that the different runs are not sufficiently independent. If this is the case one would really have to know the experimental spread of the results of the run under exactly the same condition.

If it turns out to be a real effect, the only thing I can think of is that the OS is to blame. I don't know under what conditions you played these matches (ponder on/off, single/dual core), but if there is competition between the engines for CPU and/or memory, the OS might systematically favor one because of its position in the process table (determined by the order of starting them). Or the GUI might of course cheat, giving a higher priority to the engine it starts with...

Uri Blass · Post by **Uri Blass** » Wed May 30, 2007 12:17 pm

Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...

I do not play so many games but one note is that I test only with fixed number of nodes because my changes do not cause significant changes in nodes per second.

If I test only with fixed number of nodes I do not need to care about problem like one engine that is slowed down by a significant factor and I can safely do other things at the same time on the same computer when I know they will not influence the result(I already saw games from people when one engine lost because of the simple fact that it was slowed down by a significant factor).

unfortunately unlike you I have not enough time to run 700 games and I am often happy with 200 games to decide which version to accept when hopefully I will accept more often the better version so I will have more productive changes than counterproductive changes and I will get an improvement.

Uri

Alessandro Scotti · Post by **Alessandro Scotti** » Wed May 30, 2007 2:35 pm

Thanks for the answers so far guys... to sum up, it seems I'm doomed anyway, either by not playing enough games or by unponderable factors... any news on that Ed?!?

Uri, I like your proposal of playing with node count limit, I'll try to test it. Running 700 games takes several days for me, but it's not a problem because that's exactly the time it takes to bring out even a slightly modified version so I'm not in a hurry!

ed · Post by ed » Wed May 30, 2007 5:02 pm

hgm wrote:Did you ever solve this puzzle, Ed?

Not satisfactory. I reran the whole thing at a higher time control (40/20) and the problem disappeared. But doing so 800 games lasted 800 hours which was unacceptable for me, even if you have 4 PC's at your disposal at the time.

My educated guess as that due to tiny Windows interferences stealing 100-200ms so now and then (the infamous windows swap-file comes to mind) programs go one iteration deeper or 1 ply less deeper than a former game resulting in different moves, thus randomness plays a role after all.

The higher time control you play, the fewer such things occurs.

It seems that +/- 1% is really a bit better than you can expect over 800 games: the standard error in a single run should be 0.4/sqrt(800) = 1.4%, but the difference of the result of two independent runs should be sqrt(2) larger, i.e. 2%. And in 32% of the cases the difference would be larger than that.

If I calculate the spread of the 3 percentages you quote, it is 1.9%. This is a bit larger than expected, but not so much that it is wildly improbable.

But one might argue that the different runs are not sufficiently independent. If this is the case one would really have to know the experimental spread of the results of the run under exactly the same condition.

If it turns out to be a real effect, the only thing I can think of is that the OS is to blame. I don't know under what conditions you played these matches (ponder on/off, single/dual core), but if there is competition between the engines for CPU and/or memory, the OS might systematically favor one because of its position in the process table (determined by the order of starting them). Or the GUI might of course cheat, giving a higher priority to the engine it starts with...

Everything was done to surpress randomness as much as possible, so single cpu, no PB, same openings in reverse, no learning, no TB etc.

Ed

cwb · Post by **cwb** » Wed May 30, 2007 6:48 pm

Just to make sure, we are talking about testing version A against version B of the same program, right?

What about estimating the actual variance of the experiment by using for instance cross-validation or bootstrap methods?

This would of course also need a large number of games to be useful, but it might turn out that the variance is larger than expected, which could explain a lot.

-Casper

Peter Fendrich · Post by **Peter Fendrich** » Wed May 30, 2007 8:36 pm

Maybe the first program loaded got access to faster RAM.
I remember a similar problem a long long long time ago

After I extended my system with new RAM with different speed, compared to the original, I got very fluctuating n/s...

/Peter

Ron Murawski · Post by **Ron Murawski** » Wed May 30, 2007 10:26 pm

Hi Alessandro,

If I make a coding change and the results of the first 100 games are bad I tend to stop testing and start re-coding. But if the first 100 games go well I continue running test games. Because of this tendency the result of the first 100 games always seems to be a bit better than the 100s of games that follow.

Ron

Observator bias or...

Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...