Re: Observator bias or...
Posted: Thu May 31, 2007 2:56 pm
There's a very good point from H. G. that the games could not be independent... Now I'm going to stop for a while and make sure everything is OK before running another test.
My current testing methodology is to play 40 positions, once black, once white, and do this 32 times (64 games per position) and to repeat this against multiple opponents. This is giving pretty stable results and allows me to compare two versions with reasonable reliability. Anything less is not enough, based on a few hundred thousand games testing this.hgm wrote:64% after 100 games between approximately equal engines is extreme: the standard error over 100 games should be 0.4/sqrt(100) = 4%, so a 14% deviation represents 3.5 sigma. This should happen on the average only 1 in ~4000 tries.Alessandro Scotti wrote:I remember since testing with Kiwi that results with 100 games are very unreliable. It sometimes happen that a version gets a bad start but gets better at the end of the long test. On the other hand, I had a version reach 64% after 100 games and finish with a disappointing 50% after 720 games... I will now increase the number to 800 and see if that brings some benefits (not much is expected though).
I noted a very strange effect when I was testing uMax in self play. The standard error over 100 games should be 4%, but when I played 1000 games between the same versions, and looked at the scores of the ten individual 100-game runs, these results deviated on the average much more from each other (and the final average result) than you would expect from the calculated standard error. This can only happen if the games are not independent! I can indeed not exclude this, as all the games were played in a single run, and were using the random seed the previous game ended with. So with a bad randomizer, if a single game repeats due to an equal or very close seed at the start of the game, it might imply that the following game repeats as well, destroying the independence of the game.
Whatever the cause, the effect was that the error in the win percentage was always a lot larger than you would expect based on the number of games.
Hello Harm,hgm wrote:64% after 100 games between approximately equal engines is extreme: the standard error over 100 games should be 0.4/sqrt(100) = 4%, so a 14% deviation represents 3.5 sigma. This should happen on the average only 1 in ~4000 tries.Alessandro Scotti wrote:I remember since testing with Kiwi that results with 100 games are very unreliable. It sometimes happen that a version gets a bad start but gets better at the end of the long test. On the other hand, I had a version reach 64% after 100 games and finish with a disappointing 50% after 720 games... I will now increase the number to 800 and see if that brings some benefits (not much is expected though).
I noted a very strange effect when I was testing uMax in self play. The standard error over 100 games should be 4%, but when I played 1000 games between the same versions, and looked at the scores of the ten individual 100-game runs, these results deviated on the average much more from each other (and the final average result) than you would expect from the calculated standard error. This can only happen if the games are not independent! I can indeed not exclude this, as all the games were played in a single run, and were using the random seed the previous game ended with. So with a bad randomizer, if a single game repeats due to an equal or very close seed at the start of the game, it might imply that the following game repeats as well, destroying the independence of the game.
Whatever the cause, the effect was that the error in the win percentage was always a lot larger than you would expect based on the number of games.
fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.
For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
I disagreebob wrote:fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.
For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
my upcoming ICGA paper will show just how horrible this is...
Hi Alessandro,Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...
I disagree with you.Michael Sherwin wrote:Hi Alessandro,Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...
I just saw your post. I have been busy giving and recieving grief in the ctf and wow was that a lot of fun. Luckily I was able to get my humour circuits powered up again for some light entertainment!
You should have sent me an email or or a PM on this. You should know that I have played more games against Hamsters than anyone else except for possibly you. I have played many thousand RomiChess vs Hamsters games.
Yes, I have noticed this behavior with RomiChess also, but almost exclusively when playing against hamsters. Wild variation in results. One, one hundred game match of fixed positions had 56 games with different results than the match before. All I did in my code was change an 8 to a 16 for the two test.
There seems to be some random factor (?) in Hamsters that causes it to play differently a good part of the time. There are times when Romi just blows Hamsters of the board and then from the same position in the next test there seems to be nothing Romi can do against Hamsters and is herself often blown away.
Ron has also noticed this about hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!
If Rybka was as random as Hamsters seems to be, then I would make the bet that Rybka would suffer a several hundred point drop in rating. Randomness limits an engines upper ceiling, no matter how many improvements are made. Randomness will lead to a certain % of losses regardless of how strong that the engine is.
This is just an educated guess on my part from playing so many games against Hamsters. I really do not know for sure. I came to the conclusion that you must have really put 5 different hamsters in your program, all with different personalities and that they are selected randomly before each game!
Unless you put some randomness into Hamsters on purpose then I do not see how it can be there. I hope that there might be something in all that I have said that might be of some help.
Mike
Ouch Michael... this screams "bug" all the way! I must say that I haven't noticed _extremely_ wild fluctuations in my tests, but yes there is almost definitely something strange in my engine...Michael Sherwin wrote:There seems to be some random factor (?) in Hamsters that causes it to play differently a good part of the time. There are times when Romi just blows Hamsters of the board and then from the same position in the next test there seems to be nothing Romi can do against Hamsters and is herself often blown away.
Ron has also noticed this about hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!