Fixing the number of nodes doesn't make any sense to me. Lets say I make a mod that quadruples the NPS. If the number of nodes searched is fixed then the program reaches them faster but doesn't play any stronger. I think all of us would a expect a strength improvement from a 4x speedup.hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable.
Observator bias or...
Moderator: Ras
- 
				CRoberson
- Posts: 2094
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Observator bias or...
- 
				Uri Blass
- Posts: 10905
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Observator bias or...
Some comments:bob wrote:Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?hgm wrote:What is hard to understand is why the results would be useless. I could understand that they were different. I would think it is very useful that they were different. This can tell you if the change was good or bad, provided you do enough games to make the difference statistically significant.bob wrote:search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.
Is that hard to understand given the data I provided above?
The way to determine if your testing is worthwhile is to repeat the test N times and see how consistent things are. You might be surprised. If your test is wrong only 1/4 times, that gives you a good chance of getting an erroneous result and making a poor decision.
Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...
The smaller a change, the closer the average results will be, and the more games you will need to get a statistically meaningful test result. On the other hand, the smaller the change, the more the trees will start to look like each other, and the more of the moves played in the games of the two versions will be the same.
I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.If I make a change in the evaluation to use a different piece-square table for the bare King in KBNK, to prevent it from taking shelter in the wrong corner, that change will not have any effect on any of the moves played in games that do not end in KBNK.
I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.
That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population. And you hope that subset is indicative of something. And then you change any single thing in the program and everything changes and now you are suddenly observing a different subset of the game population, and comparing them when they are not easily comparable. Without the kind of hardware I have been using these experiments would have been totally hopeless. I play 100,000 game matches regularly. Who can reproduce that kind of data (and I am talking 5+5 or 10+10 type time controls here).
To measure the score difference of 0.05% would require millions of games if the games played by the two versions were independent (because the OS taking CPU time in an irreproducible way would make them independent when you allot fixed wall-clock time per move). By eliminating this source of randomness, you could do with ~10,000 games (which would than contain ~10 games that ended in KBNK, of which ~5 were different between the versions).
Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...
This is absolutely standard statistical procedure.
1)I agree that when the difference is small you need many games to know which version is better even if you choose fixed number of nodes and few hundreds are not enough to know which version is better.
2)The advantage of fixed number of nodes relative to fixed time is that
less problems can happen with fixed number of nodes.
I do not need to care about a possible problem that one engine was slowed down by significant factor during the game and I saw games when one program lost simply because it got small depth that can be explained only by the fact that one program was slowed down by a significant factor.
3)There may be evaluation changes that change only small minority of games at fixed number of nodes
In the case of KBN vs K(assuming the program is not using tablebases) it is possible to have root dependent evaluation and not to use knowledge about KBN vs K when this is not a root position so no games except games that reach KBN vs K are going to be changed.
Uri
- 
				Uri Blass
- Posts: 10905
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Observator bias or...
fixing the number of nodes clearly make sense when you make changes that have no significant effect on the number of nodes per second.CRoberson wrote:Fixing the number of nodes doesn't make any sense to me. Lets say I make a mod that quadruples the NPS. If the number of nodes searched is fixed then the program reaches them faster but doesn't play any stronger. I think all of us would a expect a strength improvement from a 4x speedup.hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable.
You can also use it when the number of nodes are changed and you only need to change the count of nodes.
If some modification made your program 4 times faster then you can increase nodes by 1/4 every time you make a move instead of increasing it by 1.
Uri
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Observator bias or...
That is the thing I believe is totally wrong. A tiny evaluation change makes a tiny change to the tree shape. Which makes a tiny change to the moves and how they are searched. Which makes a tiny change... And eventually you get a different best move or best score. And it doesn't take but a few such searches to produce a different game. I ran exactly this test myself, in fact. I made a minor eval change (and I mean changing one term in a small way) and it changed the 80-game (40 position) result significantly, even though I am absolutely certain that the changed version is no better or worse.Uri Blass wrote:Some comments:bob wrote:Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?hgm wrote:What is hard to understand is why the results would be useless. I could understand that they were different. I would think it is very useful that they were different. This can tell you if the change was good or bad, provided you do enough games to make the difference statistically significant.bob wrote:search a fixed number of nodes, record the results. Make any change to the program you want. Search a fixed number of nodes and the results are absolutely useless to compare with the first run.
Is that hard to understand given the data I provided above?
The way to determine if your testing is worthwhile is to repeat the test N times and see how consistent things are. You might be surprised. If your test is wrong only 1/4 times, that gives you a good chance of getting an erroneous result and making a poor decision.
Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...
The smaller a change, the closer the average results will be, and the more games you will need to get a statistically meaningful test result. On the other hand, the smaller the change, the more the trees will start to look like each other, and the more of the moves played in the games of the two versions will be the same.
I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.If I make a change in the evaluation to use a different piece-square table for the bare King in KBNK, to prevent it from taking shelter in the wrong corner, that change will not have any effect on any of the moves played in games that do not end in KBNK.
I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.
That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population. And you hope that subset is indicative of something. And then you change any single thing in the program and everything changes and now you are suddenly observing a different subset of the game population, and comparing them when they are not easily comparable. Without the kind of hardware I have been using these experiments would have been totally hopeless. I play 100,000 game matches regularly. Who can reproduce that kind of data (and I am talking 5+5 or 10+10 type time controls here).
To measure the score difference of 0.05% would require millions of games if the games played by the two versions were independent (because the OS taking CPU time in an irreproducible way would make them independent when you allot fixed wall-clock time per move). By eliminating this source of randomness, you could do with ~10,000 games (which would than contain ~10 games that ended in KBNK, of which ~5 were different between the versions).
Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...
This is absolutely standard statistical procedure.
1)I agree that when the difference is small you need many games to know which version is better even if you choose fixed number of nodes and few hundreds are not enough to know which version is better.
2)The advantage of fixed number of nodes relative to fixed time is that
less problems can happen with fixed number of nodes.
I do not need to care about a possible problem that one engine was slowed down by significant factor during the game and I saw games when one program lost simply because it got small depth that can be explained only by the fact that one program was slowed down by a significant factor.
It just doesn't take much of a change to the nodes searched to blow things way out of proportion.
That's a simplistic case. I don't do any such root preprocessing at all, any eval change permeates the entire tree search.
3)There may be evaluation changes that change only small minority of games at fixed number of nodes
In the case of KBN vs K(assuming the program is not using tablebases) it is possible to have root dependent evaluation and not to use knowledge about KBN vs K when this is not a root position so no games except games that reach KBN vs K are going to be changed.
Uri
- 
				hgm  
- Posts: 28395
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Observator bias or...
Well, the number 80 is something that is entirely your fabrication. I never mentioned any number, and everything I said so far applies as well to 80 million games. The information produced in 80 games is useful, as you first have to play 80 games before you can have played 80 million games. If one would throw the results of each 80 games away because they were 'useless', you would never get a result on a larger number of games. Each game is exactly equally useful, and contains as much information as any other game. How large the batch was it belonged to does not have any effect on this.bob wrote:Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?
I experience exactly the opposite problem. When testing uMax I could not play more than 2 games against most opponents, (one with white and one with black), or the same 2 games would be exactly repeated over and over. And of the opponents that randomized their opening play most were not suitable for automatic testing, as they crashed and hung the system. I finally solved it by playing Nunn matches, forcing the dozen or so opponents that would not hang the system to play 20 different games each.Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...
That you probe it is not enough to affect the tree. The course evaluation of any KBNK position is +6, and before any of those get in window the game has to be long since decided. The subtle positional differences between having the bare King in a white or black corner will never change the nature of the fail until the root score gets very close to +6 or -6. And if that happens in the opening, a revision of your book is in order...I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.
 
 Like I said before, I already had to go through substantial trouble NOT to get that. If uMax (and many of the opponents I tried for it) would not repeat the game move for move, I probably should send my computer back for repairs under guarantee, for there must be a hardware error then. It is not normal when a computer program with a precisely specified algorithm does not give the same result every time you run it, as computer languages are defined such that the result of any statement is uniquely defined and leaves no room for undeterministic outcome.I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.
Are you now saying that testing is never any good, because even if you would do billiones times billions of independent games the gamee tree of Chess is so large that whatever you tried would always be just a tiny slice of all possible games that the program could be made to play by varying its nr of nodes or seconds it could search? Well, in standaard statistical theory it is only the size of the sample that determines the reliability of the result, not the size of the population the sample was drawn from.That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population.
Anyway, we seem to be making progress, as now at least it is 'just as error prone' to fix the number of nodes as to fix the number of seconds, in stead of 'totally useles'.
Elections are the only counterexample I know of. Everyone uses sampling when confronted with a huge or infinite data set. How big the sample has to be to draw conclusions with a certain reliability from it can be calculated, and can be strongly dependent on the method of sampling. (e.g. stratified sampling.)Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...This is absolutely standard statistical procedure.
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Observator bias or...
Your point would be? I _specifically_ said the 80 game matches were my own data, I gave 5 different 80 game match results to illustrate the point. Namely that 80 games against a single opponent is not enough to avoid high variance in the results. I said nothing more, nothing less...hgm wrote:Well, the number 80 is something that is entirely your fabrication.bob wrote:Useful -> something that produces information that is of some use. In this case, a random set of 80 games chosen from a huge population of potential games simply doesn't provide any useful information. See the 5 matches I posted. Pick any one of the 5 and you have a 3/5 chance of being wrong (I happen to know which is actually better based on 10,000 games in this test, as an example). So with 2/5 being opposite of reality, and 1/5 indicating equality which is wrong, what possible use is that data?
Statistically, that's simply wrong. Suppose you have a die with an unknown number of sides. How many "rolls" do you need before you can accurately determine how many faces there are? Certainly 1 or 2 won't cut it. And 80 games puts you into that level of uncertainty when that kind of "die" has millions of faces.I never mentioned any number, and everything I said so far applies as well to 80 million games. The information produced in 80 games is useful, as you first have to play 80 games before you can have played 80 million games.
come on. That's stupid, you know it, I know you know it, and I know that you know that I know it. I said 80 is _not_ enough. That does _not_ mean that you can never get a good result. Only an idiot would run 80 games, say "no good here" and throw them away and start over. The logical person would continue to add to that set of games until it is sufficient to interpret...If one would throw the results of each 80 games away because they were 'useless', you would never get a result on a larger number of games.
So let's not go out into never-never land...
You need to play games against Crafty, Fruit, Glaurung, ArasanX (to name just 4). You will _not_ get just two games repeated over and over with those, unless you play so badly you get smashed within 20 moves...Each game is exactly equally useful, and contains as much information as any other game. How large the batch was it belonged to does not have any effect on this.I experience exactly the opposite problem. When testing uMax I could not play more than 2 games against most opponents, (one with white and one with black), or the same 2 games would be exactly repeated over and over. And of the opponents that randomized their opening play most were not suitable for automatic testing, as they crashed and hung the system. I finally solved it by playing Nunn matches, forcing the dozen or so opponents that would not hang the system to play 20 different games each.Without serving as a "spoiler" I suggest you check that yourself. I suspect you will be hugely surprised. The same two programs won't play the same moves every time given the same starting position, much less two different versions of the same program. And the variance is absolutely huge...
Something is wrong in your testing. I have played Crafty vs arasan, glaurung and fruit and observed this wild fluctuation in game results. I have repeated it with glaurung vs fruit and arasan, and fruit vs arasan. So I don't see how you can get repeatable results when I can not, and I have played _tons_ of games trying to repeat 'em...That you probe it is not enough to affect the tree. The course evaluation of any KBNK position is +6, and before any of those get in window the game has to be long since decided. The subtle positional differences between having the bare King in a white or black corner will never change the nature of the fail until the root score gets very close to +6 or -6. And if that happens in the opening, a revision of your book is in order...I would not bet on that. Just turn on 4 piece endgame tables and look at how many pieces are on the board when you get your first tablebase hit. I have seen a hit with only 8 pieces removed here and there. And I see hits in significant numbers when 1/2 of the initial 32 pieces are gone. And many games reach that point.
Like I said before, I already had to go through substantial trouble NOT to get that. If uMax (and many of the opponents I tried for it) would not repeat the game move for move, I probably should send my computer back for repairs under guarantee, for there must be a hardware error then. It is not normal when a computer program with a precisely specified algorithm does not give the same result every time you run it, as computer languages are defined such that the result of any statement is uniquely defined and leaves no room for undeterministic outcome.I disagree with that so strongly that "strongly" is not in the right ballpark. I'd challenge you to take the same two programs, same starting point, and play games until you get at least two that match move for move. Plan on taking a while to do that.Perhaps 0.1% of the games would end such, so 99.9% of the games would be identical for both versions. The only games that differed would be those 0.1% ending in KBNK, where the improved version would now win them all, where the old version would bungle about half of them.
Are you now saying that testing is never any good, because even if you would do billiones times billions of independent games the gamee tree of Chess is so large that whatever you tried would always be just a tiny slice of all possible games that the program could be made to play by varying its nr of nodes or seconds it could search? Well, in standaard statistical theory it is only the size of the sample that determines the reliability of the result, not the size of the population the sample was drawn from.That is simply statistically unsound. You have a huge population of games that you can potentially play. You artificially capture a sub-set of that population by fixing the nodes (or depth, or anything else that produces identical games each time given the same starting position). But what says that random subset of games you have limited yourself to are a reasonable representation of the total game population? I've been testing this very hypothesis during the past couple of months, and believe me, it is just as error-prone as using elapsed time, because you are just picking a random subset that is a tiny slice of the overall population.
Anyway, we seem to be making progress, as now at least it is 'just as error prone' to fix the number of nodes as to fix the number of seconds, in stead of 'totally useles'.Elections are the only counterexample I know of. Everyone uses sampling when confronted with a huge or infinite data set. How big the sample has to be to draw conclusions with a certain reliability from it can be calculated, and can be strongly dependent on the method of sampling. (e.g. stratified sampling.)Not quite. No one takes a tiny sample of a huge population and draws any conclusion from that tiny sample, when it is important. Larger sample sizes, or more sets of samples are required...This is absolutely standard statistical procedure.
- 
				hgm  
- Posts: 28395
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Observator bias or...
Well, excuse me for saying so, but if all you said was that 80 games is not enough, then I don't understand why you said that at all. As the topic under discussion, or at least the one that I addressed and that you reacted on, was the one raised by Uri, if it is preferable to test based on time or based on node count, not how many games are enough:
To talk about things that are absolutely useles: testing uMax against Crafty, Fruit, Arasan, comes pretty close pretty to that. The more games I would play, the more useles it would be, for in 100,000 games both the old and the improved version of uMax would score 0 points. So How would I know if my improvement worked, or if I had completely broken it? It would just be a giant waste of time. An easy analysis shows that you obtain maximum information per game (so that you get the desired reliability with the smallest number of games) if you test against engines of about equal strength.
On top of that, for those that are testing engines in a higher ELO range, I would even reverse the statement: if Crafty, Fruit, Arasan,... are not able to reproduce their games despite 'random' being switched off, and despite being set for a fixed ply depth, (so that random factors outside of the engines cannot affect their logic), they are clearly not suitable test opponents and are best excluded from any gauntlets you make to evaluate tiny changes in your engine. As using such unpredictable engines needlesly add an enormous statistical variance to the quantity under measurement. Better stick to engines that behave according to specifications.
After all, the idea is to make testing to a certain accuracy as easy as possible. That you could also make it much harder on yourself by picking certain engines with nasty peculiarities, is quite irrelevant if you are smart enough to stay away from them!
			
			
									
						
										
						Here you make the general statement that testing with a fixed number of nodes is useless. Without referring to any number of games. And I don't think that 'useless' is the same as 'not enough' (for a particular purpose). Even if you want to stick to the 80 games that suddenly popped out of nowhere, if I have two versions and one of them scored 45 out of 80, while the other scored 0 out of 80, then 80 games are clearly enough to draw the far-reaching conclusion that you have broken something, and it would be plain silly to continue playing 100,000 games with this version. But all of that is standard statistics, which was never an issue in this discussion thread.bob wrote:fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.
For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
my upcoming ICGA paper will show just how horrible this is...
To talk about things that are absolutely useles: testing uMax against Crafty, Fruit, Arasan, comes pretty close pretty to that. The more games I would play, the more useles it would be, for in 100,000 games both the old and the improved version of uMax would score 0 points. So How would I know if my improvement worked, or if I had completely broken it? It would just be a giant waste of time. An easy analysis shows that you obtain maximum information per game (so that you get the desired reliability with the smallest number of games) if you test against engines of about equal strength.
On top of that, for those that are testing engines in a higher ELO range, I would even reverse the statement: if Crafty, Fruit, Arasan,... are not able to reproduce their games despite 'random' being switched off, and despite being set for a fixed ply depth, (so that random factors outside of the engines cannot affect their logic), they are clearly not suitable test opponents and are best excluded from any gauntlets you make to evaluate tiny changes in your engine. As using such unpredictable engines needlesly add an enormous statistical variance to the quantity under measurement. Better stick to engines that behave according to specifications.
After all, the idea is to make testing to a certain accuracy as easy as possible. That you could also make it much harder on yourself by picking certain engines with nasty peculiarities, is quite irrelevant if you are smart enough to stay away from them!
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
time for some real data
Here are 32 80-game matches, with +=a win, -=a loss and = means drawn.  The 80 game matches are between the two same opponents, same 40 starting opening positions one with white one with black, same time control, same everything.  Look at this and then tell me that two programs play the same games often if nothing is changed...
What is interesting is to look at each column (not so easy as formatted by HTML). Each column represents the same starting position and color. You might expect for each column to have the same character/result since the two programs, the time control, the starting position and starting color are identical within a column. But expectations and actual results don't match up. Some columns are pretty deterministic, although finding any column with identical results from top to bottom is rare. Then compare any two rows to see how often a match produces the same or very close to the same results game by game.
Variance is for real, and it is a huge problem.
			
			
									
						
										
						What is interesting is to look at each column (not so easy as formatted by HTML). Each column represents the same starting position and color. You might expect for each column to have the same character/result since the two programs, the time control, the starting position and starting color are identical within a column. But expectations and actual results don't match up. Some columns are pretty deterministic, although finding any column with identical results from top to bottom is rare. Then compare any two rows to see how often a match produces the same or very close to the same results game by game.
Variance is for real, and it is a huge problem.
Code: Select all
01:  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++---
02:  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+---
03:  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+-
04:  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++---
05:  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+-
06:  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=-
07:  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+-
08:  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--=
09:  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+-
10:  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+=
11:  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-==
12:  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+---
13:  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+---
14:  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+-
15:  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+=
16:  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+-
17:  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+-
18:  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++---
19:  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=-
20:  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+-
21:  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=---
22:  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=-
23:  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+-
24:  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++---
25:  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+-
26:  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==-
27:  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+=
28:  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+-
29:  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--=
30:  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--=
31:  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+---
32:  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+-
32 distinct runs (2560 games) found
- 
				hgm  
- Posts: 28395
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: time for some real data
Yes, and now I would be interested to see a similar picture where you did not give both programs equal time, but in stead an equal number of nodes. (Or, if that setting is not supported, and equal search depth.)
Only then we could conclude if this makes the variability better or worse...
			
			
									
						
										
						Only then we could conclude if this makes the variability better or worse...
- 
				bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: time for some real data
If you give them a specific node count, there is _zero_ variance in the results, as expected, because there is no other source of randomness under that condition. But if you play a match with node count=N, then a match with node count = N+1000, the results show the same scattering effect...hgm wrote:Yes, and now I would be interested to see a similar picture where you did not give both programs equal time, but in stead an equal number of nodes. (Or, if that setting is not supported, and equal search depth.)
Only then we could conclude if this makes the variability better or worse...
There is no way to do an exact node test, because one tiny change to either program moves you into a new random sample of result games...