Yes, I believe you are right.hgm wrote:This still prompts the question why he is posting these four particular initial traces, rather than the first four of the data-set in the other thread.
An objective test process for the rest of us?
Moderator: Ras
Re: An objective test process for the rest of us?
-
- Posts: 28359
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: An objective test process for the rest of us?
After some more thinking this actually seems a very plausible scenario:Uri Blass wrote:I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).
Some engines might be very sensitive to the mapping of main memory to cache, and lose significant nps when this mapping is unfavorable and causing collisions between frequently used data structures.
So if one would do the mini-matches by starting up the engine once, and leaving it loaded in memory during the mini-match, using 'new' commands to start new games, in some mini-matches you might be very unlucky as to the memory allocation the OS gave you (page allocation), as it seems that most Operating Systems are rather dumb in making this allocation. Even if your own engine would be immune to this effect, the opponent might not.
So in that case there could be a strong correlation between games within a run, as some runs might be played with an engine or an opponent that is slowed down by perhaps as much as 20%.
This might even happen when you restart the engine process for each game, as the OS might use a FIFO system for allocating memory pages, and it would just assign you the same pages as the previous incarnation of the engine just freed. In this case the correlation might even extend beyond the mini-matches (i.e., beyond restarts of the engine process).
It would be interesting to analyze the auto-correlation function of the data, to see how the correlation between the individual games (which must exist to get the variance above the theoretical maximum for uncorrelated results) depends on the distance between the games. Would it stay constant during a mini-match, and disappear between games belonging to different mini-matches, would it decay according to the number of intervening games irrespective of the mini-match to which they belonged, or would it decay based on the time elapsed between the games (e.g. number of moves). If the test results were generated on a cluster, it would of course be essential to know which games were generated on which machine.
Depending on the situation, one could design strategies to eliminate this artifact (e.g. by restarting the engine process at critical times, or closely monitoring the nps and take action when it falls out-of-range).
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
Do you ever read? I believe I answered that question _exactly_. I don't recall why I picked that 32-match set, but the four 80-game matches I gave were the first 4 played. As I said. Now as to whether the complete set of 32 matches (I run dozens of such 32-match tests) was one that started off particularly bad or not, I don't remember. But I didn't just choose the 4 worst cases. If you think about it you could tell that from the data. How could the average be close to zero with those big negative results thrown in? Unless there were some positive scores as well?hgm wrote:Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.
Just stop and think for a moment before making ridiculous statements...
That you 'crash through the ice' if someone asks you if you use a hypothetical example, is remarkable. I used a hypothetical example before, about the children's length. I don't consider that a crime of any sort in a scientific discussion, and your wording certainly did not exclude it. You must have severe psychological problems on this subject to react like this.
You didn't ask if I used a hypothetical example. And the question was not necessary since I had already stated where the data came from. You asked "are you making this stuff up?" which is a _far_ different question. And, in fact, is more accusation than question. I am not interested in wasting time in that kind of conversation. Maybe one day you will produce your own data... Rather than telling me how mine should look.
That I can't address, other than to say specifically that those were the first 4 80 game matches played in a set of either 32 or 64. I don't keep the data because it would be overwhelming to keep up with. I have hundreds of summaries that look like this:
Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.
Code: Select all
=============== glaurung results ->
64 distinct runs (5120 games) found
win/ draw/ lose (score)
1: 30/ 25/ 25 ( 5)
2: 28/ 23/ 29 ( -1) 29/ 24/ 27 ( 2)
3: 30/ 13/ 37 ( -7)
4: 29/ 19/ 32 ( -3) 29/ 16/ 34 ( -5) 29/ 20/ 30 ( -1)
5: 37/ 17/ 26 ( 11)
6: 26/ 20/ 34 ( -8) 31/ 18/ 30 ( 1)
7: 32/ 18/ 30 ( 2)
8: 30/ 20/ 30 ( 0) 31/ 19/ 30 ( 1) 31/ 18/ 30 ( 1) 30/ 19/ 30 ( 0)
9: 27/ 17/ 36 ( -9)
10: 28/ 20/ 32 ( -4) 27/ 18/ 34 ( -6)
11: 41/ 15/ 24 ( 17)
12: 33/ 21/ 26 ( 7) 37/ 18/ 25 ( 12) 32/ 18/ 29 ( 2)
13: 34/ 14/ 32 ( 2)
14: 25/ 22/ 33 ( -8) 29/ 18/ 32 ( -3)
15: 39/ 10/ 31 ( 8)
16: 32/ 19/ 29 ( 3) 35/ 14/ 30 ( 5) 32/ 16/ 31 ( 1) 32/ 17/ 30 ( 2) 31/ 18/ 30 ( 0)
17: 22/ 20/ 38 (-16)
18: 34/ 20/ 26 ( 8) 28/ 20/ 32 ( -4)
19: 34/ 20/ 26 ( 8)
20: 34/ 10/ 36 ( -2) 34/ 15/ 31 ( 3) 31/ 17/ 31 ( 0)
21: 24/ 23/ 33 ( -9)
22: 37/ 18/ 25 ( 12) 30/ 20/ 29 ( 1)
23: 32/ 17/ 31 ( 1)
24: 29/ 24/ 27 ( 2) 30/ 20/ 29 ( 1) 30/ 20/ 29 ( 1) 30/ 19/ 30 ( 0)
25: 26/ 14/ 40 (-14)
26: 35/ 16/ 29 ( 6) 30/ 15/ 34 ( -4)
27: 37/ 16/ 27 ( 10)
28: 32/ 19/ 29 ( 3) 34/ 17/ 28 ( 6) 32/ 16/ 31 ( 1)
29: 40/ 18/ 22 ( 18)
30: 30/ 20/ 30 ( 0) 35/ 19/ 26 ( 9)
31: 30/ 18/ 32 ( -2)
32: 31/ 21/ 28 ( 3) 30/ 19/ 30 ( 0) 32/ 19/ 28 ( 4) 32/ 17/ 29 ( 3) 31/ 18/ 29 ( 1) 31/ 18/ 30 ( 1)
33: 33/ 23/ 24 ( 9)
34: 38/ 14/ 28 ( 10) 35/ 18/ 26 ( 9)
35: 32/ 18/ 30 ( 2)
36: 25/ 18/ 37 (-12) 28/ 18/ 33 ( -5) 32/ 18/ 29 ( 2)
37: 25/ 23/ 32 ( -7)
38: 32/ 18/ 30 ( 2) 28/ 20/ 31 ( -2)
39: 36/ 15/ 29 ( 7)
40: 36/ 14/ 30 ( 6) 36/ 14/ 29 ( 6) 32/ 17/ 30 ( 2) 32/ 17/ 30 ( 2)
41: 35/ 16/ 29 ( 6)
42: 34/ 10/ 36 ( -2) 34/ 13/ 32 ( 2)
43: 23/ 28/ 29 ( -6)
44: 33/ 15/ 32 ( 1) 28/ 21/ 30 ( -2) 31/ 17/ 31 ( 0)
45: 30/ 19/ 31 ( -1)
46: 35/ 11/ 34 ( 1) 32/ 15/ 32 ( 0)
47: 28/ 21/ 31 ( -3)
48: 30/ 21/ 29 ( 1) 29/ 21/ 30 ( -1) 30/ 18/ 31 ( 0) 31/ 17/ 31 ( 0) 31/ 17/ 30 ( 0)
49: 29/ 17/ 34 ( -5)
50: 30/ 21/ 29 ( 1) 29/ 19/ 31 ( -2)
51: 36/ 15/ 29 ( 7)
52: 36/ 14/ 30 ( 6) 36/ 14/ 29 ( 6) 32/ 16/ 30 ( 2)
53: 28/ 22/ 30 ( -2)
54: 30/ 23/ 27 ( 3) 29/ 22/ 28 ( 0)
55: 27/ 16/ 37 (-10)
56: 29/ 21/ 30 ( -1) 28/ 18/ 33 ( -5) 28/ 20/ 31 ( -2) 30/ 18/ 30 ( 0)
57: 26/ 21/ 33 ( -7)
58: 29/ 20/ 31 ( -2) 27/ 20/ 32 ( -4)
59: 32/ 17/ 31 ( 1)
60: 27/ 17/ 36 ( -9) 29/ 17/ 33 ( -4) 28/ 18/ 32 ( -4)
61: 29/ 18/ 33 ( -4)
62: 26/ 19/ 35 ( -9) 27/ 18/ 34 ( -6)
63: 29/ 19/ 32 ( -3)
64: 27/ 14/ 39 (-12) 28/ 16/ 35 ( -7) 27/ 17/ 34 ( -7) 28/ 18/ 33 ( -5) 29/ 18/ 32 ( -2) 30/ 18/ 31 ( -1)
That is a real run, comparing Crafty to Glaurung, nothing edited out, nothing added in. Do you see the same kind of variability I have been talking about all along? Results from +18 to -17? I have others that are worse. This was the _first_ in my file, so since you seem to think I "pick and choose" I just chose the first one that I saved. And no I don't save them all as there is just too much data.
So we can end this discussion on the conclusion that you not only failed as a scientist, presenting obviously faulty data or or misleading people by presenting highly selected data as if it were typical, without warning, but also as a Human for being unable to engage in polite discussion. Too bad, I had hoped I could learn something from you...
That is actually pretty funny. Did I accuse _you_ of making things up? Did I accuse _you_ of exaggerating? And _I_ can't engage in polite discussion?

You are _way_ out there, let me tell you. _WAY_ out there. With that kind of attitude, I doubt you can learn anything from anybody...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
That doesn't happen in my tests. For each game played, crafty and the opponent run on the same box. I have "warning code" in Crafty that reports when my NPS drops. And it occasionally drops (for long periods of time) when someone decides to run something on a node without going thru the job scheduler. Before I accept a match, I make certain that Crafty did not complain about a drop in NPS during any game in the match. This is a rare problem, but it does happen. Add one extra compute-bound process and my NPS will be cut in half. So that is not going on in my results...Uri Blass wrote:Ignoring Bob's insulting commentshgm wrote:Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.
That you 'crash through the ice' if someone asks you if you use a hypothetical example, is remarkable. I used a hypothetical example before, about the children's length. I don't consider that a crime of any sort in a scientific discussion, and your wording certainly did not exclude it. You must have severe psychological problems on this subject to react like this.
Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.
So we can end this discussion on the conclusion that you not only failed as a scientist, presenting obviously faulty data or or misleading people by presenting highly selected data as if it were typical, without warning, but also as a Human for being unable to engage in polite discussion. Too bad, I had hoped I could learn something from you...
Bob explained that he used real data in the last post and he used the first matches and not looked for worst case.
I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).
Uri
I think part of it is that there is simply a far more random component to playing games than anyone is giving credit for. For example, Zappa's undefeated string at the WCCC. Probably won't happen again. Zappa was good, but not _that_ good. Random events do happen, and sometimes they are all on the same side of the curve, not equally distributed.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
Already answered. It was an unknown run. One that I had just finished. It was discarded when it showed that a change was for the worse. I just posted another complete 64 match set that was the _first_ one in my large summary file. See if you can now find fault in that random selection approach...hgm wrote:Yes, that could certainly be an explanation (falling under the header "faulty measutrement").Uri Blass wrote:Ignoring Bob's insulting comments
Bob explained that he used real data in the last post and he used the first matches and not looked for worst case.
I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).
But I don't think Bob's reply was completely unambiguous concerning the amount of selection involved: These were the first 4 mini-matches of a longer run, sure. But was the run selected, or was it the first run he ever did in his life, or was it randomly selected from all runs he ever did in his life?
Note that in the other thread he gave another run that also started with mini-matches numbered 1, 2, 3, ..., and that data actually looked unsuspect from a statistical point of view. This still prompts the question why he is posting these four particular initial traces, rather than the first four of the data-set in the other thread.
That is really a marvelous scientific mind you have there. "looks unsuspect". Never seen long runs of random data that appeared to be non-random? Do you look at large volumes of data at all? Didn't think so...
1: _always_ means "first run". 2: _always_ means "second run". Every time. At least in the data I present.
If the claim is that variance like this is typical, i.e. if randomly selected minimatches between the same engine versions would actually have a result _distribution_ (given as a histogram) that has a variance that has most of the events outside the theoretical maximum range for independent games (within the mini-match) allows, it would suggest that effects like you mention interfered with the measurements all the time. Or it could be that, say, 90% of the mini-matches is distributed normally, but the average is spoiled by 10% of perturbed measurements that fall in a very wide, nonintegrable tail.
Either way, to diagnose the problem, it would be necessary to see that complete result distribution for a typical run of 5K mini-matches. And even then it would be important to know if the extreme samples occur randomly in the sequence, or typically cluster near the start of the run. In absence of this, we will not be able to make very accurate speculations as to what exactly causes the problem.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
Err... perhaps because those were the first ones I had handy at that point, and by the time I posted the others, those were the ones I had handy at _that_ point.nczempin wrote:Yes, I believe you are right.hgm wrote:This still prompts the question why he is posting these four particular initial traces, rather than the first four of the data-set in the other thread.
Note that by now, I have played over 100,000 of these 80 game matches. I don't keep them all. If a change shows up as "worse" I dump the change, and I dump the match results (64 x 80 games x 4 opponents) that showed it was bad since the results are now meaningless.
Why not stop and think about what is going on before making comments? Have you ever tried to save 8 million game results, most of which are worthless because they showed that your recent addition was bad? What would you do with 8 million games where most are worthless?
Just stop and think about it for a minute...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
I'm running linux. Once a program is loaded into memory, it remains loaded in memory for the duration, because I wrote the software that runs the matches. I also monitor the load on a node to make sure no rogue process comes in and causes a problem by dropping the NPS (although the NPS would drop equally for both programs since they share a CPU without pondering enabled.) The randomness comes specifically from the timing issues I have already mentioned. Nothing else.hgm wrote:After some more thinking this actually seems a very plausible scenario:Uri Blass wrote:I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).
Some engines might be very sensitive to the mapping of main memory to cache, and lose significant nps when this mapping is unfavorable and causing collisions between frequently used data structures.
So if one would do the mini-matches by starting up the engine once, and leaving it loaded in memory during the mini-match, using 'new' commands to start new games, in some mini-matches you might be very unlucky as to the memory allocation the OS gave you (page allocation), as it seems that most Operating Systems are rather dumb in making this allocation. Even if your own engine would be immune to this effect, the opponent might not.
So in that case there could be a strong correlation between games within a run, as some runs might be played with an engine or an opponent that is slowed down by perhaps as much as 20%.
This might even happen when you restart the engine process for each game, as the OS might use a FIFO system for allocating memory pages, and it would just assign you the same pages as the previous incarnation of the engine just freed. In this case the correlation might even extend beyond the mini-matches (i.e., beyond restarts of the engine process).
It would be interesting to analyze the auto-correlation function of the data, to see how the correlation between the individual games (which must exist to get the variance above the theoretical maximum for uncorrelated results) depends on the distance between the games. Would it stay constant during a mini-match, and disappear between games belonging to different mini-matches, would it decay according to the number of intervening games irrespective of the mini-match to which they belonged, or would it decay based on the time elapsed between the games (e.g. number of moves). If the test results were generated on a cluster, it would of course be essential to know which games were generated on which machine.
Depending on the situation, one could design strategies to eliminate this artifact (e.g. by restarting the engine process at critical times, or closely monitoring the nps and take action when it falls out-of-range).
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
Give me a break. What other interpretation can one make of "four sequential runs"??? Four means 4 or 1+1+1+1. Sequential means they follow each other in succession, with no gaps or omissions or additions.hgm wrote:Well, from the phrasing "When 4 sequencial game runs produce" it was not clear to me if this was actual or hypothetical data. For real data I would have expected more something like "This morning's results, for instance, started with the 4 following traces". Taking the unlikeliness of the presented data into account, I consider this a quite normal question, and I would ask it again to anyone that would report once-in-a-million events. People that cannot handle critical questions emotionally do not belong in the scientific arena!nczempin wrote:I do think that hgm's hypothesis on you making it up was perhaps a little uncalled for; it should have been clear that you are quoting from your results.
Actually the explanation that this had been merely a hypothetical example would have been the least worrisome of all possibilities I mentioned. There is nothing wrong in using hypothetical examples to illustrate a scientific point, I just did so with the children-measurement sampling, where I completele made up the quoted population variances and averages. But presenting highly selected data without cautioning the reader of this, is so misleading that I would consider it a scientific crime, as would be the uncritical publishing of obviously erroneous data.
I don't see how that could be any clearer than it was written. If I say "I flipped 4 heads in succession" do you interpret that in some twisted way also, or do you assume I flipped the coin 4 times and got 4 heads, all in one batch???
I know how _I_ interpret that.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
I am not claiming _EVERYBODY_ needs anything. I specifically stated that for the programs I tested, in the set {Crafty, Glaurung, Glaurung2, Fruit2, Arasan 9, Arasan 10, gnuchessx 5.something, and a couple of others I have not named} that this randomness is present when any two of them play. Several of those programs _are_ being tested by _everybody_. So in that case, _everybody_ does need more games. That is a pretty decent sample of good programs, and since any two of them have a strong random factor in the outcome, I believe there is a good chance that _most_ programs have this. Perhaps not the ones that use a primitive time allocation algorithm, but most...nczempin wrote:Can we agree on this: One match of both sides of 40 positions is one match, one sample. So 320 games would be 4 samples. They are not 320 samples.bob wrote: When 4 sequential 80 game runs produce these kinds of results:
1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)
It is hard to draw any conclusions about which represents the real world. I can tell you that after 32 such matches, the average score was -5. So 2 are pretty close, 2 are so far off it isn't funny. Add 'em all up and they are still well away from the truth (-14 or so). But two very close, two way off. If you just do 80, which one do you get? For those runs, 50-50 you think the change is a lemon, or you think it is ok, assuming the original version was scoring say -8. Even 320 games doesn't tell me what I need to know. It is off on the wrong side and would convince me to toss the change that actually works. I can run the same version twice and conclude that the second version is worse, even though they are the same program.
So that's where this is coming from. I've not said that statistics do not work. They do. They just present a large error term for small sample sizes. I didn't expect it. In fact, we tested like this (but using just 160 game matches, 4 games per position) for a good while before it became obvious it was wrong.
Also: You didn't expect that from 4 samples you would get such a high variance. Well, it would not have surprised me. Especially given the situation that Crafty is in. But it is entirely possible that other engines will show lower variability, and that given experiments will show significance more quickly.
The whole point of all those statistical methodologies is to let you decide if you need more samples or not, whether you can make a decision with a certain confidence despite the low number of samples or not.
What you're saying is that your tests showed that you needed more samples. Fine. But the thing you are claiming after that, that everybody needs more samples, is not a valid conclusion, because not everybody is getting the same test results.
Also (again) remember that the variance you should be interested in is the theoretical variance of the underlying distribution. That can be estimated only, and, yes, the more games you use, the more accurate this estimate will get. And the number of samples you need for you to decide that the estimate is good enough is not a magic number, it depends on the actual situation.
Look: I agree that many times in computer chess tournaments, people tend to ascribe more significance to the results than is appropriate. But this fact seems to have turned you towards the other extreme; not everybody makes this mistake, certainly not hgm, and I hope I don't make it either.
I agree that short events don't prove conclusively which engine is the best in the world. But neither do the Olympic games or the Super Bowl, or the World Series of Poker. But shouting this fact around merely sounds like coming from a sore loser.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An objective test process for the rest of us?
as I have mentioned repeatedly, you can see what causes the variability by running this test:jwes wrote:What they are saying is that the variances you are quoting are much higher than you would get if it were a stochastic process, e.g if the probabilities of program A against crafty are 40% wins, 30% draws, and 30% losses, and you wrote a program that randomly generated sequences of 100 trials with the above probabilities, you would not have nearly the differences between these sequences that you have been getting. This would strongly suggest problems with the experimental design.bob wrote:
And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what.
Pick two programs that allow you to search until a specific number of nodes have been examined, and then stop the search at that point immediately.
Run these two programs allowing them to search some number of nodes, say 1,000,000. Now re-run the test but allow them to search 1,001,000 nodes and compare the results. Repeat until you are convinced that changing the node count limit by as little as 1000, or as much as you want, changes the results in a very significant way.
Now examine that node count difference. 10000 nodes for the first test I ran. On the hardware in question, Crafty searches about 2M nodes per second (64 bit xeon dual-cpu not dual core). 10K nodes represents 10,000 / 2,000,000 of a second, or 10/2000 or 1/200th. Can you time accurately to the nearest 2 milliseconds? Nope. Even if you check the time used after each and every node searched, the O/S has a built-in error that is larger than that. The PC clock ticks at just over 18ms per tick. So on a PC, the best you can get is an 18 ms accuracy. Which is 18/1000 of 2,000,000 or 36,000 nodes variance between two different runs.
That's the way it is, and the only solution would be to modify every program in the test to just finish the last iteration that was feasible and stop. And that reduces the sample size from all the potential games, to just the set with that one specific node count for each search. Is that a good sample? I'm not convinced it is, because when you run the test I ran above, 1K nodes changes the results significantly. If you used a fixed node count you would either get A or B, but not the sum of both. Is that a good representation of the total set of games we will see when we play in a real event? Probably not.
The thing that is wrong with the experimental setup is that I have never seen anyone notice or mention how such a small change in the nodes searched can produce such wildly varying results over 80 game matches. And then when you resort to normal time allocation, the node variance is much higher producing even more different match results.
If you claim that is a fault of the setup, then feel free to suggest a solution. But the solution has to involve not modifying all the programs to search in a way that is different from how they normally work.
It's an interesting and serious problem. And again I have never seen anyone mention what a small node count change can do to repeatability. I now know. And it was highly surprising, as all of our team and local helpers on the faculty here can attest to.