non overlapping error bars

jswaff · Post by **jswaff** » Tue Aug 04, 2020 8:10 pm

I have two tests running - version A of a program vs version B. Both tests are running the same builds of each version (the MD5 signatures of the binaries match), and are running with the same time control under identical conditions, with the exception that the set of starting positions are different. All starting positions are from arasan-depth-20.pgn.

The tests are running on an older laptop that has two physical cores and 4 logical processors. The laptop is not doing anything else. Pondering is not enabled. I am using cutechess-cli, but not with the concurrency option.

After 5700 games each, the first process is reporting an ELO difference of 18.49 +/- 7.07. The second process is reporting an ELO difference of 2.13 +/- 7.13. That is completely surprising to me. What is the confidence level associated with those error bars?

I am going to let it continue on to see if perhaps the results will begin to converge at some point, or at least get to the point that the error bars overlap, but I am starting to think there is something amiss. A resource issue perhaps? But I would think running N games without pondering on a machine that has N cores should be OK?

Any insight would be appreciated.

brianr · Post by **brianr** » Wed Aug 05, 2020 12:21 am

Not surprising to me. If you are using time per move games (instead of fixed nodes), the timing accuracy is not precise enough to occasionally result in different moves being selected, even with just 1 CPU searching. Of course, with more than 1, variability increases a lot.
The different books will highlight different "holes" in the engine also.
You can also to a "mirror" test flipping position sides and all evals should be equal. If not, something broken (unless intentional, but I don't subscribe to that approach).

I generally use the Ordo tool. There is a "manual" with a section on the math at the end.

I run a "sanity check" from time to time with fixed nodes per search. Tinker is still only one search thread. Those results should be extremely close to 50/50 if run for a while. If not, something in my testing methodology is broken. This has been a weak spot in my testing for 20 years.

jswaff · Post by **jswaff** » Wed Aug 05, 2020 3:24 am

Hi Brian!

I've run fixed depth tests with results exactly .500 as expected. The tests I'm posting about are simply game in 5+0.25, so of course there will some variability - that's expected. I have recently run tests with the same version and got a result that was close to, but not exactly 0.500, which again is expected. The thing that surprises me isn't that there is variability, but that the error bars for two identical matches don't even overlap.

As an update, after 6600 games, process one is showing an ELO of +16.78 +/- 6.56, and process two an ELO of -0.68 +/- 6.63. If the error bars are to be trusted, that shouldn't be possible, or at least should be extremely unlikely. I'll brush up on the math but I'm starting to suspect the difference in results is related to resource contention somehow.

I'm going to let it continue to run just to see how it ends up after 20k games per "process," then repeat the experiment running one at a time. Unfortunately that is going to take several days but I feel like I need to get to the bottom of this to have any confidence going forward.

Sven · Post by **Sven** » Wed Aug 05, 2020 7:02 am

The explanation is simple: the first set of starting positions favors A slightly more than the second. To get fully reproducible results would require to use 100% identical openings.

jswaff · Post by **jswaff** » Wed Aug 05, 2020 1:39 pm

+15 ELO from one half of a set of positions to the other half seems like a lot, which is why I didn't consider it, but maybe it is that simple. As I replied to Brian, I'll just have to repeat the experiment on a single processor to see if the results hold.

nionita · Post by **nionita** » Wed Aug 05, 2020 6:43 pm

jswaff wrote: ↑Tue Aug 04, 2020 8:10 pm with the exception that the set of starting positions are different.

But what is the reason to test with different sets of positions? I do not see any utility for this kind of tests, except if you test with end game positions, and one version is supposed to have better knowledge about specific end games. In which case I would expect to see different ELO numbers.

Nicu

Alayan · Post by **Alayan** » Wed Aug 05, 2020 7:45 pm

95% error bars aren't very reliable.

jswaff · Post by **jswaff** » Wed Aug 05, 2020 7:48 pm

That's a fair question, and I don't really have an answer other than I thought more positions == more variability == better?

The point has already been made to remove the different positions as a variable, and I agree. I had made an assumption that the match score using the first 10k positions in the test set should be about equal to a match score using the latter 10k positions, but maybe that was not a good assumption. I'll have to do more testing to find out. I hope it is that simple.

Alayan · Post by **Alayan** » Wed Aug 05, 2020 8:17 pm

It's not a bad thing to test with more positions, but you can't expect identical results from different sets.

hgm · Post by **hgm** » Thu Aug 06, 2020 11:08 am

Note that the error bars quoted by rating-extraction software only reflect the statistical error in game results. To that you will have to add errors due to selection of start positions, selection of opponents, variability of the hardware.

It is interesting, but little investigated, how large the variability due to position selection and opponent selection are. We can think of a set of opponents as a random sample from a larger pool of all engines. And if not all engines would produce (in the limit of infinite games and infinite positions) the same result, the random sampling would cause a statistical error in the result. Likewise for selecting positions.

As for hardware effects: on most OS the paging algorithm is usually not 'cache proof'. One process could get assigned a set of memory pages that accidentally mostly map into the same part of the cache, thus behaving as if there was only a much smaller cache. If you don't restart the engine processes for every game, this effect would persist for the entire test. So there is also a statistical error that goes down with the number of restarts. Simple restarts might not even be good enough for reducing this error, as the OS might decide to always give the new process the memory that was just released by its previous instance. (In which it would of course perfectly fit.) So you would have to randomize the order in which you restart the engines, so that there cannot be any systematics in which one gets which memory allocation.

non overlapping error bars

non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars

Re: non overlapping error bars