Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Post by hgm »

syzygy wrote: Fri Apr 21, 2023 12:37 amWhat is totally unrealisitic is that adding noise would help. Just plug in a random generator and you'll instantly get more accurate statistics. Right.

Both intuitively and in reality noise is unwelcome. That you are confused by the knowledge that "draws do not count" is just that, it confused you.

Some time ago you were suggesting that 1 Elo difference could not be measured. Clearly it can be with the right testing set up and enough games. But noise is not going to help there.
The problem here is that your claims have nothing to do with 'reality'. You did not actually run an engine match, and then added randomness to the engine or played them at much faster TC to increase their intrinsic evaluation noise, to compare the match results. You just pulled two totally arbitrary match results (10/90/0 and 55/0/45) out of your hat, for which the one with more draws happened to be statistically more significant. It is just as easy to doctor the comparison such that the results with fewer draws is more significant: just pull 10/90/0 and 100/0/0 out of the hat... That the win/loss difference should be preserved rather than the win/loss ratio or any more subtle relation between the two on changing TC is just your conviction. Nothing more. And this seems to be based on nothing: no actual result, no underlying model... Only the flawed logic that what I say must be wrong because I am not willing to admit that it is.

As for 'intuitively', you can only speak for yourself. And your intuition might very well be flawed.

As a matter of fact there are many cases where noise improves a result, and this is common knowledge amongst testers. E.g. when playing from a suit of opening positions, randomly picking those per game or pair of games greatly improves the significance of the match from always picking the same start position (where the stronger engine might easily lose 100-0 by playing each game 50 times).

To give a more mathematically inclined example:

If engines playing a game are modelled by randomly blundering away some evaluation score compared to the theoretically best move, the score will perform a random walk along the score axis during the game, and after many moves the distribution of the evaluation of the game position will approach a normal (Gaussian) distribution. When the level of play is very high, the score blundered per move is on the average low, and the width of this distribution will only increase slowly. When the initial score is close to equality (as it will be in the usual case of testing from balanced position), the width might still be narrow compared to the threshold score you need to win the game. So although the score has diffused in the direction of a win for the stronger engine, you would have to be very far in the wings of the distribution to actually exceed the win threshold. The probability for this might be astronomically small, so that wou would basically not observe any wins at all, just draws.

Adding noise, even by symmetrically perturbing the evaluation (e.g. by now and then adding or subtracting some evaluation score by forcing a randomly picked engine to play a poorer move that it would have picked on its own) would increase the width of the distribution, but not the drift towards the win threshold. A much larger part of distribution can then stick out above the win threshold, resulting in many more wins, while thre average final score is still so far from the loss threshold that losses remain negligible. (E.g. you could be 1 standard deviations away from the win trheshold, and 3 from the loss threshold, to get ~16% wins and 0.1% losses, while with half the standard deviation you would have been 2 and 6 standard deviations away from the respective threshold, resulting in 2.2% wins and 0.0000001% losses. And a 16/84/0 result would be far more significant than a 2/98/0 result. So the added noise really helps improving the significance of the result, in this example.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Post by hgm »

Jakob Progsch wrote: Fri Apr 21, 2023 6:44 am I wouldn't trust a situation where different engines share a physical cpu core. For two instances of the same engine it seems reasonable to assume the instruction slots will even out between them (although even there I'd want to do some experiments). But I don't think one can expect "fair" instruction scheduling for different engines that will have a different instruction mix. Also the presence of wide vector instructions for example can affect boost behavior and result in behavior bleeding from one engine into the other.
This actually is a valid argument. But there are many cases where the resulting bias is not harmful to your purpose.

For instance, I often run parallel matches with the same engine (self play), for comparing the balance of the start position. Then the effect does not occur at all.

When you are running gauntlets for several test versions of the same engine, which presumably have the same instruction mix (but perhaps different evaluation weights), it might be that some opponents manage to grab a larger fraction of the core when playing on it against the engine-under-test than other opponents would do. But that just mimics the effect of that opponent being a bit stronger than it actually would be on separate hardware. As long as this bias would be the same against all the different versions of your own engine, this won't affect the comparison of the latter.

Besides, you could test this in advance, by comparing how nps would change by hyperthreading, for the various opponents they share the core with, and avoid using engines for which this is problematic.
syzygy
Posts: 5694
Joined: Tue Feb 28, 2012 11:56 pm

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Post by syzygy »

hgm wrote: Fri Apr 21, 2023 9:42 am You just pulled two totally arbitrary match results (10/90/0 and 55/0/45) out of your hat, for which the one with more draws happened to be statistically more significant. It is just as easy to doctor the comparison such that the results with fewer draws is more significant: just pull 10/90/0 and 100/0/0 out of the hat...
You understand very well that the extreme numeric example was designed to illustrate the point that 2xdraw is good and win+loss is bad, as I even cared to spell out multiple times. But this was your last straw, so you have no option but to cling to it.

Not oging to waste my time on fighting intellectural dishonesty.
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Post by hgm »

Well, now that I have proven with mathematical rigor that adding noise can increase the significance of the result, there wouldn't be much for you to say anyway, right? Babble about intuition, dishing out a few more ad hominems, repeating your erroneous beliefs like a mantra... That just doesn't cut it against hard math.

But you are right about one thing: you picked the 'extreme' case 10/90/0 to mislead the reader into thinking draws are good. Had you picked the more likely outcome 6/90/4 to compare with 55/0/45, the statistical significance of the 6-4 would of course be far worse than that of 55-45.