Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

syzygy · Post by **syzygy** » Sun Apr 16, 2023 4:13 pm

Modern Times wrote: ↑Sun Apr 16, 2023 12:48 pm
syzygy wrote: ↑Sat Apr 15, 2023 11:57 pm
2. When doing ultrabullet testing on such a CPU, should you run more parallel matches than there are physical cores (each match being between two single-threaded engines without pondering)?

This is probably not worth it. You get more games but of lower quality (since nps per engine will be much lower). If you increase the time control for each game to correct for the nps loss, you still get a bit more games than without hyperthreading, but there will be a lot more noise which decreases the statistical relevance of the results. Noise means you need many more games to get the same statistical significance. (And if your statistical model does not reflect this, you will have results that are less reliable than you think.)

Stefan Pohl does exactly this for all his Stockfish and other ratings lists - he runs 20 threads on his 12-core Ryzens. Everyone seems happy with the results, I guess because of the large number of games he runs.

It would be interesting to see how the draw rate compares to 12 parallel matches on 12 cores.

hgm · Post by **hgm** » Sun Apr 16, 2023 10:16 pm

syzygy wrote: ↑Sun Apr 16, 2023 4:10 pm You are making assumptions that you don't know are true. The point of hyperthreading is to increase utilisation of execution units, not to serve the hyperthreads running on a core fairly.

This is a pretty safe bet. The CPU back end knows nothing about hyper-threading; this is purely a front-end matter (fetcher and decoder). Performance would suffer if it would not always schedule the oldest execution-ready microOp in the re-order buffer, because leaving an old instruction unexecuted can stall the retire unit (and thus the entire pipeline). It wouldn't care at all which HT put that microOp in the reorder buffer. I think it is safe to assume that the designers are not stupid, and would have to see proof of that before I believe it.

Noise increases, so draw rate goes down and statistical significance goes with it. You'll have to increase the number of games to get the same reliability.

Statistical significance goes up if draw rate goes down. Drawn games do not provide any info at all, it is like they were never played. The strength comparison comes purely from the the ratio of wins and losses, and the more you have of those, the smaller the statistical error in it will be.

Last time I checked hyperthreading did not give 50% more speed.

That depends on the program. If a program uses a unique unit every cycle, there would be no speedup at all if you run two of those on one core.

syzygy · Post by **syzygy** » Mon Apr 17, 2023 5:11 am

hgm wrote: ↑Sun Apr 16, 2023 10:16 pm Statistical significance goes up if draw rate goes down. Drawn games do not provide any info at all, it is like they were never played. The strength comparison comes purely from the the ratio of wins and losses, and the more you have of those, the smaller the statistical error in it will be.

No, win+loss is worse than 2xdraw.

My example was:

On your 6700K, A wins 10 games, draws 90 games. "Outcome" is 55-45..
On your 7950X, A wins 55 games, loses 45 games. "Outcome" is 55-45.

6700K: (wins - losses) / sqrt(wins + losses) = (10 - 0) / sqrt(10) = 10 / sqrt(10).
7950X: (wins - losses) / sqrt(wins + losses) = (55-45)/ (sqrt(55+45)) = 10 / sqrt(110).

So you get much more information from the 10 decided games on the 6700K than from the 100 decided games on the (noisy) 7950X.

In practice it won't be this extreme, but the point is that a lower draw rate as a result of noise decreases statistical significance. (And nothing else would make sense.)

hgm · Post by **hgm** » Mon Apr 17, 2023 7:34 am

That is not a relevant comparison. Of course the result would not stay the same if you decrease draw rate by adding noise. The noisy result would be more like 60+/30=/10-.

syzygy · Post by **syzygy** » Tue Apr 18, 2023 1:41 am

hgm wrote: ↑Mon Apr 17, 2023 7:34 am That is not a relevant comparison. Of course the result would not stay the same if you decrease draw rate by adding noise. The noisy result would be more like 60+/30=/10-.

Well, I know you are the last person in the world willing to admit being wrong. You seem to believe the universe would explode if you do.

Going from LTC to STC effectively increases noise. I don't think the general experience is that the rating difference is affected by a lot, i.e. the outcome in terms of point difference is essentially the same. But of course the draw rate goes down and with the draw rate the statistical significance (at same number of games).

hgm wrote:Statistical significance goes up if draw rate goes down. Drawn games do not provide any info at all, it is like they were never played. The strength comparison comes purely from the the ratio of wins and losses, and the more you have of those, the smaller the statistical error in it will be.

hgm · Post by **hgm** » Tue Apr 18, 2023 7:54 am

So when you run out of arguments, you think an ad hominem will prove your point?

The rating differences are affected a lot, and testers abundantly complained about this. Rating scales get compressed when the draw rate goes up. People have even proposed new rating systems (discounting draws) to combat this effect.

The LOS usually does not suffer much, because error bars get similarly compressed. The assumption that noise will turn all draws in wins and losses in an exact 50-50 ratio, even if one of the engines is far stronger, is just wrong.

syzygy · Post by **syzygy** » Tue Apr 18, 2023 11:38 pm

hgm wrote: ↑Tue Apr 18, 2023 7:54 am So when you run out of arguments, you think an ad hominem will prove your point?

Just a factual observation.
You started out with "wrong on several levels" and now you're just slowly retreating and changing position.

The LOS usually does not suffer much, because error bars get similarly compressed.

You were arguing that LOS went up...

The assumption that noise will turn all draws in wins and losses in an exact 50-50 ratio, even if one of the engines is far stronger, is just wrong.

The 10-0-90 example was extreme, as I wrote, but still there is no reason to expect that considerable noise favours the stronger engine.
In real tests the Elo difference will be very small, which means even small noise will have a considerable impact, and it will just lead to more wins AND losses. Again, win+loss is much worse than 2xdraw.

This is wrong:

hgm wrote:Statistical significance goes up if draw rate goes down. Drawn games do not provide any info at all, it is like they were never played. The strength comparison comes purely from the the ratio of wins and losses, and the more you have of those, the smaller the statistical error in it will be.

A draw does not give information, but win and loss decreases information.

hgm · Post by **hgm** » Wed Apr 19, 2023 7:43 am

You must misunderstand what I wrote, because I did not retreat from anything. You are still arguing from the totally unrealistic fiction that 'noise' (e.g. by lower quality games because of faster TC) would replace draws by an equal number of wins and losses. But it won't if the engines are not equally strong. If one of the engines was really that much stronger as your oh-so-reliable 10/90/0 result suggest, resolving the draws would more likely result in 95/0/5. Not in the 55/0/45 that you compare it with.

And I did not say anything about LOS, other than that is is determined by the number of wins and losses, before the sentence you qoute. This is just your imagination.

syzygy · Post by **syzygy** » Fri Apr 21, 2023 12:37 am

hgm wrote: ↑Wed Apr 19, 2023 7:43 am You must misunderstand what I wrote, because I did not retreat from anything. You are still arguing from the totally unrealistic fiction

What is totally unrealisitic is that adding noise would help. Just plug in a random generator and you'll instantly get more accurate statistics. Right.

Both intuitively and in reality noise is unwelcome. That you are confused by the knowledge that "draws do not count" is just that, it confused you.

Some time ago you were suggesting that 1 Elo difference could not be measured. Clearly it can be with the right testing set up and enough games. But noise is not going to help there.

Jakob Progsch · Post by **Jakob Progsch** » Fri Apr 21, 2023 6:44 am

I wouldn't trust a situation where different engines share a physical cpu core. For two instances of the same engine it seems reasonable to assume the instruction slots will even out between them (although even there I'd want to do some experiments). But I don't think one can expect "fair" instruction scheduling for different engines that will have a different instruction mix. Also the presence of wide vector instructions for example can affect boost behavior and result in behavior bleeding from one engine into the other.

Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost