Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

hgm · Post by **hgm** » Tue Apr 11, 2023 12:28 pm

mvanthoor wrote: ↑Tue Apr 11, 2023 12:00 pmThat is 48 mouse clicks for every tournament I want to run. Why would I want to run 16 instances of CuteChess? It can run games concurrently all by itself, and when running 16 threads, I can just put CuteChess in the background and keep using the computer.

Well, 48 mouseclicks take perhaps 20 seconds. I suppose the resulting match takes appreciably longer, otherwise using concurrency would be a completely pointless exercise. If you do that a thousand times, you will have spent about 6 hours on it. Whether that would be worth it to you to save a few thousand dollar on hardware depends on your typical hourly wages.

You would start 16 instances to be able to specify their affinity masks separately. Of course it would be better if the tournament manager you would be using would set the affinities all by itself in the way you specified (e.g. one or two games per core). The lesson here is that software optimized for the task you want to perform can save you a ton of performance compared to just grabbing the first thing that you happen to encounter that does something vaguely similar to what you want.

Investing a lot of money in some 50% over-capacity 'to keep using the computer' would be wasteful if in practice you use that over-capacity only a small fraction of the time. For the interactive things I do in parallel with heavy-duty background tasks (surfing the internet, entering text, compiling) typically 1 or 2 hyperthreads are sufficiently, so reserving 16 for it seems to ensure 14 of those are always idle. The advantage of using a general task manager to control this is that you can alter the affinity settings any time you want. So that you are not dependent of anything you specified when the match started. That is also one of the reasons I like to control the amount of concurrency dynamically, so that when I unexpectedly need to run a heavy-duty task, I can simply reduce the concurrency of the ongoing background tournament to free as many CPU power as I think I will need, and crank it back up later.

mvanthoor · Post by **mvanthoor** » Tue Apr 11, 2023 1:06 pm

hgm wrote: ↑Tue Apr 11, 2023 12:28 pm Well, 48 mouseclicks take perhaps 20 seconds. I suppose the resulting match takes appreciably longer, otherwise using concurrency would be a completely pointless exercise. If you do that a thousand times, you will have spent about 6 hours on it. Whether that would be worth it to you to save a few thousand dollar on hardware depends on your typical hourly wages.

You would start 16 instances to be able to specify their affinity masks separately. Of course it would be better if the tournament manager you would be using would set the affinities all by itself in the way you specified (e.g. one or two games per core). The lesson here is that software optimized for the task you want to perform can save you a ton of performance compared to just grabbing the first thing that you happen to encounter that does something vaguely similar to what you want.

Investing a lot of money in some 50% over-capacity 'to keep using the computer' would be wasteful if in practice you use that over-capacity only a small fraction of the time. For the interactive things I do in parallel with heavy-duty background tasks (surfing the internet, entering text, compiling) typically 1 or 2 hyperthreads are sufficiently, so reserving 16 for it seems to ensure 14 of those are always idle. The advantage of using a general task manager to control this is that you can alter the affinity settings any time you want. So that you are not dependent of anything you specified when the match started. That is also one of the reasons I like to control the amount of concurrency dynamically, so that when I unexpectedly need to run a heavy-duty task, I can simply reduce the concurrency of the ongoing background tournament to free as many CPU power as I think I will need, and crank it back up later.

While you're correct that you can achieve a lot with weaker hardware when administering it like this, I don't want the spend the time to manually hand-hold the computer. I did do things like that 15-20 years ago when 1-4 cores was all you could get (I had a two-CPU computer in 2004 already), but nowadays I just want to start a tournament but at the same time, keep working on the engine, or start a VM with Windows to edit pictures, or even play a game.

On the i7-6700K that wasn't possible; when running a tournament with 4 concurrent games, browsing the internet was all that was still possible. With the 7950X, I can basically do whatever I want with the computer even when a tournament is running. (And yes, I tested this... results don't seem to be affected.)

hgm · Post by **hgm** » Tue Apr 11, 2023 1:32 pm

Well, your original question was how one could use a concurrency larger than the number of physical cores without introducing unpredictable bias. So I described how I typically do that, which happened to involve only a minimal amount of extra work compared to what you typically have to do to creating a new version of the engine and submititng it to the test. Whether or not you would be interested in doing it was not really part of the question.

mvanthoor · Post by **mvanthoor** » Tue Apr 11, 2023 4:12 pm

hgm wrote: ↑Tue Apr 11, 2023 1:32 pm Well, your original question was how one could use a concurrency larger than the number of physical cores without introducing unpredictable bias. So I described how I typically do that, which happened to involve only a minimal amount of extra work compared to what you typically have to do to creating a new version of the engine and submititng it to the test. Whether or not you would be interested in doing it was not really part of the question.

As far as I understand you just start the tournament manager and then assign it to a single core. Thus every core always has two engines loaded simultaneously. In that case, wouldn't it be the same if one just started 32 concurrent games in CuteChess on a 16-core hyperthreaded CPU?

If so, I could speed up a gauntlet even more: by just starting 32 concurrent games, a gauntlet would be done in about one hour.

hgm · Post by **hgm** » Wed Apr 12, 2023 7:29 pm

Starting 32 concurrent games would be the same. If the scheduler of the OS worked well, and nothing else was going on on the computer.

Unfortunately the latter is seldom true: Windows can unpredictably suddenly start doing things. Because there would then be more processes than virtual cores, the OS would start to service them by time-division multiplexing two processes on the same virtual core. The unlucky engine that is running there at that time will be paused, even though its clock would remain running.

To avoid this sort of thing most testers use one core less than they have physical cores, so that the unpredictable tasks that now and then pop up will be scheduled on the otherwise idle core. But if you run on hyperthreads you have to prevent in that case that the OS starts to use two of the physical cores with only a single HT. As it would naturally do to utilize the available CPU power optimally. Better use two cores that run at full speed, than one core that run at (say) 150% through 2 HT, and one idle core. The affinity mask avoids that.

syzygy · Post by **syzygy** » Sat Apr 15, 2023 9:38 pm

mvanthoor wrote: ↑Tue Apr 11, 2023 12:00 pm
hgm wrote: ↑Tue Apr 11, 2023 11:36 am Well, work and investment are exchangeable assets. Doing things carefully and efficiently on cheap hardware can have the same performance as doing things clumsily and wastefully on very expensive hardware. Whether any extra work it might require is worth the savings on hardware price is a dilemma that every person has to decide for himself.

You can also run 16 instances of CuteChess in parallel, and assigning an affinity to each of those is just a matter of 3 mouse clicks per instance...
That is 48 mouse clicks for every tournament I want to run. Why would I want to run 16 instances of CuteChess? It can run games concurrently all by itself, and when running 16 threads, I can just put CuteChess in the background and keep using the computer.

Never heard of automation/scripting?

Why are you testing at all if proper testing is "too much work"?

syzygy · Post by **syzygy** » Sat Apr 15, 2023 11:57 pm

mvanthoor wrote: ↑Mon Apr 10, 2023 5:23 pm- Hyper-Threading: when this was introduced somewhere in 2002, it was tested to be bad for chess engine testing. An engine using more threads than there were cores would not become significantly stronger. Running a match with more engines than cores could mean that an engine got assigned to a hyper-thread, and thus be significantly weaker compared to it being assigned to a normal core.

Knowingly or unknowingly you mix up two things here.

1. Does a multi-threaded engine running on a CPU with hyperthreading gain from using more search threads than there are physical cores?

In the past (pre-"lazy smp") the answer was generally no, nowadays the answer may be yes. But it'll have to be tested since engines differ (and CPUs too).

2. When doing ultrabullet testing on such a CPU, should you run more parallel matches than there are physical cores (each match being between two single-threaded engines without pondering)?

This is probably not worth it. You get more games but of lower quality (since nps per engine will be much lower). If you increase the time control for each game to correct for the nps loss, you still get a bit more games than without hyperthreading, but there will be a lot more noise which decreases the statistical relevance of the results. Noise means you need many more games to get the same statistical significance. (And if your statistical model does not reflect this, you will have results that are less reliable than you think.)

By setting affinities you may be able to reduce the noise a bit, but it will still be there. If you run two matches on two hyperthreads of the same core, nps will vary quite a bit, which introduces noise. Maybe in this situation the nps variations will sufficiently average out that there is still a total net gain, but this is not so clear.

With my tests of Rustic on the 6700K, I've always stuck to 4 concurrent games as it was a quad core CPU. Now that I have a 7950X, I have tried a a match between the same version of Rustic, running 1000 games at 16, 24, and 30 threads. There's no difference in the outcome. I assume that this is because each of the engines has a 50% chance of getting assigned to a hyper-thread and this will be equally divided. I have not yet tested this with a gauntlet.

What do you mean by "outcome"?
If you use comparable time controls, you will probably have fewer draws on your 7950X with hyperthreading.
Fewer draws means more statistical noise and is therefore bad.

Suppose you run 100 games between A and B.
On your 6700K, A wins 10 games, draws 90 games. "Outcome" is 55-45..
On your 7950X, A wins 55 games, loses 45 games. "Outcome" is 55-45.
These are wildly different outcomes. The match on your 6700K is far stronger evidence that A is better than B than the match on your 7950X.

I think HGM once argued in a similar context (perhaps it was about imbalanced openings) that the extra noise/imbalance could be what you need to bring out a small difference in Elo more clearly. There is probably something to that argument. Say you need +1 to win a game but the strength difference only allows +0.75, so all games are draws. If you now randomly add +/- 0.25 in noise, you will go from all draws to some wins for the stronger engine. So there are two sides to this.

- Intel E-Cores: these are different cores compared to the normal cores. An engine running on such a core would be much slower than an engine on a normal performance core. If I had a CPU with E-Cores, I would probably disable them.

Using E-cores should be fine if you make sure that both engines in a match run on the same E-core. Ideally you give them a slower time control than engines running on faster P-cores.

I don't know if cutechess supports this. if not, then support should be added.

- Turbo Boost: My Intel 6700K just boosts a single thread to 4.4 GHz, and any load higher than a single thread gets boosted to 4.2 GHz. Thus when running a match the entire CPU runs at 4.2 GHz. Thus I have never disabled Turbo Boost.

That makes sense.

- Precision Boost Optimizer (AMD Ryzen): this doesn't try to hit a specific frequency, but a specific power draw or CPU temperature. If you manage to lower the CPU temperature with a bigger cooler, or the lower the power draw due to an undervolt, the CPU just boosts higher. (The one option in the BIOS to prevent this does either not work at all, or does not work on Linux.) When running a gauntlet with 16, 24 or 30 threads, the entire CPU boosts to 5.3, 5.2 and 5.0 GHz respectively. It stays pegged at 85, 84 and 78 degrees respectively. In the summer, the CPU will probably hit the 95 thermal target, so it will run slower than in winter. However, because all cores run at the same speed during a match or gauntlet, I see no reason to disable boosting altogether. I could, but then the CPU would be capped at 4.5 GHz, which would lose out on hundreds of MHz of speed, x16. That wouldn't be an option.

Ideally time controls (or measurement of time) would be adjusted with clock frequency.

Fixed nps matches would overcome all these problems, but they introduce their own...

So what's your take on this? Testing chess engines doesn't seem to become easier...

But your hardware is now also far more powerful and allows you to do much more testing in the same time.

hgm · Post by **hgm** » Sun Apr 16, 2023 9:49 am

This seems wrong on several levels. For one, why would the nps vary 'quite a bit' when you run two engines on the same physical core? Even if the two HT would compete for an execution unit every clock cycle, and it would be randomly decided which of the two is stalled for that cycle, the number of stalled cycles in a millisecond at 2GHz would be 1M, and the standard deviation in it sqrt(1M) = 1000. So the standard deviation of the nps fluctuation is 0.1%, which I would call negligible. And if games last longer than 1ms, the noise in the nps averaged over the game would even be lower.

And even when there would be severe fluctuation, it doesn't appear to be very harmfull: suppose I am running a test where the correct outcome would be 60% win, 40% loss. But nps fluctuations would randomly favor one engine or the other so that it gets 10% extra. So in 50% of the cases the chances are 50-50, in the other 50% they are 70-30. The combined result for a single game (i.e. flipping the coin to decide who gets the time odds, and then playing for a result) still has a 60-40 distribution. It has the same average, and the same variance. The result or its reliability is not affected at all!

Now the above assumed that the random handicapping was not so large that it saturated the result. So let us examine another case: in 10% of the games one of the engines gets so little service from the CPU that it certainly loses, no matter how much stronger it is (e.g. it forfeits). Now the total distribution for a game result is affected; the 60-40 applies only in 90% of the games (so 54-36), while the other 10% is 50-50 (so 5-5). In total 59% vs 41%. The new standard deviation in a single game would be sqrt(0.41*0.59) = 0.4918 instead of sqrt(0.4*0.6) = 0.4899, i.e. 1.7% smaller, while the deviation of the average result from 50% is reduced by 10%. The ratio of the two (which determines the LOS) would go down by 10-1.7 = 8.3%. You can make up for this by playing 17.3% more games. Hyperthreading might allow you to play 50% more games in the same time, though.

We see that the added noise is not so much of an issue, and even in these pretty extreme cases hardly affects the result. I guess this is because the noise is already very large to start with, even under perfect conditions.

Modern Times · Post by **Modern Times** » Sun Apr 16, 2023 12:48 pm

syzygy wrote: ↑Sat Apr 15, 2023 11:57 pm
2. When doing ultrabullet testing on such a CPU, should you run more parallel matches than there are physical cores (each match being between two single-threaded engines without pondering)?

This is probably not worth it. You get more games but of lower quality (since nps per engine will be much lower). If you increase the time control for each game to correct for the nps loss, you still get a bit more games than without hyperthreading, but there will be a lot more noise which decreases the statistical relevance of the results. Noise means you need many more games to get the same statistical significance. (And if your statistical model does not reflect this, you will have results that are less reliable than you think.)

Stefan Pohl does exactly this for all his Stockfish and other ratings lists - he runs 20 threads on his 12-core Ryzens. Everyone seems happy with the results, I guess because of the large number of games he runs.

syzygy · Post by **syzygy** » Sun Apr 16, 2023 4:10 pm

hgm wrote: ↑Sun Apr 16, 2023 9:49 amThis seems wrong on several levels. For one, why would the nps vary 'quite a bit' when you run two engines on the same physical core? Even if the two HT would compete for an execution unit every clock cycle, and it would be randomly decided which of the two is stalled for that cycle, the number of stalled cycles in a millisecond at 2GHz would be 1M, and the standard deviation in it sqrt(1M) = 1000. So the standard deviation of the nps fluctuation is 0.1%, which I would call negligible. And if games last longer than 1ms, the noise in the nps averaged over the game would even be lower.

You are making assumptions that you don't know are true. The point of hyperthreading is to increase utilisation of execution units, not to serve the hyperthreads running on a core fairly.

And even when there would be severe fluctuation, it doesn't appear to be very harmfull: suppose I am running a test where the correct outcome would be 60% win, 40% loss. But nps fluctuations would randomly favor one engine or the other so that it gets 10% extra. So in 50% of the cases the chances are 50-50, in the other 50% they are 70-30. The combined result for a single game (i.e. flipping the coin to decide who gets the time odds, and then playing for a result) still has a 60-40 distribution. It has the same average, and the same variance. The result or its reliability is not affected at all!

Noise increases, so draw rate goes down and statistical significance goes with it. You'll have to increase the number of games to get the same reliability.

Hyperthreading might allow you to play 50% more games in the same time, though.

Last time I checked hyperthreading did not give 50% more speed.
But yes, the extra speed might be worth it if you set affinities.

syzygy wrote:By setting affinities you may be able to reduce the noise a bit, but it will still be there. If you run two matches on two hyperthreads of the same core, nps will vary quite a bit, which introduces noise. Maybe in this situation the nps variations will sufficiently average out that there is still a total net gain, but this is not so clear.

If the statistical model correctly deals with the noise, and the testing framework makes sure that there is no bias, for example by regularly restarting the engines and (randomly) alternating the order in which the two engines are started, things should be fine even if no efficiency is gained.

Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost

Re: Testing: Hyper-Threading, E-cores, Turbo Boost / Precision Boost