(Why) Is hyperthreading bad for chess engines?

vittyvirus · Post by **vittyvirus** » Tue Sep 23, 2014 7:20 pm

Is it? If so, why?

Laskos · Post by **Laskos** » Tue Sep 23, 2014 7:47 pm

vittyvirus wrote:Is it? If so, why?

Because some guys say so. At 30% NPS speed-up from 4 physical cores to 8 logical cores HT, it will bring benefits. On the other hand, overclock achievable frequency on 8 logical cores is lower. If you are an overclocker, then use only physical cores and get the highest stable frequency. If not, use all logical cores HT.

HT is also great using 4 threads for heavy chess duties (on 4 physical cores machine), and the rest on internet and such lowly CPU consuming crap. Your tests will be fine, although some overly cautious guys say otherwise.

bob · Post by **bob** » Tue Sep 23, 2014 8:46 pm

Laskos wrote:
vittyvirus wrote:Is it? If so, why?
Because some guys say so. At 30% NPS speed-up from 4 physical cores to 8 logical cores HT, it will bring benefits. On the other hand, overclock achievable frequency on 8 logical cores is lower. If you are an overclocker, then use only physical cores and get the highest stable frequency. If not, use all logical cores HT.

HT is also great using 4 threads for heavy chess duties (on 4 physical cores machine), and the rest on internet and such lowly CPU consuming crap. Your tests will be fine, although some overly cautious guys say otherwise.

It is not "just because some guys say so." It is "because many guys have actually measured this carefully."

A SMP search absolutely introduces overhead, there is no way around it. Hyper-threading will improve NPS most of the time (but not all of the time, a very small program might show zero benefit). SO the question is, which is bigger, SMP search overhead or hyper threading NPS speedup. The general answer is SMP overhead is larger, which means the overhead outweighs the gain. If someone can figure out a way to solve the roughly 30% overhead for each thread added, then this might have a chance. But overhead is a direct result of move ordering, and getting the fail-high on the first move 90% of the time above that 90% range is not exactly easy. By far the best comparison might be to try 4 threads at speed N, and 8 threads at speed N/2. It becomes pretty obvious which is better on average.

Laskos · Post by **Laskos** » Tue Sep 23, 2014 8:51 pm

bob wrote:
Laskos wrote:
vittyvirus wrote:Is it? If so, why?
Because some guys say so. At 30% NPS speed-up from 4 physical cores to 8 logical cores HT, it will bring benefits. On the other hand, overclock achievable frequency on 8 logical cores is lower. If you are an overclocker, then use only physical cores and get the highest stable frequency. If not, use all logical cores HT.

HT is also great using 4 threads for heavy chess duties (on 4 physical cores machine), and the rest on internet and such lowly CPU consuming crap. Your tests will be fine, although some overly cautious guys say otherwise.

It is not "just because some guys say so." It is "because many guys have actually measured this carefully."

A SMP search absolutely introduces overhead, there is no way around it. Hyper-threading will improve NPS most of the time (but not all of the time, a very small program might show zero benefit). SO the question is, which is bigger, SMP search overhead or hyper threading NPS speedup. The general answer is SMP overhead is larger, which means the overhead outweighs the gain. If someone can figure out a way to solve the roughly 30% overhead for each thread added, then this might have a chance. But overhead is a direct result of move ordering, and getting the fail-high on the first move 90% of the time above that 90% range is not exactly easy. By far the best comparison might be to try 4 threads at speed N, and 8 threads at speed N/2. It becomes pretty obvious which is better on average.

I tested on my fried by now system _extensively_ Houdini 3. On _strength_. Without overclocking issue, my system _benefited_ from HT, having a 30-32% speed-up in NPS. R. Houdart also said that overhead from 4 to 8 threads is about 20%, so a 30% NPS speed-up is theoretically beneficial.
I knew Bob will come to say something in his usual way.

syzygy · Post by **syzygy** » Tue Sep 23, 2014 9:08 pm

vittyvirus wrote:Is it? If so, why?

If so, it is because the benefit in terms of higher nps does not outweigh the loss in search efficiency due to a higher number of threads (i.e. more nodes needed to search to the same depth).

In some cases the benefit may well outweigh the cost. The benefit will depend on hardware, engine, maybe even hash size and type of position. The cost will depend on the engine and maybe hash size and type of position. The OS might make a difference as well.

Kai wrote:On the other hand, overclock achievable frequency on 8 logical cores is lower.

Then again, the benefit from HT increase as the ratio cpu speed / memory speed increases. So there are lots of variables to consider.

Laskos · Post by **Laskos** » Tue Sep 23, 2014 9:25 pm

syzygy wrote:
vittyvirus wrote:Is it? If so, why?
If so, it is because the benefit in terms of higher nps does not outweigh the loss in search efficiency due to a higher number of threads (i.e. more nodes needed to search to the same depth).

In some cases the benefit may well outweigh the cost. The benefit will depend on hardware, engine, maybe even hash size and type of position. The cost will depend on the engine and maybe hash size and type of position. The OS might make a difference as well.

Kai wrote:On the other hand, overclock achievable frequency on 8 logical cores is lower.
Then again, the benefit from HT increase as the ratio cpu speed / memory speed increases. So there are lots of variables to consider.

One thing more, time control. 8 threads kick slower than 4 threads, so testing at ultra-fast TC is useless. Luckily, TTD is valid for Houdini, so it's not very hard to see which is better (take 1,000 positions one minute each, means adequate depth, in one day you are done with testing). A strength SPRT test at reasonable to long TC will take much longer, the benefit of HT on my system of H3 1GB Hash was only about 10-15 Elo points.

syzygy · Post by **syzygy** » Tue Sep 23, 2014 9:26 pm

Laskos wrote:One thing more, time control. 8 threads kick slower than 4 threads, so testing at ultra-fast TC is useless.

Yes, good point.

Sedat Canbaz · Post by **Sedat Canbaz** » Tue Sep 23, 2014 11:16 pm

vittyvirus wrote:Is it? If so, why?

Hello Syed,

A long time ago,
I opened a similar thread regarding Hyper-Threading, for more details:
http://www.talkchess.com/forum/viewtopi ... ight=sedat

Btw, I have tested HT OFF and HT ON,
And I noticed the both engines are almost equal in strength:

Code: Select all

Rank  Name                        Elo    +    -  games  score oppo. draws
   1 Houdini 2.0c Pro x64 6c      3423   19   18  1008   70%  3261   42%
   2 Houdini 2.0c Pro x64 12t     3421   16   16  1399   70%  3281   38%

For Full Standings:
https://sites.google.com/site/computers ... ct-auto232

Best,
Sedat

bob · Post by **bob** » Wed Sep 24, 2014 5:32 am

Laskos wrote:
bob wrote:
Laskos wrote:
vittyvirus wrote:Is it? If so, why?
Because some guys say so. At 30% NPS speed-up from 4 physical cores to 8 logical cores HT, it will bring benefits. On the other hand, overclock achievable frequency on 8 logical cores is lower. If you are an overclocker, then use only physical cores and get the highest stable frequency. If not, use all logical cores HT.

HT is also great using 4 threads for heavy chess duties (on 4 physical cores machine), and the rest on internet and such lowly CPU consuming crap. Your tests will be fine, although some overly cautious guys say otherwise.

It is not "just because some guys say so." It is "because many guys have actually measured this carefully."

A SMP search absolutely introduces overhead, there is no way around it. Hyper-threading will improve NPS most of the time (but not all of the time, a very small program might show zero benefit). SO the question is, which is bigger, SMP search overhead or hyper threading NPS speedup. The general answer is SMP overhead is larger, which means the overhead outweighs the gain. If someone can figure out a way to solve the roughly 30% overhead for each thread added, then this might have a chance. But overhead is a direct result of move ordering, and getting the fail-high on the first move 90% of the time above that 90% range is not exactly easy. By far the best comparison might be to try 4 threads at speed N, and 8 threads at speed N/2. It becomes pretty obvious which is better on average.
I tested on my fried by now system _extensively_ Houdini 3. On _strength_. Without overclocking issue, my system _benefited_ from HT, having a 30-32% speed-up in NPS. R. Houdart also said that overhead from 4 to 8 threads is about 20%, so a 30% NPS speed-up is theoretically beneficial.
I knew Bob will come to say something in his usual way.

How many games have you played? I've played MILLIONS testing hyper-threading since this comes up so often. Best I have ever seen, is a break-even. UNLESS a program is very poorly optimized and makes excessive memory accesses.

I just ran a quick test on my iMac, a quad-core I7. I picked a couple of positions and ran em several times with 4 threads and then with 8.

NPS numbers were 23.5M with 4 threads and 28.0M with 8 threads, averaged over several runs and a couple of positions. An improvement of 4.5M/23.5M = 19%. The size of the trees, all searched to the same depth was 1.01B nodes for 4 threads, 1.26B nodes for 8 threads. Tree size increased by .25/1.01B = 25%, which is a slight loss.

Rather than taking Houdart's number, why don't you get your own? All you have to do is start up houdini and have it search the same position using N real cores and then 2N hyper threaded cores. See what kind of REAL speedup you see in terms of NPS, and then in terms of tree growth. And if the growth in the tree is smaller than the NPS improvement, I would agree that it looks pretty good for that program.

However, another point of experience. If NPS and overhead are identical, I would ALWAYS choose the N thread rather than 2N thread option, because 2N threads has a much higher variance in time and tree size, which is not a particularly good thing.

And please don't start on the "widening" red herring argument.

All it takes to get a better gain from hyper-threading is to be sloppy on memory accesses, don't group variables by temporal locality, don't strive for sequential accesses when possible, etc. Then a thread within the CPU stalls waiting on cache misses, and the other thread actually does useful work, and the NPS climbs. I have spent a lot of time working on locality issues. And on a new I7 with 4 cores, I don't see that 30% number. I'd be concerned if I did as the bigger the HT gain, the more stall issues a program has in a single thread, something that is always bad and which can usually at least be mitigated.

When you test using hyper-threading and only measure speed, you can't claim that hyper-threading is better in the general sense. You might have proved that the SMP overhead is smaller in Houdini than in most. Or you might have proved that the memory access locality is not done very well and hyper-threading is correcting some of that. If you just allow one degree of freedom you can measure it. Fixed depth will show perfectly the tree space growth. NPS shows how the actual program scales ignoring SMP growth. Then you can come much closer to figuring out why HT helps or hurts and why.

(Why) Is hyperthreading bad for chess engines?

(Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?

Re: (Why) Is hyperthreading bad for chess engines?