Some Notes about Hyper-Threading

Werewolf · Post by **Werewolf** » Mon Dec 12, 2011 10:06 am

bob wrote:
Problem with HT on is that if you have 4 physical cores, and search X NPS, when you go to 8 cores (HT on) the tree will grow by 30%. If your NPS doesn't grow by MORE than 30%, you see a net loss.

NPS is NOT the way to measure parallel search performance. It provides completely bogus comparisons...

Your answer is too interesting to let it slip by!
a) Why does the tree grow by 30% with HT on?
b) Is this also true if we move from 4 physical cores to 8?
c) Why do the np/s have to increase by 30% or more to maintain performance? (because surely the HT tree isn't the same tree as the Non-HT tree and therefore time to depth is misleading)

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Dec 12, 2011 12:14 pm

M ANSARI wrote:I think a critical thing with HT is the OS. Win 7 is much better at dealing with HT than win XP. I would think that HT can be beneficial if the OS handles the threads properly and allocates resources as it should. HT threads can always be useful doing other non chess related stuff, and thus theoretically HT should help. The problem is that some OS's simply do not allocate the resources properly and thus you have increased latencies that would hurt rather than help performance. But if your setup with your OS is absolutely perfect and all affinities to the engine are properly resourced, then you would see a tiny increase in performance ... I mean really tiny.

Hello dear Majd,

Very interesting and thank you for the useful information

Perhaps you can be right that Windows 7 is better at dealing with HT ON than Windows XP x64 Prof

But to proof this...we need serious testings in Auto232 mode

However in my Windows operating system testings, i was not satisfied regarding the benchmark results by Crafty 22.8 x64 4 cores (on Win 7)

I mean each time,Crafty 22.8 x64 (on Win 7) produced different kns values,approx.20-30% slower bench results than Win XP x64
Even,Fritz benchmarks had similar results,as far as i remember the speed difference was approx.5-10 % slower than Win XP x64 too

Even after many Win 7 tweakings ad optimizings, i could not see same Crafty 22.8 bench results as on Windows XP x64

Note:the both operating system kns vlues were almost same by Houdini bench;Rybka bench;Toga bench

Another interesting note is that, Peter Grayson says that Win 7's Windows Defender can be the reason of lower bench performance by Crafty 22.8/Fritz Bench
http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=23547

BTW,i think the current HT OF /HT ON issue is really very interesting and very important,maybe later i can install Windows 8 preview on my i7 980X

And then i can start Auto232 testing (HT ON against HT OFF) and in case of better HT ON ELO Performance,i will start using the engines at 12 Threads

But in case of no any ELO gaining...again i will switch to 6 physical cores
And in my estimation,there will be no any ELO difference,even i expect the results to be in favor for HT OFF

Lets see the time will show up !

Greetings,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Dec 12, 2011 1:04 pm

Vinvin wrote:One more very interesting thing to see is : HT ON but engine only use 6 threads. I heard Windows 7 manage this very well. Is the results similar to HT OFF ? If not, Win7 probably put 2 use threads on a single CPU and that's bad ...

And 1 more thing : please run this position with "HT OFF and 6 threads" 10 times and post the 10 timings here (they shouldn't be constant) ...

Thx,
Vincent.

10 times more ???

Ok...maybe later i can re-run H13.2 with 6 cores and 12 threads too (for right now my pc is busy)

Note also that for the next benchmarks,CPUs Usage (the number of cores/threads) will be checked by Task Manager Program

Kind Regards,
Sedat

Vinvin · Post by **Vinvin** » Mon Dec 12, 2011 1:32 pm

Not especially 10 "more" times if you've already the results. The point is with more than 1 thread, results vary. 10 run give a good picture of minimum, maximum and average ...

Sedat Canbaz wrote:10 times more ???
...
Kind Regards,
Sedat

bob · Post by **bob** » Mon Dec 12, 2011 2:31 pm

Werewolf wrote:
bob wrote:
Problem with HT on is that if you have 4 physical cores, and search X NPS, when you go to 8 cores (HT on) the tree will grow by 30%. If your NPS doesn't grow by MORE than 30%, you see a net loss.

NPS is NOT the way to measure parallel search performance. It provides completely bogus comparisons...
Your answer is too interesting to let it slip by!
a) Why does the tree grow by 30% with HT on?
b) Is this also true if we move from 4 physical cores to 8?
c) Why do the np/s have to increase by 30% or more to maintain performance? (because surely the HT tree isn't the same tree as the Non-HT tree and therefore time to depth is misleading)

That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.

alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..

For (b) yes. It is not a "core" issue but a "number of threads" issue.

(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.

Werewolf · Post by **Werewolf** » Mon Dec 12, 2011 5:00 pm

bob wrote:
That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.

alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..

For (b) yes. It is not a "core" issue but a "number of threads" issue.

(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.

Bob, yet again, very informative.

Please can you evaluate this statement which I've tried to write on the basis of what you've said:

i) Alpha-beta search is a sequential.
ii) When good moves are identified they can be prioritised which increases search efficiency as the search goes on.
iii) But parallel search prevents ii) to a certain extent, due to each thread potentially examining the same positions as other threads at the same time. This creates a degree of redundancy.
iv) Thus, for every doubling of threads the search tree grows by 30% without increasing depth, due to overlap.
v) Therefore, in order to see an elo performance gain when doubling threads the NPS must >1.3 times what is was before.

In addition to that statement can I make one more?

a) HT works by dividing the core's time between two threads
b) Therefore for HT to be effective a thread cannot demand more than 50% of the Core's processing power.
c) Chess demands 100% processing power of each core
d) Therefore HT will simply decrease performance for chess by 30%, for reasons stated above.

bob · Post by **bob** » Mon Dec 12, 2011 5:31 pm

Werewolf wrote:
bob wrote:
That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.

alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..

For (b) yes. It is not a "core" issue but a "number of threads" issue.

(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.
Bob, yet again, very informative.

Please can you evaluate this statement which I've tried to write on the basis of what you've said:

i) Alpha-beta search is a sequential.
ii) When good moves are identified they can be prioritised which increases search efficiency as the search goes on.
iii) But parallel search prevents ii) to a certain extent, due to each thread potentially examining the same positions as other threads at the same time. This creates a degree of redundancy.

Not necessarily the "same position", but redundant work. For example, suppose I give you the following task:

There is a wall in front of you with three arm-sized holes in it. Your assignment is to insert your hand into each hole, for one minute, and at the end of the test, tell me which one was the most pleasant / least unpleasant, depending on what happens.

You insert your hand into hole one and you are greeted with warm water. Then warm air. After one minute you remove your hand and insert it into hole two. You are immediately stuck with a needle. No need to stick around any longer, that is already worse than hole # 1, and it might get even worse if you wait around. You move to hole three. Here you are greeted by frigid ice-water and you again immediately bail out as it could get worse. After 1 minute + maybe 1 second for each hole, you conclude "hole one was the most pleasant / least unpleasant."

Now do that with three people, which should go 3x faster, right? Each sticks his hand into his respective hole. The guy in hole 2 can't give up after the pin prick, because the other two don't know how good or bad their respective holes are. All three have to stay the entire minute. After a total of one minute spent, helper 1 reports on his warm/wet hand, helper two reports he lost two fingers, and helper three reports his hand is frozen solid. After comparing notes, they conclude, also, that hole 1 was better.

That's how alpha/beta works. If you search the best move first, and we typically get that right 90%+ of the time, then we can quickly dismiss the inferior moves. But if we don't know the value of the first when we start to work on the second and third (in parallel) we have to do extra work...

Hope that helps.

iv) Thus, for every doubling of threads the search tree grows by 30% without increasing depth, due to overlap.

Not necessarily "overlap". Just doing "unnecessary work" (work that the normal serial search would not do).

v) Therefore, in order to see an elo performance gain when doubling threads the NPS must >1.3 times what is was before.

In addition to that statement can I make one more?

a) HT works by dividing the core's time between two threads
b) Therefore for HT to be effective a thread cannot demand more than 50% of the Core's processing power.

No. The biggest HT gain comes from memory accesses. When you get a L1 cache miss and have to wait for 20 or so cycles or whatever for L2, or longer for L3, or MUCH longer for main memory, the other logical "core" can use the resources to continue since the first thread is "blocked" (much like what happens in a multiprogramming operating system when a process does I/O and others run while it is blocked.

c) Chess demands 100% processing power of each core
d) Therefore HT will simply decrease performance for chess by 30%, for reasons stated above.

Werewolf · Post by **Werewolf** » Mon Dec 12, 2011 5:54 pm

bob wrote:
No. The biggest HT gain comes from memory accesses. When you get a L1 cache miss and have to wait for 20 or so cycles or whatever for L2, or longer for L3, or MUCH longer for main memory, the other logical "core" can use the resources to continue since the first thread is "blocked" (much like what happens in a multiprogramming operating system when a process does I/O and others run while it is blocked.

c) Chess demands 100% processing power of each core
d) Therefore HT will simply decrease performance for chess by 30%, for reasons stated above.

OK, I think I get it. But surely HT can't just _magically_ increase the performance of a core. If chess demands the core's full attention, and assuming the thread it uses is not blocked (which I'm assuming is the case) then trying to get a 2nd thread to do something on the same core would surely be like a riding a bike and trying to play the piano at the same time.

If that statement is wrong I've misunderstood things!

ernest · Post by **ernest** » Mon Dec 12, 2011 6:24 pm

Sedat Canbaz wrote:Sorry...that i can not provide you more useful HT data...

Hi Sedat,

You wrote (to Vincent):
More Hyper-Threading details about i7 980X 4.33GHz:
-Only the best results have been published
-Each engine has been tested minimum 5-6 times

I am satisfied with that...
If you had answered that to me earlier (instead of...), there would have been no argument at all.

Wishing you the best,

Ernest

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Dec 12, 2011 7:14 pm

Hello dear Vincent,

Your request is done...

*Test 1:HT OFF 6 Physical cores

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Disabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
-------------------------
1st-215s
2nd-198s
3rd-111s
4rd-209s
5th-50s
6th-168s
7th-150s
8th-110s
9th-56s
10th-158s
---------
-Hiarcs 13.2 HT OFF 6 physical cores solves the mate average in 142 sec

************************************************************

*Test 2:HT ON 12 Threads

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Enabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
--------------------------
1st-48s
2nd-249s
3rd-98s
4rd-168s
5th-206s
6th-52s
7th-209s
8th-124s
9th-97s
10th-271s
----------
-Hiarcs 13.2 HT ON 12 Threads solves the mate position average in 152 sec

More Details:
-------------------
Hiarcs 13.2's Hashtables are cleaned before starting each bench
Hiarcs 13.2 has been tested with the same mate position:TOTAL 20 times
As we see again,the results are slightly in favor for HT OFF
Hiarcs 13.2 with 'Position Learning ON' is solving the mate much faster
Due to accurate speed testing,the benchmarks are done with Position Learning OFF

Download all HT Chess Benchmarks by Hiarcs 13.2:
http://www.sedatcanbaz.com/chess/games/ht_test.rar

Best Regards,
Sedat

Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading