Some Notes about Hyper-Threading

bob · Post by **bob** » Mon Dec 12, 2011 7:42 pm

Werewolf wrote:
bob wrote:
No. The biggest HT gain comes from memory accesses. When you get a L1 cache miss and have to wait for 20 or so cycles or whatever for L2, or longer for L3, or MUCH longer for main memory, the other logical "core" can use the resources to continue since the first thread is "blocked" (much like what happens in a multiprogramming operating system when a process does I/O and others run while it is blocked.

c) Chess demands 100% processing power of each core
d) Therefore HT will simply decrease performance for chess by 30%, for reasons stated above.
OK, I think I get it. But surely HT can't just _magically_ increase the performance of a core. If chess demands the core's full attention, and assuming the thread it uses is not blocked (which I'm assuming is the case) then trying to get a 2nd thread to do something on the same core would surely be like a riding a bike and trying to play the piano at the same time.

If that statement is wrong I've misunderstood things!

Here's the problem. If the CPU needs something from memory, it takes a variable number of clock cycles. Say 4 clocks for a L1 hit. Maybe 20 for a L2 hit. More for a L3 hit. And thousands of clock cycles if we have to go to real memory. That tends to stall a core since it can't proceed if all pending instructions depend on data values coming in from memory. During those "pauses" the other logical core can use that core's resources to work on a second instruction stream.

Very much as a modern operating system "interleaves" the execution of two processes as they block waiting on I/O. The more such "blocking" happens, the better that second logical core looks. If you are running completely out of L1 cache, you will barely see any HT speedup. If you don't fit in L1, and depend more on L2 or L3 or even main memory, then HT can help more...

Even though you "think" a core is 100% busy, it spends a significant amount of time waiting on data from cache and/or main memory...

bob · Post by **bob** » Mon Dec 12, 2011 7:51 pm

Sedat Canbaz wrote:Hello dear Vincent,

Your request is done...

*Test 1:HT OFF 6 Physical cores

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Disabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
-------------------------
1st-215s
2nd-198s
3rd-111s
4rd-209s
5th-50s
6th-168s
7th-150s
8th-110s
9th-56s
10th-158s
---------
-Hiarcs 13.2 HT OFF 6 physical cores solves the mate average in 142 sec

************************************************************

*Test 2:HT ON 12 Threads

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Enabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
--------------------------
1st-48s
2nd-249s
3rd-98s
4rd-168s
5th-206s
6th-52s
7th-209s
8th-124s
9th-97s
10th-271s
----------
-Hiarcs 13.2 HT ON 12 Threads solves the mate position average in 152 sec

More Details:
-------------------
Hiarcs 13.2's Hashtables are cleaned before starting each bench
Hiarcs 13.2 has been tested with the same mate position:TOTAL 20 times
As we see again,the results are slightly in favor for HT OFF
Hiarcs 13.2 with 'Position Learning ON' is solving the mate much faster
Due to accurate speed testing,the benchmarks are done with Position Learning OFF

Download all HT Chess Benchmarks by Hiarcs 13.2:
http://www.sedatcanbaz.com/chess/games/ht_test.rar

Best Regards,
Sedat

So it is worse with HT on than off. Which has been my finding every time I have tested this, starting back with the PIV...

Not terribly worse, just "worse"...

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Dec 12, 2011 8:27 pm

bob wrote:
Sedat Canbaz wrote:Hello dear Vincent,

Your request is done...

*Test 1:HT OFF 6 Physical cores

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Disabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
-------------------------
1st-215s
2nd-198s
3rd-111s
4rd-209s
5th-50s
6th-168s
7th-150s
8th-110s
9th-56s
10th-158s
---------
-Hiarcs 13.2 HT OFF 6 physical cores solves the mate average in 142 sec

************************************************************

*Test 2:HT ON 12 Threads

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Enabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
--------------------------
1st-48s
2nd-249s
3rd-98s
4rd-168s
5th-206s
6th-52s
7th-209s
8th-124s
9th-97s
10th-271s
----------
-Hiarcs 13.2 HT ON 12 Threads solves the mate position average in 152 sec

More Details:
-------------------
Hiarcs 13.2's Hashtables are cleaned before starting each bench
Hiarcs 13.2 has been tested with the same mate position:TOTAL 20 times
As we see again,the results are slightly in favor for HT OFF
Hiarcs 13.2 with 'Position Learning ON' is solving the mate much faster
Due to accurate speed testing,the benchmarks are done with Position Learning OFF

Download all HT Chess Benchmarks by Hiarcs 13.2:
http://www.sedatcanbaz.com/chess/games/ht_test.rar

Best Regards,
Sedat
So it is worse with HT on than off. Which has been my finding every time I have tested this, starting back with the PIV...

Not terribly worse, just "worse"...

Dear Robert,

First of all,i'd like to thank you for all of your work since many years
Really i have seen a lot of useful information in your postings

Yes...it seems,the chess speed is better with HT OFF
But anyway to finish this interesting discussion,i plan to test both systems (HT OFF against HT ON) in SCCT Auto232 conditions:
http://www.sedatcanbaz.com/chess/rating ... onditions/

BTW,in my opinion,Bayeselo program will give us accurate ELO statistics about the performance of the both (HT OFF/HT ON) systems

Best,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Mon Dec 12, 2011 8:51 pm

ernest wrote:
Sedat Canbaz wrote:Sorry...that i can not provide you more useful HT data...
Hi Sedat,

You wrote (to Vincent):
More Hyper-Threading details about i7 980X 4.33GHz:
-Only the best results have been published
-Each engine has been tested minimum 5-6 times

I am satisfied with that...
If you had answered that to me earlier (instead of...), there would have been no argument at all.

Wishing you the best,

Ernest

Dear Ernest,

Thank you for your kind posting...

Best Regards,
Sedat

ernest · Post by **ernest** » Mon Dec 12, 2011 9:07 pm

bob wrote:I have access to HUNDREDS of HT-enabled machines. And I have yet to find one single example where hyperthreading on provides faster time-to-depth results than hyperthreading off. NPS is irrelevant. Time to a fixed depth is much more important, as that is how you measure parallel search performance.

Hi Bob,

Given the SMP randomness, is it easy to measure "time to a fixed depth"?
What protocol do you recommend: average (± deviation), other?...

Vinvin · Post by **Vinvin** » Tue Dec 13, 2011 3:10 am

Many thanks !
It's pretty clear now ...
12 threads with HT is 7% slower !

Would mind to do the last test (HT ON but engine set to 6 threads) ?

Sedat Canbaz wrote:Hello dear Vincent,

Your request is done...

*Test 1:HT OFF 6 Physical cores

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Disabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
-------------------------
1st-215s
2nd-198s
3rd-111s
4rd-209s
5th-50s
6th-168s
7th-150s
8th-110s
9th-56s
10th-158s
---------
-Hiarcs 13.2 HT OFF 6 physical cores solves the mate average in 142 sec

************************************************************

*Test 2:HT ON 12 Threads

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Enabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
--------------------------
1st-48s
2nd-249s
3rd-98s
4rd-168s
5th-206s
6th-52s
7th-209s
8th-124s
9th-97s
10th-271s
----------
-Hiarcs 13.2 HT ON 12 Threads solves the mate position average in 152 sec

More Details:
-------------------
Hiarcs 13.2's Hashtables are cleaned before starting each bench
Hiarcs 13.2 has been tested with the same mate position:TOTAL 20 times
As we see again,the results are slightly in favor for HT OFF
Hiarcs 13.2 with 'Position Learning ON' is solving the mate much faster
Due to accurate speed testing,the benchmarks are done with Position Learning OFF

Download all HT Chess Benchmarks by Hiarcs 13.2:
http://www.sedatcanbaz.com/chess/games/ht_test.rar

Best Regards,
Sedat

Robert Flesher · Post by **Robert Flesher** » Tue Dec 13, 2011 5:25 am

bob wrote:
Werewolf wrote:
bob wrote:
No. The biggest HT gain comes from memory accesses. When you get a L1 cache miss and have to wait for 20 or so cycles or whatever for L2, or longer for L3, or MUCH longer for main memory, the other logical "core" can use the resources to continue since the first thread is "blocked" (much like what happens in a multiprogramming operating system when a process does I/O and others run while it is blocked.

c) Chess demands 100% processing power of each core
d) Therefore HT will simply decrease performance for chess by 30%, for reasons stated above.
OK, I think I get it. But surely HT can't just _magically_ increase the performance of a core. If chess demands the core's full attention, and assuming the thread it uses is not blocked (which I'm assuming is the case) then trying to get a 2nd thread to do something on the same core would surely be like a riding a bike and trying to play the piano at the same time.

If that statement is wrong I've misunderstood things!
Here's the problem. If the CPU needs something from memory, it takes a variable number of clock cycles. Say 4 clocks for a L1 hit. Maybe 20 for a L2 hit. More for a L3 hit. And thousands of clock cycles if we have to go to real memory. That tends to stall a core since it can't proceed if all pending instructions depend on data values coming in from memory. During those "pauses" the other logical core can use that core's resources to work on a second instruction stream.

Very much as a modern operating system "interleaves" the execution of two processes as they block waiting on I/O. The more such "blocking" happens, the better that second logical core looks. If you are running completely out of L1 cache, you will barely see any HT speedup. If you don't fit in L1, and depend more on L2 or L3 or even main memory, then HT can help more...

Even though you "think" a core is 100% busy, it spends a significant amount of time waiting on data from cache and/or main memory...

Thanks Bob, very interesting. This explains alot! I am going to have to test this again.

Sedat Canbaz · Post by **Sedat Canbaz** » Tue Dec 13, 2011 9:35 am

Vinvin wrote:Many thanks !
It's pretty clear now ...
12 threads with HT is 7% slower !

Would mind to do the last test (HT ON but engine set to 6 threads) ?

Not at all....ok, but just after SCCT Auto232 HT Test

BTW,SCCT Auto232 HT test is already started...

And the current standings after 25 games:

Individual statistics:

Code: Select all

1 Houdini 2.0c Pro x64 12 Threads

vs

2.Deep Rybka 4.1 x64 4 Cores       : 25 (+ 10,= 12,- 3), 64.0 %

Some Notes:
-Houdini 2.0c Pro x64 12 Threads is playing with 512 MB Hashtable size (approx. 10 elo better than 128 MB)
*Houdini 2.0b Pro x64 6 Cores is tested with 128 MB Hashtable size
-In my previous testings, Houdini 2.0c 1c performed 10 ELO better than Houdini 2.0 1c
-Its planning min 500 games to be played by Houdini 2.0c Pro x64 12 Threads

The previous match,maybe its will helpful for comparing

Individual statistics:

Code: Select all

 1 Houdini 2.0b Pro x64 6 Cores  

vs

Deep Rybka 4.1 x64 4 Cores         : 100 (+ 50,= 40,- 10), 70.0 %

Best,
Sedat

bob · Post by **bob** » Tue Dec 13, 2011 6:01 pm

Sedat Canbaz wrote:
bob wrote:
Sedat Canbaz wrote:Hello dear Vincent,

Your request is done...

*Test 1:HT OFF 6 Physical cores

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Disabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
-------------------------
1st-215s
2nd-198s
3rd-111s
4rd-209s
5th-50s
6th-168s
7th-150s
8th-110s
9th-56s
10th-158s
---------
-Hiarcs 13.2 HT OFF 6 physical cores solves the mate average in 142 sec

************************************************************

*Test 2:HT ON 12 Threads

Conditions:
----------------
i7 980X @4.33GHz
Hyper Threading Enabled
Windows XP x64 Prof
TC:60 Minutes/Game
Ponder OFF
128 MB Hashtable
Large Pages Enabled
Position Learning OFF

Solved mate in sec:
--------------------------
1st-48s
2nd-249s
3rd-98s
4rd-168s
5th-206s
6th-52s
7th-209s
8th-124s
9th-97s
10th-271s
----------
-Hiarcs 13.2 HT ON 12 Threads solves the mate position average in 152 sec

More Details:
-------------------
Hiarcs 13.2's Hashtables are cleaned before starting each bench
Hiarcs 13.2 has been tested with the same mate position:TOTAL 20 times
As we see again,the results are slightly in favor for HT OFF
Hiarcs 13.2 with 'Position Learning ON' is solving the mate much faster
Due to accurate speed testing,the benchmarks are done with Position Learning OFF

Download all HT Chess Benchmarks by Hiarcs 13.2:
http://www.sedatcanbaz.com/chess/games/ht_test.rar

Best Regards,
Sedat
So it is worse with HT on than off. Which has been my finding every time I have tested this, starting back with the PIV...

Not terribly worse, just "worse"...
Dear Robert,

First of all,i'd like to thank you for all of your work since many years
Really i have seen a lot of useful information in your postings

Yes...it seems,the chess speed is better with HT OFF
But anyway to finish this interesting discussion,i plan to test both systems (HT OFF against HT ON) in SCCT Auto232 conditions:
http://www.sedatcanbaz.com/chess/rating ... onditions/

BTW,in my opinion,Bayeselo program will give us accurate ELO statistics about the performance of the both (HT OFF/HT ON) systems

Best,
Sedat

BTW, for the record, I use HT on EVERYWHERE now, but for Crafty I set max threads to # of physical cores. Hyperthreading can help in other cases, such as when doing a linux kernel build and compiling everything. But if I run max threads at number of physical cores, I get the same identical performance as with HT off for Crafty. And should you have a 12 physical core box (24 logical cores) and be running a 12 thread game, running another application doesn't hurt as much as it does with HT off...

diep · Post by **diep** » Tue Dec 13, 2011 7:46 pm

bob wrote:
Werewolf wrote:
bob wrote:
Problem with HT on is that if you have 4 physical cores, and search X NPS, when you go to 8 cores (HT on) the tree will grow by 30%. If your NPS doesn't grow by MORE than 30%, you see a net loss.

NPS is NOT the way to measure parallel search performance. It provides completely bogus comparisons...
Your answer is too interesting to let it slip by!
a) Why does the tree grow by 30% with HT on?
b) Is this also true if we move from 4 physical cores to 8?
c) Why do the np/s have to increase by 30% or more to maintain performance? (because surely the HT tree isn't the same tree as the Non-HT tree and therefore time to depth is misleading)
That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.

alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..

For (b) yes. It is not a "core" issue but a "number of threads" issue.

(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.

Hi Bob,

If your speedup is 3.1 out of 4 and 30% is the break even point moving to 8 cores for hyperthreading, then that would mean that crafty's speedup deteriorates a lot namely that it's break even at:

Assuming 100% scaling now: 4 * 1.3 = 5.2
So you get less than 3.1 out of 5.2 with 5.2 being what you get at 8 cores.

8 * 3.1 / 5.2 = 4.76 out of 8

So if even 30% increase in nps by hyperthreading doesn't benefit
crafty then that means that assuming you get 3.1 out of 4 as a speedup,
that at 8 cores you get 4.76 out of 8.

For Young Brother Wait that seems like a rather small speedup out of 8 cores to me.

Vincent

Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading