Some Notes about Hyper-Threading

bob · Post by **bob** » Wed Dec 14, 2011 5:30 am

diep wrote:
bob wrote:
Werewolf wrote:
bob wrote:
Problem with HT on is that if you have 4 physical cores, and search X NPS, when you go to 8 cores (HT on) the tree will grow by 30%. If your NPS doesn't grow by MORE than 30%, you see a net loss.

NPS is NOT the way to measure parallel search performance. It provides completely bogus comparisons...
Your answer is too interesting to let it slip by!
a) Why does the tree grow by 30% with HT on?
b) Is this also true if we move from 4 physical cores to 8?
c) Why do the np/s have to increase by 30% or more to maintain performance? (because surely the HT tree isn't the same tree as the Non-HT tree and therefore time to depth is misleading)
That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.

alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..

For (b) yes. It is not a "core" issue but a "number of threads" issue.

(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.
Hi Bob,

If your speedup is 3.1 out of 4 and 30% is the break even point moving to 8 cores for hyperthreading, then that would mean that crafty's speedup deteriorates a lot namely that it's break even at:

Assuming 100% scaling now: 4 * 1.3 = 5.2
So you get less than 3.1 out of 5.2 with 5.2 being what you get at 8 cores.

8 * 3.1 / 5.2 = 4.76 out of 8

So if even 30% increase in nps by hyperthreading doesn't benefit
crafty then that means that assuming you get 3.1 out of 4 as a speedup,
that at 8 cores you get 4.76 out of 8.

For Young Brother Wait that seems like a rather small speedup out of 8 cores to me.

Vincent

I've posted my numbers many times. On 4, the observed number was really 3.4x for my test set. On 8, it was something over 6, but I will have to scrounge up the data to see exactly. 6.1 comes to mind...

That 30% overhead has been in Crafty for 10+ years now, and the only thing that will reduce it is more accurate move ordering, which seems unlikely to happen... Note that 30% is just a statistical approximation used to fit a straight line (estimated speedup) to data that is not exactly linear. So lower core speedups are generally understated, as the 30% was the number I got when running on a 16 cpu Cray a long while back... Going beyond 16 the speedup is overstated using that formula. But it is a ballpark...

I have several 12 core boxes now so perhaps I ought to run the test on those, as well as on our 8-core boxes which I would expect to perform a bit better on a per-cpu basis, since there is less cache contention and bandwidth required...

Sedat Canbaz · Post by **Sedat Canbaz** » Wed Dec 14, 2011 2:01 pm

More Hyper-Threading Benchmarks,but this time with Houdini 2.0c x64 PRO
Note also that with respect to all,i used Houdini (instead of Hiarcs),due to Houdini solves the current mate much faster
Plus i can't spend more efforts and also i have no much more free time ...
Thanks in advance for your understanding !

BTW,i have installed Windows 8 Preview,but unfortunately Auto232 com driver did not work
So for the current HT benchmarks has been used Windows Vista Ultimate 64-bit
Another very stable and highly recommended operating system for chess (i like it very much)

Conditions:
------------
i7 980X @4.33GHz
Windows Vista Ultimate 64-bit
Lage Pages ON
128 MB Hashtable

1) Houdini 2.0c HT OFF 6 physical cores

44s
21s
44s
25s
42s
31s
42s
46s
27s
59s
---
Note:
Houdini 2.0c HT OFF 6 Cores solves the mate average in 38s

***********************************

2) Houdini 2.0c HT ON 12 Threads

45s
41s
42s
23s
26s
68s
43s
32s
42s
46s
---

Note:
Houdini 2.0c HT ON 12 Threads solves the mate average in 41s

***********************************

3) Houdini 2.0c HT ON 6 Threads (100 % CPUs usage)

44s
69s
61s
89s
29s
86s
18s
92s
67s
---
Some Notes:
Houdini 2.0c HT ON 6 Threads solves the mate average in 61s
Houdini 2.0c 6T has been tested,when other (Hiarcs 13.2 6T) engine was thinking
In other words:All Computer's 12 CPUs usage was 100 %

*******************************************

4) Houdini 2.0c HT ON 6 Threads (50 % CPUs usage)

59s
46s
65s
29s
38s
19s
44s
68s
62s
25s
---
Some Notes:
Houdini 2.0c HT ON 6 Threads solves the mate average in 45s
Houdini 2.0c 6T has been tested with 6 CPUs,i mean rest 6 CPUs were on idle
In other words:Houdini's during thinking process,there was no any other engine
******************************************

Final Words about 12 Threads and 6 Physical cores:
Chess speed difference between HT ON and HT OFF is approx.5-10 % in favor for Hyper Threading Disabled
And in my estimation,the elo difference will be approx.10 ELO in favor for HT OFF too
Lets see...How HT will be performed in SCCT Auto232 Rating List

Download all HT benchmarks (including Mate 11 position):
http://www.sedatcanbaz.com/chess/games/ ... d_test.rar

Best Wishes,
Sedat

bob · Post by **bob** » Wed Dec 14, 2011 8:35 pm

Sedat Canbaz wrote:More Hyper-Threading Benchmarks,but this time with Houdini 2.0c x64 PRO
Note also that with respect to all,i used Houdini (instead of Hiarcs),due to Houdini solves the current mate much faster
Plus i can't spend more efforts and also i have no much more free time ...
Thanks in advance for your understanding !

BTW,i have installed Windows 8 Preview,but unfortunately Auto232 com driver did not work
So for the current HT benchmarks has been used Windows Vista Ultimate 64-bit
Another very stable and highly recommended operating system for chess (i like it very much)

Conditions:
------------
i7 980X @4.33GHz
Windows Vista Ultimate 64-bit
Lage Pages ON
128 MB Hashtable

1) Houdini 2.0c HT OFF 6 physical cores

44s
21s
44s
25s
42s
31s
42s
46s
27s
59s
---
Note:
Houdini 2.0c HT OFF 6 Cores solves the mate average in 38s

***********************************

2) Houdini 2.0c HT ON 12 Threads

45s
41s
42s
23s
26s
68s
43s
32s
42s
46s
---

Note:
Houdini 2.0c HT ON 12 Threads solves the mate average in 41s

***********************************

3) Houdini 2.0c HT ON 6 Threads (100 % CPUs usage)

44s
69s
61s
89s
29s
86s
18s
92s
67s
---
Some Notes:
Houdini 2.0c HT ON 6 Threads solves the mate average in 61s
Houdini 2.0c 6T has been tested,when other (Hiarcs 13.2 6T) engine was thinking
In other words:All Computer's 12 CPUs usage was 100 %

*******************************************

4) Houdini 2.0c HT ON 6 Threads (50 % CPUs usage)

59s
46s
65s
29s
38s
19s
44s
68s
62s
25s
---
Some Notes:
Houdini 2.0c HT ON 6 Threads solves the mate average in 45s
Houdini 2.0c 6T has been tested with 6 CPUs,i mean rest 6 CPUs were on idle
In other words:Houdini's during thinking process,there was no any other engine
******************************************

Final Words about 12 Threads and 6 Physical cores:
Chess speed difference between HT ON and HT OFF is approx.5-10 % in favor for Hyper Threading Disabled
And in my estimation,the elo difference will be approx.10 ELO in favor for HT OFF too
Lets see...How HT will be performed in SCCT Auto232 Rating List

Download all HT benchmarks (including Mate 11 position):
http://www.sedatcanbaz.com/chess/games/ ... d_test.rar

Best Wishes,
Sedat

Matches every test I have run as well, although I only use my program. I did not understand Vincent's comments about how overclocking would somehow make HT work better. I don't see how it would change a thing, unless the overclocking is such that you do not adjust memory bus timing so it stays at the same speed. Then HT might help NPS more since the memory delays are longer and the two cores have a better chance of interleaving two instruction streams more effectively...

Sedat Canbaz · Post by **Sedat Canbaz** » Wed Dec 14, 2011 8:39 pm

More HT Testings...

Today,i have done a few HT interesting testings more...
Honestly i was just wondering about what will be the chess speed,if i run 2 mp engines (both using 6 threads) against each other with ponder on

So...after a lot of HT testings,i've noticed that the both mp engine performance and kns values are falling dramatically down

I mean,in case of playing 2 MP Engines,6 threads against 6 threads,between each other with Ponder ON (the test was on same PC-i7 980X @4.33GHz)

Just i'd like to mention once more that:
---------------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 61s
-Houdini 2.0c 6T has been tested,when other (Hiarcs 13.2 6T) engine was thinking
*In other words:All Computer's 12 CPUs usage was 100 %
-During this HT testing,Houdini's kns values fall down dramitacally too,approx.50 % (when the opponent is started to pondering)

Another interesting notes about HT:
----------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 45s
•Houdini 2.0c HT ON 6T has been tested with 6 CPUs,where rest 6 CPUs were on idle
*I mean,Houdini's during thinking process,there was no any other engine pondering
And during this HT testings,Houdini 6 Threads's kns values were quite satisfied, nearly 18.000 kns

Note also exactly on same position and using same engine's (6 threads) kns values are fall down to approx. to 10.000-12.000 kns (if there is another engine pondering)

In other words:
-------------------
The chess speed of i7 980X @4.33GHz's Houdini 2.0c Pro x64 6T is approx. equal to QX9650 3.0GHz's Houdini 2.0c Pro x64 4 Physical Cores
But anyway i think its not too bad for i7 980X @4.33GHz HT 6 Threads Chess Performance

*Note:Houdini 2.0c Pro x64 4c (on QX9650 @3.66GHz) solves the mate average in 55s

BTW, in case of such eng-matches on i7 980X @4.33GHz HT 6 Threads against 6 Threads,Ponder ON...
Then at least it will be not needed: 2PCs and Auto232 player

Best,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Wed Dec 14, 2011 9:22 pm

bob wrote:
Matches every test I have run as well, although I only use my program. I did not understand Vincent's comments about how overclocking would somehow make HT work better. I don't see how it would change a thing, unless the overclocking is such that you do not adjust memory bus timing so it stays at the same speed. Then HT might help NPS more since the memory delays are longer and the two cores have a better chance of interleaving two instruction streams more effectively...

Really...i did not understand too,maybe Vincent is meaning about his Diep engine

Btw,the HT ON's 12 threads results are very good so far:

Some Notes:
-The current HT test is running in Auto232 mode
-Deep Rybka 4.1 x64 4c is playing on i7 920 @3.0 GHz (Win XP x64,HT Disabled)
-Houdini 2.0c Pro x64 12t is playing on i7 980X @4.33GHz (Vista Ultimate x64,HT Enabled)
-512 MB Hashtable size for both engines
-TC:4m+2s

Greetings,
Sedat

bob · Post by **bob** » Wed Dec 14, 2011 11:43 pm

Sedat Canbaz wrote:More HT Testings...

Today,i have done a few HT interesting testings more...
Honestly i was just wondering about what will be the chess speed,if i run 2 mp engines (both using 6 threads) against each other with ponder on

So...after a lot of HT testings,i've noticed that the both mp engine performance and kns values are falling dramatically down

I mean,in case of playing 2 MP Engines,6 threads against 6 threads,between each other with Ponder ON (the test was on same PC-i7 980X @4.33GHz)

Just i'd like to mention once more that:
---------------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 61s
-Houdini 2.0c 6T has been tested,when other (Hiarcs 13.2 6T) engine was thinking
*In other words:All Computer's 12 CPUs usage was 100 %
-During this HT testing,Houdini's kns values fall down dramitacally too,approx.50 % (when the opponent is started to pondering)

Another interesting notes about HT:
----------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 45s
•Houdini 2.0c HT ON 6T has been tested with 6 CPUs,where rest 6 CPUs were on idle
*I mean,Houdini's during thinking process,there was no any other engine pondering
And during this HT testings,Houdini 6 Threads's kns values were quite satisfied, nearly 18.000 kns

Note also exactly on same position and using same engine's (6 threads) kns values are fall down to approx. to 10.000-12.000 kns (if there is another engine pondering)

In other words:
-------------------
The chess speed of i7 980X @4.33GHz's Houdini 2.0c Pro x64 6T is approx. equal to QX9650 3.0GHz's Houdini 2.0c Pro x64 4 Physical Cores
But anyway i think its not too bad for i7 980X @4.33GHz HT 6 Threads Chess Performance

*Note:Houdini 2.0c Pro x64 4c (on QX9650 @3.66GHz) solves the mate average in 55s

BTW, in case of such eng-matches on i7 980X @4.33GHz HT 6 Threads against 6 Threads,Ponder ON...
Then at least it will be not needed: 2PCs and Auto232 player

Best,
Sedat

There IS a risk, however.

If you have 6 physical cores (12 logical cores) and you run two 6-thread programs, and you wanted me to compete "winner take all"... what I would do is when I am pondering I would burn in a tight L1-cache loop. That will give me MORE of that physical core than my opponent that is doing real things and accessing memory or L2/L3 cache... As a result, when I ponder, he actually gets fewer compute cycles than I do when I am not pondering...

I don't like the idea of sharing anything (including memory) between two chess programs if you play ponder=on. Ponder=off is perfectly OK, but with it on, there are unknown and unexpected side-effects that can introduce a bias that will remain hidden...

Sedat Canbaz · Post by **Sedat Canbaz** » Thu Dec 15, 2011 3:21 am

Another HT ON/OFF Fritz Chess Benchmarks:

HT Disabled 6 cores:-Windows Defender OFF:

HT Disabled 6 cores-Windows Defender ON:

Let Windows choose what's best for my computer:

Adjust for best performance:

HT ON-6 Threads:

Some Notes:
-All benchmarks are done with UAC Disabled
-i7 980X's CPU Cooler:Thermaltake Frio OCK
-Cpu Temperatures are higher with HT ON (and i hope the ELO performance will be higher too

)

Best Regards,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Thu Dec 15, 2011 3:34 am

bob wrote:
Sedat Canbaz wrote:More HT Testings...

Today,i have done a few HT interesting testings more...
Honestly i was just wondering about what will be the chess speed,if i run 2 mp engines (both using 6 threads) against each other with ponder on

So...after a lot of HT testings,i've noticed that the both mp engine performance and kns values are falling dramatically down

I mean,in case of playing 2 MP Engines,6 threads against 6 threads,between each other with Ponder ON (the test was on same PC-i7 980X @4.33GHz)

Just i'd like to mention once more that:
---------------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 61s
-Houdini 2.0c 6T has been tested,when other (Hiarcs 13.2 6T) engine was thinking
*In other words:All Computer's 12 CPUs usage was 100 %
-During this HT testing,Houdini's kns values fall down dramitacally too,approx.50 % (when the opponent is started to pondering)

Another interesting notes about HT:
----------------------------------------
•Houdini 2.0c HT ON 6 Threads (on i7980X @4.33GHz) solves the mate average in 45s
•Houdini 2.0c HT ON 6T has been tested with 6 CPUs,where rest 6 CPUs were on idle
*I mean,Houdini's during thinking process,there was no any other engine pondering
And during this HT testings,Houdini 6 Threads's kns values were quite satisfied, nearly 18.000 kns

Note also exactly on same position and using same engine's (6 threads) kns values are fall down to approx. to 10.000-12.000 kns (if there is another engine pondering)

In other words:
-------------------
The chess speed of i7 980X @4.33GHz's Houdini 2.0c Pro x64 6T is approx. equal to QX9650 3.0GHz's Houdini 2.0c Pro x64 4 Physical Cores
But anyway i think its not too bad for i7 980X @4.33GHz HT 6 Threads Chess Performance

*Note:Houdini 2.0c Pro x64 4c (on QX9650 @3.66GHz) solves the mate average in 55s

BTW, in case of such eng-matches on i7 980X @4.33GHz HT 6 Threads against 6 Threads,Ponder ON...
Then at least it will be not needed: 2PCs and Auto232 player

Best,
Sedat
There IS a risk, however.

If you have 6 physical cores (12 logical cores) and you run two 6-thread programs, and you wanted me to compete "winner take all"... what I would do is when I am pondering I would burn in a tight L1-cache loop. That will give me MORE of that physical core than my opponent that is doing real things and accessing memory or L2/L3 cache... As a result, when I ponder, he actually gets fewer compute cycles than I do when I am not pondering...

I don't like the idea of sharing anything (including memory) between two chess programs if you play ponder=on. Ponder=off is perfectly OK, but with it on, there are unknown and unexpected side-effects that can introduce a bias that will remain hidden...

Agreed...

Actually...i was meaning that i7 980X @4.33GHz 6 threads's speed performance is equal to QX9650@3.0GHz 4 Physical cores
*Note:in case of running 6 Threads against 6 Threads,Ponder ON games on same PC
Of course,i never plan running a such matches on same PC-i7 980X (HT Enabled,6 threads against 6 threads,ponder on...)

And the best/ideal way of testing the engines is Auto232 mode
In other words:i prefer testing the engines in maximum strength

Kind Regards,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Thu Dec 15, 2011 4:13 am

Dear Robert,

Btw,do you plan to release a new a well-optimized Crafty version which to support many cores

For example,the current available Crafty versions are up to 8 cores,thats why i paused Crafty 22.8 Benchmarks,due to it supports up to 8 cores

And as far as i know,there are some Crafty compilers,which support more than 8 cores,but unfortunately,theirs MP scaling are not very good...
I mean the chess benchmarks by Crafty are not performing quite good at 12 CPUs or higher CPUs...

In other words,its will be great if you release a new Crafty version which to support to many cores and later maybe i can resume my benchmarks with your great engine !

Best,
Sedat

Sedat Canbaz · Post by **Sedat Canbaz** » Fri Dec 16, 2011 12:27 pm

Hello Dear Friends,

Honestly,i am surprised and impressed by the latest results of Houdini 2.0c Pro x64 12 Threads (HT ON)

Of course,for a better conclusion more games are needed...
But however,looking at the current results,still i don't expect the ELO performance to be in favor for HT Enabled
Some Notes:
-Since December 2011,SCCT Auto232 Participants are started to use 512 MB Hashtable size
-Houdini 2.0c Pro x64 1c performed approx.10 ELO better than Houdini 2.0 Pro x64 1c
-Houdini 2.0 Pro x64 4c performed approx.10 ELO better than Houdini 2.0b Pro x64 4c
In other words,Houdini 2.0c Pro x64 6c's expected ELO performance to be min 15 ELO stronger than Houdini 2.0b Pro x64 6c
And that means (as i mentioned before) Houdini 2.0c Pro x64 6c is expected to be approx. 10 ELO stronger than Houdini 2.0c Pro x64 12t

Let’s see, what will the HT ELO performance after more games…

Code: Select all

Rank Name                        Elo    +    - games score oppo. draws 
   1 Houdini 2.0c Pro x64 12t   3424   34   34   254   65%  3325   41% 
   2 Houdini 2.0b Pro x64 6c    3419   17   17  1009   69%  3300   46% 
   3 Houdini 2.0 Pro x64 4c     3362   16   16  1131   61%  3293   49% 
   4 Deep Rybka 4.1 x64 6c      3359   18   18   848   61%  3294   58% 
   5 Houdini 2.0b Pro x64 4c    3351   17   17   991   51%  3343   47% 
   6 Houdini 1.5a x64 4c        3344   16   16  1086   62%  3272   48% 
   7 Deep Rybka 4.1 x64 4c      3293   13   13  1603   47%  3314   56% 
   8 Critter 1.2 x64 4c         3288   16   16  1037   50%  3290   58% 
   9 IvanHoe 47c GH x64 4c      3281   15   15  1109   49%  3288   60% 
  10 Fire 2.2 xTreme x64 4c     3275   15   15  1165   41%  3328   59% 
  11 IvanHoe 0B.09.18 x64 4c    3270   16   16  1008   46%  3294   57% 
  12 DeepSaros 2.3i x64 4c      3269   16   16   982   48%  3279   60% 
  13 IvanHoe B47d x64 4c        3266   17   17   967   43%  3309   56% 
  14 Houdini 2.0c Pro x64 1c    3262   20   20   711   61%  3193   47% 
  15 IvanHoe B47f02 x64 4c      3260   16   16  1004   48%  3274   59% 
  16 Stockfish 2.1.1 JA x64 4c  3255   16   16  1012   46%  3275   55% 
  17 Houdini 2.0 Pro x64 1c     3253   14   13  1575   54%  3227   48% 
  18 Strelka 5.1 x64 1c         3243   17   17   921   56%  3210   53% 
  19 Rybka 4.1 x64 1c           3193   17   17   888   48%  3204   53% 
  20 Ivanhoe B46fa x64 1c       3190   27   27   359   46%  3214   57% 
  21 Komodo 3.0 x64 1c          3188   14   14  1563   40%  3248   45% 
  22 Ivanhoe B50kBf x64 1c      3186   20   20   637   47%  3206   58% 
  23 Ivanhoe B46a x64 1c        3174   17   17   949   43%  3213   57% 
  24 Naum 4.2 x64 4c            3174   18   18   898   36%  3258   46% 
  25 Stockfish 111026 x64 1c    3173   18   18   861   44%  3208   49% 

Individual statistics:

1 Houdini 2.0c Pro x64 12t  : 3424  254 (+114,=104,- 36), 65.4 %

Deep Rybka 4.1 x64 4c         : 114 (+ 61,= 41,- 12), 71.5 %
Houdini 2.0b Pro x64 4c       : 140 (+ 53,= 63,- 24), 60.4 %

For SCCT Auto232 Conditions:
http://www.sedatcanbaz.com/chess/rating ... onditions/

Greetings,
Sedat

Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading

Re: Some Notes about Hyper-Threading