Some hyperthreading results

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 2:48 am
Location: London, UK
Contact:

Re: Some hyperthreading results

Post by matthewlai » Mon Sep 12, 2016 11:40 pm

Dann Corbit wrote:
Laskos wrote:
mjlef wrote:Kai,

How can both hyperthreading on and off be tested? Was this two identical machines or on 1 machine? I ask since I see a different nps with hyperthreading being off and using the half the thread, than hyperthreading on using half the threads. So I do not see how it can be tested on one machine.

If two identical machines could be setup and connected, one with hyperthreading on and one off, then that would rule out the issue. But nodes per second does not seem to be enough.

Mark
For NPS it's straightforward without switching HT off in BIOS. For example, for 4 cores, 8 threads, start a command prompt inside the folder of Komodo and type:

Start /affinity 55 Komodo.exe

55 is the hexadecimal representation of 01010101 (from core 7 to 0), i.e. physical cores 0, 2, 4, 6.

For matches I do a little sloppy job: in Cutechess-Cli with "restart=off" switch, I set the affinities by hand in task manager at the beginning of the match. If I need 4 threads on 4 physical cores, I leave only 0,2,4,6 checked. If I need 8 threads on 8 logical cores, I leave all checked. It can be done separately for each running engine.
I didn't notice significant differences using affinities to physical cores and switching HT off in the BIOS in Fritz Benchmark or NPS.
What happens when you exceed the hyperthread core count?
E.g. on a machine with 6 physical cores and 12 HT cores, what happens with 13 threads and above?
Same as if you run 2 threads on a single-core CPU without hyperthreading - the scheduler will swap the threads in and out quickly, to create the delusion of the threads running at the same time, but in reality only 1 thread will run at any given time. That's why we can run multiple programs at the same time even on single core CPUs (and more than 4 programs and quad cores), even without hyperthreading.

If all the threads have the same priorities and the scheduler is half-decent, they should all get 12/13 virtual CPU time.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 1:18 am

lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 2:48 am
Location: London, UK
Contact:

Re: Some hyperthreading results

Post by matthewlai » Tue Sep 13, 2016 1:24 am

Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
That would mean a hyperthreading speedup of 0.7*8/4 = 1.4x. It's not unheard of, but that's on the very high end. I guess it makes sense since chess engines are pretty memory-intensive (due to transposition table probes), and hyper-threading can take advantage of that.

For network training (also scales linearly with almost no data sharing between threads), I found 20-24 threads to be optimal on a 16-core machine. It becomes quite a bit slower (than 16 threads) at 32 threads.

I get 1.15x to 1.2x speedup with 20-24 threads.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 1:31 am

Dann Corbit wrote:
Laskos wrote:
mjlef wrote:Kai,

How can both hyperthreading on and off be tested? Was this two identical machines or on 1 machine? I ask since I see a different nps with hyperthreading being off and using the half the thread, than hyperthreading on using half the threads. So I do not see how it can be tested on one machine.

If two identical machines could be setup and connected, one with hyperthreading on and one off, then that would rule out the issue. But nodes per second does not seem to be enough.

Mark
For NPS it's straightforward without switching HT off in BIOS. For example, for 4 cores, 8 threads, start a command prompt inside the folder of Komodo and type:

Start /affinity 55 Komodo.exe

55 is the hexadecimal representation of 01010101 (from core 7 to 0), i.e. physical cores 0, 2, 4, 6.

For matches I do a little sloppy job: in Cutechess-Cli with "restart=off" switch, I set the affinities by hand in task manager at the beginning of the match. If I need 4 threads on 4 physical cores, I leave only 0,2,4,6 checked. If I need 8 threads on 8 logical cores, I leave all checked. It can be done separately for each running engine.
I didn't notice significant differences using affinities to physical cores and switching HT off in the BIOS in Fritz Benchmark or NPS.
What happens when you exceed the hyperthread core count?
E.g. on a machine with 6 physical cores and 12 HT cores, what happens with 13 threads and above?
I played these days with some crazy things like 64 threads Lazy on 4 physical cores (8 HT cores). Funny thing: on 16 threads the NPS is even 1% higher than on 8 HT threads. The gameplay obviously is weaker on 16 compared to 8, because per thread speed is halved. I tested at fixed nodes to 64 threads, and the conclusion would be that if NPS scaling with Lazy SMP is good (and it seems to be at least to 32 cores), then the SMP (effective speed-up) scaling is also good. This is speculative, but if I accepted 10-12 effective speed-up with YBW Houdini 4 on 32 cores, it seems than Lazy SF dev and Komodo have an effective speed-up of maybe higher than 20 on 32 cores, which is extremely high.

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 2:37 am

matthewlai wrote:
Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
That would mean a hyperthreading speedup of 0.7*8/4 = 1.4x. It's not unheard of, but that's on the very high end. I guess it makes sense since chess engines are pretty memory-intensive (due to transposition table probes), and hyper-threading can take advantage of that.

For network training (also scales linearly with almost no data sharing between threads), I found 20-24 threads to be optimal on a 16-core machine. It becomes quite a bit slower (than 16 threads) at 32 threads.

I get 1.15x to 1.2x speedup with 20-24 threads.
Strength-wise, I remember a result of mine with YBW showing 6 threads being optimal for 4 physical cores, not 8.

lkaufman
Posts: 3760
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Some hyperthreading results

Post by lkaufman » Tue Sep 13, 2016 2:46 am

Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
Thanks. A couple follow-up questions if you don't mind:
1. What is "longer time control" where you only use actual number of physical cores?
2. Why use fewer threads for longer time controls? Since using 7 or 8 is way more efficient than using 4 (based on your numbers, which are similar to my own), you must really distrust using 7 or 8 in longer games for some reason.
3. How do you know in advance whether to leave 1 or 2 threads free for things that are unpredictable?
4. Why leave 1 (or 2) threads free when aiming for 8 but none free when aiming for 4?
Komodo rules!

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 3:35 am

lkaufman wrote:
Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
Thanks. A couple follow-up questions if you don't mind:
1. What is "longer time control" where you only use actual number of physical cores?
2. Why use fewer threads for longer time controls? Since using 7 or 8 is way more efficient than using 4 (based on your numbers, which are similar to my own), you must really distrust using 7 or 8 in longer games for some reason.
I wanted to mean that I do care in these games about scaling with "effective time", TC to be longer, or NPS higher. The test can often last for days, and I still use my PC for browsing and with text editors. I feel comfortable with 8-1 only for hours, and I usually don't disturb excessively this PC during such a run. In these 4-0 longer games (say above 30s/game) overhead, granularity of Windows timer, move selection noise, all are of smaller importance, and I want to preserve the "quality" of the test. 8-1 at 10s/game or less are anyway noisy.

3. How do you know in advance whether to leave 1 or 2 threads free for things that are unpredictable?


4. Why leave 1 (or 2) threads free when aiming for 8 but none free when aiming for 4?
With 8-0 the impact of 2-3% task on fast games is bad, it can even disrupt occasionally a game, as it could stick for say 30ms to a thread used by an engine. At least 8-1 is mandatory. With 4-0 HT ON (8 logical cores), a 2-3% task usually interferes minimally with the active threads, there are plenty of free ones available, and even if it intervenes, on longer TC Windows scheduler is sufficiently fast to not keep it stuck. But I am not a specialist on these issues, and some things might be subjective.

lkaufman
Posts: 3760
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Some hyperthreading results

Post by lkaufman » Tue Sep 13, 2016 5:03 am

Laskos wrote:
lkaufman wrote:
Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
Thanks. A couple follow-up questions if you don't mind:
1. What is "longer time control" where you only use actual number of physical cores?
2. Why use fewer threads for longer time controls? Since using 7 or 8 is way more efficient than using 4 (based on your numbers, which are similar to my own), you must really distrust using 7 or 8 in longer games for some reason.
I wanted to mean that I do care in these games about scaling with "effective time", TC to be longer, or NPS higher. The test can often last for days, and I still use my PC for browsing and with text editors. I feel comfortable with 8-1 only for hours, and I usually don't disturb excessively this PC during such a run. In these 4-0 longer games (say above 30s/game) overhead, granularity of Windows timer, move selection noise, all are of smaller importance, and I want to preserve the "quality" of the test. 8-1 at 10s/game or less are anyway noisy.

3. How do you know in advance whether to leave 1 or 2 threads free for things that are unpredictable?


4. Why leave 1 (or 2) threads free when aiming for 8 but none free when aiming for 4?
With 8-0 the impact of 2-3% task on fast games is bad, it can even disrupt occasionally a game, as it could stick for say 30ms to a thread used by an engine. At least 8-1 is mandatory. With 4-0 HT ON (8 logical cores), a 2-3% task usually interferes minimally with the active threads, there are plenty of free ones available, and even if it intervenes, on longer TC Windows scheduler is sufficiently fast to not keep it stuck. But I am not a specialist on these issues, and some things might be subjective.
I see. So you think that for "real" games (30" or longer) the improved quality of the test justifies the use of only four threads. So even though you could run on 7 threads at about 20" instead of 30" and get similar search depth with more games per minute, the test might be less "fair" and so you prefer to use only four threads. If that is the case, then we are probably doing the right thing to turn off hyperthreading and use (for example) 15 threads on a 16 core machine.
But your response is interesting in another way. I always thought that the only point of testing at say 30" vs 5" is that you get more depth,since many things behave differently at different depths. But I think you are suggesting that there is just much more randomness at 5" than at 30", so even if search depth had no effect on a given idea to be tested, there is still an argument for the longer time limit. The question is this: are the random factors at 5" level ones that will become insignificant with say 10,000 games, or are we talking about factors that might bias a test even if you played a million games? That is actually very important to know.
Komodo rules!

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 11:04 am

lkaufman wrote:
Laskos wrote:
lkaufman wrote:
Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
Thanks. A couple follow-up questions if you don't mind:
1. What is "longer time control" where you only use actual number of physical cores?
2. Why use fewer threads for longer time controls? Since using 7 or 8 is way more efficient than using 4 (based on your numbers, which are similar to my own), you must really distrust using 7 or 8 in longer games for some reason.
I wanted to mean that I do care in these games about scaling with "effective time", TC to be longer, or NPS higher. The test can often last for days, and I still use my PC for browsing and with text editors. I feel comfortable with 8-1 only for hours, and I usually don't disturb excessively this PC during such a run. In these 4-0 longer games (say above 30s/game) overhead, granularity of Windows timer, move selection noise, all are of smaller importance, and I want to preserve the "quality" of the test. 8-1 at 10s/game or less are anyway noisy.

3. How do you know in advance whether to leave 1 or 2 threads free for things that are unpredictable?


4. Why leave 1 (or 2) threads free when aiming for 8 but none free when aiming for 4?
With 8-0 the impact of 2-3% task on fast games is bad, it can even disrupt occasionally a game, as it could stick for say 30ms to a thread used by an engine. At least 8-1 is mandatory. With 4-0 HT ON (8 logical cores), a 2-3% task usually interferes minimally with the active threads, there are plenty of free ones available, and even if it intervenes, on longer TC Windows scheduler is sufficiently fast to not keep it stuck. But I am not a specialist on these issues, and some things might be subjective.
I see. So you think that for "real" games (30" or longer) the improved quality of the test justifies the use of only four threads. So even though you could run on 7 threads at about 20" instead of 30" and get similar search depth with more games per minute, the test might be less "fair" and so you prefer to use only four threads. If that is the case, then we are probably doing the right thing to turn off hyperthreading and use (for example) 15 threads on a 16 core machine.
But your response is interesting in another way. I always thought that the only point of testing at say 30" vs 5" is that you get more depth,since many things behave differently at different depths. But I think you are suggesting that there is just much more randomness at 5" than at 30", so even if search depth had no effect on a given idea to be tested, there is still an argument for the longer time limit. The question is this: are the random factors at 5" level ones that will become insignificant with say 10,000 games, or are we talking about factors that might bias a test even if you played a million games? That is actually very important to know.
You mean 7 threads at longer TC than 4 threads for the same depth, with more games per minute? Yes, but the issue is mostly practical. When I am back home, and the test is running, I often interfere with my desktop PC, even with the test running, my desktop is usually more responsive than my weak laptop or tablet. While I consider this small interference as negligible with 4-0, it might be a bit disturbing with 8-1. But I forgot to mention one aspect of my testing: I often test at fixed depth or nodes, and in this case one can safely go to 8-0, or, with extremely fast games, when the time lag between moves is comparable to movetime, to concurrency higher than 8 on 8 logical cores (4 physical). Sometimes Cutechess-Cli output flows like 5-10 games per second in this case, and there is no any noise problem with concurrency. Also, do not consider 8-1 games at fixed time as always sub-standard. First thing when Komodo is released, I go to LittleBlitzer, set 50ms/move (with LittleBlitzer there are no Komodo time forfeits with that form of TC), get rid of possible time control modifications this way, take care of modifications to overhead, set concurrency to 8-1, leave it for 15-20 minutes or so in games against a previous Komodo, and I have 1000 games, with usually very informative results. Strength, NPS, depth, time used (aside granularity) are all there in the output. Usually strength difference is amplified a bit, but I am no rating list.

Games at 30'' compared to 2'' are different with systematic bias, not a white noise decaying as 1/(N games)^0.5. Even if there are no scaling issues (although at 2''/game there are very many scaling issues), the noise at short TC blurs the outcome to such a degree that strength difference might be systematically distorted. Besides the non-random noise, say to timer routines used, to overhead used (which I often modify to smaller). I only use these TC for very close engines, say SF related devs. 8-1 issue is the least of my concerns. And again, 8-1 testing can be as good and faster than 4-0 if you dedicate the whole PC to testing. You can leave 8-1 for days at long TC (say 30''/game), and if your OS tasks or antivirus don't pop-up actively at some times, you should be perfectly fine. Thread scheduler in Windows is good for say 500ms per move playing. I don't know about Linux thread scheduler.

User avatar
Laskos
Posts: 9527
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Some hyperthreading results

Post by Laskos » Tue Sep 13, 2016 11:12 am

Laskos wrote:
matthewlai wrote:
Laskos wrote:
lkaufman wrote:What is your opinion about how best to test in single-thread mode on these machines that have hyperthreading; test with HP off matching the physical core count (or minus one), or doubling the physical core count with hyperthreading on (or minus one or two)? We used to test with HT on and doubling the physical core count, but shortly before Don died we switched to testing with HT off and using the physical core count (minus 1) as almost everyone on this forum seemed convinced that HT should be off for single-thread testing. Clearly we can play more games per minute of equal quality with hyperthreading, but the suspicion is that they are less equal to each other in terms of available resources and hence more random. Of course this has nothing to do with lazy MP, as I'm talking about SP tests, but I could also ask the same question for four thread testing (on machines with 8 or more physical cores), and perhaps your answer would be different.
Since I had my first i7 4 physical core desktop and Windows 7, that's some 5 years, I always used HT ON. The match tests I divide roughly in 2: as many games as possible or as longer time control (with as many games too). For first case I use 8-1=7 or 8-2=6 logical cores. For second I use 4-0=4 physical cores. I rarely use affinity. Without it, on 4 cores the speed is indeed lower by 1-2% than setting affinity on physical cores or disabling HT in BIOS, but this is spreaded equally, so doesn't bother me. Last time I performed a statistical analysis of the match outcomes for the usage 8-1=7 was some 4 years ago with Win 7. Windows has gradually improved thread scheduling from Vista to Win 8 I have now. I did have problems with Vista, they disappeared with Win 7, and it seems even better with Win 8. I don't know how Linux handles the scheduling of the threads.

With HT ON and concurrency 8 on 4 physical cores the speed of the single-threaded engine is about 30% lower than with concurrency 4 on 4 physical cores. With concurrency 7 maybe 25% slower. So, to compare different results at the same time control, one has to be consistent, as "effective time control" is different for 4-0, 8-1 or 8-2. Sometime I have to use 8-2=6 because a YouTube or an antimalware eats one full thread, and there are other things happening with the OS or the browser.
That would mean a hyperthreading speedup of 0.7*8/4 = 1.4x. It's not unheard of, but that's on the very high end. I guess it makes sense since chess engines are pretty memory-intensive (due to transposition table probes), and hyper-threading can take advantage of that.

For network training (also scales linearly with almost no data sharing between threads), I found 20-24 threads to be optimal on a 16-core machine. It becomes quite a bit slower (than 16 threads) at 32 threads.

I get 1.15x to 1.2x speedup with 20-24 threads.
Strength-wise, I remember a result of mine with YBW showing 6 threads being optimal for 4 physical cores, not 8.
I checked that with Lazy Stockfish dev, 8 threads on 4 physical cores seems the best:

Code: Select all

   # PLAYER             : RATING  ERROR    POINTS  PLAYED     (%)
   1 SF 8 Threads       : 3035.8    7.8     533.5    1000    53.4
   2 SF 6 Threads       : 3024.6    7.8     509.5    1000    51.0
   3 SF 4 Threads       : 3000.0    8.0     457.0    1000    45.7
With 16 physical core machine the things might be different, and the optimum might happen at say 24 threads. It is due to reduced SMP efficiency on 16 cores compared to 4 cores.

Post Reply