That's a strange comment. Every SMP program I have tried shows a modest NPS improvement because the two logical cores can hide memory latency to an extent. But not a single program I have ever tested has shown a speed (time to depth) improvement because the search overhead (extra nodes) far exceeds the NPS speed gains... I've posted quite a few HT on/off nps numbers over recent years, dating back to the pentium IV which was the first to offer HT.mcostalba wrote:zullil wrote: Hi Marco,
When I use "bench 1024 [8,16] 3 default time", I get only a 3% gain in nps when going from 8 to 16 threads. With "bench 1024 [8,16] 5 default time", the gain is still just 7%.
Thus, as Robert Houdart and Bob Hyatt have suggested, test games at short time controls are not likely to demonstrate anything---and that's also assuming that gains in nps result in better chess performance.
Both Tord and Bob have convinced me that increasing nps is not likely to increase chess performance in any meaningful way, unless time/depth is also being decreased (which doesn't seem to be happening).
Since testing at anything but the fastest time controls would completely occupy my (work) computer for an extended period, I won't be able to run this test. Perhaps someone else with more hardware can run the test for you.
Thanks again for your work with SF.
Louis
Hi Louis,
thanks to you for your very interesting tests. You are the first to report an nps improvment swicthing on HT. Of course this does not mean SF is stronger, but anyhow is something that until yesterday was not even thinkable for SF because the way threads are managed was not allowing this.
Thanks
Marco
Stockfish "Use Sleeping Threads" Test
Moderator: Ras
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Stockfish "Use Sleeping Threads" Test
-
PawnStormZ
- Posts: 880
- Joined: Mon Feb 15, 2010 6:43 am
Stockfish "Use Sleeping Threads" Test
I was thinking that when HT was on it changed the way the processor worked in such a way that allocating 4 of 8 threads to the engine would not be exactly the same as running it on 4 threads with the HT off.
Obviously I will defer to your slightly superior knowledge on the subject.
Obviously I will defer to your slightly superior knowledge on the subject.
-
MikeB
- Posts: 4889
- Joined: Thu Mar 09, 2006 6:34 am
- Location: Pen Argyl, Pennsylvania
Re: Stockfish "Use Sleeping Threads" Test
My coment referencing to fixed depth for m, the tiem to reach fixed depth is 10% or more faster. The NPS is about 30-40% higher with HT on my laptop Windows 7 laptop. HT is much different than how it was 5 years agobob wrote:Only problem is NPS is not important, time to a fixed depth is how chess is actually played. If your NPS goes up, _and_ your time to fixed depth goes up, you are not gaining anything, the program is weaker.MikeB wrote:tha's what I saw too - it works best with both enabled ~ 10% gain I3, two physcial cores, 4 logical cores, Windows 7 , 64 bit.zullil wrote: ...
while having both hyperthreading and Use Sleeping Threads enabled gives a speedup of about 10% compared to having no hyperthreading. (I have checked that my machine runs the 8 threads on 8 distinct physical cores. i.e., no hyperthreading.)
...
Mike
===========================
MT=2 on a 2 Physical Core Machine
Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087
==============================
Same machine, same program, same bench test now using all 4 Logical Cores
Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584
I have tried it both long(60 seconds) and short searches ( 4- 5 seconds. HT enabled and used on my laptop consistly reaches a fixed depth faster than it does without enabled.
-
IQ
- Posts: 162
- Joined: Thu Dec 17, 2009 10:46 am
Re: Stockfish "Use Sleeping Threads" Test (Crafty
But if we agree in principle, wouldnt then time to solutiion be a much better test than time to depth. I understand that time to depth is probably still ok, but in your reasoning you mention examples where it might fail and emphasize the need for a reasonable number of different positions to test. But why then, not use time to solution in the first place? In both cases you have to choose positions carefully and use multiple runs, but time to solution should be less error prone. Maybe somebody can do an experiment analyzing the SD of both approaches?
bob wrote:While I agree in principle, I don't agree in practice. If we were diddling around with extensions and reductions and modifying them, I would not use time-to-depth for anything. But we are not modifying the search or pruning/reduction rules. There are not very many cases where you find the answer at a different depth when using threads vs single search. There are a few, which is why I always advocate using a significant number of positions, and then weeding the oddballs out. I have several positions where 2 threads is way more than 2x faster to get the right answer. I try to make sure that I don't depend on such positions to compute speedups. If you pick a set of positions, some will show super-linear speedup, and those should count. But those should not be the _only_ ones that count...IQ wrote:I disagree here. Even a higher displayed depth in a fixed time means nothing. It could very well be that through the non deterministic nature of the smp, hash table interaction and the high selectivity of modern programs that a higher depth is reached without playing stronger. The best test in my mind would be the TIME to SOLUTION of positions with known best moves (or as an approximation the MOVE where a reasonable large sample of engines agree on as depth goes to infinity). If you average time to solution over a reasonable number of positions (whose estimates themselves should be averages of multiple runs) you should be fine. Don't let yourself be fooled by depth and nodes programs display, in a parallel world and with modern selective programs their informative value is relative.bob wrote:Absolutely. But incredibly unlikely. To the point of "winning-the-lottery" type probability.zullil wrote:Is the following statement reasonable?
Suppose that for each position and each fixed amount of search time, 16 threads reaches a higher depth than 8 threads. Then 16 threads is likely to perform at least as well as 8 threads, as measured by winning chess games.
And the same rule still applies. Not just one run but several. But if you can improve the depth, then the thing will be stronger. Just watch out for flying pigs while doing this test.
Time to fixed depth is a good SMP test. There will be occasional oddities. You just have to repeat the tests enough times that they don't skew the results...
At least 95% of the time, time to depth and time to solution will be comparable when computing speedups.
-
zullil
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Stockfish "Use Sleeping Threads" Test
I don't think I've ever seen more than 15% gain in nps using hyperthreading, and even that gain occurs only in bench tests with large depth. 30-40% seems very unusual.MikeB wrote: The NPS is about 30-40% higher with HT on my laptop Windows 7 laptop.
HT enabled and used on my laptop consistly reaches a fixed depth faster than it does without enabled.
As I've (re)discovered recently, time to depth measurements are very variable. If you are consistently getting shorter time to depth numbers using hyperthreading that would be interesting indeed (or so I imagine).
This is a core i3 processor, BTW?
-
zullil
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Stockfish "Use Sleeping Threads" Test
MikeB wrote: ===========================
MT=2 on a 2 Physical Core Machine
Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087
==============================
Same machine, same program, same bench test now using all 4 Logical Cores
Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584
This is representative of what I'm seeing:
Code: Select all
Using bench 1024 8 24 default depth
===============================
Total time (ms) : 527925
Nodes searched : 3367481098
Nodes/second : 6378711
Using bench 1024 16 24 default depth
===============================
Total time (ms) : 558527
Nodes searched : 4173317854
Nodes/second : 7472007
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: Stockfish "Use Sleeping Threads" Test
In SF when a thread has finished searching it keeps busy polling for new work to be done, this could be acceptable if each thread runs on a different CPU, but in case of HT when one thread keeps polling in a tight loop it drains resources from the sibling logical thread on the same CPU for no reason.bob wrote: That's a strange comment. Every SMP program I have tried shows a modest NPS improvement
With the new "Sleeping thread" feature the threads that have finished searching and return to the split point root are put to sleep. For this to work you need a fast locking and condition variables scheme such are the newly introduces SRWLocks and Condition Variables under Windows so that the sleeping thread could be very quickly signaled to wake up when there's some new work to do.
To properly verify that the newly introduced feature does work is important to rely on modern hardware and on fast locking (so not the ones delivered in the official JA builds that instead aim at backward compatibility).
-
zullil
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Stockfish "Use Sleeping Threads" Test
And this is what I see on my i3 iMac using the icc compiler:zullil wrote:MikeB wrote: ===========================
MT=2 on a 2 Physical Core Machine
Total time (ms) : 53352
Nodes searched : 59118660
Nodes/second : 1108087
==============================
Same machine, same program, same bench test now using all 4 Logical Cores
Total time (ms) : 41542
Nodes searched : 61921860
Nodes/second : 1490584
This is representative of what I'm seeing:
So with HT I got a 17% (wow, that's high) gain in nps, but at the cost of a 6% rise in total time. Based on what Bob and Tord have said, this suggests that HT is actually deleterious to chess performance on my system.Code: Select all
Using bench 1024 8 24 default depth =============================== Total time (ms) : 527925 Nodes searched : 3367481098 Nodes/second : 6378711 Using bench 1024 16 24 default depth =============================== Total time (ms) : 558527 Nodes searched : 4173317854 Nodes/second : 7472007
Code: Select all
Using ./stockfish bench 128 2 20
===============================
Total time (ms) : 170558
Nodes searched : 350598936
Nodes/second : 2055599
Using ./stockfish bench 128 4 20
===============================
Total time (ms) : 182437
Nodes searched : 468851716
Nodes/second : 2569937
[EDIT] I repeated the test:
Code: Select all
Using bench 128 2 20
===============================
Total time (ms) : 203558
Nodes searched : 441985063
Nodes/second : 2171297
Using bench 128 4 20
===============================
Total time (ms) : 189368
Nodes searched : 488148168
Nodes/second : 2577775
The variance of total time is very large. On would need to repeat this test a lot of times before a meaningful measurement could be established.
-
zullil
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Stockfish "Use Sleeping Threads" Test
Here are the results of ten searches to depth 24 of the second of Stockfish's sixteen default positions. This was using 8 physical cores on my Mac Pro. Note the variability of the total time (in ms) taken to complete each search. I calculate a mean of 90066.6 and a sample standard deviation of 26983.9. Assuming a normal distribution, these numbers suggest that about 1 in 3 searches will take either less than 63083 and more than 117051. Perhaps you could try this experiment on your i3 machine using your SF binary. I think this illustrates why it will be difficult to demonstrate that HT yields a better-performing engine.
Code: Select all
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 640777362
Total time (ms): 97647
Nodes/second: 6562181
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 393386958
Total time (ms): 60933
Nodes/second: 6456057
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 416546724
Total time (ms): 64487
Nodes/second: 6459390
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 822556975
Total time (ms): 122239
Nodes/second: 6729087
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 395129719
Total time (ms): 60075
Nodes/second: 6577273
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 746878001
Total time (ms): 113729
Nodes/second: 6567172
Best move: dxe6
Ponder move: Bxe2
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 560845985
Total time (ms): 84954
Nodes/second: 6601760
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 565951159
Total time (ms): 85320
Nodes/second: 6633276
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 486849305
Total time (ms): 74301
Nodes/second: 6552392
Best move: Bxa6
Ponder move: bxc3
Searching: r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -
infinite: 0 ponder: 0 time: 0 increment: 0 moves to go: 0
Nodes: 938499802
Total time (ms): 136981
Nodes/second: 6851313
Best move: Bxa6
Ponder move: bxc3
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Stockfish "Use Sleeping Threads" Test
It will be somewhat different if, say, you have 4 physical cores, 8 logical cores, and you run 4 chess threads + another application. But that is actually more efficient (at least under Linux) than running 4 phisical/logical cores and letting the operating system context switch among the 5 runnable processors...PawnStormZ wrote: I was thinking that when HT was on it changed the way the processor worked in such a way that allocating 4 of 8 threads to the engine would not be exactly the same as running it on 4 threads with the HT off.
Obviously I will defer to your slightly superior knowledge on the subject.![]()
But if you really have 4 threads running, the "logical" processors will only be involved in running minute bits of code (interrupt handlers and such) and they can do this without causing context switches which is a tiny bit more efficient...