I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.
Thanks. Is this with hyperthreading, or do you have two of these 18-core processors?
If you were able to run a test analogous to mine and post the results, that would be very helpful. Otherwise all we see is that you have access to very expensive hardware.
Very different test conditions. Fastgm only used the starting position. And only ran 20 seconds.
If I can locate the 1401513 source code, I might test that binary as I did for the current one. In the meantime, I'll run his test on the latest SF and report the results later.
Stockfish 270215 64 POPCNT by Tord Romstad, Marco Costalba and Joona Kiiski
Five 20-second searches from startpos, first with 8 threads and then with 16.
setoption name Threads value 8
go movetime 20000
info nodes 171806220 time 20000
info nodes 168273476 time 20002
info nodes 170041349 time 20005
info nodes 173130777 time 20000
info nodes 173077843 time 20001
setoption name Threads value 16
go movetime 20000
info nodes 291157212 time 20000
info nodes 284994112 time 20003
info nodes 297324033 time 20004
info nodes 288602981 time 20002
info nodes 285779255 time 20003
1435527672/856329665 = 1.68
Don't those two CPUS set your house on fire? Hahahahahahah. Those are monsters from hell man; that is totally beastly those 2699v3s. I am not even surprised that you have a tablebase hit on your first move. Hahahahahahaha.
zullil wrote:I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.
./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched : 233436761771
Nodes/second : 21030196
./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched : 160664514528
Nodes/second : 14474279
21030196/14474279 = 1.45
Yes, spin-locks are faster... on real cores.
But what you forget to mention is that they are counter-productive in the case of HT. So we need fishtest to detect whether a given machine uses HT cores (plus an UCI option for spinlocks in SF).
For Linux workers it should be easy (procinfo). Don't know about Windows...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
zullil wrote:I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.
./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched : 233436761771
Nodes/second : 21030196
./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched : 160664514528
Nodes/second : 14474279
21030196/14474279 = 1.45
Yes, spin-locks are faster... on real cores.
But what you forget to mention is that they are counter-productive in the case of HT. So we need fishtest to detect whether a given machine uses HT cores (plus an UCI option for spinlocks in SF).
For Linux workers it should be easy (procinfo). Don't know about Windows...
There is a fix for this. Add a "pause" asm instruction in the middle of the spin lock loop. "pause" says "stop this process and switch to the other" (on the other hyper-threaded core.) If hyper-threading is not being used, pause does nothing. You can see how to do this in the Crafty spin lock code...