Better NPS scaling for Stockfish

zullil · Post by **zullil** » Fri Feb 27, 2015 2:52 pm

I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.

New data:

Code: Select all

Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

C11/Stockfish/src$ ./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 258495860664
Nodes/second    &#58; 23287758

C11/Stockfish/src$ ./stockfish bench 16384  8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100081
Nodes searched  &#58; 159580768428
Nodes/second    &#58; 14376540

23287758/14376540 = 1.62

Old data:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45

Sharaf_DG · Post by **Sharaf_DG** » Fri Feb 27, 2015 3:37 pm

I agree, in fact I find a tb hit on move 1 on my 36 core, e5 2699V3 processor...see below

info depth 43 currmove d2d4 currmovenumber 1
info depth 43 currmove d2d3 currmovenumber 3
info depth 43 currmove c2c3 currmovenumber 9
info depth 43 currmove a2a3 currmovenumber 11
info depth 43 currmove b2b4 currmovenumber 12
info depth 43 currmove g2g3 currmovenumber 13
info depth 43 currmove h2h3 currmovenumber 14
info depth 43 currmove g1h3 currmovenumber 15
info depth 43 currmove f2f3 currmovenumber 16
info depth 43 currmove a2a4 currmovenumber 17
info depth 43 currmove h2h4 currmovenumber 18
info depth 43 currmove g2g4 currmovenumber 19
info depth 43 currmove b1a3 currmovenumber 20
info depth 43 seldepth 55 multipv 1 score cp 13 nodes 105000059144 nps 29672215
hashfull 999 tbhits 1 time 3538666 pv d2d4 g8f6 c2c4 e7e6 g1f3 b7b6 a2a3 c8b7 b1c3 d7d5 c1g5 f8e7 c4d5 e6d5 d1a4 b8d7 e2e3 a7a6 f1d3 e8g8 e1g1 h7h6 g5f4 c7c5 d3f5 b6b5 a4c2 c5c4 f1c1 f6h5 f4e5 h5f6 e5g3 d7b6 h2h3 b7c8 f3e5 d8e8 f5c8 e8c8 g3f4 e7d6 f4g3 f8e8

zullil · Post by **zullil** » Fri Feb 27, 2015 3:51 pm

Sharaf_DG wrote:I agree, in fact I find a tb hit on move 1 on my 36 core, e5 2699V3 processor...see below

info depth 43 currmove d2d4 currmovenumber 1
info depth 43 currmove d2d3 currmovenumber 3
info depth 43 currmove c2c3 currmovenumber 9
info depth 43 currmove a2a3 currmovenumber 11
info depth 43 currmove b2b4 currmovenumber 12
info depth 43 currmove g2g3 currmovenumber 13
info depth 43 currmove h2h3 currmovenumber 14
info depth 43 currmove g1h3 currmovenumber 15
info depth 43 currmove f2f3 currmovenumber 16
info depth 43 currmove a2a4 currmovenumber 17
info depth 43 currmove h2h4 currmovenumber 18
info depth 43 currmove g2g4 currmovenumber 19
info depth 43 currmove b1a3 currmovenumber 20
info depth 43 seldepth 55 multipv 1 score cp 13 nodes 105000059144 nps 29672215
hashfull 999 tbhits 1 time 3538666 pv d2d4 g8f6 c2c4 e7e6 g1f3 b7b6 a2a3 c8b7 b1c3 d7d5 c1g5 f8e7 c4d5 e6d5 d1a4 b8d7 e2e3 a7a6 f1d3 e8g8 e1g1 h7h6 g5f4 c7c5 d3f5 b6b5 a4c2 c5c4 f1c1 f6h5 f4e5 h5f6 e5g3 d7b6 h2h3 b7c8 f3e5 d8e8 f5c8 e8c8 g3f4 e7d6 f4g3 f8e8

Thanks. Is this with hyperthreading, or do you have two of these 18-core processors?

If you were able to run a test analogous to mine and post the results, that would be very helpful. Otherwise all we see is that you have access to very expensive hardware.

Jouni · Post by **Jouni** » Fri Feb 27, 2015 4:09 pm

But here http://www.fastgm.de/threads3.html we see, that Stockfish 140513 has much better scaling:

16 cpu 8335 kn/s
8 cpu 4502 kn/s

factor 1,85

zullil · Post by **zullil** » Fri Feb 27, 2015 5:50 pm

Jouni wrote:But here http://www.fastgm.de/threads3.html we see, that Stockfish 140513 has much better scaling:

16 cpu 8335 kn/s
8 cpu 4502 kn/s

factor 1,85

Very different test conditions. Fastgm only used the starting position. And only ran 20 seconds.

If I can locate the 1401513 source code, I might test that binary as I did for the current one. In the meantime, I'll run his test on the latest SF and report the results later.

zullil · Post by **zullil** » Fri Feb 27, 2015 6:46 pm

Jouni wrote:But here http://www.fastgm.de/threads3.html we see, that Stockfish 140513 has much better scaling:

16 cpu 8335 kn/s
8 cpu 4502 kn/s

factor 1,85

1.85 ± ?

Data generated in a comparable manner:

Code: Select all

Stockfish 270215 64 POPCNT by Tord Romstad, Marco Costalba and Joona Kiiski
Five 20-second searches from startpos, first with 8 threads and then with 16.

setoption name Threads value 8
go movetime 20000
info nodes 171806220 time 20000
info nodes 168273476 time 20002
info nodes 170041349 time 20005
info nodes 173130777 time 20000
info nodes 173077843 time 20001

setoption name Threads value 16
go movetime 20000
info nodes 291157212 time 20000
info nodes 284994112 time 20003
info nodes 297324033 time 20004
info nodes 288602981 time 20002
info nodes 285779255 time 20003

1435527672/856329665 = 1.68

bob · Post by **bob** » Fri Feb 27, 2015 7:57 pm

Jouni wrote:But here http://www.fastgm.de/threads3.html we see, that Stockfish 140513 has much better scaling:

16 cpu 8335 kn/s
8 cpu 4502 kn/s

factor 1,85

Each hardware platform behaves differently, the PC platform is ANYTHING but uniform in anything other than instruction set architecture.

APassionForCriminalJustic · Sat Feb 28, 2015 7:36 pm

Sharaf_DG wrote:I agree, in fact I find a tb hit on move 1 on my 36 core, e5 2699V3 processor...see below

info depth 43 currmove d2d4 currmovenumber 1
info depth 43 currmove d2d3 currmovenumber 3
info depth 43 currmove c2c3 currmovenumber 9
info depth 43 currmove a2a3 currmovenumber 11
info depth 43 currmove b2b4 currmovenumber 12
info depth 43 currmove g2g3 currmovenumber 13
info depth 43 currmove h2h3 currmovenumber 14
info depth 43 currmove g1h3 currmovenumber 15
info depth 43 currmove f2f3 currmovenumber 16
info depth 43 currmove a2a4 currmovenumber 17
info depth 43 currmove h2h4 currmovenumber 18
info depth 43 currmove g2g4 currmovenumber 19
info depth 43 currmove b1a3 currmovenumber 20
info depth 43 seldepth 55 multipv 1 score cp 13 nodes 105000059144 nps 29672215
hashfull 999 tbhits 1 time 3538666 pv d2d4 g8f6 c2c4 e7e6 g1f3 b7b6 a2a3 c8b7 b1c3 d7d5 c1g5 f8e7 c4d5 e6d5 d1a4 b8d7 e2e3 a7a6 f1d3 e8g8 e1g1 h7h6 g5f4 c7c5 d3f5 b6b5 a4c2 c5c4 f1c1 f6h5 f4e5 h5f6 e5g3 d7b6 h2h3 b7c8 f3e5 d8e8 f5c8 e8c8 g3f4 e7d6 f4g3 f8e8

Don't those two CPUS set your house on fire?

Hahahahahahah. Those are monsters from hell man; that is totally beastly those 2699v3s. I am not even surprised that you have a tablebase hit on your first move. Hahahahahahaha.

lucasart · Post by **lucasart** » Sun Mar 01, 2015 12:50 am

zullil wrote:I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.

New data:

Code: Select all

Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

C11/Stockfish/src$ ./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 258495860664
Nodes/second    &#58; 23287758

C11/Stockfish/src$ ./stockfish bench 16384  8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100081
Nodes searched  &#58; 159580768428
Nodes/second    &#58; 14376540

23287758/14376540 = 1.62

Old data:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45

Yes, spin-locks are faster... on real cores.

But what you forget to mention is that they are counter-productive in the case of HT. So we need fishtest to detect whether a given machine uses HT cores (plus an UCI option for spinlocks in SF).

For Linux workers it should be easy (procinfo). Don't know about Windows...

bob · Post by **bob** » Sun Mar 01, 2015 1:54 am

lucasart wrote:
zullil wrote:I've repeated the benchmarking I did a week ago (copied below). It seems the changes made to thread-locking have resulted in significant improvement, at least in NPS scaling from 8 to 16 cores.

New data:
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

C11/Stockfish/src$ ./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 258495860664
Nodes/second    &#58; 23287758

C11/Stockfish/src$ ./stockfish bench 16384  8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100081
Nodes searched  &#58; 159580768428
Nodes/second    &#58; 14376540

23287758/14376540 = 1.62
Old data:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45
Yes, spin-locks are faster... on real cores.

But what you forget to mention is that they are counter-productive in the case of HT. So we need fishtest to detect whether a given machine uses HT cores (plus an UCI option for spinlocks in SF).

For Linux workers it should be easy (procinfo). Don't know about Windows...

There is a fix for this. Add a "pause" asm instruction in the middle of the spin lock loop. "pause" says "stop this process and switch to the other" (on the other hyper-threaded core.) If hyper-threading is not being used, pause does nothing. You can see how to do this in the Crafty spin lock code...

Better NPS scaling for Stockfish

Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish

Re: Better NPS scaling for Stockfish