New SMP stuff (particularly Kai)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.

Some current data:

20 cpus:

13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%

16 cpus:

11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%

8 cpus:

5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%

only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%

One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Joerg Oster wrote:
bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?
I agree it's a reasonable way of doing it.
Tomorrow I will give some numbers with 4 and 8 threads for Stockfish for comparison.
I posted my four 8 thread runs. I only have one 4 thread run which I also posted. Should have all the 4's tomorrow plus 1 or 2 two-thread runs as well. Gets slower as threads go down...
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: New SMP stuff (particularly Kai)

Post by zullil »

Joerg Oster wrote:
zullil wrote:
Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.
Looks promising, many thanks.
Looking forward to your further results.
Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.

Code: Select all

===========================
Total time (ms) : 1068964
Nodes searched  : 29587093943
Nodes/second    : 27678288
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

latejoin_tweak
===========================
Total time (ms) : 1214284
Nodes searched  : 34913456185
Nodes/second    : 28752298
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

zullil wrote:
Joerg Oster wrote:
zullil wrote:
Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.
Looks promising, many thanks.
Looking forward to your further results.
Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.

Code: Select all

===========================
Total time (ms) : 1068964
Nodes searched  : 29587093943
Nodes/second    : 27678288
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

latejoin_tweak
===========================
Total time (ms) : 1214284
Nodes searched  : 34913456185
Nodes/second    : 28752298
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

:)

My goal, in the past, was to run N tests, where N was the number of cpus used. For 4 cpus, 4 runs and averaged. But this gets a bit out of control when you hit 16 and up. I have been running 4 times and averaging, which is pretty reasonable data. SMP is highly variable as we all know.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: New SMP stuff (particularly Kai)

Post by MikeB »

bob wrote:
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
I currently have gcc and icc producing almost the same speed - I think gcc is a hair faster - but gcc used to be way faster. Now trying to figure out which is better at smp.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: New SMP stuff (particularly Kai)

Post by Laskos »

zullil wrote:
Joerg Oster wrote:
zullil wrote:
Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.
Looks promising, many thanks.
Looking forward to your further results.
Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.

Code: Select all

===========================
Total time (ms) : 1068964
Nodes searched  : 29587093943
Nodes/second    : 27678288
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

latejoin_tweak
===========================
Total time (ms) : 1214284
Nodes searched  : 34913456185
Nodes/second    : 28752298
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

Yes, this way of using only total times needs much more than 1 run with 24 positions. I usually do it with 150 positions, but it becomes very time consuming. This single run can be used more effectively if you keep the individual times for each position, then the easiest way is to do the geometric average:

(product all individual times on one core / product all individual times on 20 cores) ^ (1/24).

The average error should be no more than 5-10% even using only those 24 positions in 1 run.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: New SMP stuff (particularly Kai)

Post by Dann Corbit »

bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.

Some current data:

20 cpus:

13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%

16 cpus:

11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%

8 cpus:

5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%

only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%

One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
Strange that the overhead average should be so much higher for 16 CPUs. Is that a typo?
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: New SMP stuff (particularly Kai)

Post by Joerg Oster »

bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.

Some current data:

20 cpus:

13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%

16 cpus:

11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%

8 cpus:

5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%

only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%

One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
OK, here are the numbers for the current development version of Stockfish.
All runs with 512 MB Hash, up to depth 23 (only, but it should still give a good impression).

Code: Select all

512 1 23        1 Thread
===========================
Total time (ms) : 960083
Nodes searched  : 1840499062
Nodes/second    : 1917020


512 4 23        4 Threads
===========================
Total time (ms) : 254206       3.8
Nodes searched  : 1536539248  -16.5 %
Nodes/second    : 6044464      3.15

===========================
Total time (ms) : 229652       4.18
Nodes searched  : 1472804698  -20 %
Nodes/second    : 6413202      3.35

===========================
Total time (ms) : 243197       3.95
Nodes searched  : 1534883380  -16.6 %
Nodes/second    : 6311275      3.29

===========================
Total time (ms) : 300799       3.19             Ø 3.78
Nodes searched  : 1839032053  +0.08 %           Ø -13.2 %
Nodes/second    : 6113823      3.19             Ø 3.25



512 8 23        8 Threads
===========================
Total time (ms) : 178049       5.4
Nodes searched  : 1804005395  -1.98 %
Nodes/second    : 10132072     5.28

===========================
Total time (ms) : 188949       5.08
Nodes searched  : 1914328553  +4.01 %
Nodes/second    : 10131456     5.29

===========================
Total time (ms) : 178624       5.4
Nodes searched  : 1831661239  -0.5 %
Nodes/second    : 10254284     5.35

===========================
Total time (ms) : 196369       4.89             Ø 5.2
Nodes searched  : 2008504419  +9.13 %           Ø +2.7 %
Nodes/second    : 10228215     5.34             Ø 5.32
Speedup with 4 threads looks good. Scaling could be better.
Most remarkably, SF searches much less nodes with 4 threads than with 1!

Speedup with 8 threads leaves room for improvement.
Search overhead is small.
Jörg Oster
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Dann Corbit wrote:
bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.

Some current data:

20 cpus:

13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%

16 cpus:

11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%

8 cpus:

5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%

only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%

One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
Strange that the overhead average should be so much higher for 16 CPUs. Is that a typo?
No. It is a combination of non-determinism and not quite perfect tuning. I ran the autotune for 20 cpus only. I should run it between each different test (different numbers of threads).

For reference, the 4 samples for 20 were 10, 18, 14 and 18, so quite a bit of variability, which is way up in the billions of nodes
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Joerg Oster wrote:
bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.

Some current data:

20 cpus:

13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%

16 cpus:

11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%

8 cpus:

5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%

only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%

One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
OK, here are the numbers for the current development version of Stockfish.
All runs with 512 MB Hash, up to depth 23 (only, but it should still give a good impression).

Code: Select all

512 1 23        1 Thread
===========================
Total time (ms) : 960083
Nodes searched  : 1840499062
Nodes/second    : 1917020


512 4 23        4 Threads
===========================
Total time (ms) : 254206       3.8
Nodes searched  : 1536539248  -16.5 %
Nodes/second    : 6044464      3.15

===========================
Total time (ms) : 229652       4.18
Nodes searched  : 1472804698  -20 %
Nodes/second    : 6413202      3.35

===========================
Total time (ms) : 243197       3.95
Nodes searched  : 1534883380  -16.6 %
Nodes/second    : 6311275      3.29

===========================
Total time (ms) : 300799       3.19             Ø 3.78
Nodes searched  : 1839032053  +0.08 %           Ø -13.2 %
Nodes/second    : 6113823      3.19             Ø 3.25



512 8 23        8 Threads
===========================
Total time (ms) : 178049       5.4
Nodes searched  : 1804005395  -1.98 %
Nodes/second    : 10132072     5.28

===========================
Total time (ms) : 188949       5.08
Nodes searched  : 1914328553  +4.01 %
Nodes/second    : 10131456     5.29

===========================
Total time (ms) : 178624       5.4
Nodes searched  : 1831661239  -0.5 %
Nodes/second    : 10254284     5.35

===========================
Total time (ms) : 196369       4.89             Ø 5.2
Nodes searched  : 2008504419  +9.13 %           Ø +2.7 %
Nodes/second    : 10228215     5.34             Ø 5.32
Speedup with 4 threads looks good. Scaling could be better.
Most remarkably, SF searches much less nodes with 4 threads than with 1!

Speedup with 8 threads leaves room for improvement.
Search overhead is small.
The thing that stands out is the super-linear for a complete set of 24 problems (2nd run I think). That is something that should be investigated. super-linear for a few positions is normal. And in my case, the super-linear on one of the previous tests (not most recent) was WAY super-liner and that particular search was the biggest for 1 cpu as well, so a super-linear there had a significant influence on the entire problem set since that one position counted so much overall. I carefully equalized the search times as much as possible this time around to eliminate that bias.