One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.
Some current data:
20 cpus:
13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%
16 cpus:
11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%
8 cpus:
5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%
only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%
One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
New SMP stuff (particularly Kai)
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New SMP stuff (particularly Kai)
I posted my four 8 thread runs. I only have one 4 thread run which I also posted. Should have all the 4's tomorrow plus 1 or 2 two-thread runs as well. Gets slower as threads go down...Joerg Oster wrote:I agree it's a reasonable way of doing it.bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
Tomorrow I will give some numbers with 4 and 8 threads for Stockfish for comparison.
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: New SMP stuff (particularly Kai)
Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.Joerg Oster wrote:Looks promising, many thanks.zullil wrote:Just one run, but less total time and a smaller tree, so maybe a change for the better:Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.Code: Select all
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 479159 Nodes searched : 11386079093 Nodes/second : 23762632 ./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 443388 Nodes searched : 10465813469 Nodes/second : 23604187
Looking forward to your further results.
Code: Select all
===========================
Total time (ms) : 1068964
Nodes searched : 29587093943
Nodes/second : 27678288
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth
latejoin_tweak
===========================
Total time (ms) : 1214284
Nodes searched : 34913456185
Nodes/second : 28752298
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New SMP stuff (particularly Kai)
zullil wrote:Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.Joerg Oster wrote:Looks promising, many thanks.zullil wrote:Just one run, but less total time and a smaller tree, so maybe a change for the better:Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.Code: Select all
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 479159 Nodes searched : 11386079093 Nodes/second : 23762632 ./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 443388 Nodes searched : 10465813469 Nodes/second : 23604187
Looking forward to your further results.Code: Select all
=========================== Total time (ms) : 1068964 Nodes searched : 29587093943 Nodes/second : 27678288 ./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth latejoin_tweak =========================== Total time (ms) : 1214284 Nodes searched : 34913456185 Nodes/second : 28752298 ./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth
My goal, in the past, was to run N tests, where N was the number of cpus used. For 4 cpus, 4 runs and averaged. But this gets a bit out of control when you hit 16 and up. I have been running 4 times and averaging, which is pretty reasonable data. SMP is highly variable as we all know.
-
- Posts: 4889
- Joined: Thu Mar 09, 2006 6:34 am
- Location: Pen Argyl, Pennsylvania
Re: New SMP stuff (particularly Kai)
I currently have gcc and icc producing almost the same speed - I think gcc is a hair faster - but gcc used to be way faster. Now trying to figure out which is better at smp.bob wrote:
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: New SMP stuff (particularly Kai)
Yes, this way of using only total times needs much more than 1 run with 24 positions. I usually do it with 150 positions, but it becomes very time consuming. This single run can be used more effectively if you keep the individual times for each position, then the easiest way is to do the geometric average:zullil wrote:Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.Joerg Oster wrote:Looks promising, many thanks.zullil wrote:Just one run, but less total time and a smaller tree, so maybe a change for the better:Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.Code: Select all
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 479159 Nodes searched : 11386079093 Nodes/second : 23762632 ./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 443388 Nodes searched : 10465813469 Nodes/second : 23604187
Looking forward to your further results.Code: Select all
=========================== Total time (ms) : 1068964 Nodes searched : 29587093943 Nodes/second : 27678288 ./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth latejoin_tweak =========================== Total time (ms) : 1214284 Nodes searched : 34913456185 Nodes/second : 28752298 ./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth
(product all individual times on one core / product all individual times on 20 cores) ^ (1/24).
The average error should be no more than 5-10% even using only those 24 positions in 1 run.
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: New SMP stuff (particularly Kai)
Strange that the overhead average should be so much higher for 16 CPUs. Is that a typo?bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.
Some current data:
20 cpus:
13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%
16 cpus:
11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%
8 cpus:
5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%
only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%
One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
-
- Posts: 937
- Joined: Fri Mar 10, 2006 4:29 pm
- Location: Germany
Re: New SMP stuff (particularly Kai)
OK, here are the numbers for the current development version of Stockfish.bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.
Some current data:
20 cpus:
13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%
16 cpus:
11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%
8 cpus:
5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%
only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%
One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
All runs with 512 MB Hash, up to depth 23 (only, but it should still give a good impression).
Code: Select all
512 1 23 1 Thread
===========================
Total time (ms) : 960083
Nodes searched : 1840499062
Nodes/second : 1917020
512 4 23 4 Threads
===========================
Total time (ms) : 254206 3.8
Nodes searched : 1536539248 -16.5 %
Nodes/second : 6044464 3.15
===========================
Total time (ms) : 229652 4.18
Nodes searched : 1472804698 -20 %
Nodes/second : 6413202 3.35
===========================
Total time (ms) : 243197 3.95
Nodes searched : 1534883380 -16.6 %
Nodes/second : 6311275 3.29
===========================
Total time (ms) : 300799 3.19 Ø 3.78
Nodes searched : 1839032053 +0.08 % Ø -13.2 %
Nodes/second : 6113823 3.19 Ø 3.25
512 8 23 8 Threads
===========================
Total time (ms) : 178049 5.4
Nodes searched : 1804005395 -1.98 %
Nodes/second : 10132072 5.28
===========================
Total time (ms) : 188949 5.08
Nodes searched : 1914328553 +4.01 %
Nodes/second : 10131456 5.29
===========================
Total time (ms) : 178624 5.4
Nodes searched : 1831661239 -0.5 %
Nodes/second : 10254284 5.35
===========================
Total time (ms) : 196369 4.89 Ø 5.2
Nodes searched : 2008504419 +9.13 % Ø +2.7 %
Nodes/second : 10228215 5.34 Ø 5.32
Most remarkably, SF searches much less nodes with 4 threads than with 1!
Speedup with 8 threads leaves room for improvement.
Search overhead is small.
Jörg Oster
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New SMP stuff (particularly Kai)
No. It is a combination of non-determinism and not quite perfect tuning. I ran the autotune for 20 cpus only. I should run it between each different test (different numbers of threads).Dann Corbit wrote:Strange that the overhead average should be so much higher for 16 CPUs. Is that a typo?bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.
Some current data:
20 cpus:
13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%
16 cpus:
11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%
8 cpus:
5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%
only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%
One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
For reference, the 4 samples for 20 were 10, 18, 14 and 18, so quite a bit of variability, which is way up in the billions of nodes
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New SMP stuff (particularly Kai)
The thing that stands out is the super-linear for a complete set of 24 problems (2nd run I think). That is something that should be investigated. super-linear for a few positions is normal. And in my case, the super-linear on one of the previous tests (not most recent) was WAY super-liner and that particular search was the biggest for 1 cpu as well, so a super-linear there had a significant influence on the entire problem set since that one position counted so much overall. I carefully equalized the search times as much as possible this time around to eliminate that bias.Joerg Oster wrote:OK, here are the numbers for the current development version of Stockfish.bob wrote:One note, this seems to be the "best behaved" Crafty ever. There is always variability, but nothing like I have seen in the past.
Some current data:
20 cpus:
13.6, 12.8, 13.2, 12.5, avg=13.0
overhead avg=15%
16 cpus:
11.6, 9.8, 10.0, 10.2, avg=10.4
overhead avg =28%
8 cpus:
5.7, 6.6, 6.4, 7.2, avg=6.5
overhead avg = 15%
only one 4 cpu run so far, 3.5 speedup
overhead avg = 14%
One thing I should do is run the new "autotune" between each batch where I change the number of threads. I would expect some gain in performance there since I have tuned it for 20. Only problem is a decent run takes 8-10 hours.
All runs with 512 MB Hash, up to depth 23 (only, but it should still give a good impression).Speedup with 4 threads looks good. Scaling could be better.Code: Select all
512 1 23 1 Thread =========================== Total time (ms) : 960083 Nodes searched : 1840499062 Nodes/second : 1917020 512 4 23 4 Threads =========================== Total time (ms) : 254206 3.8 Nodes searched : 1536539248 -16.5 % Nodes/second : 6044464 3.15 =========================== Total time (ms) : 229652 4.18 Nodes searched : 1472804698 -20 % Nodes/second : 6413202 3.35 =========================== Total time (ms) : 243197 3.95 Nodes searched : 1534883380 -16.6 % Nodes/second : 6311275 3.29 =========================== Total time (ms) : 300799 3.19 Ø 3.78 Nodes searched : 1839032053 +0.08 % Ø -13.2 % Nodes/second : 6113823 3.19 Ø 3.25 512 8 23 8 Threads =========================== Total time (ms) : 178049 5.4 Nodes searched : 1804005395 -1.98 % Nodes/second : 10132072 5.28 =========================== Total time (ms) : 188949 5.08 Nodes searched : 1914328553 +4.01 % Nodes/second : 10131456 5.29 =========================== Total time (ms) : 178624 5.4 Nodes searched : 1831661239 -0.5 % Nodes/second : 10254284 5.35 =========================== Total time (ms) : 196369 4.89 Ø 5.2 Nodes searched : 2008504419 +9.13 % Ø +2.7 % Nodes/second : 10228215 5.34 Ø 5.32
Most remarkably, SF searches much less nodes with 4 threads than with 1!
Speedup with 8 threads leaves room for improvement.
Search overhead is small.