New SMP stuff (particularly Kai)

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
bob
Posts: 20562
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

New SMP stuff (particularly Kai)

Post by bob » Mon Jul 20, 2015 4:34 am

I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?

zullil
Posts: 5668
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: New SMP stuff (particularly Kai)

Post by zullil » Mon Jul 20, 2015 3:39 pm

bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?
I attempted an analogous test with the latest developmental version of Stockfish, using the 24 positions you posted previously (see below). Each position was searched to depth 26, with an 8 GB hash table that was cleared between positions (assuming I modified Stockfish's code correctly). The first run was with 1 thread, and the second with 20 threads. Turbo Boost and hyper-threading were disabled in BIOS. Note that the 20 thread run was done just once; it should be run several times due to the indeterminacy of the SMP search. In any case, here's the data:

Code: Select all

./stockfish bench 8192 1 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 2461286
Nodes searched  : 5468499984
Nodes/second    : 2221805

./stockfish bench 8192 20 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 321851
Nodes searched  : 7094586497
Nodes/second    : 22043077
So these seem to yield some rather disappointing ratios:

speedup: 2461286 / 321851 = 7.6
nps speedup: 22043077 / 2221805 = 9.9
tree growth: 7094586497 / 5468499984 = 30%

Dual Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz

Code: Select all

r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq -
r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq -
r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - -
r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - -
r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - -
r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - -
5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - -
5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - -
5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - -
5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - -
5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - - 
5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - -
5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - -
5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - -
3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - -
3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - - 
3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - -

Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: New SMP stuff (particularly Kai)

Post by Joerg Oster » Mon Jul 20, 2015 5:29 pm

Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Jörg Oster

User avatar
Laskos
Posts: 9456
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: New SMP stuff (particularly Kai)

Post by Laskos » Mon Jul 20, 2015 7:30 pm

bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?
It's how I usually do the SMP speedup, I posted several results here during the last years just with dividing total time. Doing this way, it's important you have a reasonably high number of positions or runs, and your 4 times repeating 24 seems adequate. NPS speedup measurement is not problematic, it should vary just a little from run to run. Time to depth is a difficult animal, time cannot be short in multi-cored test, it will be much longer in single core, combined with number of positions and runs, it becomes very time consuming with just few cores. Curious, did you check for non-widening this Crafty version? IIRC you had almost 30% overhead previously, did it go down? Effective speedup still seems VERY high, before you first showed these speedups, I was more accustomed to what Louis is reporting for Stockfish, a speedup of 7 or so on 16 cores, and a little more, say 8 on 20.

Recently I took a very much widening on multicore Cheng engine to try estimate its effective speedup without actually playing games (the only rigorous method when the engine is widening in the parallel search). Time to depth in this case is not adequate due to widening, so I tried to compute time to solution on a non-tactical suite like STS, with mixed results. Tactical suites are not adequate, because the overhead during the parallel search may accidentally pick up some usually pruned lines in single core search, and the speedup can be artificially inflated.

One thing about superlinear positions. Do they exhibit superlinear behavior in all multicore runs? Is it a repeatable accident of exploding tree on single core, with the parallel search in all runs behaving nicely?

zullil
Posts: 5668
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: New SMP stuff (particularly Kai)

Post by zullil » Mon Jul 20, 2015 8:10 pm

zullil wrote: I attempted an analogous test with the latest developmental version of Stockfish, using the 24 positions you posted previously (see below). Each position was searched to depth 26, with an 8 GB hash table that was cleared between positions (assuming I modified Stockfish's code correctly). The first run was with 1 thread, and the second with 20 threads. Turbo Boost and hyper-threading were disabled in BIOS. Note that the 20 thread run was done just once; it should be run several times due to the indeterminacy of the SMP search. In any case, here's the data:

Code: Select all

./stockfish bench 8192 1 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 2461286
Nodes searched  : 5468499984
Nodes/second    : 2221805

./stockfish bench 8192 20 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 321851
Nodes searched  : 7094586497
Nodes/second    : 22043077
So these seem to yield some rather disappointing ratios:

speedup: 2461286 / 321851 = 7.6
nps speedup: 22043077 / 2221805 = 9.9
tree growth: 7094586497 / 5468499984 = 30%

Dual Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz

Code: Select all

r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq -
r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq -
r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - -
r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - -
r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - -
r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - -
5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - -
5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - -
5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - -
5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - -
5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - - 
5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - -
5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - -
5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - -
3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - -
3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - - 
3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - -
I repeated the test, this time searching each position to depth 27.

Code: Select all

/stockfish bench 8192 1 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 3696244
Nodes searched  : 8230454912
Nodes/second    : 2226707

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632
speedup: 3696244 / 479159 = 7.7
nps speedup: 23762632 / 2226707 = 10.7
tree growth: 11386079093 / 8230454912 = 38%

zullil
Posts: 5668
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: New SMP stuff (particularly Kai)

Post by zullil » Mon Jul 20, 2015 8:11 pm

Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
OK, Joerg, I'll test this ASAP.

zullil
Posts: 5668
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: New SMP stuff (particularly Kai)

Post by zullil » Mon Jul 20, 2015 8:48 pm

Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.

bob
Posts: 20562
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob » Mon Jul 20, 2015 9:18 pm

Laskos wrote:
bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?
It's how I usually do the SMP speedup, I posted several results here during the last years just with dividing total time. Doing this way, it's important you have a reasonably high number of positions or runs, and your 4 times repeating 24 seems adequate. NPS speedup measurement is not problematic, it should vary just a little from run to run. Time to depth is a difficult animal, time cannot be short in multi-cored test, it will be much longer in single core, combined with number of positions and runs, it becomes very time consuming with just few cores. Curious, did you check for non-widening this Crafty version? IIRC you had almost 30% overhead previously, did it go down? Effective speedup still seems VERY high, before you first showed these speedups, I was more accustomed to what Louis is reporting for Stockfish, a speedup of 7 or so on 16 cores, and a little more, say 8 on 20.

Recently I took a very much widening on multicore Cheng engine to try estimate its effective speedup without actually playing games (the only rigorous method when the engine is widening in the parallel search). Time to depth in this case is not adequate due to widening, so I tried to compute time to solution on a non-tactical suite like STS, with mixed results. Tactical suites are not adequate, because the overhead during the parallel search may accidentally pick up some usually pruned lines in single core search, and the speedup can be artificially inflated.

One thing about superlinear positions. Do they exhibit superlinear behavior in all multicore runs? Is it a repeatable accident of exploding tree on single core, with the parallel search in all runs behaving nicely?
Yes, but I will re-test to confirm. Last test was a 2 elo difference between the two version, one with 1 thread vs the other with 8 to (I think) a fixed depth of 12.

I certainly have one or two that do. But they also show pathological behavior (for Crafty) with one thread, where at some depth the branching factor simply blows up... I am going to study this issue separately, and right now I want to (a) finish this and then (b) try to write up the parallel search algorithm.

Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: New SMP stuff (particularly Kai)

Post by Joerg Oster » Mon Jul 20, 2015 9:30 pm

bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.

I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:

speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%

These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.

What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.

In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...

More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...

More tomorrow. Comments about speedup computed in this way?
I agree it's a reasonable way of doing it.
Tomorrow I will give some numbers with 4 and 8 threads for Stockfish for comparison.
Jörg Oster

Joerg Oster
Posts: 691
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: New SMP stuff (particularly Kai)

Post by Joerg Oster » Mon Jul 20, 2015 9:35 pm

zullil wrote:
Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.
Looks promising, many thanks.
Looking forward to your further results.
Jörg Oster

Post Reply