I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
New SMP stuff (particularly Kai)
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: New SMP stuff (particularly Kai)
I attempted an analogous test with the latest developmental version of Stockfish, using the 24 positions you posted previously (see below). Each position was searched to depth 26, with an 8 GB hash table that was cleared between positions (assuming I modified Stockfish's code correctly). The first run was with 1 thread, and the second with 20 threads. Turbo Boost and hyper-threading were disabled in BIOS. Note that the 20 thread run was done just once; it should be run several times due to the indeterminacy of the SMP search. In any case, here's the data:bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
Code: Select all
./stockfish bench 8192 1 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 2461286
Nodes searched : 5468499984
Nodes/second : 2221805
./stockfish bench 8192 20 26 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 321851
Nodes searched : 7094586497
Nodes/second : 22043077
speedup: 2461286 / 321851 = 7.6
nps speedup: 22043077 / 2221805 = 9.9
tree growth: 7094586497 / 5468499984 = 30%
Dual Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
Code: Select all
r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -
r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq -
r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq -
r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - -
r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - -
r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - -
r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - -
r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - -
5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - -
5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - -
5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - -
5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - -
5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - -
5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - -
5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - -
5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - -
5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - -
3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - -
3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - -
3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - -
-
- Posts: 937
- Joined: Fri Mar 10, 2006 4:29 pm
- Location: Germany
Re: New SMP stuff (particularly Kai)
Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Jörg Oster
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: New SMP stuff (particularly Kai)
It's how I usually do the SMP speedup, I posted several results here during the last years just with dividing total time. Doing this way, it's important you have a reasonably high number of positions or runs, and your 4 times repeating 24 seems adequate. NPS speedup measurement is not problematic, it should vary just a little from run to run. Time to depth is a difficult animal, time cannot be short in multi-cored test, it will be much longer in single core, combined with number of positions and runs, it becomes very time consuming with just few cores. Curious, did you check for non-widening this Crafty version? IIRC you had almost 30% overhead previously, did it go down? Effective speedup still seems VERY high, before you first showed these speedups, I was more accustomed to what Louis is reporting for Stockfish, a speedup of 7 or so on 16 cores, and a little more, say 8 on 20.bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
Recently I took a very much widening on multicore Cheng engine to try estimate its effective speedup without actually playing games (the only rigorous method when the engine is widening in the parallel search). Time to depth in this case is not adequate due to widening, so I tried to compute time to solution on a non-tactical suite like STS, with mixed results. Tactical suites are not adequate, because the overhead during the parallel search may accidentally pick up some usually pruned lines in single core search, and the speedup can be artificially inflated.
One thing about superlinear positions. Do they exhibit superlinear behavior in all multicore runs? Is it a repeatable accident of exploding tree on single core, with the parallel search in all runs behaving nicely?
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: New SMP stuff (particularly Kai)
I repeated the test, this time searching each position to depth 27.zullil wrote: I attempted an analogous test with the latest developmental version of Stockfish, using the 24 positions you posted previously (see below). Each position was searched to depth 26, with an 8 GB hash table that was cleared between positions (assuming I modified Stockfish's code correctly). The first run was with 1 thread, and the second with 20 threads. Turbo Boost and hyper-threading were disabled in BIOS. Note that the 20 thread run was done just once; it should be run several times due to the indeterminacy of the SMP search. In any case, here's the data:
So these seem to yield some rather disappointing ratios:Code: Select all
./stockfish bench 8192 1 26 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 2461286 Nodes searched : 5468499984 Nodes/second : 2221805 ./stockfish bench 8192 20 26 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 321851 Nodes searched : 7094586497 Nodes/second : 22043077
speedup: 2461286 / 321851 = 7.6
nps speedup: 22043077 / 2221805 = 9.9
tree growth: 7094586497 / 5468499984 = 30%
Dual Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
Code: Select all
r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq - r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq - r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq - r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq - r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - - r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - - r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - - r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - - r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - - r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - - r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - - 5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - - 5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - - 5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - - 5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - 4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - 5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - - 5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - - 5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - - 5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - - 5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - - 3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - - 3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - - 3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - -
Code: Select all
/stockfish bench 8192 1 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 3696244
Nodes searched : 8230454912
Nodes/second : 2226707
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched : 11386079093
Nodes/second : 23762632
nps speedup: 23762632 / 2226707 = 10.7
tree growth: 11386079093 / 8230454912 = 38%
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: New SMP stuff (particularly Kai)
OK, Joerg, I'll test this ASAP.Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: New SMP stuff (particularly Kai)
Just one run, but less total time and a smaller tree, so maybe a change for the better:Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Code: Select all
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched : 11386079093
Nodes/second : 23762632
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched : 10465813469
Nodes/second : 23604187
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: New SMP stuff (particularly Kai)
Yes, but I will re-test to confirm. Last test was a 2 elo difference between the two version, one with 1 thread vs the other with 8 to (I think) a fixed depth of 12.Laskos wrote:It's how I usually do the SMP speedup, I posted several results here during the last years just with dividing total time. Doing this way, it's important you have a reasonably high number of positions or runs, and your 4 times repeating 24 seems adequate. NPS speedup measurement is not problematic, it should vary just a little from run to run. Time to depth is a difficult animal, time cannot be short in multi-cored test, it will be much longer in single core, combined with number of positions and runs, it becomes very time consuming with just few cores. Curious, did you check for non-widening this Crafty version? IIRC you had almost 30% overhead previously, did it go down? Effective speedup still seems VERY high, before you first showed these speedups, I was more accustomed to what Louis is reporting for Stockfish, a speedup of 7 or so on 16 cores, and a little more, say 8 on 20.bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
Recently I took a very much widening on multicore Cheng engine to try estimate its effective speedup without actually playing games (the only rigorous method when the engine is widening in the parallel search). Time to depth in this case is not adequate due to widening, so I tried to compute time to solution on a non-tactical suite like STS, with mixed results. Tactical suites are not adequate, because the overhead during the parallel search may accidentally pick up some usually pruned lines in single core search, and the speedup can be artificially inflated.
One thing about superlinear positions. Do they exhibit superlinear behavior in all multicore runs? Is it a repeatable accident of exploding tree on single core, with the parallel search in all runs behaving nicely?
I certainly have one or two that do. But they also show pathological behavior (for Crafty) with one thread, where at some depth the branching factor simply blows up... I am going to study this issue separately, and right now I want to (a) finish this and then (b) try to write up the parallel search algorithm.
-
- Posts: 937
- Joined: Fri Mar 10, 2006 4:29 pm
- Location: Germany
Re: New SMP stuff (particularly Kai)
I agree it's a reasonable way of doing it.bob wrote:I have decided to do the SMP speedup calculations a bit differently. The old way really was showing way too much information, going position by position. I am trying a less favorable way of showing SMP speedup. I have adjusted the search depths a bit so that the 1 thread test takes around 30 minutes or so on average, as close as I can get with a fixed depth. The total time taken for all the positions is summed and that gives a total time for the test. I am now through 2 out of 4 20 thread runs, doing the same thing. These now take under an hour each.
I am computing the speedup as simply T(1) / T(20). This seems like a reasonable way to compute this if you think of the set of positions as a single game, which is what they actually are. The early numbers for 20 cores look like this:
speedup: 13.6 and 12.8
nps speedup: 14.9 and 15.0
tree growth: 10% and 18%
These numbers look better if I take each position, one by one, and compute the speedup and then either take the mean or geometric mean. But this greatly reduces the amount of data.
What do you think of this? I still have the old spreadsheets, so cutting and pasting the new results is easy enough. Another thing that changed here is that in equalizing the times at 1 thread, I reduced two of the super-linear positions that are now normal once again. 4 of the 24 show a modest super-linear speedup, but those two positions with 50x and above are now behaving normally. The last iteration was a killer in those two, where the branching factor simply went insane, and the parallel search apparently helped with the problematic move ordering and produced good results.
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
More later. Will have all 4 20 core runs in another hour or so, and have it queued up to run four 16 thread, four 8 thread, four 4 thread and four 2 thread runs back to back... The four 2-thread runs will take about 24 hours total, the four 4 threads about 12 hours, and the four 8 threads about 6. the four 16 threads will run fairly close to the 20core speeds...
More tomorrow. Comments about speedup computed in this way?
Tomorrow I will give some numbers with 4 and 8 threads for Stockfish for comparison.
Jörg Oster
-
- Posts: 937
- Joined: Fri Mar 10, 2006 4:29 pm
- Location: Germany
Re: New SMP stuff (particularly Kai)
Looks promising, many thanks.zullil wrote:Just one run, but less total time and a smaller tree, so maybe a change for the better:Joerg Oster wrote:Hi Louis,
if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things.
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.Code: Select all
./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 479159 Nodes searched : 11386079093 Nodes/second : 23762632 ./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth =========================== Total time (ms) : 443388 Nodes searched : 10465813469 Nodes/second : 23604187
Looking forward to your further results.
Jörg Oster