Actual speedups from YBWC and ABDADA on 8+ core machines?

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 7:22 pm

Actual speedups from YBWC and ABDADA on 8+ core machines?

Post by TomKerrigan » Fri Jul 10, 2015 10:21 pm

Hi guys,

On a whim, I bought a circa-2010 rack-mounted server on eBay with 24 AMD cores for only $200. It has been a fun thing to own so far. I put it in my apartment's storage room (which has an electrical outlet) because it's absurdly loud. Happily I can ssh into it no problem thanks to a $9 wifi adapter that I bought on Amazon. It does use a lot of power (~200W when idle, ~300W under load) so I only turn it on once I have good confidence in the code I want to run on it.

I wanted to see how my chess engine (YBWC) would scale on it, since the biggest machine I had run my engine on up until recently was quad-core.

The results were initially depressing--due to some memory and lock contention, running with more than 5 cores slowed the engine down. But after some changes, I was able to get speedups going up to 20 cores.

I ran the Bratko Kopec positions to 11 ply and got the following speedups:

1 thread: 1x (obviously)
2 threads: 1.62
4 threads: 2.56
8 threads: 4.06
12 threads: 5.82
16 threads: 6.31
20 threads: 6.75

Running BK to deeper depths got somewhat higher speedups, as did running other positions like some randomly selected from ECM.

(A note about my YBWC implementation--I seem to be running into some lock contention issues or cache coherency issues since 24 threads searches the same number of NPS as 20 threads. I think I can address a lot of this but it would require more work than just a couple days of tinkering.)

I also implemented ABDADA (since it only takes a couple hours) along with the cutoff check that I described in a separate thread, and got the following speedups for that:

1 thread: 1x
2 threads: 1.43
4 threads: 2.25
8 threads: 3.33
12 threads: 4.79
16 threads: 5.50
20 threads: 5.94
24 threads: 7.07

So, I was wondering it these results are in-line with what others are seeing. I've browsed through a number of old posts, plus this more recent one:

http://www.talkchess.com/forum/viewtopic.php?t=56019

This seems to indicate that my speedups are comparable to what others are seeing. For example, from the linked post, the YBWC engines are seeing a 2.8-3.0 speedup with 4 cores vs. my 2.56, and a ~4.5 speedup with 8 cores vs. my 4.06 speedup. Already my results are pretty close, and once I rewrite some stuff to reduce contention, I expect to see almost exactly the same speedups.

Then again, I have reason to wonder, based on various numbers published in the academic papers for this stuff. For example, with the ABDADA paper, they were claiming a speedup of ~10x for 16 cores whereas my speedup is ony 5.5x, and in the YBWC paper they were claiming a ~300x speedup on 1000 cores which means an efficiency of ~30% on 1000 cores, whereas my engine is already only at a 40% ratio on 16 cores.

So I wonder if these academic speedups are based on completely different circumstances that wouldn't apply to my engine or if something is going wrong with my parallel implementations.

Anyway, I'd be interested in any feedback!
Talk to you guys soon,
Tom

Daniel Shawul
Posts: 3757
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by Daniel Shawul » Sat Jul 11, 2015 12:00 am

ABDADA is a very good algorithm unlike the lazy smp approach that seems to be the new favorite in this form. The search overhead in the latter is _properly_ taken care of with ABDADA, unlike some adhoc scheme used in the lazy approach ( e.g. search d and d+1 depths in parallel etc). It takes almost the same effort to implement ABDADA and Lazy SMP. I have done implementation of ABDADA, YBW, and SHT (aka Lazy SMP with no fancy/vague search overhead reduction method). ABDADA gave me a speed up of 1.87x on 2 processors equal to YBW. The thing I like about ABDADA is that it is fault tolerant! A processor can go offline and still the remaining processor will search those moves on its second pass and finish the search.

See my effort here:
http://talkchess.com/forum/viewtopic.ph ... ght=abdada

Kai better include Scorpio in that list with all of its parallel algorithms : YBW, ABDADA, and SHT. Unfortunately that guy does his pseudo studies with strong engines only :) Performance comparison of different algorithms on different engines is vague to say the least.

Btw, I have implemented EGBBs in tscp for demonstration some time ago.

bob
Posts: 20555
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by bob » Sat Jul 11, 2015 12:26 am

TomKerrigan wrote:Hi guys,

On a whim, I bought a circa-2010 rack-mounted server on eBay with 24 AMD cores for only $200. It has been a fun thing to own so far. I put it in my apartment's storage room (which has an electrical outlet) because it's absurdly loud. Happily I can ssh into it no problem thanks to a $9 wifi adapter that I bought on Amazon. It does use a lot of power (~200W when idle, ~300W under load) so I only turn it on once I have good confidence in the code I want to run on it.

I wanted to see how my chess engine (YBWC) would scale on it, since the biggest machine I had run my engine on up until recently was quad-core.

The results were initially depressing--due to some memory and lock contention, running with more than 5 cores slowed the engine down. But after some changes, I was able to get speedups going up to 20 cores.

I ran the Bratko Kopec positions to 11 ply and got the following speedups:

1 thread: 1x (obviously)
2 threads: 1.62
4 threads: 2.56
8 threads: 4.06
12 threads: 5.82
16 threads: 6.31
20 threads: 6.75

Running BK to deeper depths got somewhat higher speedups, as did running other positions like some randomly selected from ECM.

(A note about my YBWC implementation--I seem to be running into some lock contention issues or cache coherency issues since 24 threads searches the same number of NPS as 20 threads. I think I can address a lot of this but it would require more work than just a couple days of tinkering.)

I also implemented ABDADA (since it only takes a couple hours) along with the cutoff check that I described in a separate thread, and got the following speedups for that:

1 thread: 1x
2 threads: 1.43
4 threads: 2.25
8 threads: 3.33
12 threads: 4.79
16 threads: 5.50
20 threads: 5.94
24 threads: 7.07

So, I was wondering it these results are in-line with what others are seeing. I've browsed through a number of old posts, plus this more recent one:

http://www.talkchess.com/forum/viewtopic.php?t=56019

This seems to indicate that my speedups are comparable to what others are seeing. For example, from the linked post, the YBWC engines are seeing a 2.8-3.0 speedup with 4 cores vs. my 2.56, and a ~4.5 speedup with 8 cores vs. my 4.06 speedup. Already my results are pretty close, and once I rewrite some stuff to reduce contention, I expect to see almost exactly the same speedups.

Then again, I have reason to wonder, based on various numbers published in the academic papers for this stuff. For example, with the ABDADA paper, they were claiming a speedup of ~10x for 16 cores whereas my speedup is ony 5.5x, and in the YBWC paper they were claiming a ~300x speedup on 1000 cores which means an efficiency of ~30% on 1000 cores, whereas my engine is already only at a 40% ratio on 16 cores.

So I wonder if these academic speedups are based on completely different circumstances that wouldn't apply to my engine or if something is going wrong with my parallel implementations.

Anyway, I'd be interested in any feedback!
Talk to you guys soon,
Tom
Current Crafty uses YBW, but it is doing some pretty unique things to make it work. I don't have a full set of data yet, but I have run the old CB/DTS data (24 positions) with 1 core and 20 cores. The 20 core speedup sits at 15.3x at the moment. That is 4 separate runs, each run taking 2-5 minutes per position with 20 cores. And then taking the geometric mean of the 4 speedups.

I am just starting the 16 core tests, same idea... after two runs, the pure average speedup is 11.3x (10.78x and 11.8x faster for the two samples) and the two geometric means for the two runs were 12.4x and 13.1x...

This is still using the classic recursive negamax. I am in the process of running two more 16 cpu tests, then 4 8 cpu tests, 4 4cpu tests and 4 2cpu tests. The 1 cpu test takes forever but it has already been completed so I could compute the 20cpu speedups...

TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 7:22 pm

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by TomKerrigan » Sat Jul 11, 2015 1:19 am

Hi Bob,

A couple of questions--

1) Do you have an EPD or similar file of the positions you use for testing? Some quick internet searching didn't turn up anything for me.

2) I see from the graph I linked to in my OP that e.g. Stockfish is using YBW and getting a 4.76x improvement with 8 threads. That's 60% efficient. You are saying you are getting significantly more efficiency (76.5%) with more than twice the threads (20). Can you speculate about the difference, since Stockfish, Houdini, and my own engine seem to be in the same ballpark and you're doing much much better?

TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 7:22 pm

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by TomKerrigan » Sat Jul 11, 2015 1:34 am

I've seen a number of references to Lazy SMP but I can't find an explanation of it anywhere. Could somebody provide a link please?

If Lazy SMP is inferior to ABDADA though, I'm not really sure why anybody would use it, since ABDADA is crazy easy to implement.

Actually I didn't even bother using my main hash table for ABDADA, I just created a small hash table to store which moves were currently being searched. Access to it doesn't really need to be synchronized, because what's the worst that could happen if there's a race condition? And there doesn't even have to be a "number of threads searching" field, since the only thing you need to know is if ANY thread finished searching the position.

All in all, I probably made less than 50 lines of changes to my code to implement ABDADA.

jdart
Posts: 3824
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by jdart » Sat Jul 11, 2015 2:05 am

I tried ABDADA years ago and found that the increased hash table contention just killed performance. Your mileage may vary. The only good thing is that it is easy to implement.

Re the parent post's question: I think 4x speedup on 8 cores is not particularly good. But for me the factor depends on game phase. I am seeing less efficiency in the endgame. See http://talkchess.com/forum/viewtopic.ph ... 97&t=47930 although these are not very recent results.

-Jon

TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 7:22 pm

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by TomKerrigan » Sat Jul 11, 2015 2:28 am

I wonder if we're talking about the same thing. The post you linked to talked about nodes per second whereas the numbers I posted were time-to-depth (total time to search all the BK positions to 11 ply).

For 8 threads, I'm searching 6.5x as many NPS. I'm sure this could be improved somewhat, but obviously not a huge amount.

My NPS scaling for ABDADA isn't as quite good as with YBW but the curves are different. ABDADA seems to keep scaling at a good rate past 20 cores, whereas my current implementation of YBW stalls out.

bob
Posts: 20555
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by bob » Sat Jul 11, 2015 2:38 am

TomKerrigan wrote:Hi Bob,

A couple of questions--

1) Do you have an EPD or similar file of the positions you use for testing? Some quick internet searching didn't turn up anything for me.

2) I see from the graph I linked to in my OP that e.g. Stockfish is using YBW and getting a 4.76x improvement with 8 threads. That's 60% efficient. You are saying you are getting significantly more efficiency (76.5%) with more than twice the threads (20). Can you speculate about the difference, since Stockfish, Houdini, and my own engine seem to be in the same ballpark and you're doing much much better?
Yes. However the test is somewhat tricky. I have to set a specific depth for each position so that I can test with 1, 2, 4, ... cores. What I do is this: take the positions which I will post below, and set the time limit for 5 minutes per position and run them with max cores. Look at the output and for each position, note the depth where you get a full iteration done within say 1-4 minutes. This will be different for each position obviously. Now you use this for all tests, which means the 1 cpu runs will take forever and then some. Here is the FEN for the positions. I'll leave my "sd=n" command in also which shows you the depth I search these to using a 2x10core box, to produce an average of 2-3 minutes per position. I modified my FEN parsing to accept the sd=n option to set the depth for that position. You can remove it if you want. Also the best move was just the move played by Cray Blitz in this game back around 1988 or so, they are not always the best move. But crafty complains if a test positions doesn't have a best move / avoid move parameter so I gave it one. :)

These were the same positions that I used for the DTS paper published in the ICCA Journal in maybe 1992 or so... Nothing says they are great positions. One thing I do for testing is to clear the hash between positions since these positions represent consecutive moves in a real game, and leaving the hash intact lets one move greatly influence the next, sometimes getting instant moves and such which I don't want.

Code: Select all

r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -  ; bm Na5; sd=28
r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq - ; bm c6; sd=27
r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq - ; bm cxb5; sd=26
r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq - ; bm Kd7; sd=28
r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - - ; bm bxc6; sd=29
r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - - ; bm f2+; sd=29
r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - - ; bm Kc7; sd=26
r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - - ; bm cxb5; sd=26
r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - - ; bm Bd6; sd=27
r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - - ; bm Be7; sd=28
r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - - ; bm Raf8; sd=30
5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - - ; bm Kb6; sd=32
5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - - ; bm Rxf7; sd=31
5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - - ; bm Qe8; sd=32
5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - ; bm Rxf7; sd=34
4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - ; bm Rxf8; sd=33
5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - - ; bm Qd8; sd=32
5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - - ; bm bxa4; sd=30
5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - - ; bm Nxe5; sd=28
5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - - ; bm Rxe5; sd=29
5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - - ; bm a6; sd=27
3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - - ; bm Kb5; sd=25
3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - - ; bm Kb4; sd=25
3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - - ; bm Qd5 sd=26
As far as what is different between Crafty and stockfish/houdini I do not know since I have not looked at nor tried to understand their parallel search..

The new version of Crafty has the following significant changes from older versions:

(1) very lightweight splitting code. When a processor goes idle, any active thread will try to split and give it something to do. In fact, this is pretty useful as with 20 cores, when a thread goes idle, it will likely see 6-8 potential split points at a minimum, and it gets to choose which one to accept. All a splitting thread does is grab a new split block, link it to the parent, copy the necessary data, and then continue searching.

(2) threads now exclusively do what Cray Blitz (DTS) called a "late join". Once a thread requests a split, it enters a tight loop looking for split points. As soon as it finds one (or more) it chooses the best and then joins by allocating a split block and repeating the parent's actions for itself.

(3) Near the root, threads "pre-split" before being asked, since (a) there are not that many plies near the root meaning the overhead is very low and (b) positions near the root are better split candidates since those sub-trees are larger computational tasks. There are limits (probably not yet optimized) as to how far from the root this can be done, how many of the "pre-splits" are allowed (IE once you have done this N times, and all of your current pre-split points are still "unjoined" there is little point in doing yet another one. Or if you have pre-split at ply=4 already, there is no point in pre-splitting at ply=6 later, the ply=4 split point is a better one. Etc. There is probably a bit more performance to be had as there are now about 10 parallel search parameters that need optimizing. My next project is to complete the "autotune" code so that Crafty can run a test repeatedly and tune the parameters to fit a specific machine it runs on.

(4) I only have one global lock that is used to abort searches when there is an unexpected fail high at a split point. No other global locks of any kind, and this one is not acquired very frequently even in worst-case scenarios.

(5) The split point locks are REALLY limited, as with large numbers of cores, small conflicts turn into significant wait time.

In the early 90's, Cray Blitz was producing similar speedups. I didn't think I would reach that point, but the pre-splitting and late-join changes along with almost no locking seems to have gotten back to that level. Of course we are searching 2.5x deeper today which helps this look better. Harder to search a 10 ply tree that fast.

I'm sure there are several other things I have not mentioned...

bob
Posts: 20555
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by bob » Sat Jul 11, 2015 3:00 am

BTW I have some fairly recent SMP data on my web page (raw crafty log files and .xlsx spreadsheets showing all the data in a one-page view. I am in the middle of re-doing the test with the latest code (the old data is not quite reliable either, there was a fail-high at root issue that didn't seem to affect SMP performance but it was there and I want data from a flawless engine...

If you'd like to see the new data, let me know. I have the 1 cpu and 4 20cpu logs and a spreadsheet showing time to depth, nps scaling, and tree size growth. Should have the 16 cpu logs complete in a few minutes...

bob
Posts: 20555
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Post by bob » Sat Jul 11, 2015 3:04 am

TomKerrigan wrote:I wonder if we're talking about the same thing. The post you linked to talked about nodes per second whereas the numbers I posted were time-to-depth (total time to search all the BK positions to 11 ply).

For 8 threads, I'm searching 6.5x as many NPS. I'm sure this could be improved somewhat, but obviously not a huge amount.

My NPS scaling for ABDADA isn't as quite good as with YBW but the curves are different. ABDADA seems to keep scaling at a good rate past 20 cores, whereas my current implementation of YBW stalls out.
BTW, since you raised this issue, Crafty's nps on my 20 core box looks like this:

2 cpus: 2.0.
4 cpus: 3.9
8 cpus: 7.6
16 cpus: 14.0
20cpus: 16.3

Post Reply