Actual speedups from YBWC and ABDADA on 8+ core machines?
Posted: Sat Jul 11, 2015 12:21 am
Hi guys,
On a whim, I bought a circa-2010 rack-mounted server on eBay with 24 AMD cores for only $200. It has been a fun thing to own so far. I put it in my apartment's storage room (which has an electrical outlet) because it's absurdly loud. Happily I can ssh into it no problem thanks to a $9 wifi adapter that I bought on Amazon. It does use a lot of power (~200W when idle, ~300W under load) so I only turn it on once I have good confidence in the code I want to run on it.
I wanted to see how my chess engine (YBWC) would scale on it, since the biggest machine I had run my engine on up until recently was quad-core.
The results were initially depressing--due to some memory and lock contention, running with more than 5 cores slowed the engine down. But after some changes, I was able to get speedups going up to 20 cores.
I ran the Bratko Kopec positions to 11 ply and got the following speedups:
1 thread: 1x (obviously)
2 threads: 1.62
4 threads: 2.56
8 threads: 4.06
12 threads: 5.82
16 threads: 6.31
20 threads: 6.75
Running BK to deeper depths got somewhat higher speedups, as did running other positions like some randomly selected from ECM.
(A note about my YBWC implementation--I seem to be running into some lock contention issues or cache coherency issues since 24 threads searches the same number of NPS as 20 threads. I think I can address a lot of this but it would require more work than just a couple days of tinkering.)
I also implemented ABDADA (since it only takes a couple hours) along with the cutoff check that I described in a separate thread, and got the following speedups for that:
1 thread: 1x
2 threads: 1.43
4 threads: 2.25
8 threads: 3.33
12 threads: 4.79
16 threads: 5.50
20 threads: 5.94
24 threads: 7.07
So, I was wondering it these results are in-line with what others are seeing. I've browsed through a number of old posts, plus this more recent one:
http://www.talkchess.com/forum/viewtopic.php?t=56019
This seems to indicate that my speedups are comparable to what others are seeing. For example, from the linked post, the YBWC engines are seeing a 2.8-3.0 speedup with 4 cores vs. my 2.56, and a ~4.5 speedup with 8 cores vs. my 4.06 speedup. Already my results are pretty close, and once I rewrite some stuff to reduce contention, I expect to see almost exactly the same speedups.
Then again, I have reason to wonder, based on various numbers published in the academic papers for this stuff. For example, with the ABDADA paper, they were claiming a speedup of ~10x for 16 cores whereas my speedup is ony 5.5x, and in the YBWC paper they were claiming a ~300x speedup on 1000 cores which means an efficiency of ~30% on 1000 cores, whereas my engine is already only at a 40% ratio on 16 cores.
So I wonder if these academic speedups are based on completely different circumstances that wouldn't apply to my engine or if something is going wrong with my parallel implementations.
Anyway, I'd be interested in any feedback!
Talk to you guys soon,
Tom
On a whim, I bought a circa-2010 rack-mounted server on eBay with 24 AMD cores for only $200. It has been a fun thing to own so far. I put it in my apartment's storage room (which has an electrical outlet) because it's absurdly loud. Happily I can ssh into it no problem thanks to a $9 wifi adapter that I bought on Amazon. It does use a lot of power (~200W when idle, ~300W under load) so I only turn it on once I have good confidence in the code I want to run on it.
I wanted to see how my chess engine (YBWC) would scale on it, since the biggest machine I had run my engine on up until recently was quad-core.
The results were initially depressing--due to some memory and lock contention, running with more than 5 cores slowed the engine down. But after some changes, I was able to get speedups going up to 20 cores.
I ran the Bratko Kopec positions to 11 ply and got the following speedups:
1 thread: 1x (obviously)
2 threads: 1.62
4 threads: 2.56
8 threads: 4.06
12 threads: 5.82
16 threads: 6.31
20 threads: 6.75
Running BK to deeper depths got somewhat higher speedups, as did running other positions like some randomly selected from ECM.
(A note about my YBWC implementation--I seem to be running into some lock contention issues or cache coherency issues since 24 threads searches the same number of NPS as 20 threads. I think I can address a lot of this but it would require more work than just a couple days of tinkering.)
I also implemented ABDADA (since it only takes a couple hours) along with the cutoff check that I described in a separate thread, and got the following speedups for that:
1 thread: 1x
2 threads: 1.43
4 threads: 2.25
8 threads: 3.33
12 threads: 4.79
16 threads: 5.50
20 threads: 5.94
24 threads: 7.07
So, I was wondering it these results are in-line with what others are seeing. I've browsed through a number of old posts, plus this more recent one:
http://www.talkchess.com/forum/viewtopic.php?t=56019
This seems to indicate that my speedups are comparable to what others are seeing. For example, from the linked post, the YBWC engines are seeing a 2.8-3.0 speedup with 4 cores vs. my 2.56, and a ~4.5 speedup with 8 cores vs. my 4.06 speedup. Already my results are pretty close, and once I rewrite some stuff to reduce contention, I expect to see almost exactly the same speedups.
Then again, I have reason to wonder, based on various numbers published in the academic papers for this stuff. For example, with the ABDADA paper, they were claiming a speedup of ~10x for 16 cores whereas my speedup is ony 5.5x, and in the YBWC paper they were claiming a ~300x speedup on 1000 cores which means an efficiency of ~30% on 1000 cores, whereas my engine is already only at a 40% ratio on 16 cores.
So I wonder if these academic speedups are based on completely different circumstances that wouldn't apply to my engine or if something is going wrong with my parallel implementations.
Anyway, I'd be interested in any feedback!
Talk to you guys soon,
Tom