Hi all,
I've been doing some work on my parallel search algorithm, which seems to be an annual-ish thing for me.
I have achieved an average speedup of 10.1 on 16 cores (on Bob Hyatt's parallel test positions) which, to my knowledge, is pretty competitive. (DTS speedup was reported to be 11.1.)
Here are some notes, in case anyone wants to borrow or discuss ideas:
- I noticed that if I ran 16 separate copies of my engine on my 16-core computer, each copy would search ~98% as many NPS as if I was only running one copy. So, as a baseline, I reworked my engine to do identical searches in each thread. Basically I wanted to run 16 copies of the engine, but in threads instead of processes. This exercise was useful because I found some bugs and places where I was hammering on shared memory when I didn't intend to be. In particular, my history heuristic matrix was global when I had intended it to be local. Once all the demons were excised, I was searching 15.6x NPS with 16 cores, which I figured was pretty good. (And all of the searches were reporting the same node counts, etc. as a sanity check.)
- Then I changed the hash table to be global instead of per-thread, i.e., a shared hash table. I think an important part of this is to have a lock as part of the hash table entry. In the past, I used an array of locks with each lock corresponding to many entries. That was pretty stupid because locks are small, and an array of them fits in a few pages of memory, so all my threads were hammering on those pages and the cache coherency system must have been going nuts. Having your locks located with your entires spreads them out over many megabytes and minimizes contention.
- Simply having a shared hash table gave me a non-trivial speedup: ~7.5 times faster. I was very surprised by this. I understand that if the threads start at slightly different times, one thread might finish a move before another thread arrives at it, but then it seems like they would both just search the next move simultaneously and the speedup would be minimal. I wasn't expecting 7.5x, but there you go.
- Then I added ABDADA. Basically, before making a non-PV move, I hash it and see if another thread is already searching it. If so, I add it to a "deferred move" array and search it later. Instead of making a separate hash table for these move hashes, I added a new field to my main hash table. Again, this spreads the move hashes across memory and minimizes contention.
- With ABDADA, I was seeing speedups of ~8.5. Not as much of an improvement as I had expected. (I am still surprised and confused at how much improvement you get just from having a shared hash table.)
- Then I added "cutoff checks." With ABDADA, the main problem (in my opinion) is that you might defer a move that's a beta cutoff, and then you waste a bunch of time searching almost every other move before you get back to it. My idea for "cutoff checks" is this: at the beginning of the move loop, if there are any deferred moves, check the hash table to see if some other thread found a beta cutoff for the current position. Only do this if depth is greater than some threshold so the number of hash table probes doesn't explode.
- With cutoff checks, the speedup increased to ~9.2. Almost as much of an improvement as ABDADA itself!
- ABDADA is a self-synchronizing algorithm, but in practice, a thread will sometimes get bogged down searching an unnecessary subtree and stop contributing to the main search. I made a system where the first thread to finish a ply can stop all the other threads, then all the threads start the next ply at roughly the same time with the same parameters.
- The end result (which I'm pretty happy with) is an algorithm that searches 10.12x faster on 16 cores, with a 15.16x speedup in terms of NPS.
- I don't have access to a 32-core machine, but my 16-core machine does have hyperthreading, so I can get a good idea of the tree size if I run with 32 threads. The tree increases to 1.96x. So, as long as there isn't too much memory contention, I would expect a ~15x speedup on 32 cores.
Good parallel speedup with ABDADA and cutoff checks
Moderators: hgm, Rebel, chrisw
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: Good parallel speedup with ABDADA and cutoff checks
The last time I tried this, the increased hash table access from ABDADA hurt performance, vs YBWC. But that was a long time ago, many generations of hardware.
--Jon
--Jon
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
Re: Good parallel speedup with ABDADA and cutoff checks
I admit, I am not doing ABDADA as described in the paper.jdart wrote:The last time I tried this, the increased hash table access from ABDADA hurt performance, vs YBWC. But that was a long time ago, many generations of hardware.
The core idea of ABDADA is that if another thread is searching a non-PV move, skip the move and search it later.
The paper suggests making a move, recursing, probing the hash table, finding out that the move is already being searched, un-recursing, and un-making the move.
This is a pretty crazy amount of work when all you have to do is hash a move and check a separate table. With the added bonus that you can control the overhead of these checks according to depth.
Also note that if you don't skip a move correctly, it's not the end of the world. So with the ABDADA hash table, you don't need to worry about collisions or locks. (Also, the ABDADA paper suggests keeping a count of the number of processors searching a move, but never suggests doing anything with this count, so you don't need to worry about that either.)
Pseudocode follows. It works well for me. I have checked and my search ends up with many deferred moves at all different depths.
search()
{
for (i = 0; i < moves; i++)
{
unsigned int temp = (move * 1664525) + 1013904223;
int64 move_hash = current_position_hash_key ^ temp;
if (already_being_searched(move_hash, depth))
{
deferred_move[deferred_moves++] = move;
continue;
}
going_to_search(move_hash, depth);
make_move(move);
search();
unmake_move();
finished_searching(move_hash, depth);
...
}
}
bool already_being_searched(int64 move_hash, int depth)
{
if (depth <= 2) // don't overwhelm memory by doing a bunch of probes at lower depths
return false;
int n = move_hash & abdada_hash_mask;
if (abdada_hash_table[n] == move_hash)
return true;
return false;
}
void going_to_search(int64 move_hash, int depth)
{
if (depth <= 2)
return false;
int n = move_hash & abdada_hash_mask;
abdada_hash_table[n] = move_hash;
}
void finished_searching(int64 move_hash, int depth)
{
if (depth <= 2)
return false;
int n = move_hash & abdada_hash_mask;
abdada_hash_table[n] = 0;
}
-
- Posts: 27791
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Good parallel speedup with ABDADA and cutoff checks
If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
Re: Good parallel speedup with ABDADA and cutoff checks
The only thing that could cause such a slowdown is memory contention, so the working set for your engine must be bigger than the L2 caches for your CPU.hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
I don't know about your engine in particular but one thing I did to limit memory traffic with my engine is decrease the size of my pawn hash table to 128k, so that fits nicely in the 256k L2 cache of recent Intel chips.
Also if you want to run on many cores, I would recommend against probing the hash table in your quiescence search, in case that's something you're doing.
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: Good parallel speedup with ABDADA and cutoff checks
That could be due to Turbo Boost if it's enabled.hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
Re: Good parallel speedup with ABDADA and cutoff checks
Oh whoops, good catch.matthewlai wrote:That could be due to Turbo Boost if it's enabled.hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
I do all my speedup testing with turbo boost disabled.
Hope your 2670s are treating you well. I couldn't be more pleased with mine. I see that similar systems have become very rare and expensive on eBay, we really purchased at a good time!
-
- Posts: 793
- Joined: Sun Aug 03, 2014 4:48 am
- Location: London, UK
Re: Good parallel speedup with ABDADA and cutoff checks
They are! I love mine, too! We really lucked out I think.TomKerrigan wrote:Oh whoops, good catch.matthewlai wrote:That could be due to Turbo Boost if it's enabled.hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
I do all my speedup testing with turbo boost disabled.
Hope your 2670s are treating you well. I couldn't be more pleased with mine. I see that similar systems have become very rare and expensive on eBay, we really purchased at a good time!
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
Re: Good parallel speedup with ABDADA (32 cores, 64 threads)
I just rented a 32 core, 64 thread machine via Amazon (EC2). They sure give you a lot of computer for $3.50 per hour.TomKerrigan wrote: ...
- I don't have access to a 32-core machine, but my 16-core machine does have hyperthreading, so I can get a good idea of the tree size if I run with 32 threads. The tree increases to 1.96x. So, as long as there isn't too much memory contention, I would expect a ~15x speedup on 32 cores.
My engine scales a bit better than I had predicted, I said "~15x" and it's actually 15.84x faster with 32 cores.
I was able to measure the tree size with 64 threads, it's 2.39x. I predict a ~24x speedup with 64 cores, which might be a little conservative again but I figure memory contention due to the shared hash table must become a factor at SOME point.
I'd be interested in comparing these numbers with what other people are seeing.
-
- Posts: 64
- Joined: Fri Jul 03, 2015 9:22 pm
Re: Good parallel speedup with ABDADA, "bug" fix
I found a "bug" in my code related to negascout vs. ABDADA.
I was marking a move as being searched even if it failed the null-window search and was being re-searched.
This is not obviously wrong, because if the move is being re-searched, then technically it IS still being searched.
The problem is that such a move is likely to become the best move, and the efficiency of alpha-beta (parallel or not) obviously relies on the best move being searched as early as possible. Marking the move as being searched causes other threads to search it later, which means there's an unnecessary delay before they can help with the re-search and/or acquire the new bounds.
The fix is to only mark a move as being searched if it's a null-window search.
This results in a small but non-trivial improvement for Hyatt's positions:
1 core = 1.0
2 cores = 1.87
4 cores = 3.45
8 cores = 6.31
16 cores = 10.33
I was marking a move as being searched even if it failed the null-window search and was being re-searched.
This is not obviously wrong, because if the move is being re-searched, then technically it IS still being searched.
The problem is that such a move is likely to become the best move, and the efficiency of alpha-beta (parallel or not) obviously relies on the best move being searched as early as possible. Marking the move as being searched causes other threads to search it later, which means there's an unnecessary delay before they can help with the re-search and/or acquire the new bounds.
The fix is to only mark a move as being searched if it's a null-window search.
This results in a small but non-trivial improvement for Hyatt's positions:
1 core = 1.0
2 cores = 1.87
4 cores = 3.45
8 cores = 6.31
16 cores = 10.33