Good parallel speedup with ABDADA and cutoff checks

TomKerrigan · Post by **TomKerrigan** » Fri Feb 03, 2017 2:50 am

Hi all,

I've been doing some work on my parallel search algorithm, which seems to be an annual-ish thing for me.

I have achieved an average speedup of 10.1 on 16 cores (on Bob Hyatt's parallel test positions) which, to my knowledge, is pretty competitive. (DTS speedup was reported to be 11.1.)

Here are some notes, in case anyone wants to borrow or discuss ideas:

- I noticed that if I ran 16 separate copies of my engine on my 16-core computer, each copy would search ~98% as many NPS as if I was only running one copy. So, as a baseline, I reworked my engine to do identical searches in each thread. Basically I wanted to run 16 copies of the engine, but in threads instead of processes. This exercise was useful because I found some bugs and places where I was hammering on shared memory when I didn't intend to be. In particular, my history heuristic matrix was global when I had intended it to be local. Once all the demons were excised, I was searching 15.6x NPS with 16 cores, which I figured was pretty good. (And all of the searches were reporting the same node counts, etc. as a sanity check.)

- Then I changed the hash table to be global instead of per-thread, i.e., a shared hash table. I think an important part of this is to have a lock as part of the hash table entry. In the past, I used an array of locks with each lock corresponding to many entries. That was pretty stupid because locks are small, and an array of them fits in a few pages of memory, so all my threads were hammering on those pages and the cache coherency system must have been going nuts. Having your locks located with your entires spreads them out over many megabytes and minimizes contention.

- Simply having a shared hash table gave me a non-trivial speedup: ~7.5 times faster. I was very surprised by this. I understand that if the threads start at slightly different times, one thread might finish a move before another thread arrives at it, but then it seems like they would both just search the next move simultaneously and the speedup would be minimal. I wasn't expecting 7.5x, but there you go.

- Then I added ABDADA. Basically, before making a non-PV move, I hash it and see if another thread is already searching it. If so, I add it to a "deferred move" array and search it later. Instead of making a separate hash table for these move hashes, I added a new field to my main hash table. Again, this spreads the move hashes across memory and minimizes contention.

- With ABDADA, I was seeing speedups of ~8.5. Not as much of an improvement as I had expected. (I am still surprised and confused at how much improvement you get just from having a shared hash table.)

- Then I added "cutoff checks." With ABDADA, the main problem (in my opinion) is that you might defer a move that's a beta cutoff, and then you waste a bunch of time searching almost every other move before you get back to it. My idea for "cutoff checks" is this: at the beginning of the move loop, if there are any deferred moves, check the hash table to see if some other thread found a beta cutoff for the current position. Only do this if depth is greater than some threshold so the number of hash table probes doesn't explode.

- With cutoff checks, the speedup increased to ~9.2. Almost as much of an improvement as ABDADA itself!

- ABDADA is a self-synchronizing algorithm, but in practice, a thread will sometimes get bogged down searching an unnecessary subtree and stop contributing to the main search. I made a system where the first thread to finish a ply can stop all the other threads, then all the threads start the next ply at roughly the same time with the same parameters.

- The end result (which I'm pretty happy with) is an algorithm that searches 10.12x faster on 16 cores, with a 15.16x speedup in terms of NPS.

- I don't have access to a 32-core machine, but my 16-core machine does have hyperthreading, so I can get a good idea of the tree size if I run with 32 threads. The tree increases to 1.96x. So, as long as there isn't too much memory contention, I would expect a ~15x speedup on 32 cores.

jdart · Post by **jdart** » Fri Feb 03, 2017 3:32 pm

The last time I tried this, the increased hash table access from ABDADA hurt performance, vs YBWC. But that was a long time ago, many generations of hardware.

--Jon

TomKerrigan · Post by **TomKerrigan** » Fri Feb 03, 2017 6:46 pm

jdart wrote:The last time I tried this, the increased hash table access from ABDADA hurt performance, vs YBWC. But that was a long time ago, many generations of hardware.

I admit, I am not doing ABDADA as described in the paper.

The core idea of ABDADA is that if another thread is searching a non-PV move, skip the move and search it later.

The paper suggests making a move, recursing, probing the hash table, finding out that the move is already being searched, un-recursing, and un-making the move.

This is a pretty crazy amount of work when all you have to do is hash a move and check a separate table. With the added bonus that you can control the overhead of these checks according to depth.

Also note that if you don't skip a move correctly, it's not the end of the world. So with the ABDADA hash table, you don't need to worry about collisions or locks. (Also, the ABDADA paper suggests keeping a count of the number of processors searching a move, but never suggests doing anything with this count, so you don't need to worry about that either.)

Pseudocode follows. It works well for me. I have checked and my search ends up with many deferred moves at all different depths.

search()
{
for (i = 0; i < moves; i++)
{
unsigned int temp = (move * 1664525) + 1013904223;
int64 move_hash = current_position_hash_key ^ temp;

if (already_being_searched(move_hash, depth))
{
deferred_move[deferred_moves++] = move;
continue;
}
going_to_search(move_hash, depth);
make_move(move);
search();
unmake_move();
finished_searching(move_hash, depth);
...
}
}

bool already_being_searched(int64 move_hash, int depth)
{
if (depth <= 2) // don't overwhelm memory by doing a bunch of probes at lower depths
return false;
int n = move_hash & abdada_hash_mask;
if (abdada_hash_table[n] == move_hash)
return true;
return false;
}

void going_to_search(int64 move_hash, int depth)
{
if (depth <= 2)
return false;
int n = move_hash & abdada_hash_mask;
abdada_hash_table[n] = move_hash;
}

void finished_searching(int64 move_hash, int depth)
{
if (depth <= 2)
return false;
int n = move_hash & abdada_hash_mask;
abdada_hash_table[n] = 0;
}

hgm · Post by **hgm** » Fri Feb 03, 2017 7:18 pm

If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.

TomKerrigan · Post by **TomKerrigan** » Fri Feb 03, 2017 7:53 pm

hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.

The only thing that could cause such a slowdown is memory contention, so the working set for your engine must be bigger than the L2 caches for your CPU.

I don't know about your engine in particular but one thing I did to limit memory traffic with my engine is decrease the size of my pawn hash table to 128k, so that fits nicely in the 256k L2 cache of recent Intel chips.

Also if you want to run on many cores, I would recommend against probing the hash table in your quiescence search, in case that's something you're doing.

matthewlai · Post by **matthewlai** » Sat Feb 04, 2017 1:42 pm

hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.

That could be due to Turbo Boost if it's enabled.

TomKerrigan · Post by **TomKerrigan** » Sat Feb 04, 2017 11:17 pm

matthewlai wrote:
hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
That could be due to Turbo Boost if it's enabled.

Oh whoops, good catch.

I do all my speedup testing with turbo boost disabled.

Hope your 2670s are treating you well. I couldn't be more pleased with mine. I see that similar systems have become very rare and expensive on eBay, we really purchased at a good time!

matthewlai · Post by **matthewlai** » Sun Feb 05, 2017 12:43 am

TomKerrigan wrote:
matthewlai wrote:
hgm wrote:If I run more than 2 copies of my engine on my i7, the NPS already drops ~20%.
That could be due to Turbo Boost if it's enabled.
Oh whoops, good catch.

I do all my speedup testing with turbo boost disabled.

Hope your 2670s are treating you well. I couldn't be more pleased with mine. I see that similar systems have become very rare and expensive on eBay, we really purchased at a good time!

They are! I love mine, too! We really lucked out I think.

TomKerrigan · Post by **TomKerrigan** » Tue Feb 07, 2017 3:44 am

TomKerrigan wrote: ...
- I don't have access to a 32-core machine, but my 16-core machine does have hyperthreading, so I can get a good idea of the tree size if I run with 32 threads. The tree increases to 1.96x. So, as long as there isn't too much memory contention, I would expect a ~15x speedup on 32 cores.

I just rented a 32 core, 64 thread machine via Amazon (EC2). They sure give you a lot of computer for $3.50 per hour.

My engine scales a bit better than I had predicted, I said "~15x" and it's actually 15.84x faster with 32 cores.

I was able to measure the tree size with 64 threads, it's 2.39x. I predict a ~24x speedup with 64 cores, which might be a little conservative again but I figure memory contention due to the shared hash table must become a factor at SOME point.

I'd be interested in comparing these numbers with what other people are seeing.

TomKerrigan · Post by **TomKerrigan** » Thu Feb 09, 2017 10:59 pm

I found a "bug" in my code related to negascout vs. ABDADA.

I was marking a move as being searched even if it failed the null-window search and was being re-searched.

This is not obviously wrong, because if the move is being re-searched, then technically it IS still being searched.

The problem is that such a move is likely to become the best move, and the efficiency of alpha-beta (parallel or not) obviously relies on the best move being searched as early as possible. Marking the move as being searched causes other threads to search it later, which means there's an unnecessary delay before they can help with the re-search and/or acquire the new bounds.

The fix is to only mark a move as being searched if it's a null-window search.

This results in a small but non-trivial improvement for Hyatt's positions:

1 core = 1.0
2 cores = 1.87
4 cores = 3.45
8 cores = 6.31
16 cores = 10.33

Good parallel speedup with ABDADA and cutoff checks

Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA and cutoff checks

Re: Good parallel speedup with ABDADA (32 cores, 64 threads)

Re: Good parallel speedup with ABDADA, "bug" fix