Implementation of multithreaded search in Jazz

hgm · Post by **hgm** » Tue Apr 23, 2013 3:41 pm

lucasart wrote:You're taking an edge case. What you should look at is the general case instead. In the general case, the first move is best, and the score is within the aspiration window's bounds.

That is exactly the case I was talking about. There is nothing you can do about the time it takes to search the best move when you only split at the root. But most other moves would be below it, and these could then be quickly refuted even without knowing the exact score of the first move. So it is very possible that 3 other threads can quickly prove all other moves fail low compared to the aspiration alpha, while the first thread is doing the first move, so that in the end your search time is reduced to the duration of the latter.

Even when some other moves score close to the best move, when you do LMR you could postpone their full-depth search until you have refuted all other moves at LMR depth. By that time the result of the best move might be in, and they could be searched with the same alpha that a single-threaded search would also have used (except now 4 at a time).

Evert · Post by **Evert** » Tue Apr 23, 2013 3:52 pm

hgm wrote:I had a similar experience when I tried to implement SMP by running two (otherwise unmodified) processes with the hash table in a shared memory area. This caused such a large slowdown that none of the tricks I later added to prevent them from doing the same thing could overcome it. Even my most advanced solution was slower than searching with the single process.

Actually, I never saw an advantage from doing something similar either, but it was due to a bug in setting up my zero-window for the parallel search.
I do know that Jazz' parallel speed is hindered by its dynamic piece square tables though (which aren't shared, I can't use the lockless scheme for them and using a spinlock will probably slow things down too much, but I may experiment with that a bit), which seem to make the cores fight over available cache space. At least I get significantly more parallel speed-up if I disable the piece square tables, but I also get a significantly weaker program...

I never found out why that was. (But I spent only 3 days on it; after that the engine had to play in the tourney I was preparing it for (the Olympiad 2010), and I never got back to it afterwards.)

Whatever engine this was (Spartacus?), what I want to see is DeepMax.

hgm · Post by **hgm** » Tue Apr 23, 2013 4:08 pm

It was HaQiKi D.

So far there is not even a micro-Max derivative that can ponder. (Which is bad, because it means it also cannot support analyze mode.)

jdart · Post by **jdart** » Tue Apr 23, 2013 8:37 pm

I do know that Jazz' parallel speed is hindered by its dynamic piece square tables though (which aren't shared, I can't use the lockless scheme for them and using a spinlock will probably slow things down too much, but I may experiment with that a bit), which seem to make the cores fight over available cache space.

Piece square tables should be small so it should be possible to move these into a per-thread memory structure, so each thread has a copy.

--Jon

syzygy · Post by **syzygy** » Wed Apr 24, 2013 2:53 am

Evert wrote:I had to make two changes: the first was to store score and permutation data in the move list (they weren't in there before) and the second was to do away with the 64k repetition hash table (I didn't bench it, but building it up from scratch at each split point seemed like a certain way to make things extremely slow).

One way to keep the repetition hash table is by not copying the full state at a split, but only the list of moves from the root position to the split node, and then performing those moves including repetition hash table updates on thread-private structures. After a thread finishes searching a subtree it undoes all moves to get its private structures back to the root position.

I suppose this might also help for your dynamic piece square tables.

Copying only the moves from root to split has the additional advantage that in principle idle threads can perform a split on their own close to the root while the master thread keeps searching somewhere far from the root without being interrupted.

You only need one set of private structures per thread, even if you implement helpful master, because the helpful master by definition is searching a subtree of the node at which it has run out of work and decided to help one of its slaves.

The second concerns time control and is something I need to go and fix properly: I use node counts to decide when to check the clock, specifically (slightly simplified, but you get the idea)

I don't know if this would run into portability problems on WIndows, but at least on Unix this can be easily fixed by a having a signal handler set a flag when time runs out and letting the search check the flag. Of course this could also be done by having a separate thread that blocks until a timer runs out and then sets a flag.

Evert · Post by **Evert** » Wed Apr 24, 2013 4:46 pm

jdart wrote:
I do know that Jazz' parallel speed is hindered by its dynamic piece square tables though (which aren't shared, I can't use the lockless scheme for them and using a spinlock will probably slow things down too much, but I may experiment with that a bit), which seem to make the cores fight over available cache space.
Piece square tables should be small so it should be possible to move these into a per-thread memory structure, so each thread has a copy.

Indeed, but the important detail is "dynamic piece square table". What I meant by that is that Jazz calculates piece square tables (mainly based on the pawn structure) and stores these in a cache for later re-use (this is fairly peculiar for Jazz, although I think at least Rookie uses a similar approach). Each thread has its own cache already but I don't copy anything over from the parent thread, so every thread has to recalculate the piece square tables it needs when it needs them.

It's a good point though: it's possible that it's this overhead that's causing the problem, rather than all threads trying to pull things from memory. I'll have to check that.

Evert · Post by **Evert** » Wed Apr 24, 2013 4:54 pm

syzygy wrote: One way to keep the repetition hash table is by not copying the full state at a split, but only the list of moves from the root position to the split node, and then performing those moves including repetition hash table updates on thread-private structures. After a thread finishes searching a subtree it undoes all moves to get its private structures back to the root position.

That could help. It's still a bit more tricky though, and it does make splits more expensive. I'll check how much the repetition hash table is actually worth (in terms of time savings) first.

I suppose this might also help for your dynamic piece square tables.

Possibly. Certainly something I'll check!

I don't know if this would run into portability problems on WIndows, but at least on Unix this can be easily fixed by a having a signal handler set a flag when time runs out and letting the search check the flag. Of course this could also be done by having a separate thread that blocks until a timer runs out and then sets a flag.

I don't think it's portable to Windows, but apart from that as far as I know you can only set the alarm for an integer number of seconds, which is too coarse for time management.
Using a background thread would work though, in combination with a sleep() call (also not portable to Windows, but there are ways around that).

syzygy · Post by **syzygy** » Wed Apr 24, 2013 9:06 pm

Evert wrote:
syzygy wrote: One way to keep the repetition hash table is by not copying the full state at a split, but only the list of moves from the root position to the split node, and then performing those moves including repetition hash table updates on thread-private structures. After a thread finishes searching a subtree it undoes all moves to get its private structures back to the root position.
That could help. It's still a bit more tricky though, and it does make splits more expensive. I'll check how much the repetition hash table is actually worth (in terms of time savings) first.

I would say the splits become very cheap. It's just a matter of copying a few moves. Making/unmaking the moves is done in the thread's own time and does not block any other threads.

You could also defer the unmaking until you have the moves from the root to the next split node to search, because you might not have to back up all the way to the root. For me the difference was not measurable, though.

Copying the full board state including any bigger structures such as a repetition hash table is of course way more expensive and this has to be done either by the master thread while the slave thread is blocked or by the slave thread while the master thread is blocked.

I don't know if this would run into portability problems on WIndows, but at least on Unix this can be easily fixed by a having a signal handler set a flag when time runs out and letting the search check the flag. Of course this could also be done by having a separate thread that blocks until a timer runs out and then sets a flag.
I don't think it's portable to Windows, but apart from that as far as I know you can only set the alarm for an integer number of seconds, which is too coarse for time management.

Use setitimer().

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 25, 2013 6:43 pm

That could help. It's still a bit more tricky though, and it does make splits more expensive. I'll check how much the repetition hash table is actually worth (in terms of time savings) first.

That is exactly what i do for my cluster search. Only the moves from the root are sent over the network , and the processor that receives it does a make/unmake on its local board to get at the right position. At first I thought of only reducing the message size, but later it occurred to me that this approach can be good on smp systems as well. It only makes the necessary makes/unmakes so the cost is very minimal. Usually you don't have to go all the way back up to the root before you get a hashkey match so the make/unmakes are very short with a proper work sharing scheme. I have asked about this methodology some time ago but didn't get any response. The fact that it is a competitive approach (while being easier) is that i get almost the same nps scaling as the one which copies board states. In the worst case all you have to do is raise the minimum split depth by 1.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 25, 2013 6:58 pm

I guess that is the reason why MTDf is more suitable for parallelization since you don't have to worry about updating bounds. It makes easier automatic parallelization e.g Cilk and Cilkchess, and also Transposition driven scheduling works better with MTDf.

As to root splitting, I get about 1.25 speed up even if all the root moves from the 3rd are reduced by 1. And I wait until the first root move is finished before starting search on others. This is ofcourse not same us what you tested since with the standard shared hash table approach both threads start searching on the same moves. You need some pre-cautions to avoid redundant work. With a best-first search, a virtual loss is assigned to a node when one thread starts working on it so that other threads pick other moves. In alpha-beta , I think ABDADA does something similar through the hash table. I don't think you tried these things so maybe you should to see if you get some improvement. The virtual loss trick is apparently an important improvement for MCTS search.

Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz