Lazy SMP

JuLieN · Post by **JuLieN** » Thu Dec 27, 2012 5:21 pm

Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

If, say, you have height threads and 16 chessmen for your side on the board, you'd give two pieces to find the legal moves of to each of those threads. The only blocking part of the code would be the function pushing the found legal move into the moves list, and, maybe, the in-check checking routine when it needs to move a piece on the board.

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.

jdart · Post by **jdart** » Thu Dec 27, 2012 7:59 pm

There is some overhead to create or assign a new thread so most chess programmers try to do that not too often during the search, which is why they are splitting the tree and doing it only when sufficient depth remains. Splitting in the movegen or eval would be too fine-grained to be efficient I think.

--Jon

zamar · Post by **zamar** » Thu Dec 27, 2012 8:05 pm

JuLieN wrote:Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

If, say, you have height threads and 16 chessmen for your side on the board, you'd give two pieces to find the legal moves of to each of those threads. The only blocking part of the code would be the function pushing the found legal move into the moves list, and, maybe, the in-check checking routine when it needs to move a piece on the board.

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.

I'd expect that Thread/Process communication overhead would be enormous here.

JuLieN · Post by **JuLieN** » Thu Dec 27, 2012 8:09 pm

jdart wrote:There is some overhead to create or assign a new thread so most chess programmers try to do that not too often during the search, which is why they are splitting the tree and doing it only when sufficient depth remains. Splitting in the movegen or eval would be too fine-grained to be efficient I think.

--Jon

Thanks Jon, good points about the overhead. I was reading a rule of thumb about that, saying that each chunk should last for at least a few milliseconds to be worth it (in a dedicated parallel procedures tutorial for Free Pascal):

Overhead, slow down

The overhead heavily depends on the system (number and types of cores, type of shared memory, speed of critical sections, cache size). Here are some general hints:
Each chunk of work (index) should take at least some milliseconds.
The overhead is independent of recursive levels of parallel procedures.
Multi threading overhead, which is independent of the MTProcs units, but simply results from todays computer architectures:
As soon as one thread is created your program becomes multi threaded and the memory managers must use critical sections, which slows down. So even if you do nothing with the thread your program might become slower.
The cmem heap manager is on some systems much faster for multi threading. In my benchmarks especially on intel systems and especially under OS X the speed difference can be more than 10 times.
Strings and interfaces are globally reference counted. Each access needs a critical section. Processing strings in multiple threads will therefore hardly give any speed up. Use PChars instead.
Each chunk of work (index) should work on a disjunctive part of memory to avoid cross cache updates.
Do not work on vast amounts of memory. On some systems one thread alone is fast enough to fill the memory bus speed. When the memory bus maximum speed is reached, any further thread will slow down instead of making it faster.

As it looks quite simple to implement, I might give it a try anyway, just for the sake of measuring this overhead...

bob · Post by **bob** » Thu Dec 27, 2012 8:57 pm

JuLieN wrote:Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

If, say, you have height threads and 16 chessmen for your side on the board, you'd give two pieces to find the legal moves of to each of those threads. The only blocking part of the code would be the function pushing the found legal move into the moves list, and, maybe, the in-check checking routine when it needs to move a piece on the board.

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.

That leaves a LOT of serial code left, which means that even a 2x speedup would be a real challenge.

Joost Buijs · Post by **Joost Buijs** » Thu Dec 27, 2012 11:33 pm

JuLieN wrote:Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.

This is something I tried in 2001 when I had a machine with only 2 cores available. My evaluation function was split in 2 parts, one for black and one for white. When the threads are already running and waiting there is hardly any overhead at all. It gave me a speedup of about 1.7. The nice thing is you can program this within less then one hour.
The search was handled by one of the 2 processors, so the second processor did only half the evaluation and still it gave a reasonable speedup.

JuLieN · Post by **JuLieN** » Thu Dec 27, 2012 11:40 pm

Joost Buijs wrote:
JuLieN wrote:Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.
This is something I tried in 2001 when I had a machine with only 2 cores available. My evaluation function was split in 2 parts, one for black and one for white. When the threads are already running and waiting there is hardly any overhead at all. It gave me a speedup of about 1.7. The nice thing is you can program this within less then one hour.

I'll definitely try it!

Yes, there's very little to modify in the code to get such "threads" going, so it's definitely worth the... "effort".

As Bob says, there's still plenty of code left out of the threaded part (for instance list sortings and fast list evaluations), but if those simple tries give results then maybe other parts of the code could be added.

For SMP beginners another interest is that it's probably less prone to introduce some new logical bugs.

Joost Buijs · Post by **Joost Buijs** » Thu Dec 27, 2012 11:50 pm

I think this is only usefull if you have just a few cores and want some quick results. The problem is you have to split your code into pieces with approximately the same execution time.

ZirconiumX · Post by **ZirconiumX** » Fri Dec 28, 2012 2:01 pm

Actually, I have though about this - and approached it from a different perspective.

E-Mail to Marco Costalba wrote: In SMP search there will come a point where there is no/negative improvement from extra CPUs.

This got me thinking: what to do with the extra cores?

The solution I came up with is a risky one: use the other cores to speed up eval or whatever.

What we do is almost like prefetching the eval result. Just after MakeMove() we place our position on a to-do stack, are given a reference ID, and recurse. When the eval is needed we spin on a DONE flag in our slot.

The hope is that we can be "served" quickly enough to speed up that node's EVAL - you could try a multithreaded "helpful master" approach if you wanted some more gain.

The risk is that the node gets pruned before the result is required, giving us (mostly, but not completely thanks to EVAL cache) wasted work.

Of course, for now this idea is less than worthless, because we have yet to hit that limit.

Matthew:out

To which he replied:

E-Mail from Marco Costalba wrote: Hi Matthew,

this is a creative idea !

Unfortunately I am not so optimistic, apart that some logic would be required to skip pruned nodes (that are the majority), there is also the problem that after do_move() is called, we very quickly move forward to call evaluate() at the next ply. My guess is that we need the evaluation before a slave thread had the time to calculate it.

Anyhow it is an idea that deserves some extra thought. Regarding the SMP limit, I don't see it as a problem, because we could use slave's idle time to make it compute the eval, independently from the number of threads.

Marco

This would only really work on a full-width search with no selectivity - something that will not be the case for some time.

Matthew:out

bob · Post by **bob** » Fri Dec 28, 2012 5:09 pm

Joost Buijs wrote:
JuLieN wrote:Has someone tried to split the move generator and the evaluation function, instead of splitting the search tree?

Likewise, you could compute all the terms of your evaluation function in parallel threads. If you have more terms than you have threads then once a term is finished computing you'd affect the thread a new term to compute and so on.

Has this ever been tried? Being a very lazy programmer I was contemplating the easiest way to implement SMP in my engine, and came with this "lazy SMP" idea.
This is something I tried in 2001 when I had a machine with only 2 cores available. My evaluation function was split in 2 parts, one for black and one for white. When the threads are already running and waiting there is hardly any overhead at all. It gave me a speedup of about 1.7. The nice thing is you can program this within less then one hour.
The search was handled by one of the 2 processors, so the second processor did only half the evaluation and still it gave a reasonable speedup.

Your numbers seem a bit off. Your results imply that about 2/3 of your total search time is spent in the evaluation, which seems incredibly high (if you take your eval and divide by 2 (fully parallel eval), keeping the rest of your code serial) you end up with an eval that is really expensive. Do you REALLY spend that much time in your eval, so that only parallelizing that gives you a 1.7x speedup?

For reference, using 2 cpus in a fully parallelized alpha/beta search gives me a 1.7x-1.8x speedup. And that is doing almost EVERYTHING in parallel.

Lazy SMP

Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP

Re: Lazy SMP