"How To" guide to parallel-izing an engine, easy a

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: "How To" guide to parallel-izing an engine, ea

Post by Michel »

Not really. Let's say you have 16 threads and they're searching 50 moves each, which seems unlikely but let's consider this worst-case scenario for the sake of argument. So you have a total of 800 moves that you have to store in your "currently-searching" hash table. If your hash table is 2^20 entries, it'll only be 0.07% full. The odds of a hash collision when inserting a new move are 1 in 1310. So it doesn't seem like something you'd have to worry about too much. But if you are worried about collisions, it would be trivial to change the hash table code so that nothing ever gets overwritten. I think that would add 10-15 lines of code to the pseudocode that I've posted. But again, I don't think this is a problem that needs to be solved. The worst thing that happens if there's a collision is that multiple threads search the same move at the same time, which just means that the move gets searched faster.
But the deeper threads will not be just writing 800 moves. They will be be constantly writing. So the probability that a valid high depth entry gets overwritten is higher than what you imply.

This is easily fixable by using linear probing: do not overwrite but search for a free entry. Since the load of the hashtable is low, linear probing would give little overhead.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: "How To" guide to parallel-izing an engine, ea

Post by hgm »

TomKerrigan wrote:I'm heading out to lunch now but will look at this idea more later. If I end up changing my code to add this improvement, is that okay with you? Would you like to be credited in the pseudocode file? And if so, how? Thanks!
Feel free to use it; you don't have to credit me. It is hard for me anyway to believe someone would not have thought of this before.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: "How To" guide to parallel-izing an engine, ea

Post by lucasart »

TomKerrigan wrote:
lucasart wrote:Interesting. I'll give this a try.

But it will need some adaptation to work. The main problem is that, with shared hashtable (aka lazy), you certainly don't want to have all threads searching the same depth, or running their own iterative deepening loop in a desynchronized manner. ...
I've never understood this about Lazy SMP, what's the point of having different threads search different depths?

What kind of speedups do you get on Hyatt's positions with your Lazy SMP implementation?
Measure elo instead of TTD speedup, and you will understand...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

lucasart wrote:
TomKerrigan wrote:
lucasart wrote:Interesting. I'll give this a try.

But it will need some adaptation to work. The main problem is that, with shared hashtable (aka lazy), you certainly don't want to have all threads searching the same depth, or running their own iterative deepening loop in a desynchronized manner. ...
I've never understood this about Lazy SMP, what's the point of having different threads search different depths?

What kind of speedups do you get on Hyatt's positions with your Lazy SMP implementation?
Measure elo instead of TTD speedup, and you will understand...
Surely the point of parallel search is to make the search faster, and to measure how much faster you've made something, you time it.

If you're unable to post a speedup number to describe how much faster your search is with 4 cores, I'm at a loss... ?
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

Michel wrote:
Not really. Let's say you have 16 threads and they're searching 50 moves each, which seems unlikely but let's consider this worst-case scenario for the sake of argument. So you have a total of 800 moves that you have to store in your "currently-searching" hash table. If your hash table is 2^20 entries, it'll only be 0.07% full. The odds of a hash collision when inserting a new move are 1 in 1310. So it doesn't seem like something you'd have to worry about too much. But if you are worried about collisions, it would be trivial to change the hash table code so that nothing ever gets overwritten. I think that would add 10-15 lines of code to the pseudocode that I've posted. But again, I don't think this is a problem that needs to be solved. The worst thing that happens if there's a collision is that multiple threads search the same move at the same time, which just means that the move gets searched faster.
But the deeper threads will not be just writing 800 moves. They will be be constantly writing. So the probability that a valid high depth entry gets overwritten is higher than what you imply.

This is easily fixable by using linear probing: do not overwrite but search for a free entry. Since the load of the hashtable is low, linear probing would give little overhead.
Ah, you're absolutely right, thank you. I just checked and higher-depth moves are indeed being overwritten by lower-depth moves, albeit not very frequently. (Probably because I've depth-limited the algorithm.)

Although in terms of speedup numbers this seems to make no measurable difference.

Another factor to consider is that if a move gets overwritten, then the next thread to search the move will re-write the move to the hash table, so the impact of the move being overwritten is pretty small.

I'm reluctant to add linear probing to the algorithm because that could potentially get pretty hairy with race conditions, which means I would have to add locks, which adds to the complication and starts to defeat the purpose of "Simplified" ABDADA... :/ Not sure how to best address the situation. I'm open to ideas...? I will keep thinking about it. Thanks for bringing this to my attention.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

hgm wrote:
TomKerrigan wrote:I'm heading out to lunch now but will look at this idea more later. If I end up changing my code to add this improvement, is that okay with you? Would you like to be credited in the pseudocode file? And if so, how? Thanks!
Feel free to use it; you don't have to credit me. It is hard for me anyway to believe someone would not have thought of this before.
In all the papers and posts I've read about YBW and ABDADA, I don't think I've seen such an idea mentioned.

Although after some more runs of my test suite, it seems that the improvement I measured might have just been statistical noise. :(
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: "How To" guide to parallel-izing an engine, ea

Post by lucasart »

TomKerrigan wrote:
lucasart wrote:
TomKerrigan wrote:
lucasart wrote:Interesting. I'll give this a try.

But it will need some adaptation to work. The main problem is that, with shared hashtable (aka lazy), you certainly don't want to have all threads searching the same depth, or running their own iterative deepening loop in a desynchronized manner. ...
I've never understood this about Lazy SMP, what's the point of having different threads search different depths?

What kind of speedups do you get on Hyatt's positions with your Lazy SMP implementation?
Measure elo instead of TTD speedup, and you will understand...
Surely the point of parallel search is to make the search faster, and to measure how much faster you've made something, you time it.

If you're unable to post a speedup number to describe how much faster your search is with 4 cores, I'm at a loss... ?
I've already posted my TTD speedup numbers in the other thread on ABDADA.

But you still don't get it. TTD is the wrong measure, because the result of Depth=N with Threads=T doesn't just depend on N. It also depends on T. This is the so-called "widening effect". And it gives incredible elo scaling to "lazy SMP" approch, provided well tuned (as shown in Stockfish, Cheng, Demolito, among other).

Sure, the TTD scaling is inferior. But that's only a problem for academic idelogues, who only write pseudo-code and papers. It's not a problem for pragmatic people who write real code, and care about real results (ie. Elo).

By the way, I consider ABDADA as part of "lazy SMP". Lazy SMP is just a synonym for Shared Hashtable. That's the base (think margarita pizza). On top of it, you add several improvements, like: depth repartition (olives), aborting immediately threads searching obsolete iterations (mushrooms), ABDADA (anchovies), and whatever else, provided it works in testing (Elo not TTD).

The difference between "lazy" and "not lazy" is whether your SMP algorithm involves recursive tree splitting. This is where I draw the line. Recursive tree splitting algorithms (eg. YBW) are much harder to implement, and especially to debug.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

lucasart wrote: ...
I've already posted my TTD speedup numbers in the other thread on ABDADA.

But you still don't get it. TTD is the wrong measure, because the result of Depth=N with Threads=T doesn't just depend on N. It also depends on T. This is the so-called "widening effect". And it gives incredible elo scaling to "lazy SMP" approch, provided well tuned (as shown in Stockfish, Cheng, Demolito, among other).
Sorry, I go for months (sometimes years) without reading this forum or posting to it. Searching it via Google seems to be impossible and sifting through the keyword search results is difficult. If you could give me a link to the discussion you're talking about, I would be interested to read it.

As for the idea of "widening," I've seen this mentioned a few times and infer that it has to do with late move reductions. The idea being that you search moves to different depths depending on move ordering. And some Lazy SMP implementations randomize move ordering to some degree, meaning that the late moves are pushed forward and searched deeper and that results in an improvement in playing strength. Is that right?

The fact that "widening" works pretty well seems to be proof that these depth reductions are unsound though, doesn't it? I mean, if it's causing the single-threaded search to miss a lot of good moves?
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

TomKerrigan wrote: ...
Ah, you're absolutely right, thank you. I just checked and higher-depth moves are indeed being overwritten by lower-depth moves, albeit not very frequently. (Probably because I've depth-limited the algorithm.)

Although in terms of speedup numbers this seems to make no measurable difference.

Another factor to consider is that if a move gets overwritten, then the next thread to search the move will re-write the move to the hash table, so the impact of the move being overwritten is pretty small.

I'm reluctant to add linear probing to the algorithm because that could potentially get pretty hairy with race conditions, which means I would have to add locks, which adds to the complication and starts to defeat the purpose of "Simplified" ABDADA... :/ Not sure how to best address the situation. I'm open to ideas...? I will keep thinking about it. Thanks for bringing this to my attention.
I've hit on a solution to the problem: make the hash table 4-way. Very simple, no need for locks, works brilliantly. I've been running test positions for half an hour and at no point have any of these 4-way "buckets" filled up. Thanks again for the help!
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: "How To" guide to parallel-izing an engine, ea

Post by lucasart »

TomKerrigan wrote:As for the idea of "widening," I've seen this mentioned a few times and infer that it has to do with late move reductions. The idea being that you search moves to different depths depending on move ordering. And some Lazy SMP implementations randomize move ordering to some degree, meaning that the late moves are pushed forward and searched deeper and that results in an improvement in playing strength. Is that right?

The fact that "widening" works pretty well seems to be proof that these depth reductions are unsound though, doesn't it? I mean, if it's causing the single-threaded search to miss a lot of good moves?
All these are good comments. But, really, the key to sucess is to stop trying to reason on a theoretical level. There are so many complex interactions and trade offs, you cannot reason with it (unless you limit yourself to sub 2000 elo alpha-beta toy engine).

I can tell you from experience that search reductions are worth *hundreds* of elo. Not bad for something that people from academia dismiss as "unsound". Same goes for lazy SMP, and many more things.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.