"How To" guide to parallel-izing an engine, easy a

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: "How To" guide to parallel-izing an engine, ea

Post by Dann Corbit »

TomKerrigan wrote:
lucasart wrote: ...
I've already posted my TTD speedup numbers in the other thread on ABDADA.

But you still don't get it. TTD is the wrong measure, because the result of Depth=N with Threads=T doesn't just depend on N. It also depends on T. This is the so-called "widening effect". And it gives incredible elo scaling to "lazy SMP" approch, provided well tuned (as shown in Stockfish, Cheng, Demolito, among other).
Sorry, I go for months (sometimes years) without reading this forum or posting to it. Searching it via Google seems to be impossible and sifting through the keyword search results is difficult. If you could give me a link to the discussion you're talking about, I would be interested to read it.

As for the idea of "widening," I've seen this mentioned a few times and infer that it has to do with late move reductions. The idea being that you search moves to different depths depending on move ordering. And some Lazy SMP implementations randomize move ordering to some degree, meaning that the late moves are pushed forward and searched deeper and that results in an improvement in playing strength. Is that right?

The fact that "widening" works pretty well seems to be proof that these depth reductions are unsound though, doesn't it? I mean, if it's causing the single-threaded search to miss a lot of good moves?
I think that "widening" in context of LazySMP usually means accidental widening.

Which is to say, it is an unintended consequence of the method itself.

When you have a giant pile of threads analyzing a position (some possibly with minor offsets) and all reading from the hashes of the others you will get some extra "bushyness" to the tree. The outcome of this extra bushyness seems to be stronger play on the whole.

It is not hard to understand why this might be the case. All the threads are looking for pv nodes and each new bit of information written to the hash can make the searches a little bit smarter. So (for instance) we might find some clever new pv nodes as a result of this searching overlap.

There is a down side to the widening effect also. The search does not get as deep. You can easily verify this with the last non-Lazy Stockfish verses the Lazy version. The newer version was measurably stronger, and the non-lazy version got measurably deeper.

Other Lazy verses Non-Lazy experiments see the same thing. As for strength, there are some experiments that do not show Lazy to be superior, but I think that is really an open question.

What is not in dispute is that LazySMP works really well and is fairly simple to implement correctly.

There is something counter-intuitive about wider and bushier being stronger than deep, but LazySMP has a lot of counter-intuitive things going for it.

For instance, we should not be able to simply add up the nodes going from one thread to two. Yet two LazySMP threads seem to scale magically to nearly 2x both in nodes and Elo. Quite frankly, I don't get it and I would live to hear a truly sensible explanation as to why it should be like that.

Mental picture of 4 threads with no stagger trying to analyze the same root position:
Lots and lots of hash table hits, but these are not zero cost. Good that we avoided searching, but we spent time attempting work that was already completed.

Given enough threads, one would think we would have almost all hits except for the last ply. I would think that would be bad and not good, since there would be so many lookups getting compute energy instead of searching unsearched positions.

Unless we count hash hits as new nodes, I would also think that the NPS would plummet.
And if the hash hits are new nodes, then I think we are getting a false indication of nodes searched.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

lucasart wrote:
TomKerrigan wrote:As for the idea of "widening," I've seen this mentioned a few times and infer that it has to do with late move reductions. The idea being that you search moves to different depths depending on move ordering. And some Lazy SMP implementations randomize move ordering to some degree, meaning that the late moves are pushed forward and searched deeper and that results in an improvement in playing strength. Is that right?

The fact that "widening" works pretty well seems to be proof that these depth reductions are unsound though, doesn't it? I mean, if it's causing the single-threaded search to miss a lot of good moves?
All these are good comments. But, really, the key to sucess is to stop trying to reason on a theoretical level. There are so many complex interactions and trade offs, you cannot reason with it (unless you limit yourself to sub 2000 elo alpha-beta toy engine).

I can tell you from experience that search reductions are worth *hundreds* of elo. Not bad for something that people from academia dismiss as "unsound". Same goes for lazy SMP, and many more things.
Sounds like you have a big axe to grind here. Instead of slagging off academics and theory and basically the idea of understanding how stuff works, it seems like there's an opportunity here to improve how these reductions are done.
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

Dann Corbit wrote:
TomKerrigan wrote:
lucasart wrote: ...
I've already posted my TTD speedup numbers in the other thread on ABDADA.

But you still don't get it. TTD is the wrong measure, because the result of Depth=N with Threads=T doesn't just depend on N. It also depends on T. This is the so-called "widening effect". And it gives incredible elo scaling to "lazy SMP" approch, provided well tuned (as shown in Stockfish, Cheng, Demolito, among other).
Sorry, I go for months (sometimes years) without reading this forum or posting to it. Searching it via Google seems to be impossible and sifting through the keyword search results is difficult. If you could give me a link to the discussion you're talking about, I would be interested to read it.

As for the idea of "widening," I've seen this mentioned a few times and infer that it has to do with late move reductions. The idea being that you search moves to different depths depending on move ordering. And some Lazy SMP implementations randomize move ordering to some degree, meaning that the late moves are pushed forward and searched deeper and that results in an improvement in playing strength. Is that right?

The fact that "widening" works pretty well seems to be proof that these depth reductions are unsound though, doesn't it? I mean, if it's causing the single-threaded search to miss a lot of good moves?
I think that "widening" in context of LazySMP usually means accidental widening.

Which is to say, it is an unintended consequence of the method itself.

When you have a giant pile of threads analyzing a position (some possibly with minor offsets) and all reading from the hashes of the others you will get some extra "bushyness" to the tree. The outcome of this extra bushyness seems to be stronger play on the whole.

It is not hard to understand why this might be the case. All the threads are looking for pv nodes and each new bit of information written to the hash can make the searches a little bit smarter. So (for instance) we might find some clever new pv nodes as a result of this searching overlap. ...
Hey Dann, good to hear from you. Hope things have been going well. :)

The latter part is the one I don't follow. If you search the same position to the same depth with the same alpha-beta window, it seems like you should get the same stuff written back to the hash table, regardless of which thread is doing the search.

The only way I can imagine that smarter/better/different data is written to the hash table by a thread is if pruning is done in a non-determinisitic way?
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: "How To" guide to parallel-izing an engine, ea

Post by Evert »

lucasart wrote: All these are good comments. But, really, the key to sucess is to stop trying to reason on a theoretical level. There are so many complex interactions and trade offs, you cannot reason with it (unless you limit yourself to sub 2000 elo alpha-beta toy engine).
You know, understanding how things work and why things are the way they are is not a bad thing. It's brought us many great things over the past centuries.
I can tell you from experience that search reductions are worth *hundreds* of elo. Not bad for something that people from academia dismiss as "unsound".
Qualifying reductions as "unsound" is not dismissing them as something not useful. It means that you change the tree in a way that may change the root evaluation for a given depth from the actual mini-max value. In practice it means that at fixed depth, reductions are a loss in terms of playing strength. This is in fact true. Point is, in games you use wall-time limited searches rather than depth-limited ones, and there reductions win back more than they cost you.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: "How To" guide to parallel-izing an engine, ea

Post by Evert »

TomKerrigan wrote: The latter part is the one I don't follow. If you search the same position to the same depth with the same alpha-beta window, it seems like you should get the same stuff written back to the hash table, regardless of which thread is doing the search.

The only way I can imagine that smarter/better/different data is written to the hash table by a thread is if pruning is done in a non-determinisitic way?
That seems to be the idea. History and killer tables are per-thread, and this changes move ordering to a degree where some threads pick up cut-moves hiding among late moves before others, so you find (and resolve) these faster than you otherwise would. To benefit from this the threads must be de-synchronised. Thing is, you'd expect YBW to benefit from the same thing, and you'd hope this to be rare if move ordering is any good...
TomKerrigan
Posts: 64
Joined: Fri Jul 03, 2015 9:22 pm

Re: "How To" guide to parallel-izing an engine, ea

Post by TomKerrigan »

Evert wrote:
TomKerrigan wrote: The latter part is the one I don't follow. If you search the same position to the same depth with the same alpha-beta window, it seems like you should get the same stuff written back to the hash table, regardless of which thread is doing the search.

The only way I can imagine that smarter/better/different data is written to the hash table by a thread is if pruning is done in a non-determinisitic way?
That seems to be the idea. History and killer tables are per-thread, and this changes move ordering to a degree where some threads pick up cut-moves hiding among late moves before others, so you find (and resolve) these faster than you otherwise would. To benefit from this the threads must be de-synchronised. Thing is, you'd expect YBW to benefit from the same thing, and you'd hope this to be rare if move ordering is any good...
Got it. I don't do late move reductions in my engine, not that I have anything against them, I just haven't gotten around to it. But that means my parallel search almost always yields the same exact stuff as my single-threaded search, just faster.

I would be interested to know how effective my parallel techniques are when applied to a more modern engine, i.e., one that does LMR.