RMO - Randomized Move Order - yet another Lazy SMP derivate

Daniel Shawul · Post by **Daniel Shawul** » Sat Jan 04, 2020 2:39 am

Michel wrote: ↑Fri Jan 03, 2020 10:53 pm
Daniel wrote: No that is not true. Threads work at two depths: d and d+1 simultaneously. While this separation may result in reduced overhead, each group of threads working on the same depth sure going to result in a ton of overhead!
When I said "overhead" I meant "administrative overhead" (like in YBW). Lazy trades this overhead (deregulation ) for the overhead resulting from double work.

In SHT there is only the hash table to make the threads desynchronize a little. It seemed to work well in Toga II up to 4 threads (where according to CCRL its scaling was more or less on par with the YBW engines of the time). But it seems unreasonable to extend it beyond that. Your tests seem to confirm this.

Ok got it. Btw YBW is still clearly better than SHT so far.

Code: Select all

Score of scorpio-ybw vs scorpio-sht: 44 - 18 - 70  [0.598] 132

I have to stop this test now since the machine is needed for other work. It would be interesting to see how ABDADA fairs against
YBW at longer time controls if somebody is willing to do the test.

That combined with the previous result,

Code: Select all

Score of scorpio-ybw vs scorpio-abdada: 32 - 67 - 101  [0.412] 200

suggest that for the conditions I tested at, i.e. 40/60 tc - 12 threads SMP machine, it seems ABDADA > YBW > SHT.
If Lazy helps in some way, I guess LazyABDADA will work even better.

I glanced at Stockfish code to see what Breadcrumbs does, and indeed it looks like ABDADA in disguise with a different name.
The "leaving breadcrumbs" part is what ABDADA does to store a busy flag in HT.

Dann Corbit · Post by **Dann Corbit** » Sat Jan 04, 2020 2:44 am

Where do you imagine that DTS sits within this grid?
On top? In the middle? On the bottom?

I guess that the main problem here is that ZCT is the only open source DTS program I can think of.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jan 04, 2020 2:55 am

Dann Corbit wrote: ↑Sat Jan 04, 2020 2:44 am Where do you imagine that DTS sits within this grid?
On top? In the middle? On the bottom?

I guess that the main problem here is that ZCT is the only open-source DTS program I can think of.

Well, I don't know but times may have changed. As Michel suggested maybe the less "bookkeeping" ( locks, split blocks, and in case of DTS an iterative search) a method needs the better for scalability. But I feel like a smarter algorithm (including DTS) should be able to improve upon LazySMP by replacing the part where all threads search at the same depth. I can't see pure SHT overcoming the so-far poor performance without using ABDADA like tricks... Oh well who knows, I originally thought YBW would be the best at 12 threads...

Dann Corbit · Post by **Dann Corbit** » Sat Jan 04, 2020 5:04 am

Why not a hybrid?

Half of the threads are using Lazy SMP to update the hash table.
The other half of the threads are using DTS/YBW/ABDADA to update the hash table.
Or some other mix as a function of thread count, hash table size, and time.

lucasart · Post by **lucasart** » Sat Jan 04, 2020 6:52 am

Dann Corbit wrote: ↑Sat Jan 04, 2020 5:04 am Why not a hybrid?

Half of the threads are using Lazy SMP to update the hash table.
The other half of the threads are using DTS/YBW/ABDADA to update the hash table.
Or some other mix as a function of thread count, hash table size, and time.

Talk is cheap. Show us your code

smatovic · Post by **smatovic** » Sat Jan 04, 2020 8:12 am

Just to summarize...

- test Elo not TTD when comparing Lazy SMP with other parallel searches
- plain SHT approach can be successfully extended with depth+1 techniques
- a too simple engine may not be able to profit from Lazy SMP
- but for some engines Lazy SMP simply does not work out
- different implementations of different parallel search algorithms vary in effectiveness
- SHT and Lazy SMP scale better nps wise than YBW and ABDADA cos of less overhead
- people still discuss why Lazy SMP works out
- there is an break-even point for YBW, maybe between 16 and 512 cores resp. threads

Despite the discussions on Lazy SMP vs. XY, I am personally more interested in
the last point which Bob mentioned. The break-even point of YBW. There must be
also one for ABDADA. Currently we have max 64 cores per cpu socket, and Moore's
Law has still something in pipe, so it is relative certain that within a decade
we have to switch to something new. I tested different parallel searches with up
to 256 SIMD units, acting as 256 workers, on gpu architecture, but it remains
open how and if my results are applicable to pro cpu engines. Cos of the
architecture I was not able to implement YBW, and I did not compare SHT/Lazy SMP
Elo-wise, but I observed an possible break-even point, >=128 threads, and made
an proposal, RMO, to overcome this. Curious what others will come up with in
some years...

--
Srdja

Michel · Post by **Michel** » Sat Jan 04, 2020 11:02 am

Ronald wrote: ↑Fri Jan 03, 2020 11:35 pm
Michel wrote: ↑Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.

A "richer" tree is an undefined concept. I assume you mean a wider tree.

I think it still needs to be established that Lazy indeed searches a substantially wider tree (this is testable to some extent). It is not very obvious to me which mechanism might cause this as all the threads use basically the same search. I seem to recall some tests on Fishtest that tried to explicitly widen the tree for some of the threads and they failed. I am sure others have tried this too.

If Lazy searches indeed a wider tree then the question remains (as many people have pointed out) why a similar widening could not be implemented in a single threaded search.

chrisw · Post by **chrisw** » Sat Jan 04, 2020 12:55 pm

Michel wrote: ↑Sat Jan 04, 2020 11:02 am
Ronald wrote: ↑Fri Jan 03, 2020 11:35 pm
Michel wrote: ↑Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.
A "richer" tree is an undefined concept. I assume you mean a wider tree.

all else being equal (eg width), the richer tree would be the one that contained more senseful moves and fewer dumb moves. Presumably.

I think it still needs to be established that Lazy indeed searches a substantially wider tree (this is testable to some extent). It is not very obvious to me which mechanism might cause this as all the threads use basically the same search. I seem to recall some tests on Fishtest that tried to explicitly widen the tree for some of the threads and they failed. I am sure others have tried this too.

If Lazy searches indeed a wider tree then the question remains (as many people have pointed out) why a similar widening could not be implemented in a single threaded search.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jan 04, 2020 2:24 pm

Dann Corbit wrote: ↑Sat Jan 04, 2020 5:04 am Why not a hybrid?

Half of the threads are using Lazy SMP to update the hash table.
The other half of the threads are using DTS/YBW/ABDADA to update the hash table.
Or some other mix as a function of thread count, hash table size, and time.

That is definitely possible, you only need to use one algorithm if it is superior to all others in every scenario.
For instance, I don't see YBW being beaten with few cores ( though i haven't tested now) so one could divide
the processors into groups say 4, and use Lazy on top, YBW at the leaves etc ...

- SHT and Lazy SMP scale better nps wise than YBW and ABDADA cos of less overhead

ABDADA has the same nps as SHT at least for me.

smatovic · Post by **smatovic** » Sat Jan 04, 2020 2:33 pm

Daniel Shawul wrote: ↑Sat Jan 04, 2020 2:24 pm
Dann Corbit wrote: ↑Sat Jan 04, 2020 5:04 am Why not a hybrid?

Half of the threads are using Lazy SMP to update the hash table.
The other half of the threads are using DTS/YBW/ABDADA to update the hash table.
Or some other mix as a function of thread count, hash table size, and time.
That is definitely possible, you only need to use one algorithm if it is superior to all others in every scenario.
For instance, I don't see YBW being beaten with few cores ( though i haven't tested now) so one could divide
the processors into groups say 4, and use Lazy on top, YBW at the leaves etc ...

I ran ABDADA for the first 64 workers and RMO for up to 256 in Zeta v099k,
it works.

Daniel Shawul wrote: ↑Sat Jan 04, 2020 2:24 pm
- SHT and Lazy SMP scale better nps wise than YBW and ABDADA cos of less overhead
ABDADA has the same nps as SHT at least for me.

Ah, yes, I had to hack some things on gpu, uses atomic functions, probably not
necessary on cpu for ABDADA.

--
Srdja

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate