RMO - Randomized Move Order - yet another Lazy SMP derivate

lucasart · Post by **lucasart** » Fri Jan 03, 2020 12:21 pm

Daniel Shawul wrote: ↑Thu Jan 02, 2020 11:41 pm I did some quick tests on a ryzen 3900x with 12-physical cores (24 threads but I used only 12) at tc 40/60

ABDADA beats plain SHT by a score of 10-3-15, and YBW beats ABDADA by a score of 9-6-9 so far.
Not so many games I know but It is clear to me YBW > ABDADA > SHT even at 12 threads.

Lazy SMP could definitely benefit from ABDADA ideas i.e. using transposition table to coordinate work-sharing
between threads rather than rely only on the threads searching at 2 depths. Michel mentioned about "Breadcrumbs" using ABDADA ideas -- I definitely agree threads working on the same depth could benefit from ABDADA ideas. Also, don't forget ABDADA is the original real lazy SMP implementation...so all this hype about lazy SMP is because stockfish uses it yikes! It happens to "widen" search somehow and helped it but it might not work in other engines which don't prune as heavily as stockifsh. Also, LazySMP is rather uninteresting for parallel search literature because it moves goal post to achieve something else ... The work should have been the same for all algorithms but LazySMP apparently changes selectivity even though I never tried it.

ABDADA is good on paper, but Bread crumbs is better in the real world

It adresses the problems of ABDADA which are:

1/ HT real estate: There is not a bit to waste on HT entries, no room for nproc. Need another table, preferable a much smaller one (see next point).

2/ Contention: HT is used (in a modern engine tuned for elo, not a toy engine) at every node, including qs nodes. So one needs to be careful with contention on hashEntry.nProc, especially as they have to be updated with atomic RMW operations (increment and decrement) with acquire/release ordering. So you want a smaller table of (key,nproc) and use it at high enough depths to minimise contention with large number of threads.

3/ Move ordering: Messiging around with move ordering is a very bad idea, in modern engines where move ordering is so massively fine tuned and path dependant. You just cannot mess up move ordering for the sake of SMP. I am ready to accept the superiority of ABDADA in toy engines where move ordering is deterministic (ie. no path dependance from history CMH etc.). And what about LMR and LMP ? ABDADA's re-ordering will completely screw that, which is worth hundreds of elo. But in real world modern engine, it's a terrible idea. Also, it messes up the codebase and introduces serious code smell (the logic of re-ordering moves). The Bread Crumbs solution of simply increasing the reduction is much superior. There is power in simplicity.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jan 03, 2020 1:21 pm

I happen to have introduced a bug to ABDADA at the root when I added some root move ordering stuff a while ago.
So I fixed that, and re-run the match overnight with 12 threads.
To my surprise, it is smashing even YBW now that I am wondering if I am disadvantaging YBW somehow

Code: Select all

Score of scorpio-ybw vs scorpio-abdada: 13 - 32 - 51  [0.405] 95

ABDADA good on paper ?? give me a break!

I will have to check if the short 40/60 time control and split_depth of 8 (which btw I use for ABDADA too) are hurting YBW performance.

1/ HT real estate: There is not a bit to waste on HT entries, no room for nproc. Need another table, preferable a much smaller one (see next point).

I don't store nproc but a 1-bit flag to indicate if the move is taken by another processor. So nproc goes into some flag i already had with no extra
cost -- i think most people do that too.

2/ Contention: HT is used (in a modern engine tuned for elo, not a toy engine) at every node, including qs nodes. So one needs to be careful with contention on hashEntry.nProc, especially as they have to be updated with atomic RMW operations (increment and decrement) with acquire/release ordering. So you want a smaller table of (key,nproc) and use it at high enough depths to minimise contention with large number of threads.

I use lockless hashtables for both ABDADA/YBW so there is no contention as regards the hashtable. Scorpio does HT for quiescence search too.

3/ Move ordering: Messiging around with move ordering is a very bad idea, in modern engines where move ordering is so massively fine tuned and path dependant. You just cannot mess up move ordering for the sake of SMP. I am ready to accept the superiority of ABDADA in toy engines where move ordering is deterministic (ie. no path dependance from history CMH etc.). And what about LMR and LMP ? ABDADA's re-ordering will completely screw that, which is worth hundreds of elo. But in real world modern engine, it's a terrible idea. Also, it messes up the codebase and introduces serious code smell (the logic of re-ordering moves). The Bread Crumbs solution of simply increasing the reduction is much superior. There is power in simplicity.

Move ordering is the same even on the second pass for ABDADA so I don't know what you are talking about. I don't even actually generate the moves on the second pass since I use iterative search. It goes through the same set of moves that the first pass generated, and also doesn't even check for legality. I also take extra precaution to use the same selectivity (e.g. LMR) when on the first pass some moves are skipped. E.g. I base the LMR amount based on the move index not on so-far-searched moves etc. That way I don't pollute the experiment f.i. unlike what LazySMP does with "widening" and all ...

Does LazySMP advantage really come from "widening effect"? I have doubts about it now that ABDADA beats YBW.

Michel · Post by **Michel** » Fri Jan 03, 2020 4:12 pm

_If_ Lazy would be really worse than YBW in TTD (I appreciate Lucas' high quality data, but there should also be data on the YBW side for comparison) then its strength must come from widening the search. However it is not clear to me what mechanism in Lazy would be responsible for widening the search. AFAIK all threads use the same pruning and reductions.

bob · Post by **bob** » Fri Jan 03, 2020 5:46 pm

No one has done the RIGHT experiment to test Lazy vs YBW. The _correct_ test would be to take the same program, and create a YBW version and a Lazy version, and then play them in a long match. I generally believe in time to depth. And for modest numbers of threads (maybe up to 32-64) I will always believe that YBW is better. It will search a similar tree to the purely serial tree, only deeper. Lazy (to me) seems to address the scaling issue and NOT the quality issue. IE for large numbers of threads (> 32/64) it continues to improve in performance where YBW becomes much harder to match. There are lots of tricks that many YBW programs don't use (mainly avoiding locks except where absolutely necessary as opposed to where they make it much easier to program around race conditions).

It seems we come back to this "widening" issue again which I consider to be irrelevant. If widening helps Lazy, why not modify the serial search to widen, since it MUST help there as well. I notice some mention widening, some mention depth, but nobody mentions real test results which is the only way to resolve this question about which is better.

My summary: YBW will ALWAYS be superior to Lazy and friends, until you reach the point where additional threads/cores fail to improve depth performance. That's the YBW asymptote that is well-known (although its precise slope is not really known since I doubt any existing YBW program is truly optimal in implementation). Once you reach this magic point where additional cores don't help YBW, now you begin to step into the arena where Lazy will catch up and surpass YBW. But that is, so far as I can imagine, the ONLY advantage of Lazy. It will scale much better in terms of performance gained per core added, but only for larger numbers of cores than YBW can cope with. Perhaps one day a master's student somewhere or a Ph.D. student will decide to tackle this question and do a definitive comparison where all is equal except for the parallel search approach. So far we don't have such data, except maybe for when Stockfish made the conversion. I've not seen that data. For YBW (really more like DTS in my case) I have data up to 32 cores and performance is quite good.

I've worked on this problem for a LONG time (back to late 70's). I can state with certainty that for up to 16 cores, Lazy will never play as well as YBW. For cores beyond 512, YBW will never play as well as Lazy most likely. But within that band of 16-512, where the actual break-even point lies is unknown. I am almost certain that it is above 32 cores however, and my guess would be that by 64 cores, lazy might be better unless someone really does some performance tweaking. I went through this stuff with vtune several years ago to eliminate all the various points of contention (locks, shared memory accesses, waiting, etc). I doubt my parallel search is optimal. But I do know for certain that it is very good. Placing that break-even barrier would be some useful data. But it would burn a ton of CPU cycles to do properly, on a machine that won't be cheap. And when you hit the NUMA issues it becomes even more complex.

dragontamer5788 · Post by **dragontamer5788** » Fri Jan 03, 2020 7:08 pm

bob wrote: ↑Fri Jan 03, 2020 5:46 pm No one has done the RIGHT experiment to test Lazy vs YBW. The _correct_ test would be to take the same program, and create a YBW version and a Lazy version, and then play them in a long match.

While that's the correct, general idea, there are numerous implementation details that will hamper such a test. Lockless (atomic-compare-and-swap) vs spinlocks, NUMA computers vs UMA computers, Intel Skylake vs AMD Zen (vs IBM Power9), Hyperthreads vs Physical cores.

There are so many variables at play here.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jan 03, 2020 7:48 pm

Final score

Code: Select all

Score of scorpio-ybw vs scorpio-abdada: 32 - 67 - 101  [0.412] 200

On the other hand, YBW looks like it will smash SHT, full result tomorrow

Code: Select all

Score of scorpio-ybw vs scorpio-sht: 6 - 1 - 6  [0.692] 13

30-sec search on initial position
========================
NPS on 1-core is about 1.2 mnps
ABDADA/SHT on 12 cores = 12.5 mnps
YBW on 12 cores = 9.5 mnps

YBW is getting about 7 mnps in the games ( 1.5 sec per move on average) so I expect its result will improve if I used longer time control.
In any case, ABDADA is certainly not an only-good-on-paper algorithm...

I don't have LazySMP implemented, which "parallelizes on depth". But my question is if the SHT approach that has all threads working on the same depth is this bad compared to ABDADA/SMP, whatever LazySMP brings by parallelizing with depth, you can improve upon by using ABDADA or YBW for the threads working on the same depth, no? That is a LazyABDADA or LazyYBW algorithm ...

Therefore, If I can't get pure SHT (without d/d+1 stuff) to beat ABDADA/YBW it means point proven by induction.
Q.E.D

Dann Corbit · Post by **Dann Corbit** » Fri Jan 03, 2020 8:30 pm

Am I the only one who feels repugnance for "throw it at the wall and see if it sticks" searches?

Now, Lucas has pointed out, and I think quite eloquently, that Lazy SMP is beautiful in its simplicity.
I think we were all shocked at Dan Homan's published results. And therein lies the rub.

Lazy SMP works GREAT. And I don't think anyone has ever given an adequate explanation of why. And adding gasoline to the already blazing pyre of conundrum, Dr. Hyatt has pointed out that Lazy SMP seems to scale better with massive CPU load. The bizarre thing here is that my gedankenexperiment says the opposite. As cores increase, the hash table should fill up faster and we would have more and more collisions. At very long time scale or very high core count eventually we overwrite everything before it can even be used to a large degree. So why would Lazy SMP scale better and better rather than worse and worse?

Another algorithm that REALLY sticks in my craw is MCTS. Monte Carlo, for gambling, of course. So, instead of intelligent and guided search, we cast a net blindly left right and center. Maybe it lands in the water. Maybe it lands on a tree. Maybe it catches a bunch of fish. Why should this turn out so well? Why is it that intelligent guidance should not dominantly prevail over randomized searching? I do realize that MCTS "learns" from the good and bad results, but it still seems faulty to me.

As for experiments proving which technique is best, I think that it only shows that implementation X of algorithm Y is better than implementation A of algorithm B. An algorithm analysis where O(f(n)) is proven would show why algorithm A or X should be preferred. Err... complicated by locking.

The theory that Lazy SMP fills in the cracks where the intelligent pruning made mistakes also seems strange to me. Are not the Lazy searches using the exact same pruning algorithms in all the threads? Where then, does this magical crack filling come from?

I guess the real problem here is that algorithms like YBW are easy to understand why they work. Lazy SMP working so fabulously well is a real puzzle (yes, it is elegant and simple, but why does it work so DARN WELL?!)

I think also that most of our guesses about which algorithms are best come from folklore. What I mean by that is Anthony Cozzie's Zappa searched so efficiently that we all assumed (at the time) DTS is best. And it might be. Or maybe not. Perhaps it was simply a superb implementation. He simply smoked everyone at time to depth:
http://www.talkchess.com/forum3/viewtopic.php?t=56019
And Zappa was one of the top programs of the day, without a doubt.

Another piece of folklore came from the switch of Stockfish from YBW to Lazy SMP. The Elo of Stockfish rose, so we all assumed that Lazy SMP was better (despite the loss in time to ply) because of the gain in Elo. But was it the algorithm or the implementation or was it really simply superior?

On the whole, though, I think that experimentation with search algorithms is the most fascinating and fruitful part of computer chess.

grahamj · Post by **grahamj** » Fri Jan 03, 2020 9:08 pm

Dann Corbit wrote: ↑Fri Jan 03, 2020 8:30 pm Another algorithm that REALLY sticks in my craw is MCTS. Monte Carlo, for gambling, of course. So, instead of intelligent and guided search, we cast a net blindly left right and center. Maybe it lands in the water. Maybe it lands on a tree. Maybe it catches a bunch of fish. Why should this turn out so well? Why is it that intelligent guidance should not dominantly prevail over randomized searching? I do realize that MCTS "learns" from the good and bad results, but it still seems faulty to me.

The MC in MCTS is monumentally confusing. The search algorithm used by A0 and LC0 is deterministic. PUCT (predictor + upper confidence tree) is a better acronym.

Dann Corbit wrote: ↑Fri Jan 03, 2020 8:30 pm On the whole, though, I think that experimentation with search algorithms is the most fascinating and fruitful part of computer chess.

I agree.

dragontamer5788 · Post by **dragontamer5788** » Fri Jan 03, 2020 9:21 pm

grahamj wrote: ↑Fri Jan 03, 2020 9:08 pm
Dann Corbit wrote: ↑Fri Jan 03, 2020 8:30 pm Another algorithm that REALLY sticks in my craw is MCTS. Monte Carlo, for gambling, of course. So, instead of intelligent and guided search, we cast a net blindly left right and center. Maybe it lands in the water. Maybe it lands on a tree. Maybe it catches a bunch of fish. Why should this turn out so well? Why is it that intelligent guidance should not dominantly prevail over randomized searching? I do realize that MCTS "learns" from the good and bad results, but it still seems faulty to me.
The MC in MCTS is monumentally confusing. The search algorithm used by A0 and LC0 is deterministic. PUCT (predictor + upper confidence tree) is a better acronym.

I really like your PUCT acronym, and that really will prevent future confusion.

MCTS is "monte carlo" because you assume that the various branches are a one-armed bandit. You then perform the statistical algorithm based on prior visits to calculate "probability of win" on any branch, and then balance exploration vs exploitation across your choices. Its a deterministic algorithm that assumes randomness on behalf of the nodes.

Dann Corbit wrote: ↑Fri Jan 03, 2020 8:30 pm On the whole, though, I think that experimentation with search algorithms is the most fascinating and fruitful part of computer chess.
I agree.

Agree x3.

Dann Corbit wrote: As for experiments proving which technique is best, I think that it only shows that implementation X of algorithm Y is better than implementation A of algorithm B. An algorithm analysis where O(f(n)) is proven would show why algorithm A or X should be preferred. Err... complicated by locking.

Locking doesn't "complicate" things. Its just one more O() calculation. Overall, we should be aiming at roughly 3x O() calculations to theoretically understand our programs:

1. Number of CPU instructions executed
2. Number of memory load/stores executed (in L1, L2, L3 cache, and DDR4 RAM)
3. The number of MESI messages passed between CPU cores.

The optimal algorithm will minimize all three. Remember that only one core can actually claim any memory location for its L1 cache. When one core is writing to a particular cache line, no other core can write to it until the core sends an "Invalidate" message to the other cores.

LazySMP has a very small number for #3, because of the nature of the shared hash table. There's less contention obviously, because the hash table is large and core counts are small relative to the hash table. In fact, the cache line is probably returned to DDR4 memory (ie: no core actually is writing) before the next core writes to the data, minimizing contention.

Dann Corbit · Post by **Dann Corbit** » Fri Jan 03, 2020 9:47 pm

Something is missing from the minimization of these:
1. Number of CPU instructions executed
2. Number of memory load/stores executed (in L1, L2, L3 cache, and DDR4 RAM)
3. The number of MESI messages passed between CPU cores.

Consider a program that does eval() on the root node, and then waits until time expires and returns the move choice.
1. Very few instructions
2. Very few load/store operations
3. Very few CPU messages

But it is obviously bad. Something has to measure the usefulness of the outputs and we need measures like depth achieved and Elo gained to factor in somehow.
The chess search is a function of several variables. I guess that ultimately we have time as the x axis and Elo as the y axis.

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate