RMO - Randomized Move Order - yet another Lazy SMP derivate

Michel · Post by **Michel** » Fri Jan 03, 2020 9:50 pm

Sigh... Everyone claims "it is a mystery that Lazy works so well".

I do not see the mystery.

Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.

As far as I know, no one has discredited this explanation.

dragontamer5788 · Post by **dragontamer5788** » Fri Jan 03, 2020 9:57 pm

Dann Corbit wrote: ↑Fri Jan 03, 2020 9:47 pm Something is missing from the minimization of these:
1. Number of CPU instructions executed
2. Number of memory load/stores executed (in L1, L2, L3 cache, and DDR4 RAM)
3. The number of MESI messages passed between CPU cores.

Consider a program that does eval() on the root node, and then waits until time expires and returns the move choice.
1. Very few instructions
2. Very few load/store operations
3. Very few CPU messages

But it is obviously bad. Something has to measure the usefulness of the outputs and we need measures like depth achieved and Elo gained to factor in somehow.
The chess search is a function of several variables. I guess that ultimately we have time as the x axis and Elo as the y axis.

Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study (maybe others have invented this simple game themselves already: but I'm still relatively new to this community). It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)

Michel · Post by **Michel** » Fri Jan 03, 2020 10:08 pm

dragontamer5788 wrote: ↑Fri Jan 03, 2020 9:57 pm
Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study. It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)

But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jan 03, 2020 10:13 pm

Michel wrote: ↑Fri Jan 03, 2020 9:50 pm Sigh... Everyone claims "it is a mystery that Lazy works so well".

I do not see the mystery.

Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.

As far as I know, no one has discredited this explanation.

No that is not true. Threads work at two depths: d and d+1 simultaneously. While this separation may result in reduced overhead, each group of threads working on the same depth sure going to result in a ton of overhead!

I think the goal should be to prove SHT (no d/d+1 stuff) can be better than existing algorithms (YBW, ABDADA, DTS etc). If not, Lazy ( implicitly LazySHT ) is going to lose to one of LazyYBW, LazyABDADA, LazyDTS etch where you use the smarter algorithm for threads searching at the same depth.

dragontamer5788 · Post by **dragontamer5788** » Fri Jan 03, 2020 10:18 pm

Michel wrote: ↑Fri Jan 03, 2020 10:08 pm
dragontamer5788 wrote: ↑Fri Jan 03, 2020 9:57 pm
Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study. It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)
But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.

That's a good point.

Hmm... I guess there can be no better test than simply matching engines against each other in chess. That's the ultimate test.

Michel · Post by **Michel** » Fri Jan 03, 2020 10:53 pm

Daniel wrote: No that is not true. Threads work at two depths: d and d+1 simultaneously. While this separation may result in reduced overhead, each group of threads working on the same depth sure going to result in a ton of overhead!

When I said "overhead" I meant "administrative overhead" (like in YBW). Lazy trades this overhead (deregulation

) for the overhead resulting from double work.

In SHT there is only the hash table to make the threads desynchronize a little. It seemed to work well in Toga II up to 4 threads (where according to CCRL its scaling was more or less on par with the YBW engines of the time). But it seems unreasonable to extend it beyond that. Your tests seem to confirm this.

grahamj · Post by **grahamj** » Fri Jan 03, 2020 11:23 pm

Michel wrote: ↑Fri Jan 03, 2020 10:08 pm But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.

I have been working on that... If anyone knows of previous work along these lines please let me know. As well as correlations, I think heavy-tailed distributions are important since even a generally accurate evaulation function will sometimes make gross errors.

Ronald · Post by **Ronald** » Fri Jan 03, 2020 11:35 pm

Michel wrote: ↑Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.

I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.

bob · Post by **bob** » Fri Jan 03, 2020 11:39 pm

dragontamer5788 wrote: ↑Fri Jan 03, 2020 7:08 pm
bob wrote: ↑Fri Jan 03, 2020 5:46 pm No one has done the RIGHT experiment to test Lazy vs YBW. The _correct_ test would be to take the same program, and create a YBW version and a Lazy version, and then play them in a long match.
While that's the correct, general idea, there are numerous implementation details that will hamper such a test. Lockless (atomic-compare-and-swap) vs spinlocks, NUMA computers vs UMA computers, Intel Skylake vs AMD Zen (vs IBM Power9), Hyperthreads vs Physical cores.

There are so many variables at play here.

I don't disagree. But so long as you use the same program, different search algorithms, the test ought to work. Might need to tune each version to each new type of architecture, but so long as basic chess knowledge remains the same... both programs are, in reality, going to get run on all sorts of architectures, so worrying too much about tweaking for max performance is probably overkill anyway.

lucasart · Post by **lucasart** » Sat Jan 04, 2020 1:31 am

Ronald wrote: ↑Fri Jan 03, 2020 11:35 pm
Michel wrote: ↑Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.

Exactly!

RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate