RMO - Randomized Move Order - yet another Lazy SMP derivate

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Michel »

Sigh... Everyone claims "it is a mystery that Lazy works so well".

I do not see the mystery.

Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.

As far as I know, no one has discredited this explanation.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by dragontamer5788 »

Dann Corbit wrote: Fri Jan 03, 2020 9:47 pm Something is missing from the minimization of these:
1. Number of CPU instructions executed
2. Number of memory load/stores executed (in L1, L2, L3 cache, and DDR4 RAM)
3. The number of MESI messages passed between CPU cores.

Consider a program that does eval() on the root node, and then waits until time expires and returns the move choice.
1. Very few instructions
2. Very few load/store operations
3. Very few CPU messages

But it is obviously bad. Something has to measure the usefulness of the outputs and we need measures like depth achieved and Elo gained to factor in somehow.
The chess search is a function of several variables. I guess that ultimately we have time as the x axis and Elo as the y axis.
Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study (maybe others have invented this simple game themselves already: but I'm still relatively new to this community). It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Michel »

dragontamer5788 wrote: Fri Jan 03, 2020 9:57 pm
Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study. It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)
But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

Michel wrote: Fri Jan 03, 2020 9:50 pm Sigh... Everyone claims "it is a mystery that Lazy works so well".

I do not see the mystery.

Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.

As far as I know, no one has discredited this explanation.
No that is not true. Threads work at two depths: d and d+1 simultaneously. While this separation may result in reduced overhead, each group of threads working on the same depth sure going to result in a ton of overhead!

I think the goal should be to prove SHT (no d/d+1 stuff) can be better than existing algorithms (YBW, ABDADA, DTS etc). If not, Lazy ( implicitly LazySHT ) is going to lose to one of LazyYBW, LazyABDADA, LazyDTS etch where you use the smarter algorithm for threads searching at the same depth.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by dragontamer5788 »

Michel wrote: Fri Jan 03, 2020 10:08 pm
dragontamer5788 wrote: Fri Jan 03, 2020 9:57 pm
Perhaps now is a good time for me to plug my test methodology? I have invented a "game" for my personal study. It is the simplest possible game I could imagine, so that I can focus on the pure task of finding a better search.

1. A move is represented by a 5-bit number, between 0 and 31.
2. Players alternate by choosing a move.
3. After a set depth, the game is over. A 10-depth game (5-moves per player) results in a 60-bit final position.
4. Hash(chosen numbers) == the precise value of the final position. Assume that the hash creates a 16-bit signed integer (say, the bottom 16-bits of MD5, Adler32, CRC16, or whatever other hash function you can think of). Negative numbers favor Player2, Positive numbers favor Player1. Zero is exactly a tie.

For example, if Player1 chooses "move 5", and then Player2 chooses "move 22", the position would be the 10-bit binary number: 0010111000b (hex 0x0B8).

The "Search Question" is to discover the optimal moves of the two players. AlphaBeta is simple and obvious for this simple game, and the various search strategies (YBW, ABDADA, Shared Hash Table, LazySMP) all would apply to this game as well. Its how I've been experimenting with GPU-searches the past few months.

The ultimate question therefore, is which search methodology can calculate an exhaustive, depth 10 search over this game in the shortest amount of time? (or other benchmarks: such as the Big-O of instructions, memory operations, or CPU messages)
But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.
That's a good point.

Hmm... I guess there can be no better test than simply matching engines against each other in chess. That's the ultimate test.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Michel »

Daniel wrote: No that is not true. Threads work at two depths: d and d+1 simultaneously. While this separation may result in reduced overhead, each group of threads working on the same depth sure going to result in a ton of overhead!
When I said "overhead" I meant "administrative overhead" (like in YBW). Lazy trades this overhead (deregulation :)) for the overhead resulting from double work.

In SHT there is only the hash table to make the threads desynchronize a little. It seemed to work well in Toga II up to 4 threads (where according to CCRL its scaling was more or less on par with the YBW engines of the time). But it seems unreasonable to extend it beyond that. Your tests seem to confirm this.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
grahamj
Posts: 43
Joined: Thu Oct 11, 2018 2:26 pm
Full name: Graham Jones

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by grahamj »

Michel wrote: Fri Jan 03, 2020 10:08 pm But this is not a realistic "chess like" tree. In chess neighboring positions in the tree are heavily correlated. Moreover evaluation errors are also heavily correlated. Modern chess engines (A/B and MCTS) exploit these correlations. You will not capture this with a simple random tree.

Generating realistic synthetic chess like trees to test search algorithms would be an interesting project.
I have been working on that... If anyone knows of previous work along these lines please let me know. As well as correlations, I think heavy-tailed distributions are important since even a generally accurate evaulation function will sometimes make gross errors.
Graham Jones, www.indriid.com
User avatar
Ronald
Posts: 160
Joined: Tue Jan 23, 2018 10:18 am
Location: Rotterdam
Full name: Ronald Friederich

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Ronald »

Michel wrote: Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by bob »

dragontamer5788 wrote: Fri Jan 03, 2020 7:08 pm
bob wrote: Fri Jan 03, 2020 5:46 pm No one has done the RIGHT experiment to test Lazy vs YBW. The _correct_ test would be to take the same program, and create a YBW version and a Lazy version, and then play them in a long match.
While that's the correct, general idea, there are numerous implementation details that will hamper such a test. Lockless (atomic-compare-and-swap) vs spinlocks, NUMA computers vs UMA computers, Intel Skylake vs AMD Zen (vs IBM Power9), Hyperthreads vs Physical cores.

There are so many variables at play here.
I don't disagree. But so long as you use the same program, different search algorithms, the test ought to work. Might need to tune each version to each new type of architecture, but so long as basic chess knowledge remains the same... both programs are, in reality, going to get run on all sorts of architectures, so worrying too much about tweaking for max performance is probably overkill anyway.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by lucasart »

Ronald wrote: Fri Jan 03, 2020 11:35 pm
Michel wrote: Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.
Exactly!
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.