Another take on DTS?

bob · Post by **bob** » Sun Oct 27, 2019 9:19 pm

Nothing wrong with that. EXCEPT that you are accepting a significant performance degradation (no global transposition table) in return for a smaller performance loss when probing the table. There is probably a broad spectrum of such approaches (several have been mentioned over the years) each of which has the tradeoff of table utility (how many hits) vs table overhead (latency to do remote lookups, or limit them to reduce the latency hit on NPS.)

The nice thing about SMP/NUMA implementations is that the latency is near zero for SMP and not a whole lot higher for reasonable NUMA configurations. When you factor in ethernet, infiniband, etc, however, either the latency goes WAY up, or else you have to reduce the benefit the table offers in an unrestricted architecture...

petero2 · Post by **petero2** » Sun Oct 27, 2019 11:02 pm

bob wrote: ↑Sun Oct 27, 2019 9:19 pm Nothing wrong with that. EXCEPT that you are accepting a significant performance degradation (no global transposition table)
...
The nice thing about SMP/NUMA implementations is that the latency is near zero for SMP and not a whole lot higher for reasonable NUMA configurations.

It is certainly true that this algorithm is less efficient on an Ethernet connected cluster than using an SMP algorithm on an SMP/NUMA machine having as many cores as the total number of cores in the whole cluster. The problem is of course that building such an SMP/NUMA machine is very expensive or impossible if the total core count is too large.

So the interesting question to me is: If you have a cluster of modern computers (e.g. Intel Core CPUs) connected with Ethernet networking, what is the best way to use this cluster to maximize playing strength of your chess engine?

So far I have found no better algorithm than the one I currently use in Texel, i.e. asynchronously shared transposition table and an asynchronous ABDADA-like search algorithm. In one test I ran I measured +76 Elo going from 1 to 4 cluster nodes, compared to +205 Elo by instead increasing the thinking time by a factor of 4. So the cluster algorithm achieved 76/205 = 37% of the theoretical maximum Elo gain. I tried the same thing using Cluster Toga and got 19/103 = 18%. I also tried Scorpio, but was not able to find parameter settings that gained any Elo. I am not aware of any other publicly available cluster capable chess engines. (Actually, Stockfish has a cluster branch based on some of the ideas I use in Texel. I should test that too sometime.)

I find it very unlikely that the Texel algorithm is the best possible for an Ethernet cluster though. The "asynchronously shared" transposition table idea can be combined with other search algorithms like YBW and DTS. That could be interesting to explore.

I don't think a truly distributed transposition table has much hope of working though, i.e. one where each cluster node holds 1/N of the total transposition table and if a node wants to probe an entry stored on a different node it would have to use message passing. On typical PC/Ethernet hardware it would make a single hash probe as expensive as searching maybe 500 nodes using a local transposition table.

dragontamer5788 · Post by **dragontamer5788** » Mon Oct 28, 2019 5:49 am

petero2 wrote: ↑Sun Oct 27, 2019 7:16 am * Each cluster node has its own local transposition table, which works the same as in a normal SMP engine. In Texel's case I use your lockless hashing algorithm. Thanks for inventing that.

What's this lockless hashing algorithm? I'm definitely interested in it.

petero2 wrote: ↑Sun Oct 27, 2019 11:02 pmSo the interesting question to me is: If you have a cluster of modern computers (e.g. Intel Core CPUs) connected with Ethernet networking, what is the best way to use this cluster to maximize playing strength of your chess engine?

One thing that I like about my "Lazy Futures" approach is that its basically a message-passing algorithm which probably would work decently well in an Ethernet cluster setting. I've got a topic somewhere else on the subject, but the short-story is that Alpha / Beta bounds are expression-trees of futures, instead of ints. A drawing from my last topic is as follows:

The full tree assumes 3-children per position. Its quite big, but you can see how the alpha-beta bounds were calculated from the root if you prefer. The root position "P" starts with Alpha-Beta bounds of [-inf, +inf], and each row can be evaluated in parallel without any loss in work-efficiency.

The alpha / beta expressions are trees written in reverse polish notation. For example, P31's alpha is "inf - F1 max F2 max", which evaluates into: max(F2, max(F1, - infinity)). Negate (or "-" ) is reverse-polish notation, affecting the single item immediately to the left of the negation sign, so its a "negative infinity" for P31's alpha.

"F" values (such as F1 or F2) are futures, values that are not known yet because they are being evaluated in parallel. F1 represents the future-evaluation of position P1. (While F131 would be the future-evaluation of position P131). A future-binding is simply a message, such as "F1 equals +55 centipawns", possibly "LowerBound: F1 > +20 centipawns" and "Upperbound: F1 < +90 centipawns". Such a message would only need a 64-bit future-identifier + 16-bit centipawn evaluation: 10 bytes broadcast. I don't know if lower-bound or upper-bound messages would be useful in my scheme, but its possible.

"Beta-blocked" is only relative to work-efficiency. If you wanted a 100% work efficient algorithm, you'd have to wait until Score vs Beta were both fully bound. For example,Beta-Blocked P32 would need F31, F1, and F2 to be bound to a evaluation if you wanted an equivalent work-efficiency to sequential alpha-beta. Its pretty natural to execute P32 speculatively however.

Anyway, a speculative P32 would look like: alpha: [inf - F1 max F2 max] (identical to P31, its "older brother", or negate(beta) from its father P3). While P32's beta would be [inf- F31 max -]... that is: negate(max(parent.alpha, olderBrother's future)). All children, such as P321, P3211, P32111, will remember that F31 is "still pending", and a sanity check on alpha-beta bounds at a later time (when more values are bound) will let you know if the positions have entered beta-cutoff and can be discarded early.

--------

So in effect:

1. Sort all the positions on your cluster in terms of work efficiency. For example, P33 would be speculating on two values: F32 and F31, while P32 only speculates on F31... while P31 has no speculation at all. As such, P31 is fully work efficient (sequential alpha-beta must, and always will visit P31), while P32, P33, P34... are more-and-more speculative.

2. If you run out of memory, focus on the greatest-depth positions to "clean up" memory. This will force some degree of speculation.

3. Load balance by message-passing positions to other computers in your cluster. However, every position you pass to another computer will increase the ethernet bandwidth required. For example, if you load-balance P32 to another computer, this computer will need to broadcast the bindings of F1, F2, and F31 to all computers in the cluster. (by the time F1, F2, and F31 are complete, you don't know if the other computer load-balanced P32's children to other computers. Therefore, you should probably do an expensive full-broadcast).

With that being said, by having the futures explicitly laid out here, you can load-balance in such a way to minimize communications and save on ethernet bandwidth. For example, P31 only has F1 and F2 that its waiting on, so the "cost" of load balancing P31 is 33% less than broadcasting P32.

--------

So yeah... it seems like this scheme would work out very well for message passing. Not that I've done it or anything, I'm focusing on GPU-implementation for now. GPU is what I got at home so that's what I'm going for. But I've spent some time hypothesizing how I'd generalize the approach to 4x GPUs instead of 1x GPU... so that's why I know what to type up here in theory

petero2 · Post by **petero2** » Mon Oct 28, 2019 7:15 am

dragontamer5788 wrote: ↑Mon Oct 28, 2019 5:49 am
petero2 wrote: ↑Sun Oct 27, 2019 7:16 am * Each cluster node has its own local transposition table, which works the same as in a normal SMP engine. In Texel's case I use your lockless hashing algorithm. Thanks for inventing that.
What's this lockless hashing algorithm? I'm definitely interested in it.

Lockless hashing has also been referred to as "the xor trick". It is described in detail here: http://www.craftychess.com/hyatt/hashing.html

dragontamer5788 · Post by **dragontamer5788** » Mon Oct 28, 2019 8:54 am

petero2 wrote: ↑Mon Oct 28, 2019 7:15 am
dragontamer5788 wrote: ↑Mon Oct 28, 2019 5:49 am
petero2 wrote: ↑Sun Oct 27, 2019 7:16 am * Each cluster node has its own local transposition table, which works the same as in a normal SMP engine. In Texel's case I use your lockless hashing algorithm. Thanks for inventing that.
What's this lockless hashing algorithm? I'm definitely interested in it.
Lockless hashing has also been referred to as "the xor trick". It is described in detail here: http://www.craftychess.com/hyatt/hashing.html

Definitely pretty nifty. Thanks for the info. Reminds me of the doubly-linked list with only "one pointer" trick, by only storing (prev XOR next).

bob · Post by **bob** » Mon Oct 28, 2019 7:53 pm

BTW, for the record, this was the idea from Harry Nelson (one of the Cray Blitz team) way way back... Tim Mann re-discovered it when porting Crafty to the Alpha. Not sure whether he found Harry's old paper that mentioned it, or if he re-discovered it on his own. In either case, It was not "my" idea.

petero2 · Post by **petero2** » Mon Oct 28, 2019 9:15 pm

bob wrote: ↑Mon Oct 28, 2019 7:53 pm BTW, for the record, this was the idea from Harry Nelson (one of the Cray Blitz team) way way back... Tim Mann re-discovered it when porting Crafty to the Alpha. Not sure whether he found Harry's old paper that mentioned it, or if he re-discovered it on his own. In either case, It was not "my" idea.

In that case I want to change my earlier statement to: Thanks for making that idea known to me.

mar · Post by **mar** » Mon Oct 28, 2019 9:38 pm

dragontamer5788 wrote: ↑Mon Oct 28, 2019 8:54 am Reminds me of the doubly-linked list with only "one pointer" trick, by only storing (prev XOR next).

Sorry for a bit of OT but how is that supposed to work? You'd need to know a pointer of either neighbor, so deleting an item from the list would still require O(N) traversal just like a singly linked list (instead of O(1) for classical DL list, which is the primary advantage).
Perhaps the only advantage of such approach would be that you can traverse from either end, but I don't see much use for that in practice.
Also debugging garbage pointers doesn't sound like much fun. Just wondering.

dragontamer5788 · Post by **dragontamer5788** » Mon Oct 28, 2019 10:32 pm

mar wrote: ↑Mon Oct 28, 2019 9:38 pm Perhaps the only advantage of such approach would be that you can traverse from either end, but I don't see much use for that in practice.

A singly-linked list seems like it can easily implement a stack. I'm pretty sure a singly-linked list can implement a queue (but it takes a bit of thought). A dequeue however... that is push-front, push-back, pop-front, and pop-back... seems to require a doubly-linked list. And the "XOR-pointer" trick will be sufficient for a dequeue (as you are only pulling data from the front or back, and never in the middle).

Otherwise, you'd have to double your pointer space by creating a classic doubly linked list, which has the (useless) feature of being able to pull out of the middle.

--------

A dequeue is very useful for load-balancing tree traversal searches. Your local thread push_back and pop_back (aka: Depth-first search locally), but "other" threads can invade and pop_front whenever they need to work-steal some work from your thread (aka: Breadth-first search while load-balancing). It takes some effort to mitigate the race-conditions through locks or compare-and-swap operations... but a dequeue is definitely useful.

mar · Post by **mar** » Mon Oct 28, 2019 10:51 pm

dragontamer5788 wrote: ↑Mon Oct 28, 2019 10:32 pm A singly-linked list seems like it can easily implement a stack. I'm pretty sure a singly-linked list can implement a queue (but it takes a bit of thought). A dequeue however... that is push-front, push-back, pop-front, and pop-back... seems to require a doubly-linked list. And the "XOR-pointer" trick will be sufficient for a dequeue (as you are only pulling data from the front or back, and never in the middle).

Otherwise, you'd have to double your pointer space by creating a classic doubly linked list, which has the (useless) feature of being able to pull out of the middle.

--------

A dequeue is very useful for load-balancing tree traversal searches. Your local thread push_back and pop_back (aka: Depth-first search locally), but "other" threads can invade and pop_front whenever they need to work-steal some work from your thread (aka: Breadth-first search while load-balancing). It takes some effort to mitigate the race-conditions through locks or compare-and-swap operations... but a dequeue is definitely useful.

Ok, thanks. But I'm still not sold on the linked list idea to implement a double-ended queue.
My implementation uses a (dynamic) circular buffer with two indices, seems more compact to me (only expensive operation is growing as needed, but still amortized O(1) push and you can always reserve in advance) => elements are tightly packed together, no need to store any kind of pointer/index per element.
EDIT: as a bonus, I get random access to queue elements for free

Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?

Re: Another take on DTS?