SMP, first shot at implementation

abulmo2 · Post by **abulmo2** » Sat Sep 12, 2020 10:19 pm

mvanthoor wrote: ↑Sat Sep 12, 2020 8:21 pm Do you have a link to that non-locking mechanism for using the hash table? In my engine, it's (almost) impossible to write the thing single-threaded because of the way how Rust works. I'm going to need a thread for the UCI loop, and a thread for the search, so I'll be starting out with two threads already. Going from there to using 2-4 threads for the search instead of one, will be a relatively easy step, as platform-independent concurrency is built right into the language. A non-locking hash table mechanism would be really useful.

The article is available online: https://craftychess.com/hyatt/hashing.html
I am not convinced this reduce the overall rate of collisions in a significant margin. I used the lock-less approach in Amoeba 3.0 & 3.1 but removed it in Amoeba 3.2. This simplification was worth a couple of Elo. The shared hash table in chess is against the philosophy of the Rust language. Race conditions are acceptable in a chess engine accessing/writing the shared hash table.

mvanthoor · Post by **mvanthoor** » Sat Sep 12, 2020 11:38 pm

abulmo2 wrote: ↑Sat Sep 12, 2020 10:19 pm The shared hash table in chess is against the philosophy of the Rust language. Race conditions are acceptable in a chess engine accessing/writing the shared hash table.

Yes, true. The shared hash table itself is not the problem, IF mutex locking is used.

According to FabianVDW (of FabChess, which is also an engine in Rust), mutex locking works, but is too slow if an engine uses more than 8 threads. Then the overhead of the locks/unlocks becomes so big that more threads don't add any more speed. He says that "racey access to the hash table is necessary." That can be done with unsafe Rust, but it is not the preference, as this will drop you into manual memory management territory: YOU make sure that everything is correct instead of the compiler being able to check it for you. If you wrap code by unsafe { }, the compiler won't help you at all.

If the problem starts with 8 threads only, I think I'll first implement proper locked hash table sharing (when the time for SMP comes for my engine... after I hit at least 2850 ELO on a single thread), and if it isn't fast enough when running more than 4 threads, I'll look into lockless/unsafe Rust then.

Joost Buijs · Post by **Joost Buijs** » Sun Sep 13, 2020 7:43 am

mvanthoor wrote: ↑Sat Sep 12, 2020 8:21 pm
Joost Buijs wrote: ↑Sat Sep 12, 2020 3:27 pm
Kieren Pearson wrote: ↑Sat Sep 12, 2020 3:08 pm At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.
Maybe you can get away with this when you don't use the transposition table in quiescence, in my experience locking the transposition table with a mutex is way to slow to be useful.
Do you have a link to that non-locking mechanism for using the hash table? In my engine, it's (almost) impossible to write the thing single-threaded because of the way how Rust works. I'm going to need a thread for the UCI loop, and a thread for the search, so I'll be starting out with two threads already. Going from there to using 2-4 threads for the search instead of one, will be a relatively easy step, as platform-independent concurrency is built right into the language. A non-locking hash table mechanism would be really useful.

The original idea is from Robert Hyatt, he describes it here: https://craftychess.com/hyatt/hashing.html

From CPW: https://www.chessprogramming.org/Shared_Hash_Table

Joost Buijs · Post by **Joost Buijs** » Sun Sep 13, 2020 8:25 am

abulmo2 wrote: ↑Sat Sep 12, 2020 10:19 pm
mvanthoor wrote: ↑Sat Sep 12, 2020 8:21 pm Do you have a link to that non-locking mechanism for using the hash table? In my engine, it's (almost) impossible to write the thing single-threaded because of the way how Rust works. I'm going to need a thread for the UCI loop, and a thread for the search, so I'll be starting out with two threads already. Going from there to using 2-4 threads for the search instead of one, will be a relatively easy step, as platform-independent concurrency is built right into the language. A non-locking hash table mechanism would be really useful.
The article is available online: https://craftychess.com/hyatt/hashing.html
I am not convinced this reduce the overall rate of collisions in a significant margin. I used the lock-less approach in Amoeba 3.0 & 3.1 but removed it in Amoeba 3.2. This simplification was worth a couple of Elo. The shared hash table in chess is against the philosophy of the Rust language. Race conditions are acceptable in a chess engine accessing/writing the shared hash table.

Locking or not has no influence on the rate of collisions at all, like you said it is used to avoid race conditions when accessing/writing the shared hash table. The lock-less scheme by XORing the hash-key with the data-field doesn't avoid race conditions at all, but it gives you a simple means to detect for the majority of cases whether a race condition has occured or not. Of course this is not perfect, but I can't imagine that it will cost Elo, if it does there must be something else causing this.

Dann Corbit · Post by **Dann Corbit** » Sun Sep 13, 2020 11:03 am

For Rust, if you want to stay safe, why not just use a lockless thread safe hash table. There are lots of them. A web search will turn up plenty. Then you won't have to use unsafe. I am surprised if Rust does not already have one. Java and C# both have one.

chrisw · Post by **chrisw** » Sun Sep 13, 2020 12:01 pm

Joost Buijs wrote: ↑Sat Sep 12, 2020 7:24 pm
chrisw wrote: ↑Sat Sep 12, 2020 6:57 pm Yes, could be, although I was using only four of the six available cores and the CPU load was reporting at 60% or so.
I figured for some time now that nps is quite search efficienvy dependant. Depends on where the node counter gets upticked also. In my case, when a MakeMove gets done. If there's some 'smarts' move-ordering and so on, then efficient search with more cuts tends to spend more time on smarts and less doing MakeMoves (they get pruned or thrown out), so more efficient search (from thread cross-pollination) ought not (maybe) to scale nps with thread count. Dunno, maybe.
The best is to keep all global variables that belong to a search thread (including the node counter) in a struct that is aligned at 64 bytes (the typical width of a cache line) to prevent different threads from writing into the same cache line. At the end of the search you can collect the node counters from the different threads and add them together to get the final node count. When you let different threads write into the same cache line (even for a single counter) it completely kills performance.

Done, thanks. First sight it gains 1.5% speed up (as measured by nps). I think the only globals remaining now are in a debug test to update/check on maximum movelist width since the beginning of time. I guess I can throw those away by now.
However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of nps increase against thread count? From memory of past reports, I recollect non-linear scaling is the norm.

Yes! Right now, it takes no care at all with hash entries, read nor write. No locks, no mutex. I do have some quite rigorous tests to check hash data, the tt move has to validate else the entire entry is abandoned. I also double check (because overblew the normal 64 bit read/write) that what just gets read (or written) is still there by the time of the next instruction. Hash code is marked 'go back and clean this up'.
Checking hash data validity should work, there is no need to lock the transposition table at all if you are sure your that your hash data is valid.

Joost Buijs · Post by **Joost Buijs** » Sun Sep 13, 2020 2:35 pm

chrisw wrote: ↑Sun Sep 13, 2020 12:01 pm Done, thanks. First sight it gains 1.5% speed up (as measured by nps). I think the only globals remaining now are in a debug test to update/check on maximum movelist width since the beginning of time. I guess I can throw those away by now.
However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of nps increase against thread count? From memory of past reports, I recollect non-linear scaling is the norm.

With YBWC non-linear nps-scaling is the norm, with lazy-SMP nps-scaling should be almost linear (besides some CPU effects), the CPU probably boosts to a higher frequency when only a single core is used. For instance my engine does in the opening phase ~2.5 Mnps on a single core, and it does 70 to 75 Mnps on 32 cores, this is almost linear.

If your nps-scaling with lazy-SMP is much worse there must be some kind of bottleneck, from a distance it's difficult to figure out what it could be.

chrisw · Post by **chrisw** » Sun Sep 13, 2020 3:27 pm

Joost Buijs wrote: ↑Sun Sep 13, 2020 2:35 pm
chrisw wrote: ↑Sun Sep 13, 2020 12:01 pm Done, thanks. First sight it gains 1.5% speed up (as measured by nps). I think the only globals remaining now are in a debug test to update/check on maximum movelist width since the beginning of time. I guess I can throw those away by now.
However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of nps increase against thread count? From memory of past reports, I recollect non-linear scaling is the norm.
With YBWC non-linear nps-scaling is the norm, with lazy-SMP nps-scaling should be almost linear (besides some CPU effects), the CPU probably boosts to a higher frequency when only a single core is used. For instance my engine does in the opening phase ~2.5 Mnps on a single core, and it does 70 to 75 Mnps on 32 cores, this is almost linear.

If your nps-scaling with lazy-SMP is much worse there must be some kind of bottleneck, from a distance it's difficult to figure out what it could be.

Ok, thanks, I'll keep looking. btw, is there a Server blitz match this month (I stupidly missed the last one, convinced it was on Sunday rather than Saturday) and is it possible to arrange for some test games with a comparable engine on the server? When I log on there's usually nobody there, I guess one has to make an arrangement beforehand?

Joost Buijs · Post by **Joost Buijs** » Sun Sep 13, 2020 4:25 pm

chrisw wrote: ↑Sun Sep 13, 2020 3:27 pm
Joost Buijs wrote: ↑Sun Sep 13, 2020 2:35 pm
chrisw wrote: ↑Sun Sep 13, 2020 12:01 pm Done, thanks. First sight it gains 1.5% speed up (as measured by nps). I think the only globals remaining now are in a debug test to update/check on maximum movelist width since the beginning of time. I guess I can throw those away by now.
However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of nps increase against thread count? From memory of past reports, I recollect non-linear scaling is the norm.
With YBWC non-linear nps-scaling is the norm, with lazy-SMP nps-scaling should be almost linear (besides some CPU effects), the CPU probably boosts to a higher frequency when only a single core is used. For instance my engine does in the opening phase ~2.5 Mnps on a single core, and it does 70 to 75 Mnps on 32 cores, this is almost linear.

If your nps-scaling with lazy-SMP is much worse there must be some kind of bottleneck, from a distance it's difficult to figure out what it could be.
Ok, thanks, I'll keep looking. btw, is there a Server blitz match this month (I stupidly missed the last one, convinced it was on Sunday rather than Saturday) and is it possible to arrange for some test games with a comparable engine on the server? When I log on there's usually nobody there, I guess one has to make an arrangement beforehand?

It's our intention to have the next blitz tourney on Saturday the 19th of September. I keep the server 24/7 online. Usually there are some engines logged on, there is no need to make an arrangement beforehand.

mvanthoor · Post by **mvanthoor** » Sun Sep 13, 2020 7:49 pm

Dann Corbit wrote: ↑Sun Sep 13, 2020 11:03 am For Rust, if you want to stay safe, why not just use a lockless thread safe hash table. There are lots of them. A web search will turn up plenty. Then you won't have to use unsafe. I am surprised if Rust does not already have one. Java and C# both have one.

It has a hashmap, but bucket replacement schemes are important in chess engines; at least at the somewhat higher levels. I don't know yet in what way such schemes can be implemented using the Rust hashmap. By default, it uses "Robin Hood Bucket Stealing":

https://en.wikipedia.org/wiki/Hash_tabl ... od_hashing

Maybe that scheme is already good enough (or even the default in the better chess engines: replace the worst used bucket entry with one that has a better hit rate). I'll see when I get to implementing the transposition table.

SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation