SMP, first shot at implementation

fabianVDW · Post by **fabianVDW** » Sun Sep 13, 2020 9:09 pm

mvanthoor wrote: ↑Sat Sep 12, 2020 11:38 pm
According to FabianVDW (of FabChess, which is also an engine in Rust), mutex locking works, but is too slow if an engine uses more than 8 threads. Then the overhead of the locks/unlocks becomes so big that more threads don't add any more speed. He says that "racey access to the hash table is necessary." That can be done with unsafe Rust, but it is not the preference, as this will drop you into manual memory management territory: YOU make sure that everything is correct instead of the compiler being able to check it for you. If you wrap code by unsafe { }, the compiler won't help you at all.

If the problem starts with 8 threads only, I think I'll first implement proper locked hash table sharing (when the time for SMP comes for my engine... after I hit at least 2850 ELO on a single thread), and if it isn't fast enough when running more than 4 threads, I'll look into lockless/unsafe Rust then.

I think I stand by what I said in the past.IIRC with the XOR trick I measured nearly no collision, since the writes are atomic. It doesn't change the fact that everything surrounding the hash table is inherently unsafe - funnily this does not even change using safe Rust only, although one might trick oneself into thinking so. Hash key collisions happen often enough at TCEC timecontrols and hardware, such that a lot of engines (including FabChess) heavily crashed in pre-season TCEC testing. Of course this is a different kind of unsafe than we usually envision in the context of Rust, but a non crashing engine is ultimately the goal we want to achieve. The easiest solution is to treat everything coming out of the hash table as complete garbage(mainly we can check if TTMove bytes correspond to a move enconding and the move contained in the move encoding is valid in the position). Then one can also easily disregard any kinds of data races - simply because validness of TTMove will be checked anyway(the score might change, but I think we can handwave that - or can we?).

Nonetheless, my feeling is that when thread-> infinity the strategy of allowing data races is suboptimal.

mvanthoor · Post by **mvanthoor** » Mon Sep 14, 2020 12:27 am

fabianVDW wrote: ↑Sun Sep 13, 2020 9:09 pm The easiest solution is to treat everything coming out of the hash table as complete garbage(mainly we can check if TTMove bytes correspond to a move enconding and the move contained in the move encoding is valid in the position).

You could just decode the 64-bit integer coming out of the TT, and check if the resulting move is actually legal; but doesn't this cause a lot of performance loss? The procedure should be faster than actually making, evaluating and unmaking the move, but still, having to check the hash entry for validity, does negate part of why you'd actually use a hash table: to minimize (re)computation of known values.

Nonetheless, my feeling is that when thread-> infinity the strategy of allowing data races is suboptimal.

Yes, but you're also correct that if you have a huge number of threads using a single object, locking and unlocking that object for each and every thread will eventually take so much time that lots of threads will just be sitting around waiting for their tiny time slice to come along. Then you'd get to the point where adding more threads is actually detrimental to the performance of the engine.

Maybe, if an engine has so many cores available, it would be better to run two evaluations at the same time: an alpha/beta one, and a neural network one, both on the same position, but with their own hashtable, then pick the move that has the highest score. Or three evaluations, or four... and it probably has been done already. (I remember the Shredding GUI a "triple brain" feature where it runs three chess engines at once, and of the three resulting moves, it chooses the one with the highest score; this is basically the same thing I'm proposing here, but with three different engines instead of three different evaluations.)

abulmo2 · Post by **abulmo2** » Mon Sep 14, 2020 9:47 am

chrisw wrote: ↑Sun Sep 13, 2020 12:01 pm However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of

Amoeba got the following results, on a ryzen 7 1700x (8 cores / 16 threads) all cores at 3.5 Ghz :

On 1 thread it is slightly faster (2 Mnps) than on 2 to 8 threads (1.67xNthreads), but the relationship is pretty linear. On 9 to 16 threads, using SMP, the speed increase per threads is halved (0.86xNthreads) but still with a good linearity.

chrisw · Post by **chrisw** » Mon Sep 14, 2020 10:58 am

abulmo2 wrote: ↑Mon Sep 14, 2020 9:47 am
chrisw wrote: ↑Sun Sep 13, 2020 12:01 pm However, I still have more or less the same non-linear scaling of nps with thread count. Does anyone have any stats of their own experience of linearity of
Amoeba got the following results, on a ryzen 7 1700x (8 cores / 16 threads) all cores at 3.5 Ghz :

On 1 thread it is slightly faster (2 Mnps) than on 2 to 8 threads (1.67xNthreads), but the relationship is pretty linear. On 9 to 16 threads, using SMP, the speed increase per threads is halved (0.86xNthreads) but still with a good linearity.

Thanks, useful. I’ve not begun any experiments to get to the bottom of this, but I’m wondering just how much ones memory usage plays in this. So far my engine has been one thread only, with design feature built in that it would one day (two days ago) become multi-threaded. My programming tends to be quite sloppy anyway, and I’ve made practically zero effort to be RAM size efficient, so I’m wondering if my non-linearity isn’t just a function of N threads blowing everything out of cache limits. Generally cleaning up over-lengthy arrays won’t take me too long, so that’s on todo list, then repeat tests.
One thing I never bothered to research, because never had to, is caching. I always wondered but never looked up, does each core have some own cache, or not? Or does this depend on type of cache?

chrisw · Post by **chrisw** » Mon Sep 14, 2020 11:10 am

mvanthoor wrote: ↑Mon Sep 14, 2020 12:27 am
fabianVDW wrote: ↑Sun Sep 13, 2020 9:09 pm The easiest solution is to treat everything coming out of the hash table as complete garbage(mainly we can check if TTMove bytes correspond to a move enconding and the move contained in the move encoding is valid in the position).
You could just decode the 64-bit integer coming out of the TT, and check if the resulting move is actually legal; but doesn't this cause a lot of performance loss? The procedure should be faster than actually making, evaluating and unmaking the move, but still, having to check the hash entry for validity, does negate part of why you'd actually use a hash table: to minimize (re)computation of known values.

Nonetheless, my feeling is that when thread-> infinity the strategy of allowing data races is suboptimal.
Yes, but you're also correct that if you have a huge number of threads using a single object, locking and unlocking that object for each and every thread will eventually take so much time that lots of threads will just be sitting around waiting for their tiny time slice to come along. Then you'd get to the point where adding more threads is actually detrimental to the performance of the engine.

Maybe, if an engine has so many cores available, it would be better to run two evaluations at the same time: an alpha/beta one, and a neural network one, both on the same position, but with their own hashtable, then pick the move that has the highest score. Or three evaluations, or four... and it probably has been done already. (I remember the Shredding GUI a "triple brain" feature where it runs three chess engines at once, and of the three resulting moves, it chooses the one with the highest score; this is basically the same thing I'm proposing here, but with three different engines instead of three different evaluations.)

That’s been done (at a relatively simplistic level), it’s relatively easy with python and python chess to launch engines simultaneously and process results in real time to give you a composite engine.
In C it’s perfectly possible also to the same, but take it further. One thought did occur to me, what about using different eval or even different search AND a common hash table. At first sight this might seem crazy with (Random) N different based evals coming back out of a hash entry, but I wonder ....
Perhaps not use the eval, but use the search guidance. No idea if anyone ever tried this.

Dann Corbit · Post by **Dann Corbit** » Mon Sep 14, 2020 2:15 pm

mvanthoor wrote: ↑Sun Sep 13, 2020 7:49 pm
Dann Corbit wrote: ↑Sun Sep 13, 2020 11:03 am For Rust, if you want to stay safe, why not just use a lockless thread safe hash table. There are lots of them. A web search will turn up plenty. Then you won't have to use unsafe. I am surprised if Rust does not already have one. Java and C# both have one.
It has a hashmap, but bucket replacement schemes are important in chess engines; at least at the somewhat higher levels. I don't know yet in what way such schemes can be implemented using the Rust hashmap. By default, it uses "Robin Hood Bucket Stealing":

https://en.wikipedia.org/wiki/Hash_tabl ... od_hashing

Maybe that scheme is already good enough (or even the default in the better chess engines: replace the worst used bucket entry with one that has a better hit rate). I'll see when I get to implementing the transposition table.

Make sure that the hashmap is thread safe.
C# (for instance) thas thread safe containers and containers that are not thread safe.

Henk · Post by **Henk** » Mon Sep 14, 2020 3:35 pm

Yes you could use ImmutableDictionary. But it has its price.
For instance I don't get much faster than 0.1 million nodes per second.
I remember speed of 1 million nodes per second in the past.

Maybe used too many immutable variables and datastructures.
Don't know. Or assigning priorities to moves taking too much time now.
But might even have a NodeCount bug.

Joost Buijs · Post by **Joost Buijs** » Mon Sep 14, 2020 7:21 pm

All this fancy stuff will put you in a straitjacket, I don't know about Rust ('Rust roest' we say here in Holland), but I still think C/C++ is the only serious language to write a chess engine in. Most other languages output slower code (or the compilers are worse), for instance I keep a shadow copy of my engine in Pascal (Free Pascal to be exact) and it is at least 30 to 40% slower as my C++ version.

fabianVDW · Post by **fabianVDW** » Mon Sep 14, 2020 8:16 pm

Joost Buijs wrote: ↑Mon Sep 14, 2020 7:21 pm All this fancy stuff will put you in a straitjacket, I don't know about Rust ('Rust roest' we say here in Holland), but I still think C/C++ is the only serious language to write a chess engine in. Most other languages output slower code (or the compilers are worse), for instance I keep a shadow copy of my engine in Pascal (Free Pascal to be exact) and it is at least 30 to 40% slower as my C++ version.

I would love to disprove this quote with an actual fast chess engine, but sadly FabChess is not, since I am not the best programmer. Have a look at Asymptote or Rustic though, they are both very fast. Most compiled Rust code compiles to similar speed as C/C++. Convince yourself with a quick google search about Rust performance vs C++ performance if you want. Fact is that you can port C code to Rust with unsafe and achieve the same performance pretty much guaranteed - this would include a lot of unsafe that one doesn't want though.

mvanthoor wrote: ↑Mon Sep 14, 2020 12:27 am
fabianVDW wrote: ↑Sun Sep 13, 2020 9:09 pm The easiest solution is to treat everything coming out of the hash table as complete garbage(mainly we can check if TTMove bytes correspond to a move enconding and the move contained in the move encoding is valid in the position).
You could just decode the 64-bit integer coming out of the TT, and check if the resulting move is actually legal; but doesn't this cause a lot of performance loss? The procedure should be faster than actually making, evaluating and unmaking the move, but still, having to check the hash entry for validity, does negate part of why you'd actually use a hash table: to minimize (re)computation of known values.

This is exactly what I do: I take the 16-bit integer which represents the TTMove and check if the resulting move is actually legal with a different procedure than make, solely written for this purpose(first I check if the 16bits even describe a correct move, because I don't save the GameMove enum in the HashTable, but rather 16bytes, to save memory). You are also right in that this also results in a performance loss - however after implementing it in FabChess instead of locks I remember even a slight increase in NPS even on a single thread, much faster on multiple threads. Also: What are you going to do about it? If you don't use perfect hashing(which will cost storage) you have to validate if the move is legal or your engine has to be resilient to making illegal moves in some positions and not crashing.

lucasart · Post by **lucasart** » Tue Sep 15, 2020 2:35 am

fabianVDW wrote: ↑Sun Sep 13, 2020 9:09 pm I think I stand by what I said in the past.IIRC with the XOR trick I measured nearly no collision, since the writes are atomic. It doesn't change the fact that everything surrounding the hash table is inherently unsafe - funnily this does not even change using safe Rust only, although one might trick oneself into thinking so. Hash key collisions happen often enough at TCEC timecontrols and hardware, such that a lot of engines (including FabChess) heavily crashed in pre-season TCEC testing. Of course this is a different kind of unsafe than we usually envision in the context of Rust, but a non crashing engine is ultimately the goal we want to achieve. The easiest solution is to treat everything coming out of the hash table as complete garbage(mainly we can check if TTMove bytes correspond to a move enconding and the move contained in the move encoding is valid in the position).

Completely agree. It's important to highlight, for those who still believe in Father Christmas, that the XOR trick does NOT eliminate hash collisions. It simply reduces their occurrence by some orders of magnitude.

A while ago Demolitio had a fatal bug exposed by HT race (it wasn't the usual playing a of illegal HT move, but something far less obvious). Instead of finding the root cause and fixing it, I got lazy and implemented the XOR hashing and it "solved" the problem. That just hid it under the carpet, nothing more…

Of course the bug didn't occur in my home testing conditions, and woke up in TCEC.

SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation