Crafty 25.6 search stability

Cardoso · Post by **Cardoso** » Fri May 01, 2020 10:45 am

I take my hat off to Ras and Bob for your patience, because I was almost loosing mine just by reading this thread. Must be my old age, but it is frustrating to see people just wanting to argue and not wanting to read carefully and making no tests/experiments, even with readily available source code to change and test with.
jhaglund2 has to be a very young man.

MikeB · Post by **MikeB** » Fri May 01, 2020 5:12 pm

Cardoso wrote: ↑Fri May 01, 2020 10:45 am I take my hat off to Ras and Bob for your patience, because I was almost loosing mine just by reading this thread. Must be my old age, but it is frustrating to see people just wanting to argue and not wanting to read carefully and making no tests/experiments, even with readily available source code to change and test with.
jhaglund2 has to be a very young man.

Ditto +1.

bob · Post by **bob** » Sat May 02, 2020 3:27 am

I didn't think anything Joshua wrote was complaining. He seemed to be interested in what was going on and could not understand the odd behavior the transposition table causes (not to mention move ordering can also change which affects LMR/LMP as well. - history counters and such.)

This is one of the reasons I eliminated the "new" command in Crafty. I decided it was better to simply terminate and restart, rather than figuring out what could have been changed and change it back.

petero2 · Post by **petero2** » Sun May 03, 2020 11:36 am

bob wrote: ↑Sat Apr 25, 2020 11:41 pm The threads are not synchronized, if they were performance would be terrible.

I agree, but a little less terrible than what you suggest here:

bob wrote: ↑Sun Apr 26, 2020 8:00 pm About all you can do is allow each thread to search one move and then pause while the other threads do the same, one at a time. And in the same order. Which means no speedup at all, and would actually be slower than a normal alpha/beta search.

I was able to implement a deterministic search where using 2 threads is better than 1 thread, and using 4 threads is slightly better than 2, on modern Intel CPUs. Using more than 4 threads will probably be worse than 4 threads though.

To have a deterministic multithreaded search it is enough that all threads access all shared resources one at a time in a pre-determined order. Between such accesses more than one thread could for example perform position evaluation in parallel.

Exactly how terrible such an implementation would be depends on how much shared state the chess engine has and how often it wants to access it. In Texel I only share the main transposition table. Other data like pawn hash tables and history tables are thread local. I also use a hybrid ABDADA / lazy SMP implementation where threads only have to communicate with each other (except through the transposition table) when thread 0 starts a new search at the root node. This happens very seldomly compared to transposition table access. Also Texel does not use the transposition table below the first ply in q-search, so the number of accesses per second should be lower than for other engines that probe at all plies in the q-search.

As a test I implemented the following in Texel:

Before each read or write access to shared state, each thread is required to call a lock function. After the shared access is completed, the thread has to call an unlock function.
The lock/unlock functions guarantee that at any time, at most one thread has access to the shared state.
The lock function is guaranteed to return in a pre-determined order for all threads, regardless of the order in which the threads call the lock function.
Any thread that is not allowed to run because another thread either currently holds the lock, or is pre-determined to run before the thread that called lock, will simply busy wait until the other threads have finished their shared state manipulation.

The calls to lock/unlock are pre-determined to return in the following order:

Code: Select all

Thread 0, lock
Thread 0, unlock
Thread 1, lock
Thread 1, unlock
...
Thread N-1, lock
Thread N-1, unlock
Thread 0, lock
Thread 0, unlock
...

This implementation has a number of consequences:

Busy waiting in user space is the only realistic option to be able to do millions of lock/unlock operations per second. Busy waiting in user space tends to perform extremely badly though if the number of software threads is larger than the number of free hardware threads (cores or hyperthreads). I have seen the NPS go from 4 million to 100 when this happens.
The search is deterministic in the sense that if you start from the same state and search the same number of nodes, the result will be the same. The state includes hash table size and content, history table content, and number of search threads. For example, the following input:
Code: Select all
```
setoption name threads value 4
setoption name hash value 1024
go depth 18
```
produces the following result:
Code: Select all
```
info depth 18 score cp 28 time 3603 nodes 14225843 nps 3948332 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 d4e5 b8c6 e1g1 f8e7 h2h3 e8g8 b1c3 e4c3 b2c3 h7h6 a1b1 b7b6
...
info nodes 18213121 nps 4012584 hashfull 125 time 4539
```
The only thing that changes from run to run is the time and NPS values.
Because the result is completely deterministic, a side effect in the initial implementation was that all threads except the first searched exactly the same tree, so using more than one helper thread (i.e. more than 2 threads in total) was completely useless. To work around this I made two changes.
The ABDADA code does a read-modify-write operation on the transposition table to decide which move to check and to mark it as currently being searched. Initially this was implemented as:
1. lock
2. read from TT
3. unlock
4. some other code
5. lock
6. write "busy bit" to TT
7. unlock
The fix was to remove steps 3 and 5 so the whole read-modify-write operation becomes atomic with respect to other threads.
I only use the ABDADA code when the remaining search depth is >= 7 and rely on lazy SMP for smaller depths. To get search divergence also for smaller depths I added a small thread-dependent delay before the helper threads start searching:
Code: Select all
```
  for (int i = 0; i < threadNo*2; i++) {
    lock();
    unlock();
  }
```
In practice the search will not be deterministic when playing games, because games are typically constrained by time control and not by number of searched nodes. This effect is present also for single threaded search in almost all chess engines.

Results

I ran tests between single threaded Texel, the standard non-deterministic multi-threaded search, and the modified deterministic multi-threaded search. The time control was 24 seconds + 0.24 seconds/move. I got the following results:

Code: Select all

2 threads deterministic vs 1 thread                 : +19 elo
2x time 2 threads deterministic vs 2x time 1 thread : +25 elo
4 threads deterministic vs 1 thread                 : +32 elo
2 threads standard vs 1 thread                      : +82 elo
2x time 1 thread vs 1x time 1 thread                : +111 elo
2 threads deterministic vs 2 threads standard       : -65 elo

All matches were at least 1000 games long so the results are statistically significant, but differences between close results may not be significant.

This means that using a deterministic search removed about 75% of the strength increase from going from 1 to 2 threads, compared to a non-deterministic search.

For engines that have more shared state than Texel, 2 threads may indeed not be stronger than 1 thread.

I don't think this code is actually useful for anything, but it was an interesting experiment.

Code

The source code for this experiment is available here: https://github.com/peterosterlund2/texe ... ts/regular

Compiled executables are available here: https://www.dropbox.com/s/h3ilbco04ced2 ... ic.7z?dl=1

bob · Post by **bob** » Sun May 03, 2020 9:55 pm

I don't see how that can solve the problem. You say "always access the TT in the same order". Take two threads, A and B. If they access at the same time, you can certainly make it A then B. But what if B does a probe and A has not gotten there yet. In fact, you don't even know that A will EVER get there. How do you handle that. You let B go first this time, then A follows a few microseconds (or more) later, violating the A/B order.

I don't see any way to solve that other than to execute one instruction from each thread, in the same order as the serial search.

petero2 · Post by **petero2** » Tue May 05, 2020 6:59 pm

bob wrote: ↑Sun May 03, 2020 9:55 pm I don't see how that can solve the problem. You say "always access the TT in the same order". Take two threads, A and B. If they access at the same time, you can certainly make it A then B. But what if B does a probe and A has not gotten there yet. In fact, you don't even know that A will EVER get there. How do you handle that.

It basically works like this:

All threads are seated at a round table.
One thread is given a token.
Only the thread that has the token is allowed to access the shared state.
When a thread is finished with the token it gives it to the neighbor on the right.
When a thread has finished all work (never wants the token again) it leaves the table.

To understand the source code I linked to before, perform the following mapping:

Wait for the token : Call the lock() function.
Give the token to the right neighbor. Call the unlock() function.
Leave the table : Call the quit() function.

In the source code there is a "flags" array. It holds one integer value for each thread with the following meaning:
0 : The thread currently does not hold the token.
1 : The thread currently does hold the token.
2 : The thread has left the table.

There are also padding values in the flags array to make sure each used flag is in a separate cache line.

In order to guarantee progress the threads must follow some additional rules:

A thread must in a timely manner call either the lock() or quit() function if it does not currently have the token.
A thread must in a timely manner call the unlock() function if it currently does have the token.
After a thread has called the quit() function, it must not call the lock()/unlock()/quit() functions again, any must not access the shared state.

I create new threads each time the engine starts thinking. When a thread is finished thinking for a move it calls the quit() function. When all threads have called the quit() function the move to play has been decided and is reported to the GUI. The threads can then be killed.

bob · Post by **bob** » Tue May 05, 2020 7:57 pm

Do ALL threads sit at the "same table" when accessing (say) the transposition/refutation table (hash table)? If so this would really seem to slow things down. Suppose one of them is in the q-search and doesn't even want to access the thing? Or suppose it does, but at a later time after the others have "left the table"?

This is the issue I brought up about any sort of synchronization to provide determinism. Either the cost is prohibitive or it doesn't quite work...

BTW, the "never wants the token again"??? How can that be known in advance? Who knows when a deeper transposition will bring us back to this same position, and now the order has been violated.

jhaglund2 · Post by **jhaglund2** » Fri May 08, 2020 4:11 am

Thank you Peter Osterlund! You are a true master and marvel. My hero. Thank you for understanding!

I can verify the results (PV lines) are deterministic for each thread value versus the same thread value.
Your results are interesting. I'm currently running some tests with Texel.

I'm very pleased to see this come to life!! I'm so impressed.

The usefulness comes later.
The purpose of this was, not only to help Crafty, and point out that, we need to do some sort of sanity check at the end of the PV lines.

I had a solution. I call it PV jumping. You jump to the end of your PV line, and see how your are doing. e.g., Ply 20 + sanity check (20 plies (or whatever works best for you)). That's a total of ply 40. You work backwards in the PV line, if the results are unfavorable. These methods can be used to get game results.

The other purpose was to lay a foundation for testing grounds. To get solid, reliable, repeatable results each time around, even if it means slightly weaker right now.

While, we can get around 100-200 million nodes per second. I think it's time we start calculating games per second to help decide your chess moves.

NPS @ 100 moves per game.
2m = 20k gps
10m = 100k gps
20m = 200k gps
100m = 1m gps
200m = 2m gps

How many games per second do you think you could generate?

Then, would follow the announcement of Crafty with a new MCST-like method. Calculating it's move to play from Games per Second (gps) results.

Thanks again Peter!

petero2 · Post by **petero2** » Fri May 08, 2020 11:49 pm

bob wrote: ↑Tue May 05, 2020 7:57 pm Do ALL threads sit at the "same table" when accessing (say) the transposition/refutation table (hash table)?

Yes, there is only "one table", in order to guarantee deterministic results.

bob wrote: ↑Tue May 05, 2020 7:57 pm If so this would really seem to slow things down. Suppose one of them is in the q-search and doesn't even want to access the thing?

This is correct. If one thread is delayed every other thread will eventually have to wait for it. It will not cause deadlocks though because even a complicated q-search will eventually terminate.

bob wrote: ↑Tue May 05, 2020 7:57 pm This is the issue I brought up about any sort of synchronization to provide determinism. Either the cost is prohibitive or it doesn't quite work...

My only point was that the cost was slightly smaller than you suggested. Instead of 2 threads being weaker than 1 thread I found that in Texel 2 threads are slightly better (+20 Elo) than one thread.

I think we both agree that this is a horrible way to implement multithreaded search in an alpha-beta based chess engine.

bob wrote: ↑Tue May 05, 2020 7:57 pm BTW, the "never wants the token again"??? How can that be known in advance? Who knows when a deeper transposition will bring us back to this same position, and now the order has been violated.

When the GUI sends the "go" command to the engine, the engine creates all the search threads and puts them around the table. At some later point in time, the engine decides that it is now time to make a move. At that point all threads have to somehow be informed that they should stop searching. It is only when a thread has stopped searching it will "leave the table".

For the next search the old threads are killed and new threads are put around the table again.

bob · Post by **bob** » Sat May 09, 2020 6:18 pm

There are lots of "non-solution solutions."

For example, you can allow each thread to handle one node and then wait for the rest to do the same, in thread-order. But it offers no speedup of any kind. The primary goal of parallel search is to search faster. Or search a deterministic game tree. Unfortunately not both. Everyone I know of has gone for faster non-deterministic search. Which is more important? Higher quality moves or exactly reproducible analysis? You can't have both.

Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability