Crafty 25.6 search stability

jhaglund2 · Post by **jhaglund2** » Tue Apr 28, 2020 11:01 pm

Ras - <...> multi-threading with several threads writing to the hash table isn't deterministic, and it depends on when which thread writes and reads in relation to other worker threads.

Because of multi threading it remains random behaviour how many reads will result in the same value before it changes.

Why read something that is going to change before it's used?
Why write information if it will change?
Why have multiple threads accessing the same data at the same time, when there is other work to be done?

Right. Believe me, there really is NO solution to this, if the goal is to have the parallel program run faster than the serial program. Painful but true. You can try any parallel chess program you can get your hands on, and see the same non-deterministic behavior.

The goal is to have the same PV lines displayed for each ply-> each time you do a search on a given position, with and without multithreads.
Think of it like a perft. You will have x number of positions. There's no reason there should be any result different. It would be like saying there's no such thing as a multithreaded perft. If the information isn't the same, it's not being handled correctly...

I don't understand why you're suggesting synchronization doesn't work with multiple threads.

Synchronization Pthread Example - Mutexes
A mutex lock is a mechanism that can be acquired by only one thread at a time. For other threads to get the same mutex, they must wait until it is released by the current owner of the mutex.

The key advantage of multithreading code is that all threads see the same memory. So, data is already shared between threads. But the failure of coordinating the access to the data can lead to incorrect results due to the reason such as data reaces. The mutex lock is one of ways of synchronizing data sharing methods.

1. Synchronization with Mutex
The mutual exclusion lock is the simplest and most primitive synchronization variable. It provides a single, absolute owner for the section of code (aka a critical section) that it brackets between the calls to pthread_mutex_lock() and pthread_mutex_unlock(). The first thread that locks the mutex gets ownership, and any subsequent attempts to lock it will fail, causing the calling thread to go to sleep. When the owner unlocks it, one of the sleepers will be awakened, made runnable, and given the chance to obtain ownership.
2. Synchronization with Mutex and Condition Variable
Condition variable enables threads to communicate with state changes. In other words, a condition variable allows one thread to inform other threads about the changes in the state of a shared resource and allows the other threads to wait for such notification. It allows a thread to sleep(wait) until another thread signals it that it must respond to since some condition has arisen. Without condition variables, the waiting have to do polling to check whether the condition become true.

Code: Select all

std::mutex mtx;             // mutex for critical sectionstd::condition_variable cv; // condition variable for critical section  
bool ready = false;         // Tell threads to run
int current = 0;            // current count
std::unique_lock<std::mutex> lck(mtx);  
while(num != current || !ready){ cv.wait(lck); }  
current++;

Notify:

Code: Select all

void run()
{  
    std::unique_lock<std::mutex> lck(mtx);  
    ready = true;  
    cv.notify_all();
}

2. Synchronization with Spin Locks
Spin locks are essentially mutex locks.
A spin lock polls its lock condition repeatedly until that condition becomes true. Spin locks are most often used on multiprocessor systems where the expected wait time for a lock is small. In these situations, it is often more efficient to poll than to block the thread, which involves a Context switch and the updating of thread data structures.

3. Synchronization with Semaphore
A semaphore is a counting and signaling mechanism. We use it to allow threads access to a specified number of items. If there is a single item, then a semaphore is virtually the same as a mutex.
However, it is more commonly used in a situation where there are multiple items to be managed. Semaphores can also be used to signal between threads or processes. For example, to tell another thread that there is data present in a queue. There are two types of semaphores: named and unnamed semaphores.

Code: Select all

#include <semaphore.h>

int main()
{
        sem_t mySemaphore;
        sem_init(&mySemaphore;, 0, 5);
        //...
        sem_destroy(&mySemaphore;);
        return 0;
}

In the code, we initialized a semaphore with a count of 5. The second parameter of the call to sem_init() is 0, and this makes the semaphore private to the thread. Passing the value of one would enable the semaphore to be shared between multiple processes.
Producer / Consumer (Bounded buffer) problem
We will address the producer/consumer (aka Bounded Buffer) problem.
Suppose one or more producer threads and one or more consumer threads. Producers produce data items and wish to place them in a buffer. Then, consumers grab data items out of the buffer consume the data in some way.
Because the bounded buffer is a shared resource, we must of course require synchronized access to it to avoid any race condition. To understand this problem better, let us examine some actual code:

Code: Select all

sem_init(&empty, 0, MAX); // MAX buffers are empty to begin with...
sem_init(&full, 0, 0);    // ... and 0 are full
void *producer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&empty);
        put(i);
        sem_post(&full);
    }
}

void *consumer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&full);
        int b = get();
        sem_post(&empty);
        printf("%d\n", b);
    }
}

Suppose, there are multiple producers and multiple consumers. We now have a problem: a race condition.

Imagine two producers both calling into put() at the same time. Assume producer 1 gets to run first, and just starts to fill the first buffer entry (fill = 0 @ buffer[fill] = value;). Before the producer gets a chance to increment the fill counter to 1, it is interrupted. Producer 2 starts to run, and at the same line of code, it also puts its data into the 0th element of buffer, which means that the old data there is overwritten!

As we can see, what we’ve forgotten here is mutual exclusion. The filling of a buffer and incrementing of the index into the buffer is a critical section, and thus must be guarded carefully. So let’s use binary semaphore and add some locks.

Code: Select all

sem_init(&mutex, 0, 1);    // mutex = 1 since it a lock
void *producer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&mutex); 
        sem_wait(&empty);
        put(i);
        sem_post(&full);
        sem_post(&mutex); 
    }
}

void *consumer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&mutex;); 
        sem_wait(&full;);
        int b = get();
        sem_post(&empty);
        sem_post(&mutex); 
        printf("%d\n", b);
    }
}

Now we’ve added some locks around the entire put()/get() parts of the code. However, it still doesn’t work. Why? Deadlock!
A producer then runs. It has data to produce and if it were able to run, it would be able to wake the consumer thread and all would be good. Unfortunately, the first thing it does is call sem_wait(&mutex;) on the binary mutex semaphore. The lock is already held. Hence, the producer is now stuck waiting too.

Solution: We simply must reduce the scope of the lock.

Code: Select all

void *producer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&empty);
 
        // scope of lock reduced
        sem_wait(&mutex);
        put(i);
        sem_post(&mutex);

        sem_post(&full);
    }
}

void *consumer(void *arg) {
    int i;
    for (i = 0; i < loops; i++) {
        sem_wait(&full);

        // scope of lock reduced
        sem_wait(&mutex);
        int b = get();
        sem_post(&mutex);

        sem_post(&empty);
        printf("%d\n", b);
    }
}

bob · Post by **bob** » Tue Apr 28, 2020 11:25 pm

Your questions at the top. Hash table is the problem. 1, 2 or more threads can store and read the same position. That's called a race condition. If you synchronize there it becomes a huge bottleneck. There are other things that need to be read and written by different threads at the same time. Pawn hash for example. The act of synchronization is slow. Involves lots of cache invalidation on every other processor when you write something. Requires waiting when another thread has exclusive access to an entry.

Trust me, MANY have looked at this issue. You can mitigate the issue by limiting probe depth so there are not so many race conditions, but then you lose most of the benefit of the hash table. It is just something we all chose to accept to produce the speed gains we see...

Further down... I don't say synchronization doesn't work with multiple threads. I said it doesn't work with chess, specifically. Too much shared data. If you look up a good paper on parallel programming, you will find they typically split the data into chunks and let each thread works on its own data for a while, then synchronizes with other threads to combine the results. Can't do this in chess. No way to divide the tree because you don't know what it looks like until you search it, and then it is too late.

BTW I DO use locks in crafty. You have to in specific places. But NOT in places that are executed once per node, as an example...

For the rest of your questions, you simply have to trust me. I taught parallel programming for 40+ years. Everything you wrote is pretty much spot on, it just doesn't apply / work with chess...

For an example, it would be pretty easy, if your assignment is to dig 400 post holes, to use 4 people and get it done 4x faster.

But imagine 16 or 32 people trying to analyze a chess position at the same time. Efficiently with no duplication of effort. MUCH harder.

Michael Sherwin · Post by **Michael Sherwin** » Wed Apr 29, 2020 4:39 am

bob wrote: ↑Tue Apr 28, 2020 11:25 pm Your questions at the top. Hash table is the problem. 1, 2 or more threads can store and read the same position. That's called a race condition. If you synchronize there it becomes a huge bottleneck. There are other things that need to be read and written by different threads at the same time. Pawn hash for example. The act of synchronization is slow. Involves lots of cache invalidation on every other processor when you write something. Requires waiting when another thread has exclusive access to an entry.

Trust me, MANY have looked at this issue. You can mitigate the issue by limiting probe depth so there are not so many race conditions, but then you lose most of the benefit of the hash table. It is just something we all chose to accept to produce the speed gains we see...

Further down... I don't say synchronization doesn't work with multiple threads. I said it doesn't work with chess, specifically. Too much shared data. If you look up a good paper on parallel programming, you will find they typically split the data into chunks and let each thread works on its own data for a while, then synchronizes with other threads to combine the results. Can't do this in chess. No way to divide the tree because you don't know what it looks like until you search it, and then it is too late.

BTW I DO use locks in crafty. You have to in specific places. But NOT in places that are executed once per node, as an example...

For the rest of your questions, you simply have to trust me. I taught parallel programming for 40+ years. Everything you wrote is pretty much spot on, it just doesn't apply / work with chess...

All these problems with parallelism in chess is why, as soon as I get my new engine up and running, I want to experiment with the idea of having a dedicated thread(s) to process hash request. The idea is for a search thread to put in a request for a hash read before it is needed in hopes that it will be waiting in its mailbox when it is needed. Any idea how many search threads can be serviced by a dedicated 'librarian' thread before delays would happen? I don't know why someone more capable does not test this idea. It would not be that difficult to implement.

Ras · Post by **Ras** » Wed Apr 29, 2020 10:36 am

jhaglund2 wrote: ↑Tue Apr 28, 2020 11:01 pmWhy have multiple threads accessing the same data at the same time

Because shared data in the hashtable is how the worker threads exchange data! If you synchronise, that will be so slow that you could as well just go back to a single thread. If each thread gets its own separate hash table, then the threads won't exchange search data anymore, and the speedup from multithreading will be negligible.

I don't understand why you're suggesting synchronization doesn't work with multiple threads.

That's because you misunderstand the whole situation. You think people wouldn't even know basic stuff like thread synchronisation - come on, this is beginner level. Of course people know mutexes, semaphores and spin locks! They have even tried them and found that the performance is abysmal in this context.

Another point is that determinism would be nice, but the whole purpose of multiple worker threads in a chess engine is getting faster search results, not deterministic ones. If the result is a different move and slightly different evaluation in each run, that's completely OK as long as they are of similar quality.

bob · Post by **bob** » Wed Apr 29, 2020 6:45 pm

Michael Sherwin wrote: ↑Wed Apr 29, 2020 4:39 am
bob wrote: ↑Tue Apr 28, 2020 11:25 pm Your questions at the top. Hash table is the problem. 1, 2 or more threads can store and read the same position. That's called a race condition. If you synchronize there it becomes a huge bottleneck. There are other things that need to be read and written by different threads at the same time. Pawn hash for example. The act of synchronization is slow. Involves lots of cache invalidation on every other processor when you write something. Requires waiting when another thread has exclusive access to an entry.

Trust me, MANY have looked at this issue. You can mitigate the issue by limiting probe depth so there are not so many race conditions, but then you lose most of the benefit of the hash table. It is just something we all chose to accept to produce the speed gains we see...

Further down... I don't say synchronization doesn't work with multiple threads. I said it doesn't work with chess, specifically. Too much shared data. If you look up a good paper on parallel programming, you will find they typically split the data into chunks and let each thread works on its own data for a while, then synchronizes with other threads to combine the results. Can't do this in chess. No way to divide the tree because you don't know what it looks like until you search it, and then it is too late.

BTW I DO use locks in crafty. You have to in specific places. But NOT in places that are executed once per node, as an example...

For the rest of your questions, you simply have to trust me. I taught parallel programming for 40+ years. Everything you wrote is pretty much spot on, it just doesn't apply / work with chess...
All these problems with parallelism in chess is why, as soon as I get my new engine up and running, I want to experiment with the idea of having a dedicated thread(s) to process hash request. The idea is for a search thread to put in a request for a hash read before it is needed in hopes that it will be waiting in its mailbox when it is needed. Any idea how many search threads can be serviced by a dedicated 'librarian' thread before delays would happen? I don't know why someone more capable does not test this idea. It would not be that difficult to implement.

Won't work. two or more threads can STILL try to access the hash entry and you would have to somehow order them. Order them in a way that each time you run the position you make them access the TT in the same order. But what happens when one thread is a bit slow due to its core handling an interrupt or whatever? No way to order the requests identically when you might not even have one (or more) of the requests.

Single HT access was common in distributed chess programs like Sun Phoenix, plus Platt's program and others. To solve the problem of distributed memory as opposed to shared memory in today's SMP or NUMA platforms.

bob · Post by **bob** » Wed Apr 29, 2020 6:53 pm

Ras wrote: ↑Wed Apr 29, 2020 10:36 am
jhaglund2 wrote: ↑Tue Apr 28, 2020 11:01 pmWhy have multiple threads accessing the same data at the same time
Because shared data in the hashtable is how the worker threads exchange data! If you synchronise, that will be so slow that you could as well just go back to a single thread. If each thread gets its own separate hash table, then the threads won't exchange search data anymore, and the speedup from multithreading will be negligible.

I don't understand why you're suggesting synchronization doesn't work with multiple threads.
That's because you misunderstand the whole situation. You think people wouldn't even know basic stuff like thread synchronisation - come on, this is beginner level. Of course people know mutexes, semaphores and spin locks! They have even tried them and found that the performance is abysmal in this context.

Another point is that determinism would be nice, but the whole purpose of multiple worker threads in a chess engine is getting faster search results, not deterministic ones. If the result is a different move and slightly different evaluation in each run, that's completely OK as long as they are of similar quality.

Well said. And as I mentioned, there are SOME places where synchronization is needed. I eliminated most of the common ones in Crafty dealing with thread splitting. But when two searches (parallel threads) complete at the same time, you have to synchronize, or you can back up the wrong score and PV due to a simple race.

For example

Thread A completes and asks "is my score of 0.30 the best at this ply so far?" The answer is yes. Then thread A gets suspended while its core handles either an interrupt or a higher-priority task. Along comes thread B which asks "is my score of 3.25 the best so far? Answer = yes. Thread B writes its result. Then thread A resumes and overwrites thread B's result and you just missed winning a piece. So some sync points are mandatory. But hopefully if you limit splitting in a reasonable way, it won't be happening very often and the cases where one thread waits on another can be made VERY small. If you look at Crafty, you would see that that version 25.0 was all about REDUCING the number of synchronization points, rather than increasing them.

jhaglund2 · Post by **jhaglund2** » Thu Apr 30, 2020 10:01 am

bob - Hash table is the problem. 1, 2 or more threads can store and read the same position. That's called a race condition. If you synchronize there it becomes a huge bottleneck.

Let's focus on the hash table.
Can you help me demonstrate this in Crafty? I want to experience this bottleneck.

Ras - They have even tried them and found that the performance is abysmal in this context.

I don't care about speed. Consistency and accuracy is more important to me.

Ras - <...> determinism would be nice, but the whole purpose of multiple worker threads in a chess engine is getting faster search results, not deterministic ones.

I think of multiple threads should be used to do math faster, which is deterministic.

Say you want to count nodes to 10 billion. 1 thread counts to 10 billion in 16 seconds. 2 threads count in 8 seconds. 4 threads count in 4 seconds. 8 threads count in 2 seconds. All scenarios, it counts to 10 billion each time, no matter how many times you do the math.

Ras - If the result is a different move and slightly different evaluation in each run, that's completely OK as long as they are of similar quality.

That was the reason it was brought to my attention. I detected a bad PV line in Crafty. I retested the position and it was significantly improved, and then kept changing slightly. This raised concerns about the stability and reliability. It appeared to be getting "bad" information someplace.

Michael Sherwin - I want to experiment with the idea of having a dedicated thread(s) to process hash request. The idea is for a search thread to put in a request for a hash read before it is needed in hopes that it will be waiting in its mailbox when it is needed.

I woke up a bit ago and had a similar idea. Why not have a hash table(s) dedicated to each root move per thread. The master thread would generate the movelist, and then tell the dedicated threads ID what root move to work on.

Why can't you split the hash table into sections dedicated to it's thread ID?

Code: Select all

unsigned int n = std::thread::hardware_concurrency();
	std::cout << n << " concurrent threads are supported.\n";

	std::thread t1(foo);
	std::thread::id t1_id = t1.get_id();

	std::thread t2(foo);
	std::thread::id t2_id = t2.get_id();

	std::thread t3(foo);
	std::thread::id t3_id = t3.get_id();

Ras · Post by **Ras** » Thu Apr 30, 2020 10:30 am

jhaglund2 wrote: ↑Thu Apr 30, 2020 10:01 am
Ras - They have even tried them and found that the performance is abysmal in this context.
I don't care about speed. Consistency and accuracy is more important to me.

Then I suggest that you go with single thread. Multithreaded with synchronisation in hash table access wouldn't be faster anyway.

I think of multiple threads should be used to do math faster, which is deterministic.

Please re-read what bob has already explained: http://talkchess.com/forum3/viewtopic.p ... 10#p841405

Say you want to count nodes to 10 billion. 1 thread counts to 10 billion in 16 seconds. 2 threads count in 8 seconds. 4 threads count in 4 seconds. 8 threads count in 2 seconds. All scenarios, it counts to 10 billion each time, no matter how many times you do the math.

Re-read the explanations.

Why can't you split the hash table into sections dedicated to it's thread ID?

Please re-read: http://talkchess.com/forum3/viewtopic.p ... 20#p841785

jhaglund2 · Post by **jhaglund2** » Thu Apr 30, 2020 1:16 pm

Ras- if each thread gets its own separate hash table, then the threads won't exchange search data anymore, and the speedup from multithreading will be negligible.

I forgot about that part. Each thread could have it's own search() and compared in the main. Picture a Multi-PV.
How does it use the pawn hashtable, if it doesn't exchange data?

Ras · Post by **Ras** » Thu Apr 30, 2020 5:23 pm

jhaglund2 wrote: ↑Thu Apr 30, 2020 1:16 pm
Ras- if each thread gets its own separate hash table, then the threads won't exchange search data anymore, and the speedup from multithreading will be negligible.
I forgot about that part. Each thread could have it's own search() and compared in the main.

The answer is right in what you even quoted, but still didn't read.

Look, Crafty's source code is public. You can just slap in same locking for hash table access, and benchmark what happens. You don't seem to believe anything else anyway, so just do the work yourself.

Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability

Re: Crafty 25.6 search stability