I am currently working on implementing Lazy SMP in my engine, but I am stuck at the transposition table. I have read some articles discussing splitting the transposition table into different parts comprised of multiple entries and then having a lock for each of those. I have also heard something about lockless hashing for 128 bit entries (unfortunately my entries are too big for this), but I don't really understand it. How does it work?
I also read a guide discussing having a 4 byte spinlock for every table entry, but I think this might take up too much space, which would result in a huge transposition table just to get a descent amount of entries.
What do/did you do to make the transposition table thread-safe in your engines? I am pretty much stuck until i find some way of doing it.
Thanks in advance
Transposition table and multithreaded search
Moderators: hgm, Rebel, chrisw
-
- Posts: 174
- Joined: Thu Nov 26, 2020 10:06 am
- Full name: Niels Abildskov
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Transposition table and multithreaded search
One option is to put the transposition table into a mutex. Then you can just lock the TT when one of the threads writes to it. At that point however, the other threads must wait. That should work fine. It has been observed by some engine authors however, when you have a lot of threads (like 16+), the constant locks can take so much time that the engine grinds to a halt. For that, I have no solution (yet), but I'll look into that when I actually reach that point with my engine.
-
- Posts: 27809
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Transposition table and multithreaded search
Basically the idea of lockless hashing is that you accept that occasionally an entry gets corrupted by a race between processes, but that the entry contains a kind of checksum that would detect such corruption, and ignore the data in that case. This is done in a clever way so that you don't need to reserve any extra space in the entry for the checksum. Instead you combine it with the hash signature that you already store in the entry.
The technique is not limited to 128-bit entries. What you do is that you XOR all other words of the entry with the word that contains the hash signature, before storing the latter. When you later read back the entry, you do the same thing again. Normally this would recover the original word with the hash signature. In a corrupted entry, where the other data is no longer the same as when it was written, you will then not recover the original signature. So the probing routine will not recognize it as the sought entry.
To maximize the probability of discovering the corruption, you should take care that the bits that get XORed with the hash signature are as variable as possible. (The hash signature might not fill an entire word, but perhaps only half of it, and you would not be able to detect any corruption in the other half.) It would for instance not be very helpful to XOR with the filed containing the depth, as almost all depths in the table will be 0 or 1. XORing with the move might not be that useful either, as a large fraction of the entries might contain the code for 'invalid' in the move field (if they were fail lows). Aligning the score field with the signature could be more effective. Ideally you would have part of the signature in each word of the entry, so they XOR with each other. In fact you would not even having to XOR anything then; concatenating two key parts that do not belong with each other would also prevent recognition. But the downside of that is that you could not test for a matching signature in one step.
There never is an absolute guarantee that a corruption will be detected. But the effect of that is not worse than having a key collision between uncorrupted entries. A corrupted hash signature just performs like any other independently created signature. So you must be able to live with that.
The technique is not limited to 128-bit entries. What you do is that you XOR all other words of the entry with the word that contains the hash signature, before storing the latter. When you later read back the entry, you do the same thing again. Normally this would recover the original word with the hash signature. In a corrupted entry, where the other data is no longer the same as when it was written, you will then not recover the original signature. So the probing routine will not recognize it as the sought entry.
To maximize the probability of discovering the corruption, you should take care that the bits that get XORed with the hash signature are as variable as possible. (The hash signature might not fill an entire word, but perhaps only half of it, and you would not be able to detect any corruption in the other half.) It would for instance not be very helpful to XOR with the filed containing the depth, as almost all depths in the table will be 0 or 1. XORing with the move might not be that useful either, as a large fraction of the entries might contain the code for 'invalid' in the move field (if they were fail lows). Aligning the score field with the signature could be more effective. Ideally you would have part of the signature in each word of the entry, so they XOR with each other. In fact you would not even having to XOR anything then; concatenating two key parts that do not belong with each other would also prevent recognition. But the downside of that is that you could not test for a matching signature in one step.
There never is an absolute guarantee that a corruption will be detected. But the effect of that is not worse than having a key collision between uncorrupted entries. A corrupted hash signature just performs like any other independently created signature. So you must be able to live with that.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Transposition table and multithreaded search
Hi Niel,
lockless hashing is the way to go and it is much more simple than you might think. I try to explain...
Here is an example how the simple (single threaded) write operation can look like.
Then you will see the difference between lockless hashing directly after that.
Now the lockless version. After that i will explain the difference.
As you can see only the following lines have changed...
The data structure behind looks like this
Now, the point is if different threads write different data into the slot, the stored key will be different.
When you retrieve the data it works like this.
If there were different threads writing different data at the same time, the current key of the position xored with
stored data will not result in the stored key.
The principle can be applied to larger data (more then 64 bits) too. But in the application context 64 bits allow an atomic
read and write operation and you can handle the data in one run.
To summarize, you xor in the data to the key when writing the data.
When reading the data, the current position key xored with the stored data must result in the stored key.
(Otherwise the data was from another thread with a different key)
Hope it helps a little bit.
lockless hashing is the way to go and it is much more simple than you might think. I try to explain...
Here is an example how the simple (single threaded) write operation can look like.
Then you will see the difference between lockless hashing directly after that.
Code: Select all
void TT::write(pos_t* pos, int move, int score, int depth, int ply, int flag)
{
hashslot_t* slot = &table[pos->key & (entries - 1)];
if(depth >= slot->data.depth || age != slot->data.age)
{
slot->data.depth = depth;
slot->data.age = age;
slot->data.flag = flag;
slot->data.move = move;
slot->data.score = scoreIntoSlot(score, ply);
slot->key = pos->key;
}
}
Code: Select all
void TT::write(pos_t* pos, int move, int score, int depth, int ply, int flag)
{
hashslot_t* slot = &table[pos->key & (slotcount - 1)];
uint64_t* data = (uint64_t*) &slot->data;
if(depth >= slot->data.depth || age != slot->data.age) {
// SAME AS BEFORE BUT...
slot->key = pos->key ^ *data; // lockless hashing
}
}
Code: Select all
1. uint64_t* data = (uint64_t*) &slot->data;
2. slot->key = pos->key ^ *data;
Code: Select all
typedef struct {
uint16_t move;
int16_t score;
int16_t unused;
uint16_t depth : 7, age : 7, flag : 2;
} hashdata_t;
typedef struct {
uint64_t key;
hashdata_t data;
} hashslot_t;
When you retrieve the data it works like this.
Code: Select all
hashslot_t* hslot = TT::get_link(pos);
uint64_t* data = (uint64_t*) &hslot->data;
if(hslot->key == (pos->key ^ *data)) {
// lookup
}
stored data will not result in the stored key.
The principle can be applied to larger data (more then 64 bits) too. But in the application context 64 bits allow an atomic
read and write operation and you can handle the data in one run.
To summarize, you xor in the data to the key when writing the data.
When reading the data, the current position key xored with the stored data must result in the stored key.
(Otherwise the data was from another thread with a different key)
Hope it helps a little bit.
-
- Posts: 174
- Joined: Thu Nov 26, 2020 10:06 am
- Full name: Niels Abildskov
Re: Transposition table and multithreaded search
Hello Desperado.
Thank you very much for your answer! I think that I've got point now, but I just have one question: I don't understand how changing the data stored in the hashdata_t instance, that the hashslot_t is holding, has anything to do with the "data" pointer itself.
Won't the "data" pointer be the same regardless of the information stored in the structure?
(e.g. if I change slot->data.score = 100 to slot->data.score = -250, won't &slot->data be unaffected?)
Thank you very much for your answer! I think that I've got point now, but I just have one question: I don't understand how changing the data stored in the hashdata_t instance, that the hashslot_t is holding, has anything to do with the "data" pointer itself.
Won't the "data" pointer be the same regardless of the information stored in the structure?
(e.g. if I change slot->data.score = 100 to slot->data.score = -250, won't &slot->data be unaffected?)
-
- Posts: 27809
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Transposition table and multithreaded search
&slot->data will be unaffected, but *(&slot->data) will not be unaffected.
This casting of pointers to other types is a bit of a dirty way to do it; cleaner would be to define the data as a union of an uint64_t and the hashdata_t struct. You can the use the uint64_t description to do the XOR.
This casting of pointers to other types is a bit of a dirty way to do it; cleaner would be to define the data as a union of an uint64_t and the hashdata_t struct. You can the use the uint64_t description to do the XOR.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Transposition table and multithreaded search
1. The connection of hashdata_t* and uint64_t* is the start address of the data address(hashdata_t*) == address(uint64_t*)niel5946 wrote: ↑Wed Feb 03, 2021 4:41 pm Hello Desperado.
Thank you very much for your answer! I think that I've got point now, but I just have one question: I don't understand how changing the data stored in the hashdata_t instance, that the hashslot_t is holding, has anything to do with the "data" pointer itself.
Won't the "data" pointer be the same regardless of the information stored in the structure?
(e.g. if I change slot->data.score = 100 to slot->data.score = -250, won't &slot->data be unaffected?)
2. The difference is, that i can make operations with an 64-bit integer like *int64_t == 112 but i cannot do *hashdata_t == 123
So, to make a comparison or do a xor operation of the memory block with an integer, i need a type cast. So it is an implementation trick.
Last edited by Desperado on Wed Feb 03, 2021 5:18 pm, edited 4 times in total.
-
- Posts: 160
- Joined: Tue Jan 23, 2018 10:18 am
- Location: Rotterdam
- Full name: Ronald Friederich
Re: Transposition table and multithreaded search
In current CPU's memory from RAM is moved in chunks of 64 bytes (a cacheline) to cache before the CPU can work with the data. This proces is very slow compared to the executiontime of instructions. Because it has taken so much time to get the 64 bytes it's much more efficient to use the full 64 bytes of memory for multiple hashentries.
From a programming perspective it's the best and easiest way to use locks for accessing the hashtable. There is however a (big) timepenalty for using locks.
So what can be the problems when you don't use locks?
1: Getting the hashdata of the wrong hashkey. It's possible that thread X reads the hashdata belonging to hashkey A and at the same time thread Y writes the hashdata belonging to hashkey B in the same entry. This can result in thread X using the hashdata belonging to the wrong hashentry. Multiple tests show that incidentally using wrong hashdata in the search doesn't effect the outcome of the search much, so in practice this isn't a problem you need to solve. There might be a problem with using the hashmove of the hashdata. Because the hashmove belongs to another possition, usually the hashmove will be illegal in the position of the thread, so you have to check the legality of the hashmove in the position. A solution to this is the already mentioned XOR ing of data and key, which makes it possible to check if hashdata and hashkey belong to eachother. It is still advisable to check the hashmove, because it is possible that different positions lead to an identical hashkey, although with good hashvalues the chance is small. After some tests I abandoned the XOR solution in rofChade, because it seems a little bit weaker.
2: Getting corrupted hashdata. When 2 threads concurrently access the same data, with current CPU's only 1 thread can get access to a register at a time. With 64 bit registers this means that only 1 thread has access to those 64 bits. It's not possible that those 64 bits become a mix of data from both threads.
If a hashentry is not properly aligned and crosses the boundary of 64 bits it is possible that part of the hashentry is from thread X and the other part is from thread Y which can lead to unpredictable results, for instance when the hashmove is in two parts. With problem 1 a hashmove is alway a real move, you don't have to check the "syntax" of the move, only if the move is possible in the current position. With problem 2 you can get a hashmove from position "8" to 196, etc. so you also have to check the syntax of the hashmove.
To solve problem 2 you have to make sure your entries are properly aligned and an entry must not cross a 64 bit boundary. Also make sure that you read and write whole entries in 1 statement and not use multiple assignments for different parts of the entry.
From a programming perspective it's the best and easiest way to use locks for accessing the hashtable. There is however a (big) timepenalty for using locks.
So what can be the problems when you don't use locks?
1: Getting the hashdata of the wrong hashkey. It's possible that thread X reads the hashdata belonging to hashkey A and at the same time thread Y writes the hashdata belonging to hashkey B in the same entry. This can result in thread X using the hashdata belonging to the wrong hashentry. Multiple tests show that incidentally using wrong hashdata in the search doesn't effect the outcome of the search much, so in practice this isn't a problem you need to solve. There might be a problem with using the hashmove of the hashdata. Because the hashmove belongs to another possition, usually the hashmove will be illegal in the position of the thread, so you have to check the legality of the hashmove in the position. A solution to this is the already mentioned XOR ing of data and key, which makes it possible to check if hashdata and hashkey belong to eachother. It is still advisable to check the hashmove, because it is possible that different positions lead to an identical hashkey, although with good hashvalues the chance is small. After some tests I abandoned the XOR solution in rofChade, because it seems a little bit weaker.
2: Getting corrupted hashdata. When 2 threads concurrently access the same data, with current CPU's only 1 thread can get access to a register at a time. With 64 bit registers this means that only 1 thread has access to those 64 bits. It's not possible that those 64 bits become a mix of data from both threads.
If a hashentry is not properly aligned and crosses the boundary of 64 bits it is possible that part of the hashentry is from thread X and the other part is from thread Y which can lead to unpredictable results, for instance when the hashmove is in two parts. With problem 1 a hashmove is alway a real move, you don't have to check the "syntax" of the move, only if the move is possible in the current position. With problem 2 you can get a hashmove from position "8" to 196, etc. so you also have to check the syntax of the hashmove.
To solve problem 2 you have to make sure your entries are properly aligned and an entry must not cross a 64 bit boundary. Also make sure that you read and write whole entries in 1 statement and not use multiple assignments for different parts of the entry.