Transposition table and multithreaded search

niel5946 · Post by **niel5946** » Wed Feb 03, 2021 1:37 pm

I am currently working on implementing Lazy SMP in my engine, but I am stuck at the transposition table. I have read some articles discussing splitting the transposition table into different parts comprised of multiple entries and then having a lock for each of those. I have also heard something about lockless hashing for 128 bit entries (unfortunately my entries are too big for this), but I don't really understand it. How does it work?

I also read a guide discussing having a 4 byte spinlock for every table entry, but I think this might take up too much space, which would result in a huge transposition table just to get a descent amount of entries.

What do/did you do to make the transposition table thread-safe in your engines? I am pretty much stuck until i find some way of doing it.

Thanks in advance

mvanthoor · Post by **mvanthoor** » Wed Feb 03, 2021 3:04 pm

One option is to put the transposition table into a mutex. Then you can just lock the TT when one of the threads writes to it. At that point however, the other threads must wait. That should work fine. It has been observed by some engine authors however, when you have a lot of threads (like 16+), the constant locks can take so much time that the engine grinds to a halt. For that, I have no solution (yet), but I'll look into that when I actually reach that point with my engine.

hgm · Post by **hgm** » Wed Feb 03, 2021 3:09 pm

Basically the idea of lockless hashing is that you accept that occasionally an entry gets corrupted by a race between processes, but that the entry contains a kind of checksum that would detect such corruption, and ignore the data in that case. This is done in a clever way so that you don't need to reserve any extra space in the entry for the checksum. Instead you combine it with the hash signature that you already store in the entry.

The technique is not limited to 128-bit entries. What you do is that you XOR all other words of the entry with the word that contains the hash signature, before storing the latter. When you later read back the entry, you do the same thing again. Normally this would recover the original word with the hash signature. In a corrupted entry, where the other data is no longer the same as when it was written, you will then not recover the original signature. So the probing routine will not recognize it as the sought entry.

To maximize the probability of discovering the corruption, you should take care that the bits that get XORed with the hash signature are as variable as possible. (The hash signature might not fill an entire word, but perhaps only half of it, and you would not be able to detect any corruption in the other half.) It would for instance not be very helpful to XOR with the filed containing the depth, as almost all depths in the table will be 0 or 1. XORing with the move might not be that useful either, as a large fraction of the entries might contain the code for 'invalid' in the move field (if they were fail lows). Aligning the score field with the signature could be more effective. Ideally you would have part of the signature in each word of the entry, so they XOR with each other. In fact you would not even having to XOR anything then; concatenating two key parts that do not belong with each other would also prevent recognition. But the downside of that is that you could not test for a matching signature in one step.

There never is an absolute guarantee that a corruption will be detected. But the effect of that is not worse than having a key collision between uncorrupted entries. A corrupted hash signature just performs like any other independently created signature. So you must be able to live with that.

Desperado · Post by **Desperado** » Wed Feb 03, 2021 3:18 pm

Hi Niel,

lockless hashing is the way to go and it is much more simple than you might think. I try to explain...

Here is an example how the simple (single threaded) write operation can look like.
Then you will see the difference between lockless hashing directly after that.

Code: Select all

void TT::write(pos_t* pos, int move, int score, int depth, int ply, int flag)
{
    hashslot_t* slot = &table[pos->key & (entries - 1)];

    if(depth >= slot->data.depth || age != slot->data.age)
    {
        slot->data.depth = depth;
        slot->data.age = age;
        slot->data.flag = flag;
        slot->data.move = move;
        slot->data.score = scoreIntoSlot(score, ply);
        slot->key = pos->key;
    }
}

Now the lockless version. After that i will explain the difference.

Code: Select all

void TT::write(pos_t* pos, int move, int score, int depth, int ply, int flag)
{
    hashslot_t* slot = &table[pos->key & (slotcount - 1)];
    uint64_t* data = (uint64_t*) &slot->data;

    if(depth >= slot->data.depth || age != slot->data.age) {

        // SAME AS BEFORE BUT...
        slot->key = pos->key ^ *data; // lockless hashing
    }
}

As you can see only the following lines have changed...

Code: Select all

1. uint64_t* data = (uint64_t*) &slot->data;
2. slot->key = pos->key ^ *data;

The data structure behind looks like this

Code: Select all

typedef struct {
    uint16_t move;
    int16_t score;
    int16_t unused;
    uint16_t depth : 7, age : 7, flag : 2;
} hashdata_t;

typedef struct {
    uint64_t key;
    hashdata_t data;
} hashslot_t;

Now, the point is if different threads write different data into the slot, the stored key will be different.

When you retrieve the data it works like this.

Code: Select all

    hashslot_t* hslot = TT::get_link(pos);
    uint64_t* data = (uint64_t*) &hslot->data;

    if(hslot->key == (pos->key ^ *data)) {
    // lookup
    }

If there were different threads writing different data at the same time, the current key of the position xored with
stored data will not result in the stored key.

The principle can be applied to larger data (more then 64 bits) too. But in the application context 64 bits allow an atomic
read and write operation and you can handle the data in one run.

To summarize, you xor in the data to the key when writing the data.
When reading the data, the current position key xored with the stored data must result in the stored key.

(Otherwise the data was from another thread with a different key)

Hope it helps a little bit.

niel5946 · Post by **niel5946** » Wed Feb 03, 2021 4:41 pm

Hello Desperado.

Thank you very much for your answer! I think that I've got point now, but I just have one question: I don't understand how changing the data stored in the hashdata_t instance, that the hashslot_t is holding, has anything to do with the "data" pointer itself.
Won't the "data" pointer be the same regardless of the information stored in the structure?
(e.g. if I change slot->data.score = 100 to slot->data.score = -250, won't &slot->data be unaffected?)

hgm · Post by **hgm** » Wed Feb 03, 2021 5:00 pm

&slot->data will be unaffected, but *(&slot->data) will not be unaffected.

This casting of pointers to other types is a bit of a dirty way to do it; cleaner would be to define the data as a union of an uint64_t and the hashdata_t struct. You can the use the uint64_t description to do the XOR.

Desperado · Post by **Desperado** » Wed Feb 03, 2021 5:03 pm

niel5946 wrote: ↑Wed Feb 03, 2021 4:41 pm Hello Desperado.

Thank you very much for your answer! I think that I've got point now, but I just have one question: I don't understand how changing the data stored in the hashdata_t instance, that the hashslot_t is holding, has anything to do with the "data" pointer itself.
Won't the "data" pointer be the same regardless of the information stored in the structure?
(e.g. if I change slot->data.score = 100 to slot->data.score = -250, won't &slot->data be unaffected?)

1. The connection of hashdata_t* and uint64_t* is the start address of the data address(hashdata_t*) == address(uint64_t*)
2. The difference is, that i can make operations with an 64-bit integer like *int64_t == 112 but i cannot do *hashdata_t == 123
So, to make a comparison or do a xor operation of the memory block with an integer, i need a type cast. So it is an implementation trick.

Ronald · Post by **Ronald** » Wed Feb 03, 2021 5:07 pm

In current CPU's memory from RAM is moved in chunks of 64 bytes (a cacheline) to cache before the CPU can work with the data. This proces is very slow compared to the executiontime of instructions. Because it has taken so much time to get the 64 bytes it's much more efficient to use the full 64 bytes of memory for multiple hashentries.

From a programming perspective it's the best and easiest way to use locks for accessing the hashtable. There is however a (big) timepenalty for using locks.

So what can be the problems when you don't use locks?
1: Getting the hashdata of the wrong hashkey. It's possible that thread X reads the hashdata belonging to hashkey A and at the same time thread Y writes the hashdata belonging to hashkey B in the same entry. This can result in thread X using the hashdata belonging to the wrong hashentry. Multiple tests show that incidentally using wrong hashdata in the search doesn't effect the outcome of the search much, so in practice this isn't a problem you need to solve. There might be a problem with using the hashmove of the hashdata. Because the hashmove belongs to another possition, usually the hashmove will be illegal in the position of the thread, so you have to check the legality of the hashmove in the position. A solution to this is the already mentioned XOR ing of data and key, which makes it possible to check if hashdata and hashkey belong to eachother. It is still advisable to check the hashmove, because it is possible that different positions lead to an identical hashkey, although with good hashvalues the chance is small. After some tests I abandoned the XOR solution in rofChade, because it seems a little bit weaker.

2: Getting corrupted hashdata. When 2 threads concurrently access the same data, with current CPU's only 1 thread can get access to a register at a time. With 64 bit registers this means that only 1 thread has access to those 64 bits. It's not possible that those 64 bits become a mix of data from both threads.

If a hashentry is not properly aligned and crosses the boundary of 64 bits it is possible that part of the hashentry is from thread X and the other part is from thread Y which can lead to unpredictable results, for instance when the hashmove is in two parts. With problem 1 a hashmove is alway a real move, you don't have to check the "syntax" of the move, only if the move is possible in the current position. With problem 2 you can get a hashmove from position "8" to 196, etc. so you also have to check the syntax of the hashmove.

To solve problem 2 you have to make sure your entries are properly aligned and an entry must not cross a 64 bit boundary. Also make sure that you read and write whole entries in 1 statement and not use multiple assignments for different parts of the entry.

Transposition table and multithreaded search

Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search

Re: Transposition table and multithreaded search