SMP hashing problem

bob · Post by **bob** » Wed Jan 28, 2009 2:55 am

jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.

You got it. Vincent is _never_ going to get it, because he can't stop typing long enough to read what we are discussing and understand the issue. He's gotta keep that keyboard clicking, even though the result is irrelevant...

bob · Post by **bob** » Wed Jan 28, 2009 2:57 am

wgarvin wrote:Do you agree that its possible for an interrupt to occur between two instructions? ANY two instructions, even a pair of stores?

That's the scenario Bob described in his very first post. First store instruction executes, then interrupt happens. Second store instruction does not get executed. Half of the table entry (the key half) is written to the cache line, the other half (the data half) does not get written until a zillion years later when the thread resumes. Meanwhile (say, after about half of a zillion years), another thread tries to read the same table entry.

Yes, its highly unlikely to happen on any particular execution of that code. When it DOES happen, the second thread reads exactly what the cache line contains: the first half of a new entry, and the second half of an OLD one from long before (because the thread suspended right before the store instruction that would have stored the second half of the new entry, and it still hasn't been executed---in Bob's case, it might NEVER have be executed because the other thread of Crafty would crash first )

You are doing what is often referred to as "pissing into the wind." All you are going to get, for the effort, is wet pants legs and shoes. Absolutely nothing good will come of this. You understand the problem. Most everyone else understands the problem. Vincent is hung up in memory-controller land which is the wrong continent in this discussion...

bob · Post by **bob** » Wed Jan 28, 2009 3:00 am

wgarvin wrote:
diep wrote:And ALIGN YOUR HASHTABLE.

Vincent
Most x86 malloc implementations have 16-byte alignment anyways so that double and SSE data will be properly aligned (I know Microsoft's does).

With 32-byte pawn hash entries, the only thing he's risking is that he'll have to fetch two cache lines instead of one. (I'm not sure but x86 processors might hide that kind of latency pretty well... if it was two 16-byte reads into XMM registers they would, anyway).

But you're right that manually aligning it to a 32-byte boundary is probably a good idea (better safe than sorry!)

It wouldn't fix Bob's interrupt bug though, which has nothing to do with alignment and everything to do with atomicity.

Correct. I believe I still align hash tables on 16 byte boundary. I did not do phash because until recently, I was using processes and using shmget() to malloc hash and phash, and that is automatically aligned to a page boundary instead... I have fixed the phash alignment in 23.0 already, but it has nothing to do with this specific issue...

Why he can't get this I don't understand...

Zach Wegner · Post by **Zach Wegner** » Wed Jan 28, 2009 4:24 am

Perhaps this will end this discussion for a little bit. This was compiled and run on a Q6600. It's a bit long and only for Unix, sorry. mmap of course returns memory aligned to page width. Try with any number of cpus or size of table...

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>

#define CPUS 4

typedef unsigned long long BITBOARD;

typedef struct
{
        BITBOARD a;
        BITBOARD b;
} block;

void *shared_alloc(BITBOARD size)
{
        void *mem;

        if ((mem = mmap(0, size, PROT_READ | PROT_WRITE,
                MAP_ANON | MAP_SHARED /*| MAP_HASSEMAPHORE*/, -1, 0)) == MAP_FAILED)
        {
                perror("mmap failed");
                exit(EXIT_FAILURE);
        }

        return mem;
}

int main(void)
{
        BITBOARD size;
        block temp;
        block *mem;
        int *pid;
        int c;
        int id;
        int i;
        int count;
        const int l = 16;

        size = l * sizeof(block);
        mem = shared_alloc(size);
        pid = shared_alloc(CPUS * sizeof(int));
        pid[0] = getpid();
        id = 0;
        for (c = 1; c < CPUS; c++)
        {
                i = fork();
                if (i == 0)
                {
                        id = c;
                        break;
                }
                else
                        pid[c] = i;
        }

        // loop
        c = id * 4;
        count = 0;
        while (1)
        {
                temp = mem[c];

                if (temp.a != temp.b)
                {
                        printf("ERROR(cpu %i)! [%i %i] in slot %i, count=%i\n",
                                        id, (int)temp.a, (int)temp.b, c, count);

                        for (c = 1; c < CPUS; c++)
                                if (c != id)
                                        kill(pid[c], SIGKILL);

                        munmap(mem, size);
                        exit(EXIT_FAILURE);
                }

                temp.a = id;
                temp.b = id;
                mem[c] = temp;
                c = (c + 1) & (l - 1);
                count++;
        }
}

The output:

Code: Select all

ERROR(cpu 1)! [0 1] in slot 7, count=563

bob · Post by **bob** » Wed Jan 28, 2009 7:19 am

Zach Wegner wrote:Perhaps this will end this discussion for a little bit. This was compiled and run on a Q6600. It's a bit long and only for Unix, sorry. mmap of course returns memory aligned to page width. Try with any number of cpus or size of table...

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>

#define CPUS 4

typedef unsigned long long BITBOARD;

typedef struct
{
        BITBOARD a;
        BITBOARD b;
} block;

void *shared_alloc(BITBOARD size)
{
        void *mem;

        if ((mem = mmap(0, size, PROT_READ | PROT_WRITE,
                MAP_ANON | MAP_SHARED /*| MAP_HASSEMAPHORE*/, -1, 0)) == MAP_FAILED)
        {
                perror("mmap failed");
                exit(EXIT_FAILURE);
        }

        return mem;
}

int main(void)
{
        BITBOARD size;
        block temp;
        block *mem;
        int *pid;
        int c;
        int id;
        int i;
        int count;
        const int l = 16;

        size = l * sizeof(block);
        mem = shared_alloc(size);
        pid = shared_alloc(CPUS * sizeof(int));
        pid[0] = getpid();
        id = 0;
        for (c = 1; c < CPUS; c++)
        {
                i = fork();
                if (i == 0)
                {
                        id = c;
                        break;
                }
                else
                        pid[c] = i;
        }

        // loop
        c = id * 4;
        count = 0;
        while (1)
        {
                temp = mem[c];

                if (temp.a != temp.b)
                {
                        printf("ERROR(cpu %i)! [%i %i] in slot %i, count=%i\n",
                                        id, (int)temp.a, (int)temp.b, c, count);

                        for (c = 1; c < CPUS; c++)
                                if (c != id)
                                        kill(pid[c], SIGKILL);

                        munmap(mem, size);
                        exit(EXIT_FAILURE);
                }

                temp.a = id;
                temp.b = id;
                mem[c] = temp;
                c = (c + 1) & (l - 1);
                count++;
        }
}

The output:

Code: Select all

ERROR(cpu 1)! [0 1] in slot 7, count=563

None of that matters. The world's foremost authority on parallel programming and computer architecture has already spoken... You just don't know how to properly align the data so this won't happen.

Sven · Post by **Sven** » Wed Jan 28, 2009 2:55 pm

wgarvin wrote:Do you agree that its possible for an interrupt to occur between two instructions? ANY two instructions, even a pair of stores?

That's the scenario Bob described in his very first post. First store instruction executes, then interrupt happens.[...]

I think such an interrupt is always possible since there is also an operating system running on the computer that rarely does things from time to time. So even if most of the time each of our N cpus is busy with its part of the tree search, there are very short moments where "something else" gets a little bit of CPU time.

If, however, an interrupt *is* possible during our parallel search (even if "highly unlikely" but not impossible) then the next question would be whether such an interrupt can cause the first of the two MOVQ instructions to get executed *before* and the second *after* the interrupt, which could in turn lead to the "cache intersection" problem that Bob has described. My spontaneous answer would be "yes, it can" since I have no proof to exclude that. But processor experts may prove me wrong.

Long story short:
1. Is it possible that an interrupt occurs during parallel search? (I assume yes, and I'm quite sure about this.)
2. If yes, are the two subsequent MOVQ's interruptible in the middle? (I also assume yes but I'm not that sure in this case.)

Sven

bob · Post by **bob** » Wed Jan 28, 2009 6:33 pm

Sven Schüle wrote:
wgarvin wrote:Do you agree that its possible for an interrupt to occur between two instructions? ANY two instructions, even a pair of stores?

That's the scenario Bob described in his very first post. First store instruction executes, then interrupt happens.[...]
I think such an interrupt is always possible since there is also an operating system running on the computer that rarely does things from time to time. So even if most of the time each of our N cpus is busy with its part of the tree search, there are very short moments where "something else" gets a little bit of CPU time.

If, however, an interrupt *is* possible during our parallel search (even if "highly unlikely" but not impossible) then the next question would be whether such an interrupt can cause the first of the two MOVQ instructions to get executed *before* and the second *after* the interrupt, which could in turn lead to the "cache intersection" problem that Bob has described. My spontaneous answer would be "yes, it can" since I have no proof to exclude that. But processor experts may prove me wrong.

Long story short:
1. Is it possible that an interrupt occurs during parallel search? (I assume yes, and I'm quite sure about this.)
2. If yes, are the two subsequent MOVQ's interruptible in the middle? (I also assume yes but I'm not that sure in this case.)

Sven

If you look at the intel programmer's guide stuff, particularly the one on system programming, you find that an interrupt can occur between _any_ two instructions, so long as interrupts are not disabled. But you don't even need that. All you need is any sort of "gap" between the two instructions.

Zach wrote an example program to produce this and it happens quickly enough that it is highly likely that an interrupt is not always the "trigger'. All that is needed is for something to introduce any sort of delay between the two stores. A pipeline stall. The inability to retire the two stores at the same time (back to back gives a tiny wedge of time where another processor can slip in and start this problem).

It has nothing to do with cache alignment, or a hardware design fault, it has everything to do with concurrent execution without synchronization.

Why Vincent doesn't get this is totally beyond me. He has somehow mentally "locked in" on alignment as the problem even though it is completely unrelated to the problem...

Oh well...

Cardoso · Post by **Cardoso** » Thu Jan 29, 2009 12:10 am

I don't see the lockless hashing in crafty 22.9 will it be in crafty 22.10?

Alvaro

bob · Post by **bob** » Thu Jan 29, 2009 2:33 am

Cardoso wrote:I don't see the lockless hashing in crafty 22.9 will it be in crafty 22.10?

Alvaro

No. But it is in 23.0 for both hash (not that important) and pawn hash where it is critical. 23.0 will be out as soon as I get this new automatic SMP search tuning code to do something that looks reasonable...

bob · Post by **bob** » Thu Jan 29, 2009 8:07 pm

At least it looks like everyone now understands the problem, how/why it occurs, and a couple of courses of action that can be used to prevent it completely, one that is efficient (XOR), one that is not so efficient (locks).

Not sure why it took so long for it to "sink in". But it is a real problem, but fortunately, it is one of those serious issues that has a very inexpensive and easy-to-understand fix...

SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem