assembler for locking at AMD magny cours

diep · Post by **diep** » Fri Nov 06, 2020 1:50 am

So i built for fun an old 48 core box with old magny cours cpu's for a fraction of the price the great threadrippers go for.

All sorts of benchmarks ran fine.

When i compiled oldie Diep from the year 2012, which in computer chess terms is of course ancient history,
and ran it 48 cores then box freezed.

Now one of the possibilities is the locking code i use.
More specifically the spinlock code. This is very old ugly assembler i grabbed from here one day.

Anyone have here have some inline assembler for linux which works for this sort of processor?

Thanks in Advance,
Vincent

p.s. when i try to attach a photo of the box (4.8MB only) it says file too large here.

mar · Post by **mar** » Fri Nov 06, 2020 3:41 am

Hi Vincent, long time no see

I think a simple spinlock might go like this:

Code: Select all

while (compare_and_swap(lock, 0, 1))
{
	while (atomic_load(lock))
		pause();
}

according to gcc mnemonics:
compare_and_swap = __sync_bool_compare_and_swap
atomic_load = __atomic_load_n(..., __ATOMIC_SEQ_CST), actually not sure you need sequential consistency here
pause = asm volatile("pause"); (x86/x64 only)

C++ has a built-in library for atomics, perhaps there's something similar in C, probably an implementation of spinlock itself
I'm pretty sure Linux will have some internal headers with a spinlock available (linux/spinlock.h maybe?), but no idea if it's low level or not

there should be a simple standard way of doing this

EDIT: to unlock, simply atomic_store(lock, 0) = __atomic_store_n(...)

diep · Post by **diep** » Fri Nov 06, 2020 4:49 pm

mar wrote: ↑Fri Nov 06, 2020 3:41 am Hi Vincent, long time no see

I think a simple spinlock might go like this:
Code: Select all
while (compare_and_swap(lock, 0, 1))
{
	while (atomic_load(lock))
		pause();
}
according to gcc mnemonics:
compare_and_swap = __sync_bool_compare_and_swap
atomic_load = __atomic_load_n(..., __ATOMIC_SEQ_CST), actually not sure you need sequential consistency here
pause = asm volatile("pause"); (x86/x64 only)

C++ has a built-in library for atomics, perhaps there's something similar in C, probably an implementation of spinlock itself
I'm pretty sure Linux will have some internal headers with a spinlock available (linux/spinlock.h maybe?), but no idea if it's low level or not

there should be a simple standard way of doing this

EDIT: to unlock, simply atomic_store(lock, 0) = __atomic_store_n(...)

Yeah that's for JAVA type architects who do not mind losing factor 1000.
Pausing is gonna push it into the runqueue which fires every 10 ms.
You could just as well play checkers then

10ms penalty - that's suicide

that's why we use assembler.

You work at a government job i suppose?

Just kidding.

mar · Post by **mar** » Fri Nov 06, 2020 5:08 pm

diep wrote: ↑Fri Nov 06, 2020 4:49 pm Yeah that's for JAVA type architects who do not mind losing factor 1000.
Pausing is gonna push it into the runqueue which fires every 10 ms.
You could just as well play checkers then

10ms penalty - that's suicide

that's why we use assembler.

You work at a government job i suppose?
Just kidding.

what?!

intrinsics are very efficient and boil down to efficient machine code. the compiler will manage the registers plus they are portable

pause only hints that the hypercore waiting can do other useful stuff

see this https://github.com/torvalds/linux/blob/ ... spinlock.h

you wonder what cpu_relax does? guess what - it's the pause instruction (=rep nop)
see here https://github.com/torvalds/linux/blob/ ... rocessor.h

so they do exactly the same thing except that they load first (assuming contention), then compare and swap

diep · Post by **diep** » Fri Nov 06, 2020 6:50 pm

mar wrote: ↑Fri Nov 06, 2020 5:08 pm
diep wrote: ↑Fri Nov 06, 2020 4:49 pm Yeah that's for JAVA type architects who do not mind losing factor 1000.
Pausing is gonna push it into the runqueue which fires every 10 ms.
You could just as well play checkers then

10ms penalty - that's suicide

that's why we use assembler.

You work at a government job i suppose?
Just kidding.
what?!

intrinsics are very efficient and boil down to efficient machine code. the compiler will manage the registers plus they are portable

pause only hints that the hypercore waiting can do other useful stuff

see this https://github.com/torvalds/linux/blob/ ... spinlock.h

you wonder what cpu_relax does? guess what - it's the pause instruction (=rep nop)
see here https://github.com/torvalds/linux/blob/ ... rocessor.h

so they do exactly the same thing except that they load first (assuming contention), then compare and swap

You point at kernel stuff. You do not want any sort of kernel call nor compiler that can mess up with spinlocks.
There is no hypercores here only real cores. This is AMD magny cours. 12 cores a cpu.

Though the board isn't very stable - seems like a duck so far.

Only 1 store on ebay sells those boards for not too much - and when it arrived here it was obvious they wrapped that plastic too tight around it and the fans on the board were totally worn out. Not many other choices there.

If i sell some 3d printers (hah could take some time) i wouldn't build this same box. Losing too much time to it. Time i can't lose as otherwise such 3d printer never sees the market place.

As for GCC - that's the worst to use intrinsics.

For a FFT implementation (for some prime numbers) i used a bunch of intrinsics some time ago and it was just totally out of its mind there generating all kind of nonsense around it.

When you expect to have tons of cores then losing system time around the thing that matters: the xchg instruction - that can strike back at you exponential.

mar · Post by **mar** » Fri Nov 06, 2020 7:38 pm

diep wrote: ↑Fri Nov 06, 2020 6:50 pm You point at kernel stuff. You do not want any sort of kernel call nor compiler that can mess up with spinlocks.
There is no hypercores here only real cores. This is AMD magny cours. 12 cores a cpu.

I see, if no hyperthreading then pause won't help you.
I was merely pointing out the spinlock implementation in kernel

as for gcc codegen quality, this:

Code: Select all

void spin_lock(int *lock)
{
    do
    {
        while (__atomic_load_n((volatile int *)lock, __ATOMIC_SEQ_CST));
    } while(!__sync_bool_compare_and_swap((volatile int *)lock, 0, 1));
}

void spin_unlock(int *lock)
{
    __atomic_store_n((volatile int *)lock, 0, __ATOMIC_SEQ_CST);
}

compiles to

Code: Select all

spin_lock(int*):
        mov     edx, 1
.L2:
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        jne     .L2
        lock cmpxchg    DWORD PTR [rdi], edx
        jne     .L2
        ret
spin_unlock(int*):
        xor     eax, eax
        xchg    eax, DWORD PTR [rdi]
        ret

of course it can be inlined, this is just the code for the function. not bad I guess
perhaps a swap could be used instead of CAS, but otherwise seems fine to me
also not sure you need sequential consistency everywhere, but seems decent to me

(note that I removed the pause instruction here)

mar · Post by **mar** » Fri Nov 06, 2020 7:59 pm

actually this

Code: Select all

void spin_lock(int *lock)
{
    do
    {
        while (__atomic_load_n((volatile int *)lock, __ATOMIC_SEQ_CST));
    } while(__atomic_exchange_n((volatile int *)lock, 1, __ATOMIC_SEQ_CST));
}

void spin_unlock(int *lock)
{
    __atomic_store_n((volatile int *)lock, 0, __ATOMIC_SEQ_CST);
}

might work as well, but I haven't tested it

produces

Code: Select all

spin_lock(int*):
        mov     edx, 1
.L2:
        mov     eax, DWORD PTR [rdi]
        test    eax, eax
        jne     .L2
        mov     eax, edx
        xchg    eax, DWORD PTR [rdi]
        test    eax, eax
        jne     .L2
        ret
spin_unlock(int*):
        xor     eax, eax
        xchg    eax, DWORD PTR [rdi]
        ret

diep · Post by **diep** » Fri Nov 06, 2020 8:09 pm

Now that last is looking really good Martin!

p.s. always use the 'volatile' keyword in C

in C :

volatile int lock;

otherwise the compiler will totally screw you and D* you.

or in this case probably it's

LinuxAssemblerLock(int *volatile lock) { ..

Vinvin · Post by **Vinvin** » Sat Nov 07, 2020 2:51 am

diep wrote: ↑Fri Nov 06, 2020 1:50 am So i built for fun an old 48 core box with old magny cours cpu's for a fraction of the price the great threadrippers go for.
...

Hi !
I suppose CPU's are "Opteron 6174" as shown here : https://www.anandtech.com/show/2978/amd ... ore-xeon/2
and here : https://www.cpubenchmark.net/cpu.php?cp ... cpuCount=4

What are the speed and Watts consumption when running Stockfish 12 (4 GB hash) and Stockfish 11 full speed on this box ?

Thanks for this information,
Vincent

Dann Corbit · Post by **Dann Corbit** » Sat Nov 07, 2020 4:31 am

I got 47 million NPS from Opteron. You can find this entry on Ipman Chess benchmark:
47.371.167 4x AMD Opteron 6276 @2.3ghz 64threads base Dann Corbit

I guess that a good place to look for assember locking is in the Asmfish code. The original stuff by mohammed was brilliant.
He also wrote the first effective SMP routines in assembly. Since it is x86, it ought to be portable.

assembler for locking at AMD magny cours

assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours

Re: assembler for locking at AMD magny cours