Didn't forget to mention this. Actually knew nothing about it.lucasart wrote: But what you forget to mention is that they are counter-productive in the case of HT.
Better NPS scaling for Stockfish
Moderators: hgm, Rebel, chrisw
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Better NPS scaling for Stockfish
-
- Posts: 5569
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Better NPS scaling for Stockfish
For those that might find this of interest, the C++11 code for the spinlock:
I've added the spin_lock() function to simulate pthread_spin_lock(). Let's see what various compilers make of this (removing alignment instructions).
Using gcc-4.7.3:
Both gcc-4.8.1 and gcc-4.9.0 give:
The C++11 code is modelled after glibc's pthread_spin_lock() implementation:
The code produced by gcc-4.7.3 isn't very good as it uses a lock cmpxchg loop to implement the atomic subtraction of 1.
The code produced by gcc-4.8.1/4.9.0 uses lock xadd which is better. Strangely these compilers miss the fact that subtracting 1 / adding -1 can be implemented using a decrement operation. They also miss the fact that there is no need to compare the original lock value with 1; it is sufficient to test the decremented value for 0 (which does not nead the cmp). But this is a minor problem.
So how about the pause instruction. Glibc has it: pause has the same opcode as "rep nop". To add it to the C++11 implementation:
Using gcc-4.8.1:
Code: Select all
#include <atomic>
class Spinlock {
std::atomic_int lock;
public:
Spinlock() { lock = 1; } // Init here to workaround a bug with MSVC 2013
void acquire() {
while (lock.fetch_sub(1, std::memory_order_acquire) != 1)
while (lock.load(std::memory_order_relaxed) <= 0) {}
}
void release() { lock.store(1, std::memory_order_release); }
};
int spin_lock(Spinlock s)
{
s.acquire();
return 0;
}
Using gcc-4.7.3:
Code: Select all
_Z9spin_lockP8Spinlock:
.L2:
movl (%rdi), %eax
.L4:
leal -1(%rax), %edx
movl %eax, %ecx
lock cmpxchgl %edx, (%rdi)
jne .L4
cmpl $1, %ecx
je .L12
.L7:
movl (%rdi), %eax
testl %eax, %eax
jg .L2
jmp .L7
.L12:
xorl %eax, %eax
ret
Code: Select all
_Z9spin_lockP8Spinlock:
.L2:
movl $-1, %eax
lock xaddl %eax, (%rdi)
cmpl $1, %eax
je .L7
.L5:
movl (%rdi), %eax
testl %eax, %eax
jle .L5
jmp .L2
.L7:
xorb %al, %al
ret
Code: Select all
pthread_spin_lock:
mov 4(%esp), %eax
1: LOCK
decl 0(%eax)
jne 2f
xor %eax, %eax
ret
.align 16
2: rep
nop
cmpl $0, 0(%eax)
jg 1b
jmp 2b
The code produced by gcc-4.8.1/4.9.0 uses lock xadd which is better. Strangely these compilers miss the fact that subtracting 1 / adding -1 can be implemented using a decrement operation. They also miss the fact that there is no need to compare the original lock value with 1; it is sufficient to test the decremented value for 0 (which does not nead the cmp). But this is a minor problem.
So how about the pause instruction. Glibc has it: pause has the same opcode as "rep nop". To add it to the C++11 implementation:
Code: Select all
#include <immintrin.h>
...
void acquire() {
while (lock.fetch_sub(1, std::memory_order_acquire) != 1)
while (lock.load(std::memory_order_relaxed) <= 0) { _mm_pause(); }
Code: Select all
_Z9spin_lockP8Spinlock:
.L2:
movl $-1, %eax
lock xaddl %eax, (%rdi)
cmpl $1, %eax
je .L7
.L5:
movl (%rdi), %eax
testl %eax, %eax
jg .L2
rep nop
jmp .L5
.L7:
xorb %al, %al
ret