Better NPS scaling for Stockfish

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Better NPS scaling for Stockfish

Post by zullil »

lucasart wrote: But what you forget to mention is that they are counter-productive in the case of HT.
Didn't forget to mention this. Actually knew nothing about it. :wink:
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Better NPS scaling for Stockfish

Post by syzygy »

For those that might find this of interest, the C++11 code for the spinlock:

Code: Select all

#include <atomic>

class Spinlock &#123;
  std&#58;&#58;atomic_int lock;
public&#58;
  Spinlock&#40;) &#123; lock = 1; &#125; // Init here to workaround a bug with MSVC 2013
  void acquire&#40;) &#123;
    while &#40;lock.fetch_sub&#40;1, std&#58;&#58;memory_order_acquire&#41; != 1&#41;
        while &#40;lock.load&#40;std&#58;&#58;memory_order_relaxed&#41; <= 0&#41; &#123;&#125;
  &#125;
  void release&#40;) &#123; lock.store&#40;1, std&#58;&#58;memory_order_release&#41;; &#125;
&#125;;

int spin_lock&#40;Spinlock s&#41;
&#123;
  s.acquire&#40;);
  return 0;
&#125;
I've added the spin_lock() function to simulate pthread_spin_lock(). Let's see what various compilers make of this (removing alignment instructions).

Using gcc-4.7.3:

Code: Select all

_Z9spin_lockP8Spinlock&#58;
.L2&#58;
        movl    (%rdi&#41;, %eax
.L4&#58;
        leal    -1&#40;%rax&#41;, %edx
        movl    %eax, %ecx
        lock cmpxchgl   %edx, (%rdi&#41;
        jne     .L4
        cmpl    $1, %ecx
        je      .L12
.L7&#58;
        movl    (%rdi&#41;, %eax
        testl   %eax, %eax
        jg      .L2
        jmp     .L7
.L12&#58;
        xorl    %eax, %eax
        ret
Both gcc-4.8.1 and gcc-4.9.0 give:

Code: Select all

_Z9spin_lockP8Spinlock&#58;
.L2&#58;
        movl    $-1, %eax
        lock xaddl      %eax, (%rdi&#41;
        cmpl    $1, %eax
        je      .L7
.L5&#58;
        movl    (%rdi&#41;, %eax
        testl   %eax, %eax
        jle     .L5
        jmp     .L2
.L7&#58;
        xorb    %al, %al
        ret
The C++11 code is modelled after glibc's pthread_spin_lock() implementation:

Code: Select all

pthread_spin_lock&#58;
        mov     4&#40;%esp&#41;, %eax
1&#58;      LOCK
        decl    0&#40;%eax&#41;
        jne     2f
        xor     %eax, %eax
        ret

        .align  16
2&#58;      rep
        nop
        cmpl    $0, 0&#40;%eax&#41;
        jg      1b
        jmp     2b
The code produced by gcc-4.7.3 isn't very good as it uses a lock cmpxchg loop to implement the atomic subtraction of 1.

The code produced by gcc-4.8.1/4.9.0 uses lock xadd which is better. Strangely these compilers miss the fact that subtracting 1 / adding -1 can be implemented using a decrement operation. They also miss the fact that there is no need to compare the original lock value with 1; it is sufficient to test the decremented value for 0 (which does not nead the cmp). But this is a minor problem.

So how about the pause instruction. Glibc has it: pause has the same opcode as "rep nop". To add it to the C++11 implementation:

Code: Select all

#include <immintrin.h>
...

  void acquire&#40;) &#123;
    while &#40;lock.fetch_sub&#40;1, std&#58;&#58;memory_order_acquire&#41; != 1&#41;
        while &#40;lock.load&#40;std&#58;&#58;memory_order_relaxed&#41; <= 0&#41; &#123; _mm_pause&#40;); &#125;
Using gcc-4.8.1:

Code: Select all

_Z9spin_lockP8Spinlock&#58;
.L2&#58;
        movl    $-1, %eax
        lock xaddl      %eax, (%rdi&#41;
        cmpl    $1, %eax
        je      .L7
.L5&#58;
        movl    (%rdi&#41;, %eax
        testl   %eax, %eax
        jg      .L2
        rep nop
        jmp     .L5
.L7&#58;
        xorb    %al, %al
        ret