Regarding AVX2

Rebel · Post by **Rebel** » Wed Nov 03, 2021 9:11 pm

I have an Intel I7 with AVX2 support and get mixed NPS results.

Stockfish 14.1 from abrok

BMI    : 692K
AVX2   : 688K
Modern : 679K
SSE3   : 680K
X64    : 458K
X32    : 253K

Koivisto 6.23

Code: Select all

AVX2   : 1.514K
SSE4.2 : 1.636K
SSE    : 1.503K
Popc   : 1.607K

Berserk 6 is odd...

Code: Select all

AVX2 : 1.651k
Popc : 1.914K ?

Seer 2.3.0 is a bit odd as well...

Code: Select all

AVX2 : 447K
AVX  : 457K
SSE2 : 520K ?

Also tested Komodo Dragon 2.5 and Ethereal 13.25 with AVX2 versus non AVX2, no AVX2 speed improvement.

Two engine who do profit, Minic and Arasan.

Arasan 23.0.1

Code: Select all

AVX2 : 519K
BMI2 : 102K
Popc : 104K

Minic 3.17

Code: Select all

AVX2 : 496K (Skylake compile)
Popc : 409K (Sandybridge compile)

Maybe authors can contribute which compiler they are using.

xr_a_y · Post by **xr_a_y** » Wed Nov 03, 2021 9:50 pm

A lot of things can be going on here, for instance if your tests were single thread while the machine is otherwise not loaded thermal throttling and boosting can really interact with the result. Also availability of FMA build will play an important role in here.

At least your CPU precise model will be required for an analysis to start.

I you wanna read very interesting stuff on that subject (especially for AVX-512), i'd recommend this : https://travisdowns.github.io/blog/2020 ... freq1.html

xr_a_y · Post by **xr_a_y** » Wed Nov 03, 2021 9:55 pm

Also, I recently try to test things a bit on small vector, having issue on TCEC hardware. Here is some code to start with maybe

Code: Select all

#include <iostream>

// Highly inspired by/copied from https://github.com/xianyi/OpenBLAS, same naming convention here.

/*
My gcc (and clang) gives those macros for simd extension :

>> gcc -march=skylake-avx512 -dM -E - < /dev/null | egrep "SSE|AVX" | sort

#define __AVX__ 1
#define __AVX2__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1
#define __AVX512F__ 1
#define __AVX512VL__ 1
#define __MMX_WITH_SSE__ 1
#define __SSE__ 1
#define __SSE2__ 1
#define __SSE2_MATH__ 1
#define __SSE3__ 1
#define __SSE4_1__ 1
#define __SSE4_2__ 1
#define __SSE_MATH__ 1
#define __SSSE3__ 1
*/

#if defined(_MSC_VER)
#define DOT_INLINE __inline
#elif defined(__GNUC__)
#if defined(__STRICT_ANSI__)
#define DOT_INLINE __inline__
#else
#define DOT_INLINE inline
#endif
#else
#define DOT_INLINE
#endif

#ifdef _MSC_VER
#define DOT_FINLINE static __forceinline
#elif defined(__GNUC__)
#define DOT_FINLINE static DOT_INLINE __attribute__((always_inline))
#else
#define DOT_FINLINE static
#endif

/** SSE **/
#ifdef __SSE__
#include <xmmintrin.h>
#endif
/** SSE2 **/
#ifdef __SSE2__
#include <emmintrin.h>
#endif
/** SSE3 **/
#ifdef __SSE3__
#include <pmmintrin.h>
#endif
/** SSSE3 **/
#ifdef __SSSE3__
#include <tmmintrin.h>
#endif
/** SSE41 **/
#ifdef __SSE4_1__
#include <smmintrin.h>
#endif
/** AVX **/
#if defined(__AVX__) || defined(__FMA__)
#include <immintrin.h>
#endif

//----------------------------------
// AVX512
//----------------------------------
#if defined(__AVX512VL__)
#define V_SIMD 512
typedef __m512 v_f32;
#define v_nlanes_f32 16
#define v_add_f32    _mm512_add_ps
#define v_sub_f32    _mm512_sub_ps
#define v_mul_f32    _mm512_mul_ps
#define v_muladd_f32 _mm512_fmadd_ps
#ifdef AVX512_IMPL1

DOT_FINLINE float v_sum_f32(v_f32 a) {
    return _mm512_reduce_add_ps(a);
}
#elif defined AVX512_IMPL2
DOT_FINLINE float v_sum_f32(v_f32 a) {
    __m512 tmp = _mm512_add_ps(a,_mm512_shuffle_f32x4(a,a,_MM_SHUFFLE(0,0,3,2)));
    __m128 r = _mm512_castps512_ps128(_mm512_add_ps(tmp,_mm512_shuffle_f32x4(tmp,tmp,_MM_SHUFFLE(0,0,0,1))));
    r = _mm_hadd_ps(r,r);
    return _mm_cvtss_f32(_mm_hadd_ps(r,r));
}
#else
DOT_FINLINE float v_sum_f32(v_f32 a) {
   __m512 h64   = _mm512_shuffle_f32x4(a, a, _MM_SHUFFLE(3, 2, 3, 2));
   __m512 sum32 = _mm512_add_ps(a, h64);
   __m512 h32   = _mm512_shuffle_f32x4(sum32, sum32, _MM_SHUFFLE(1, 0, 3, 2));
   __m512 sum16 = _mm512_add_ps(sum32, h32);
   __m512 h16   = _mm512_permute_ps(sum16, _MM_SHUFFLE(1, 0, 3, 2));
   __m512 sum8  = _mm512_add_ps(sum16, h16);
   __m512 h4    = _mm512_permute_ps(sum8, _MM_SHUFFLE(2, 3, 0, 1));
   __m512 sum4  = _mm512_add_ps(sum8, h4);
   return _mm_cvtss_f32(_mm512_castps512_ps128(sum4));
}
#endif

#define v_load_f32(PTR) _mm512_loadu_ps((const __m512*)(PTR))
#define v_zero_f32       _mm512_setzero_ps

//----------------------------------
// AVX
//----------------------------------
#elif defined(__AVX2__)
#define V_SIMD 256
typedef __m256 v_f32;
#define v_nlanes_f32 8
#define v_add_f32    _mm256_add_ps
#define v_sub_f32    _mm256_sub_ps
#define v_mul_f32    _mm256_mul_ps
#ifdef __FMA__
#define v_muladd_f32 _mm256_fmadd_ps
#else
DOT_FINLINE v_f32 v_muladd_f32(v_f32 a, v_f32 b, v_f32 c) { return v_add_f32(v_mul_f32(a, b), c); }
#endif
DOT_FINLINE float v_sum_f32(__m256 a) {
   __m256 sum_halves = _mm256_hadd_ps(a, a);
   sum_halves        = _mm256_hadd_ps(sum_halves, sum_halves);
   __m128 lo         = _mm256_castps256_ps128(sum_halves);
   __m128 hi         = _mm256_extractf128_ps(sum_halves, 1);
   __m128 sum        = _mm_add_ps(lo, hi);
   return _mm_cvtss_f32(sum);
}
#define v_load_f32 _mm256_loadu_ps
#define v_zero_f32  _mm256_setzero_ps

//----------------------------------
// SSE
//----------------------------------
#elif defined(__SSE2__)
#define V_SIMD 128
typedef __m128 v_f32;
#define v_nlanes_f32 4
#define v_add_f32    _mm_add_ps
#define v_sub_f32    _mm_sub_ps
#define v_mul_f32    _mm_mul_ps
#ifdef __FMA__
#define v_muladd_f32 _mm_fmadd_ps
#elif defined(__FMA4__)
#define v_muladd_f32 _mm_macc_ps
#else
DOT_FINLINE v_f32 v_muladd_f32(v_f32 a, v_f32 b, v_f32 c) { return v_add_f32(v_mul_f32(a, b), c); }
#endif
DOT_FINLINE float v_sum_f32(__m128 a) {
#ifdef __SSE3__
   __m128 sum_halves = _mm_hadd_ps(a, a);
   return _mm_cvtss_f32(_mm_hadd_ps(sum_halves, sum_halves));
#else
   __m128 t1 = _mm_movehl_ps(a, a);
   __m128 t2 = _mm_add_ps(a, t1);
   __m128 t3 = _mm_shuffle_ps(t2, t2, 1);
   __m128 t4 = _mm_add_ss(t2, t3);
   return _mm_cvtss_f32(t4);
#endif
}
#define v_load_f32 _mm_loadu_ps
#define v_zero_f32  _mm_setzero_ps
#endif

#ifndef V_SIMD
#define V_SIMD 0
#endif

template<size_t N> [[nodiscard]] float dotProductFma(const float* x, const float* y) {
   size_t i  = 0;
   float dot = 0.0f;
   if constexpr (N <= 0) return dot;

#if V_SIMD
   constexpr int vstep    = v_nlanes_f32;
   constexpr int unrollx4 = N & (-vstep * 4);
   constexpr int unrollx  = N & -vstep;
   v_f32 vsum0 = v_zero_f32();
   v_f32 vsum1 = v_zero_f32();
   v_f32 vsum2 = v_zero_f32();
   v_f32 vsum3 = v_zero_f32();
   while (i < unrollx4) {
      vsum0 = v_muladd_f32(v_load_f32(x + i), v_load_f32(y + i), vsum0);
      vsum1 = v_muladd_f32(v_load_f32(x + i + vstep), v_load_f32(y + i + vstep), vsum1);
      vsum2 = v_muladd_f32(v_load_f32(x + i + vstep * 2), v_load_f32(y + i + vstep * 2), vsum2);
      vsum3 = v_muladd_f32(v_load_f32(x + i + vstep * 3), v_load_f32(y + i + vstep * 3), vsum3);
      i += vstep * 4;
   }
   vsum0 = v_add_f32(v_add_f32(vsum0, vsum1), v_add_f32(vsum2, vsum3));
   while (i < unrollx) {
      vsum0 = v_muladd_f32(v_load_f32(x + i), v_load_f32(y + i), vsum0);
      i += vstep;
   }
   dot = v_sum_f32(vsum0);
#else
   constexpr int n1 = N & -4;
   for (; i < n1; i += 4) { dot += y[i] * x[i] + y[i + 1] * x[i + 1] + y[i + 2] * x[i + 2] + y[i + 3] * x[i + 3]; }
#endif
   while (i < N) {
      dot += y[i] * x[i];
      ++i;
   }
   return dot;
}

#include <chrono>

int main(int , char **){
    constexpr long long int N = NDEF;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    float m[N*N];
    float v[N];
    
    for(auto i = 0; i < N*N; ++i) m[i] = (i%7-0.5f)*1.f/((i+1)*N);
    for(auto i = 0; i < N; ++i) v[i] = (i%3-0.5f)*1.f/((i+1)*N);

    float sum = 0;

    for(auto i = 0; i < 1024*1024/N; ++i){
        for(auto j = 0; j < N; ++j){
            sum += dotProductFma<N>(m+N*j,v);
        }
        for(auto i = 0; i < N; ++i) v[i] *= 0.99f;
    }
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
    std::cout << N << " " << sum << " " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]" << std::endl;
    return 0;    
}

You can play with it this way for instance

Code: Select all

#!/bin/bash
for arch in x86-64 core2 nehalem sandybridge skylake skylake-avx512 native; do
   echo "arch : $arch"
   for N in 16 32 64 128 256 512 1024; do
     g++ -march=$arch -DNDEF=$N -O3 dot.cpp -std=c++17
     #g++ -march=$arch -mno-fma -DNDEF=$N -O3 dot.cpp -std=c++17
     ./a.out
   done
   echo "******************************"
done

for impl in AVX512_IMPL1 AVX512_IMPL2; do
   echo "alt $impl"
   for N in 16 32 64 128 256 512 1024; do
     g++ -march=$arch -D$impl -DNDEF=$N -O3 dot.cpp -std=c++17
     #g++ -march=$arch -mno-fma -DNDEF=$N -O3 dot.cpp -std=c++17
     ./a.out
   done
   echo "******************************"
done

xr_a_y · Post by **xr_a_y** » Wed Nov 03, 2021 9:57 pm

And finally, Minic "sandy bridge" compile is not only popcnt !

Code: Select all

-- Intel --
* minic_X.YY_linux_x64_skylake         : fully optimized Linux64 (popcnt+avx2+bmi2)  => and FMA !
* minic_X.YY_linux_x64_sandybridge     : optimized Linux64 (popcnt+avx)  
* minic_X.YY_linux_x64_nehalem         : optimized Linux64 (popcnt+sse4.2)  
* minic_X.YY_linux_x64_core2           : basic Linux64 (nopopcnt+sse3)  

-- AMD --
* minic_X.YY_linux_x64_znver3          : fully optimized Linux64 (popcnt+avx2+bmi2)   => and FMA !
* minic_X.YY_linux_x64_znver1          : almost optimized Linux64 (popcnt+avx2)   => and FMA !
* minic_X.YY_linux_x64_bdver1          : optimized Linux64 (nopopcnt+avx)   => and FMA4 !
* minic_X.YY_linux_x64_barcelona       : optimized Linux64 (nopopcnt+sse4A)  
* minic_X.YY_linux_x64_athlon64-sse3   : basic Linux64 (nopopcnt+sse3)

CMCanavessi · Post by **CMCanavessi** » Wed Nov 03, 2021 10:51 pm

Rebel wrote: ↑Wed Nov 03, 2021 9:11 pm I have an Intel I7 with AVX2 support and get mixed NPS results.

What exact model of I7?

AndrewGrant · Post by **AndrewGrant** » Thu Nov 04, 2021 1:18 am

Something to note, and I have an example to share, no numbers though right now but I can get them.

I have a Ryzen 3700x, and a Ryzen 1950x. Both have AVX2, and all previous sets needed for NNUE. I have a version of Ethereal that uses AVX, and another that uses AVX2.

On the 3700x, the AVX2 is a significant increase in speed over the AVX version. It is clear from simple benches that the speed is there, and no rigorous testing is needed.

On the 1950x, no clear speedup is seen between the AVX2 and AVX versions. How can this be? Well, it turns out that the 1950x "supports" the AVX2 instruction set, but performs 2x128bit operations instead of 1x256 bit operations. Someone with more knowledge can explain exactly how that impacts things, but from a high overview, there is no gain in speed from AVX2, for instructions which are just done as easily with 2xAVX1 instructions.

jdart · Post by **jdart** » Thu Nov 04, 2021 2:08 am

The latest Arasan release has no SIMD support except for AVX2, that is why the other compiles are so much worse.

The current dev branch does support SIMD with older microarchitectures, so it does much better. AVX2 and "modern" builds are about equivalent.

Latest results (1 run, 1 core, Intel Xeon 2690v3):
AVX2+BMI2: 1214085 nps
AVX2: 1157468 nps
modern (SSE2 + SSSE3 + SSE4.1): 1162187 nps
generic (SSE2 only): 1099976 nps

Regarding AVX2

Regarding AVX2

Re: Regarding AVX2

Re: Regarding AVX2

Re: Regarding AVX2

Re: Regarding AVX2

Re: Regarding AVX2

Re: Regarding AVX2