Galois Field Affine Transformation

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Galois Field Affine Transformation

Post by dangi12012 »

Going forward it seems like AMD and Intel will have simd hardware support for 8x8 binary matrices in their higher end products in the form of AVX-512 extensions named Galois field extensions.
Namely these instructions:

Code: Select all

VGF2P8AFFINEQB - Galois field affine transformation
VGF2P8AFFINEINVQB - Galois field affine transformation inverse
VGF2P8MULB - Galois field multiply bytes
This is very interesing since VGF2P8AFFINEQB can do 8x8 binary matrix multiplication in hardware as a single instruction.
My favourite chatbot will explain it better than I could:

Explain avx512 Galois field instructions
AVX-512 Galois Field instructions are a set of specialized instructions for performing operations on Galois Field (GF) elements, which are a mathematical construct used in error-correcting codes and other applications. These instructions are part of the AVX-512 instruction set, which is an extension to the Advanced Vector Extensions (AVX) instruction set for Intel processors. AVX-512 Galois Field instructions allow for efficient implementation of GF operations such as multiplication, division, and exponentiation. They also enable support for different Galois Field sizes, such as GF(2^8) and GF(2^16). These instructions can be used in various applications such as data storage, communications, and cryptography.
Can they be used to multiply two 8x8 binary matrices represented by 64bit numbers?
Yes, AVX-512 Galois Field instructions can be used to multiply two 8x8 binary matrices represented by 64-bit numbers.

The AVX-512 instruction set includes specialized instructions for performing Galois Field (GF) operations on 512-bit registers. These operations can be used to multiply two 8x8 binary matrices represented by 64-bit numbers. Since each element of the 8x8 matrix can be represented by a single bit, a 64-bit number can hold the entire matrix. The AVX-512 Galois Field instructions can then be used to perform the matrix multiplication efficiently.

It's worth noting that the AVX-512 instruction set is an extension of the AVX instruction set, which means that it requires a compatible processor and operating system to run. Also it's important to keep in mind that the implementation of these instructions might be specific to the software library or programming language used.
Give me an example in C++ using _mm512_gf2p8affine_epi64_epi8

Code: Select all

#include <immintrin.h>

__m512i matrix1, matrix2, result;

// Load the two matrices into the __m512i variables
matrix1 = _mm512_loadu_si512((__m512i *)matrix1_data);
matrix2 = _mm512_loadu_si512((__m512i *)matrix2_data);

// Perform the matrix multiplication
result = _mm512_gf2p8affine_epi64_epi8(matrix1, matrix2, 0);

// Store the result in the output array
_mm512_storeu_si512((__m512i *)output_data, result);
Now back to myself. What is very cool is that this has direct application in replacing some parts of nnue of cours but I was more interested in slider movegen of "bitrotation" this is a strong improvement upon hyperbola qsc where byteswap is replaced with bitrotation.

This really just 1 instructions - and I cant wait to get my hands on Zen4

Code: Select all

__m512i reverse_epi8(__m512i input) {
    const __m512i select = _mm512_set1_epi64(bit_shuffle_const(7, 6, 5, 4, 3, 2, 1, 0));
    return _mm512_gf2p8affine_epi64_epi8(input, select, 0x00);
}
Page 5 has some good examples: https://builders.intel.com/docs/network ... -guide.pdf
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
Ras
Posts: 2696
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Galois Field Affine Transformation

Post by Ras »

dangi12012 wrote: Fri Jan 13, 2023 9:40 pmWhat is very cool is that this has direct application in replacing some parts of nnue of cours
That would be relevant because the eval is a big part of where engines spend their computing time...
but I was more interested in slider movegen
... while movegen is not.
Rasmus Althoff
https://www.ct800.net
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Galois Field Affine Transformation

Post by dangi12012 »

Ras wrote: Sat Jan 14, 2023 1:50 am
dangi12012 wrote: Fri Jan 13, 2023 9:40 pmWhat is very cool is that this has direct application in replacing some parts of nnue of cours
That would be relevant because the eval is a big part of where engines spend their computing time...
but I was more interested in slider movegen
... while movegen is not.
Depends on the engine. On the canonical CPU alphabeta it's true. On others it's not so clear. Also some search extensions might use movegen for mobility etc to check whether to extend or not.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3226
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Galois Field Affine Transformation

Post by smatovic »

dangi12012 wrote: Fri Jan 13, 2023 9:40 pm [...]
My favourite chatbot will explain it better than I could:
[...]
Can they be used to multiply two 8x8 binary matrices represented by 64bit numbers?
[...]
Give me an example in C++ using _mm512_gf2p8affine_epi64_epi8
+1

Hahaha, Gen Z chess engine development ;) (I know, you can do it on your own...)

--
Srdja
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Galois Field Affine Transformation

Post by dangi12012 »

I took the time and implemented the algorithm in software fallback here:
https://github.com/Gigantua/Chess_Moveg ... sField.hpp

and described it here:
https://www.talkchess.com/forum3/viewto ... 30#p941529

Unluckily I dont have any AVX512 capable hardware yet - so I cannot say what the performance is. Interestingly being a native 512 bit algorithm it can solve 2 Queens or 4 Rooks, Bishops at once.

The core algorithm is this:
having all 8 masks in 512 bits (which can be looked up)
Run this algorithm: ((o - sq) ^ bit_reverse(bit_reverse(o) - sq_rev)) & mask;

bit_reverse with 8x64bits will still only use 2 instructions (_mm512_gf2p8affine_epi64_epi8 and _mm512_shuffle_epi8)
counting correctly this will be: 8 Instructions + some inititialisatoin

The ratio of instructions to solved pieces is the best that exists - performance with hardware support remains to be seen.

I think this is also the natural evolution.
Solving one square at a time -> Mailslot and Loops
Solving one piece at a time-> Many BB algos
Solving many pieces at once -> This algorithm
Consolidate movegen and evaluation into matrix math -> (Future)

Having 8x8 binary GEMM available in hardware has some very big implications for chess in many other topics as well.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
connor_mcmonigle
Posts: 544
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: Galois Field Affine Transformation

Post by connor_mcmonigle »

dangi12012 wrote: Sat Jan 14, 2023 10:41 am
Ras wrote: Sat Jan 14, 2023 1:50 am
dangi12012 wrote: Fri Jan 13, 2023 9:40 pmWhat is very cool is that this has direct application in replacing some parts of nnue of cours
That would be relevant because the eval is a big part of where engines spend their computing time...
but I was more interested in slider movegen
... while movegen is not.
Depends on the engine. On the canonical CPU alphabeta it's true. On others it's not so clear. Also some search extensions might use movegen for mobility etc to check whether to extend or not.
Which others are you referring to? The only two competitive approaches at present are "relatively small network predicting value + PVS search" and "large network predicting value and policy + PUCT search". Neither of these approaches spends a significant amount of time on move generation (the latter even less so than the former). Though this doesn't preclude the possibility of some novel competitive approach which is more sensitive to move generation speed, the trend is certainly not going in this direction.

Also, no search extension heuristic (at least found in top engines) requires fast move generation. In general, move generation speed currently is largely orthogonal to Elo.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Galois Field Affine Transformation

Post by dangi12012 »

Actually quite nice - when using Intels Software Developer Emulator (SDE) https://www.intel.com/content/www/us/en ... lator.html

You can run any application with it and all child processes will report an emulated cpuid and be able to execute avx512 code.
For example my Zen3 predends now it is knights landing with AVX512F support.
Pretty cool when developing - can even start visual studio with it to execute and develop for older and newer ISAs:

Code: Select all

...
     -spr                Set chip-check and CPUID for Intel(R) Sapphire Rapids CPU
     -gnr                Set chip-check and CPUID for Intel(R) Granite Rapids CPU
     -srf                Set chip-check and CPUID for Intel(R) Sierra Forest CPU
     -grr                Set chip-check and CPUID for Intel(R) Grand Ridge CPU
     -future             Set chip-check and CPUID for Intel(R) Future chip CPU
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Galois Field Affine Transformation

Post by dangi12012 »

Glorious.

Here are 3 Rooks and 1 bishop. Solve all 16 rays please and give me back 8 bitboards.
https://godbolt.org/z/vYjaefEnP

Code: Select all

attack8(unsigned long, unsigned long, unsigned long, long long __vector(8)):                     # @attack8(unsigned long, unsigned long, unsigned long, long long __vector(8))
        vpbroadcastq    zmm1, rdi
        vpandq  zmm1, zmm1, zmm0
        vpbroadcastq    zmm2, rsi
        vpbroadcastq    zmm3, rdx
        vpbroadcastq    zmm4, qword ptr [rip + .LCPI0_0] # zmm4 = [9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745]
        vpsubq  zmm2, zmm1, zmm2
        vgf2p8affineqb  zmm1, zmm1, zmm4, 0
        vpshufd zmm1, zmm1, 78                  # zmm1 = zmm1[2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13]
        vpsubq  zmm1, zmm1, zmm3
        vgf2p8affineqb  zmm1, zmm1, zmm4, 0
        vpshufd zmm1, zmm1, 78                  # zmm1 = zmm1[2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13]
        vpternlogq      zmm0, zmm2, zmm1, 96
        ret
That is 1 Lookup + 12 instructions total for 4 pieces at once. Now my mind is at piece with that topic.
That being said vgf2p8affineqb has much more potential thant the multiplication with the transposed identity which is used here to do 8xbswap.
Fen Compression, Evaluation, Adding an additional layer to NN to go from Bitboard to bytes directly, and many more.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer