[Question] Efficiently generate ray masks?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: [Question] Efficiently generate ray masks?

Post by lithander »

Here's what I came up with for Leorik:

Code: Select all

        ulong GetDiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(DIAGONAL, file - rank);
        }

        ulong GetAntidiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
        }

        ulong GetVertical(int square)
        {
            return VERTICAL << (square & 7);
        }

        ulong GetHorizontal(int square)
        {
            return HORIZONTAL << (square & 56);
        }

        //sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
        ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
        
        const ulong DIAGONAL = 0x8040201008040201UL;
        const ulong ANTIDIAGONAL = 0x0102040810204080UL;
        const ulong HORIZONTAL = 0x00000000000000FFUL;
        const ulong VERTICAL = 0x0101010101010101UL;
        
The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: [Question] Efficiently generate ray masks?

Post by dangi12012 »

lithander wrote: Mon Jan 17, 2022 7:13 pm Here's what I came up with for Leorik:

Code: Select all

ulong GetDiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(DIAGONAL, file - rank);
        }

        ulong GetAntidiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
        }

        ulong GetVertical(int square)
        {
            return VERTICAL << (square & 7);
        }

        ulong GetHorizontal(int square)
        {
            return HORIZONTAL << (square & 56);
        }

        //sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
        ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
        
        const ulong DIAGONAL = 0x8040201008040201UL;
        const ulong ANTIDIAGONAL = 0x0102040810204080UL;
        const ulong HORIZONTAL = 0x00000000000000FFUL;
        const ulong VERTICAL = 0x0101010101010101UL;
        
The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
this looks a lot faster. Especially if using a template or inline on VerticalShift.
If this is C++ get into the habit of using constexpr/consteval/constinit instead of const. It really makes a huge difference in terms of performance.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: [Question] Efficiently generate ray masks?

Post by dangi12012 »

lithander wrote: Mon Jan 17, 2022 7:13 pm The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
Oh I just saw its C#. Anyways its 10% faster than the original answer. Thanks for posting this!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: [Question] Efficiently generate ray masks?

Post by dangi12012 »

Final result

Code: Select all

#include <stdint.h>
template<uint64_t bb>
static constexpr uint64_t mask_shift(int ranks) {
    return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}
template<int dir>
static constexpr uint64_t mask(int square) {
    int rank = square >> 3;
    int file = square & 7;

    if constexpr (dir == 0)
        return 0xFFull << (square & 56); //HORIZONTAL
    else if constexpr (dir == 1)
        return 0x0101010101010101ull << (square & 7); //VERTICAL
    else if constexpr (dir == 2)
        return mask_shift<0x8040201008040201ull>(file - rank); //Diagonal
    else {
        return mask_shift<0x0102040810204080ull>(7 - file - rank); //Antidiagonal
    }
}
Produces perfect assembly. Probably faster than looking it up on many architectures.
https://godbolt.org/z/Yej3ajb7v

The insight about this: a mov is always a mov. Nothing can change it. But having inline assembly that generates a constant in 2-3 instruction can be hidden inside and overlapped with other instructions by the compiler to get a higher overall IPC.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: [Question] Efficiently generate ray masks?

Post by pedrojdm2021 »

in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: [Question] Efficiently generate ray masks?

Post by Desperado »

pedrojdm2021 wrote: Sun Jan 23, 2022 4:17 am in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations

Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: [Question] Efficiently generate ray masks?

Post by dangi12012 »

Desperado wrote: Sun Jan 23, 2022 9:49 am
pedrojdm2021 wrote: Sun Jan 23, 2022 4:17 am in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations

Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.
Its a practical problem since on gpu architectures lookups are more expensive. So one lookup is more like 25 branchless lines of code which is good.
Also for CPU lookup with this inline version you will not get L1 pollution with all these small tables and can be sure that the cache hirarchy stays clean (so more performance in some scenarios)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: [Question] Efficiently generate ray masks?

Post by dangi12012 »

I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?

Code: Select all

template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 869
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: [Question] Efficiently generate ray masks?

Post by Mike Sherwin »

dangi12012 wrote: Sun Mar 06, 2022 10:49 pm I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?

Code: Select all

template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}
I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster. :oops:
Mike Sherwin
Posts: 869
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: [Question] Efficiently generate ray masks?

Post by Mike Sherwin »

Mike Sherwin wrote: Sun Mar 06, 2022 11:16 pm
dangi12012 wrote: Sun Mar 06, 2022 10:49 pm I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?

Code: Select all

template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}
I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster. :oops:
Okay yes, I do see something, maybe. In this code:

Code: Select all

	uint64_t mask_shift(int ranks) {
		return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
	}
Why introduce an unneeded branch instead of having two functions:

Code: Select all

uint64_t mask_shift_up(int ranks) {
    return bb << -(ranks << 3);
}

Code: Select all

uint64_t mask_shift_down(int ranks) {
    return bb >> (ranks << 3);
}
:?: