[Question] Efficiently generate ray masks?

lithander · Post by **lithander** » Mon Jan 17, 2022 7:13 pm

Here's what I came up with for Leorik:

        ulong GetDiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(DIAGONAL, file - rank);
        }

        ulong GetAntidiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
        }

        ulong GetVertical(int square)
        {
            return VERTICAL << (square & 7);
        }

        ulong GetHorizontal(int square)
        {
            return HORIZONTAL << (square & 56);
        }

        //sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
        ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
        
        const ulong DIAGONAL = 0x8040201008040201UL;
        const ulong ANTIDIAGONAL = 0x0102040810204080UL;
        const ulong HORIZONTAL = 0x00000000000000FFUL;
        const ulong VERTICAL = 0x0101010101010101UL;

The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.

dangi12012 · Post by **dangi12012** » Mon Jan 17, 2022 8:51 pm

lithander wrote: ↑Mon Jan 17, 2022 7:13 pm Here's what I came up with for Leorik:

Code: Select all

ulong GetDiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(DIAGONAL, file - rank);
        }

        ulong GetAntidiagonal(int square)
        {
            int rank = square >> 3;
            int file = square & 7;
            return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
        }

        ulong GetVertical(int square)
        {
            return VERTICAL << (square & 7);
        }

        ulong GetHorizontal(int square)
        {
            return HORIZONTAL << (square & 56);
        }

        //sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
        ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
        
        const ulong DIAGONAL = 0x8040201008040201UL;
        const ulong ANTIDIAGONAL = 0x0102040810204080UL;
        const ulong HORIZONTAL = 0x00000000000000FFUL;
        const ulong VERTICAL = 0x0101010101010101UL;

The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.

this looks a lot faster. Especially if using a template or inline on VerticalShift.
If this is C++ get into the habit of using constexpr/consteval/constinit instead of const. It really makes a huge difference in terms of performance.

dangi12012 · Post by **dangi12012** » Mon Jan 17, 2022 9:20 pm

lithander wrote: ↑Mon Jan 17, 2022 7:13 pm The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.

Oh I just saw its C#. Anyways its 10% faster than the original answer. Thanks for posting this!

dangi12012 · Post by **dangi12012** » Mon Jan 17, 2022 9:26 pm

Final result

Code: Select all

#include <stdint.h>
template<uint64_t bb>
static constexpr uint64_t mask_shift(int ranks) {
    return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}
template<int dir>
static constexpr uint64_t mask(int square) {
    int rank = square >> 3;
    int file = square & 7;

    if constexpr (dir == 0)
        return 0xFFull << (square & 56); //HORIZONTAL
    else if constexpr (dir == 1)
        return 0x0101010101010101ull << (square & 7); //VERTICAL
    else if constexpr (dir == 2)
        return mask_shift<0x8040201008040201ull>(file - rank); //Diagonal
    else {
        return mask_shift<0x0102040810204080ull>(7 - file - rank); //Antidiagonal
    }
}

Produces perfect assembly. Probably faster than looking it up on many architectures.
https://godbolt.org/z/Yej3ajb7v

The insight about this: a mov is always a mov. Nothing can change it. But having inline assembly that generates a constant in 2-3 instruction can be hidden inside and overlapped with other instructions by the compiler to get a higher overall IPC.

pedrojdm2021 · Post by **pedrojdm2021** » Sun Jan 23, 2022 4:17 am

in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations

Desperado · Post by **Desperado** » Sun Jan 23, 2022 9:49 am

pedrojdm2021 wrote: ↑Sun Jan 23, 2022 4:17 am in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations

Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.

dangi12012 · Post by **dangi12012** » Sun Jan 23, 2022 10:30 pm

Desperado wrote: ↑Sun Jan 23, 2022 9:49 am
pedrojdm2021 wrote: ↑Sun Jan 23, 2022 4:17 am in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations

Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.

Its a practical problem since on gpu architectures lookups are more expensive. So one lookup is more like 25 branchless lines of code which is good.
Also for CPU lookup with this inline version you will not get L1 pollution with all these small tables and can be sure that the cache hirarchy stays clean (so more performance in some scenarios)

dangi12012 · Post by **dangi12012** » Sun Mar 06, 2022 10:49 pm

I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?

Code: Select all

template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}

Mike Sherwin · Post by **Mike Sherwin** » Sun Mar 06, 2022 11:16 pm

dangi12012 wrote: ↑Sun Mar 06, 2022 10:49 pm I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?
Code: Select all
template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}

I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster.

Mike Sherwin · Post by **Mike Sherwin** » Mon Mar 07, 2022 12:29 am

Mike Sherwin wrote: ↑Sun Mar 06, 2022 11:16 pm
dangi12012 wrote: ↑Sun Mar 06, 2022 10:49 pm I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45

Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?
Code: Select all
template<uint64_t bb>
uint64_t mask_shift(int ranks) {
	return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

uint64_t Optimize(int sq, uint64_t occ)
{
    return dir_HO(sq) ^ dir_VE(sq) ^  dir_D1(sq) ^ dir_D2(sq);
}
I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster.

Okay yes, I do see something, maybe. In this code:

Code: Select all

	uint64_t mask_shift(int ranks) {
		return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
	}

Why introduce an unneeded branch instead of having two functions:

Code: Select all

uint64_t mask_shift_up(int ranks) {
    return bb << -(ranks << 3);
}

Code: Select all

uint64_t mask_shift_down(int ranks) {
    return bb >> (ranks << 3);
}

[Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?

Re: [Question] Efficiently generate ray masks?