ulong GetDiagonal(int square)
{
int rank = square >> 3;
int file = square & 7;
return VerticalShift(DIAGONAL, file - rank);
}
ulong GetAntidiagonal(int square)
{
int rank = square >> 3;
int file = square & 7;
return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
}
ulong GetVertical(int square)
{
return VERTICAL << (square & 7);
}
ulong GetHorizontal(int square)
{
return HORIZONTAL << (square & 56);
}
//sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
const ulong DIAGONAL = 0x8040201008040201UL;
const ulong ANTIDIAGONAL = 0x0102040810204080UL;
const ulong HORIZONTAL = 0x00000000000000FFUL;
const ulong VERTICAL = 0x0101010101010101UL;
The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
Minimal Chess (simple, open source, C#) - Youtube & Github Leorik (competitive, in active development, C#) - Github & Lichess
ulong GetDiagonal(int square)
{
int rank = square >> 3;
int file = square & 7;
return VerticalShift(DIAGONAL, file - rank);
}
ulong GetAntidiagonal(int square)
{
int rank = square >> 3;
int file = square & 7;
return VerticalShift(ANTIDIAGONAL, 7 - file - rank);
}
ulong GetVertical(int square)
{
return VERTICAL << (square & 7);
}
ulong GetHorizontal(int square)
{
return HORIZONTAL << (square & 56);
}
//sign of 'ranks' decides between left shift or right shift. Then convert signed ranks to a positiver number of bits to shift by. Each rank has 8 bits e.g. 1 << 3 == 8
ulong VerticalShift(ulong bb, int ranks) => ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
const ulong DIAGONAL = 0x8040201008040201UL;
const ulong ANTIDIAGONAL = 0x0102040810204080UL;
const ulong HORIZONTAL = 0x00000000000000FFUL;
const ulong VERTICAL = 0x0101010101010101UL;
The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
this looks a lot faster. Especially if using a template or inline on VerticalShift.
If this is C++ get into the habit of using constexpr/consteval/constinit instead of const. It really makes a huge difference in terms of performance.
lithander wrote: ↑Mon Jan 17, 2022 7:13 pm
The idea is pretty simple: Just take the line through the center that you want (the constant) and then shift them in the right way so they go through the square. But I think it's more or less coming down to exactly the same thing the previous posts suggest just written differently.
Oh I just saw its C#. Anyways its 10% faster than the original answer. Thanks for posting this!
The insight about this: a mov is always a mov. Nothing can change it. But having inline assembly that generates a constant in 2-3 instruction can be hidden inside and overlapped with other instructions by the compiler to get a higher overall IPC.
in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations
pedrojdm2021 wrote: ↑Sun Jan 23, 2022 4:17 am
in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations
Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.
pedrojdm2021 wrote: ↑Sun Jan 23, 2022 4:17 am
in my engine i just populate them in arrays, and i get them by squareIndex. I don't calculate them on the fly. the only computation needed is to get the item from the array, but i don't know if it is actually more expensive that generating them by doing some binary operations
Hi. I think everybody is doing it like that. In this thread the goal is explizitly to get a solution that avoids any lookup.
So the fastet "initialization" code may be used for real time computation. It is more about a theoretical context than a practical one.
(move generation context). So if your code for initialization is a fast one compared to the other ones, that would be interesting.
Its a practical problem since on gpu architectures lookups are more expensive. So one lookup is more like 25 branchless lines of code which is good.
Also for CPU lookup with this inline version you will not get L1 pollution with all these small tables and can be sure that the cache hirarchy stays clean (so more performance in some scenarios)
I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45
Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?
dangi12012 wrote: ↑Sun Mar 06, 2022 10:49 pm
I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45
Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?
I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster.
dangi12012 wrote: ↑Sun Mar 06, 2022 10:49 pm
I want to warm up this thread because it just became more relevant than ever.
Is there a way to optimize this code further: https://godbolt.org/z/qKfWerq45
Is there a way to get more optimal assembly? I think Horizontal and Vertical lookup is literally the best that can be done - but is there a way to get away with the conditional shift in D1 and D2 - or are there other ways to generate the diagonal masks faster (without lookups)?
I see one possible optimization if I'm thinking clearly although the compiler might do it automatically. Since (X & 7) is done three times it should be done only once at the beginning. And the same for (X >> 3). But if compilers are as smart as what people say they are it probably won't change anything. And I do not understand templates anyway. Other than that I don't see anyway to make it faster.
Okay yes, I do see something, maybe. In this code: