Over the whole integer range there are only 4 solutions: 2015959759, 2015960475, 2016185679, 2017106723
Now when reading the cpw - I saw that the first xor is not needed because xor is assosciative and assignments introduce a dependencychain. https://www.chessprogramming.org/BitScan
So this code:
Code: Select all
int bitScanForward(uint64_t bb) {
unsigned int folded;
bb ^= bb - 1;
folded = (int) bb ^ (bb >> 32);
return lsb_64_table[folded * 0x78291ACF >> 26];
}
Code: Select all
int bitScanForward(uint64_t bb) {
uint32_t folded = (int32_t)(((bb - 1) >> 32) ^ (bb - 1));
return lsb_64_table[folded * 0x78291ACF >> 26];
}
Of course compilers know about XOR associativity too - but the reassignment of bb is not needed algorithmically (xor twice is a noop). With my version the asm blsmsk gets replaced with dec which is a much simpler instruction. So the reciprocal throughput for this single instruction goes from 0.5 to 0.25 which is twice as many per clock and the faster x86 simple op-decoders can decode this opcode.
Proof of improvement (around 3%)
https://godbolt.org/z/xbbPnEs1h
Now I also did some further playing around and found an alternative solution which is not useful in any way because there are faster solutions - but its a sort of divide and conquer down to 8 bit:
Code: Select all
int bitScanForward(uint64_t k)
{
k = (((k - 788440361 ) >> 32) ^ (k - 1913312527));
k = (((k - 1342754306) >> 16) ^ (k - 531570821 ));
k = (((k - 1854703209) >> 8 ) ^ (k - 1093726077));
return lsb_256_table[k & 0xFF];
}
Code: Select all
static bool bitScanForward_solve(int seed)
{
uint64_t m = 0;
for (int i=0;i<64;i++)
{
uint64_t bb = 1ull << i;
bb = (bb - 1) - ((bb - 1) >> 32);
uint32_t idx = (static_cast<uint32_t>(bb) * seed) >> 26;
uint64_t idx_mask = 1ull << idx;
if (m & idx_mask) return false;
m |= idx_mask;
}
std::cout << "SOLVED! ";
std::cout << std::format("({})\n", seed);
return true;
}
#pragma omp parallel for
for (int i = -2147483647; i < 2147483647; i++)
{
bitScanForward_solve(i);
}
In my mind its still 2005 with singlecore cpus and finding solutions over the whole integer range takes a few minutes. We live in great times.
Greetings - Daniel