Optimizing Matt Taylor's Folding trick

dangi12012 · Post by **dangi12012** » Fri Apr 22, 2022 1:22 am

When wanting to know the LSB of 64 bits without harware acceleration then there is a nice trick to do it with 32 bit multiplication.
Over the whole integer range there are only 4 solutions: 2015959759, 2015960475, 2016185679, 2017106723
Now when reading the cpw - I saw that the first xor is not needed because xor is assosciative and assignments introduce a dependencychain. https://www.chessprogramming.org/BitScan

So this code:

Code: Select all

int bitScanForward(uint64_t bb) {
   unsigned int folded;
   bb ^= bb - 1;
   folded = (int) bb ^ (bb >> 32);
   return lsb_64_table[folded * 0x78291ACF >> 26];
}

Can be shortened to this:

Code: Select all

int bitScanForward(uint64_t bb) {
   uint32_t folded = (int32_t)(((bb - 1) >> 32) ^ (bb - 1));
   return lsb_64_table[folded * 0x78291ACF >> 26];
}

Of course compilers know about XOR associativity too - but the reassignment of bb is not needed algorithmically (xor twice is a noop). With my version the asm blsmsk gets replaced with dec which is a much simpler instruction. So the reciprocal throughput for this single instruction goes from 0.5 to 0.25 which is twice as many per clock and the faster x86 simple op-decoders can decode this opcode.

Proof of improvement (around 3%)
https://godbolt.org/z/xbbPnEs1h

Now I also did some further playing around and found an alternative solution which is not useful in any way because there are faster solutions - but its a sort of divide and conquer down to 8 bit:

Code: Select all

int bitScanForward(uint64_t k)
{
	k = (((k - 788440361 ) >> 32)  ^ (k - 1913312527));
	k = (((k - 1342754306) >> 16)  ^ (k - 531570821 ));
	k = (((k - 1854703209) >> 8 )  ^ (k - 1093726077));
	return lsb_256_table[k & 0xFF];
}

Also a nice thing of todays age is that any 32 bit constant for any non complex problem can be found in below 1 second on modern systems with a simple #pragma omp parallel for

Code: Select all

static bool bitScanForward_solve(int seed)
{
	uint64_t m = 0;
	for (int i=0;i<64;i++)
	{
		uint64_t bb = 1ull << i;
		bb = (bb - 1) - ((bb - 1) >> 32);
		uint32_t idx = (static_cast<uint32_t>(bb) * seed) >> 26;

		uint64_t idx_mask = 1ull << idx;
		if (m & idx_mask) return false;
		m |= idx_mask;
	}
	std::cout << "SOLVED! ";
	std::cout << std::format("({})\n", seed);
	return true;
}

#pragma omp parallel for
for (int i = -2147483647; i < 2147483647; i++)
{
     bitScanForward_solve(i);
}

Maybe all of the above is already known - but I still wanted to share it.
In my mind its still 2005 with singlecore cpus and finding solutions over the whole integer range takes a few minutes. We live in great times.
Greetings - Daniel