Fastest pawn quiet move generation I was able to come with

Sven · Post by **Sven** » Sat Jun 10, 2017 9:25 pm

cdani wrote:Sure someone can improve this, but I think is really fast.

b = bitboard of pawns
increment = 8 or 16

	uint8_t cas;
	uint8_t casd;
	switch &#40;popcount&#40;b&#41;) &#123;
	case 8&#58;
		cas = lsb&#40;b&#41;;
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	// ...
	case 2&#58;
		cas = lsb&#40;b&#41;;
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	case 1&#58;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	case 0&#58;
		return ml;
	default&#58; //> 8 pawns for strange games or positions
	altremovpeo2 &#58;
		if &#40;b == 0&#41;
			return ml;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	&#125;

Obviously you do this twice during generation of quiet pawn moves: once for all pawns on rank 2-6 (for white) that can advance one rank, with increment=8, and once for all pawns on rank 2 that can advance two ranks, with increment=16. So your modification adds two calls to popcount(). How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?

Code: Select all

	while &#40;b&#41; &#123;
		uint8_t cas = lsb&#40;b&#41;;
		b &= b - 1;
		&#40;ml++)->moviment = MakeMove&#40;cas, cas + increment&#41;;
	&#125;
	return ml;

cdani wrote:
mar wrote:What was the performance gain versus a loop? (and overall speed gain?)
Of course minimal, something like 1100000 n/s to 1100500 n/s to give an idea. Is more for fun than for anything other, and also is interesting to share and to have other ideas.

1) About how many positions did you use to test the performance difference? 10, 100, 1000, ...?
2) What was the previous solution to compare against (the one with 1100000 n/s): a loop similar to my example above, or something different?

cdani wrote:Any elo gain is due to accumulating a few of those things.

Certainly correct. However, accumulating micro-optimizations can sometimes also turn out to be neutral (in terms of speed and/or rating difference) if at least one of the micro-optimizations is somehow "unclear", can sometimes help and sometimes hurt, etc.

mkchan · Post by **mkchan** » Sat Jun 10, 2017 9:43 pm

My mistake, for some reason I assumed it was in a loop. Now I see you basically unrolled it. Not sure if that's actually faster than the loop itself

jwes · Post by **jwes** » Sat Jun 10, 2017 11:41 pm

cdani wrote:Sure someone can improve this, but I think is really fast.

Code: Select all

b = bitboard of pawns
increment = 8 or 16

	uint8_t cas;
	uint8_t casd;
	switch &#40;popcount&#40;b&#41;) &#123;
	default&#58; //> 8 pawns for strange games or positions
	altremovpeo2 &#58;
		if &#40;b == 0&#41;
			return ml;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	&#125;

It won't make any difference, but if you don't like gotos, you can write the default as:

Code: Select all

	default&#58; //> 8 pawns for strange games or positions
            for (;b;b &= b - 1&#41;
            &#123;
                cas = lsb&#40;b&#41;;
                casd = cas + increment;
                ml->moviment = MakeMove&#40;cas, casd&#41;;
                ml++;
            &#125;
            return ml;

mar · Post by **mar** » Sat Jun 10, 2017 11:43 pm

Sven Schüle wrote:How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?

Well popcount is 1 instruction and I don't see any goto except a jump table. Also, if you hint the compiler that default is unreachable, it may be even faster.

However I agree with you that such micro-optimizations are a dubious win at best, making the code more complicated. We're certainly talking 0 elo here.

Sven · Post by **Sven** » Sun Jun 11, 2017 12:58 am

mar wrote:
Sven Schüle wrote:How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?
Well popcount is 1 instruction

But does it use only 1 CPU cycle?

mar wrote:and I don't see any goto except a jump table.

I meant this (within code that is unreachable for standard chess, though):

Code: Select all

      goto altremovpeo2;

hgm · Post by **hgm** » Sun Jun 11, 2017 6:58 am

I can add that I have seen cases where deleting unreachable code caused a slowdown of nearly 15% (in qperft). The assembly for the reachable code looked identical in both cases. I don't know if modern CPUs still can exhibit such a paradoxical behavior.

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Jun 11, 2017 8:16 am

I would not do it - even if it seems to give small gain. Popcount + likely miss predicted indirect jump + nine branch target buffer entries versus a small loop which only costs one entry in the btb ...

Code: Select all

if &#40;b&#41; do &#123;...&#125; while &#40;b &= b-1&#41;;

http://www.agner.org/optimize/microarchitecture.pdf

https://stackoverflow.com/questions/373 ... ke-jmp-rax

mar · Post by **mar** » Sun Jun 11, 2017 9:26 am

Sven Schüle wrote:But does it use only 1 CPU cycle?

If we can trust this table then it should be much faster than bit scan
http://www.agner.org/optimize/instruction_tables.pdf

mar · Post by **mar** » Sun Jun 11, 2017 10:54 am

hgm wrote:I can add that I have seen cases where deleting unreachable code caused a slowdown of nearly 15% (in qperft). The assembly for the reachable code looked identical in both cases. I don't know if modern CPUs still can exhibit such a paradoxical behavior.

This is interesting, I recently experienced something similar I can't explain (yet the code is different):

Code: Select all

.loop&#58;
mov eax,&#91;edi+4&#93;
mov ecx,&#91;eax*4 + ofs&#93;
mov edx,&#91;edi&#93;
mov ebx,&#91;edx*4 + ofs&#93;
cmp ecx, ebx
jle .skip
mov eax,&#91;edi+4&#93;    ; slower if omitted
mov ecx,&#91;eax*4 + ofs&#93;
mov edx,&#91;edi&#93;        ; slower if omitted
mov ebx,&#91;edx*4 + ofs&#93;
mov &#91;eax*4 + ofs&#93;, ebx
mov &#91;edx*4 + ofs&#93;, ecx
.skip&#58;
...
... loop ...

The above (dumb) machine code is generated by a program of mine (inner loop of worst-case bubble sort).
I managed to avoid useless reloads (marked with `;`), yet after that the code ran ~6% slower...

cdani · Post by **cdani** » Sun Jun 11, 2017 1:01 pm

I have tested again and also the new ideas given here.

The test are with a little more of 710,000 fen positions of any phase of the game, generating the moves 100 times for each position.

Each variant is compiled (PGO) in an I7 5820K, and one test is run on itself, and another in an AMD-FX 8350.

Variant 1

Code: Select all

	uint8_t cas;
	uint8_t casd;
	switch &#40;popcount&#40;b&#41;) &#123;
	case 8&#58;
		cas = lsb&#40;b&#41;;
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	case 7&#58;
        ...
	case 1&#58;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	case 0&#58;
		return ml;
	default&#58; //strange cases
	altremovpeo2 &#58;
		if &#40;b == 0&#41;
			return ml;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	&#125;

Variant 2

Code: Select all

		altremovpeo2 &#58;
		if &#40;b == 0&#41;
		    return ml;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
		b &= b - 1;
		goto altremovpeo2;

Variant 3

Code: Select all

	for (; b; b &= b - 1&#41;
	&#123;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
	&#125;
	return ml;

Variant 4

Code: Select all

	uint8_t cas;
	uint8_t casd;
	int numpeons;
	ml += &#40;numpeons = popcount&#40;b&#41;);
	switch &#40;numpeons&#41; &#123;
	case 8&#58;
		cas = lsb&#40;b&#41;;
		b &= b - 1;
		casd = cas + increment;
		ml&#91;-8&#93;.moviment = MakeMove&#40;cas, casd&#41;;
	case 7&#58;
        ...
	case 1&#58;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml&#91;-1&#93;.moviment = MakeMove&#40;cas, casd&#41;;
	case 0&#58;
		return ml;
	default&#58; //strange cases
	altremovpeo2 &#58;
		if &#40;b == 0&#41;
			return ml;
		cas = lsb&#40;b&#41;;
		casd = cas + increment;
		ml->moviment = MakeMove&#40;cas, casd&#41;;
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	&#125;

I have run each of the 4 variants 3 times on each computer and taken the median of the ms.

Code: Select all

Intel
1      2      3      4
110853 112580 112631 112714

Amd
1      2      3      4
140561 144129 144220 142313

So there is no doubt that the fastest one is the first. I thought than the 4th could be the fastest but was not the case. Maybe is Visual studio that is not optimizing very well.

Fastest pawn quiet move generation I was able to come with

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi

Re: Fastest pawn quiet move generation I was able to come wi