Fastest pawn quiet move generation I was able to come with

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Fastest pawn quiet move generation I was able to come wi

Post by Sven »

cdani wrote:Sure someone can improve this, but I think is really fast.

Code: Select all

b = bitboard of pawns
increment = 8 or 16

	uint8_t cas;
	uint8_t casd;
	switch (popcount(b)) {
	case 8:
		cas = lsb(b);
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	// ...
	case 2:
		cas = lsb(b);
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	case 1:
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	case 0:
		return ml;
	default: //> 8 pawns for strange games or positions
	altremovpeo2 :
		if (b == 0)
			return ml;
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	}
Obviously you do this twice during generation of quiet pawn moves: once for all pawns on rank 2-6 (for white) that can advance one rank, with increment=8, and once for all pawns on rank 2 that can advance two ranks, with increment=16. So your modification adds two calls to popcount(). How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?

Code: Select all

	while (b) {
		uint8_t cas = lsb(b);
		b &= b - 1;
		(ml++)->moviment = MakeMove(cas, cas + increment);
	}
	return ml;
cdani wrote:
mar wrote:What was the performance gain versus a loop? (and overall speed gain?)
Of course minimal, something like 1100000 n/s to 1100500 n/s to give an idea. Is more for fun than for anything other, and also is interesting to share and to have other ideas.
1) About how many positions did you use to test the performance difference? 10, 100, 1000, ...?
2) What was the previous solution to compare against (the one with 1100000 n/s): a loop similar to my example above, or something different?
cdani wrote:Any elo gain is due to accumulating a few of those things.
Certainly correct. However, accumulating micro-optimizations can sometimes also turn out to be neutral (in terms of speed and/or rating difference) if at least one of the micro-optimizations is somehow "unclear", can sometimes help and sometimes hurt, etc.
mkchan
Posts: 88
Joined: Thu Oct 06, 2016 9:17 pm
Location: India

Re: Fastest pawn quiet move generation I was able to come wi

Post by mkchan »

My mistake, for some reason I assumed it was in a loop. Now I see you basically unrolled it. Not sure if that's actually faster than the loop itself
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Fastest pawn quiet move generation I was able to come wi

Post by jwes »

cdani wrote:Sure someone can improve this, but I think is really fast.

Code: Select all

b = bitboard of pawns
increment = 8 or 16

	uint8_t cas;
	uint8_t casd;
	switch (popcount(b)) {
	default: //> 8 pawns for strange games or positions
	altremovpeo2 :
		if (b == 0)
			return ml;
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	}
It won't make any difference, but if you don't like gotos, you can write the default as:

Code: Select all

	default: //> 8 pawns for strange games or positions
            for (;b;b &= b - 1)
            {
                cas = lsb(b);
                casd = cas + increment;
                ml->moviment = MakeMove(cas, casd);
                ml++;
            }
            return ml;
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Fastest pawn quiet move generation I was able to come wi

Post by mar »

Sven Schüle wrote:How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?
Well popcount is 1 instruction and I don't see any goto except a jump table. Also, if you hint the compiler that default is unreachable, it may be even faster.

However I agree with you that such micro-optimizations are a dubious win at best, making the code more complicated. We're certainly talking 0 elo here.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Fastest pawn quiet move generation I was able to come wi

Post by Sven »

mar wrote:
Sven Schüle wrote:How can it still be faster than the "standard" implementation (example below) that does not need any popcount (and also no "goto")?
Well popcount is 1 instruction
But does it use only 1 CPU cycle?
mar wrote:and I don't see any goto except a jump table.
I meant this (within code that is unreachable for standard chess, though):

Code: Select all

      goto altremovpeo2;
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Fastest pawn quiet move generation I was able to come wi

Post by hgm »

I can add that I have seen cases where deleting unreachable code caused a slowdown of nearly 15% (in qperft). The assembly for the reachable code looked identical in both cases. I don't know if modern CPUs still can exhibit such a paradoxical behavior.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest pawn quiet move generation I was able to come wi

Post by Gerd Isenberg »

I would not do it - even if it seems to give small gain. Popcount + likely miss predicted indirect jump + nine branch target buffer entries versus a small loop which only costs one entry in the btb ...

Code: Select all

if (b) do {...} while (b &= b-1);
http://www.agner.org/optimize/microarchitecture.pdf

https://stackoverflow.com/questions/373 ... ke-jmp-rax
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Fastest pawn quiet move generation I was able to come wi

Post by mar »

Sven Schüle wrote:But does it use only 1 CPU cycle?
If we can trust this table then it should be much faster than bit scan
http://www.agner.org/optimize/instruction_tables.pdf
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Fastest pawn quiet move generation I was able to come wi

Post by mar »

hgm wrote:I can add that I have seen cases where deleting unreachable code caused a slowdown of nearly 15% (in qperft). The assembly for the reachable code looked identical in both cases. I don't know if modern CPUs still can exhibit such a paradoxical behavior.
This is interesting, I recently experienced something similar I can't explain (yet the code is different):

Code: Select all

.loop:
mov eax,[edi+4]
mov ecx,[eax*4 + ofs]
mov edx,[edi]
mov ebx,[edx*4 + ofs]
cmp ecx, ebx
jle .skip
mov eax,[edi+4]    ; slower if omitted
mov ecx,[eax*4 + ofs]
mov edx,[edi]        ; slower if omitted
mov ebx,[edx*4 + ofs]
mov [eax*4 + ofs], ebx
mov [edx*4 + ofs], ecx
.skip:
...
... loop ...
The above (dumb) machine code is generated by a program of mine (inner loop of worst-case bubble sort).
I managed to avoid useless reloads (marked with `;`), yet after that the code ran ~6% slower...
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Fastest pawn quiet move generation I was able to come wi

Post by cdani »

I have tested again and also the new ideas given here.

The test are with a little more of 710,000 fen positions of any phase of the game, generating the moves 100 times for each position.

Each variant is compiled (PGO) in an I7 5820K, and one test is run on itself, and another in an AMD-FX 8350.

Variant 1

Code: Select all

	uint8_t cas;
	uint8_t casd;
	switch (popcount(b)) {
	case 8:
		cas = lsb(b);
		b &= b - 1;
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	case 7:
        ...
	case 1:
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	case 0:
		return ml;
	default: //strange cases
	altremovpeo2 :
		if (b == 0)
			return ml;
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	}
Variant 2

Code: Select all

		altremovpeo2 :
		if (b == 0)
		    return ml;
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
		b &= b - 1;
		goto altremovpeo2;
Variant 3

Code: Select all

	for (; b; b &= b - 1)
	{
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
	}
	return ml;
Variant 4

Code: Select all

	uint8_t cas;
	uint8_t casd;
	int numpeons;
	ml += (numpeons = popcount(b));
	switch (numpeons) {
	case 8:
		cas = lsb(b);
		b &= b - 1;
		casd = cas + increment;
		ml[-8].moviment = MakeMove(cas, casd);
	case 7:
        ...
	case 1:
		cas = lsb(b);
		casd = cas + increment;
		ml[-1].moviment = MakeMove(cas, casd);
	case 0:
		return ml;
	default: //strange cases
	altremovpeo2 :
		if (b == 0)
			return ml;
		cas = lsb(b);
		casd = cas + increment;
		ml->moviment = MakeMove(cas, casd);
		ml++;
		b &= b - 1;
		goto altremovpeo2;
	}
I have run each of the 4 variants 3 times on each computer and taken the median of the ms.

Code: Select all

Intel
1      2      3      4
110853 112580 112631 112714

Amd
1      2      3      4
140561 144129 144220 142313
So there is no doubt that the fastest one is the first. I thought than the 4th could be the fastest but was not the case. Maybe is Visual studio that is not optimizing very well.