Branchless make/unmake logic

leanchess · Post by **leanchess** » Tue Aug 22, 2023 11:03 am

It's been quite a while since I promised this, but I believe it finally works.

A move is defined as a pair of half-moves:

Code: Select all

typedef union {
	uint64_t v;
	struct {
		half_move_t prim;
		half_move_t sec;
	};
} move_t;

A half-move consists of two piece-squares:

Code: Select all

typedef union {
	uint32_t v;
	struct {
		piece_square_t from;
		piece_square_t to;
	};
} half_move_t;

A piece-square contains (surprise!) the piece and the square:

Code: Select all

typedef union {
	uint16_t v;
	struct {
		union {
			uint8_t piece;
			struct {
				uint8_t index4  : 4;
				uint8_t type    : 4;
			};
			struct {
				uint8_t index   : 5;
				uint8_t type3   : 3;
			};
		};
		union {
			uint8_t square;
			struct {
				uint8_t square7 : 7;
				uint8_t virgin  : 1;
			};
		};
	};
} piece_square_t;

The colour of the piece is shared between its index (as MSB) and type (as LSB).

Since piece indices are invariant, index4 changes semantics in to piece-squares, namely,

prim.to.index4 value is XORed into the board's ep_file field (0-7 for a-h files, 8 for N/A), while
sec.to.index4 value is XORed into the board's castle field (a 4-bit bitmask).

This means that those values have to be calculated for every position.

Castling rights delta is deferred from the colour/side, the 2 LSBs of index4 and virgin (kings, A-castling rook and H-castling are placed in indices 0, 1 and 2 respectively). virgin bits are only set in from piece-squares, allowing to.square to be used as a whole for extra efficiency.

Castling moves (before delta adjustments):

Code: Select all

K 05A087A206688460
Q 03A080A102688460
k 3DB0BFB23E78BC70
q 3BB0B8B13A78BC70

The current implementation supports bitboards, an 8x8 mailbox and a 2x16 piece list, but this could be easily adapted to 0x88.

leanchess · Post by **leanchess** » Sun Aug 27, 2023 2:25 am

It looks like nobody is interested.

How about efficiently compressing the board to 192 bits using this idea?

The board structure (perhaps pio_board_t should be more appropriate?):

Code: Select all

typedef struct {
	uint64_t occ;
	uint64_t v1;
	uint64_t v2;
} pico_board_t;

int128_t would have been ideal, but that'd cause an alignment issue (any ideas regarding a solution?)

The encoding function:

Code: Select all

void pico_write(const board_t* b, pico_board_t* p) {
	uint128_t v = 0;
	uint8_t s = 0, sq8 = board_get_ep_square(b) ^ 8;
	for (uint64_t n = p->occ = b->occ; n; n &= n - 1, s += 4) {
		piece_square_t ps = b->mbox[_ctz_u64(n)];
		uint128_t t = ps.type == TYPE_KING_BLACK && b->turn
			? TYPE_KING_BLACK_TURN
			: ((ps.type3 == TYPE3_PAWN && ps.square == sq8) ||
				(ps.type3 == TYPE3_ROOK && ps.virgin
					&& (b->castle & 1 << ((ps.index & 1) | ps.color << 1)))
						? ps.color | TYPE3_SPEC << 1
						: ps.type);
		v |= t << s;
	}
	p->v1 = (uint64_t)v;
	p->v2 = v >> 64;
}

Decoding is slightly more complicated, but still acceptable:

Code: Select all

result_t pico_read(board_t* b, const pico_board_t* p) {
	uint8_t c[2] = {0, 0};
	board_clear(b);
	b->ep_file = FILE_COUNT;
	uint128_t v = p->v1 | (uint128_t)p->v2 << 64;
	result_t r;
	for (uint64_t n = b->occ = p->occ; n; n &= n - 1, v >>= 4) {
		piece_square_t ps = {
			.type   = v & 15,
			.square = _ctz_u64(n)
		};
		if ((r = read_piece(b, ps, c)).error) {
			return r;
		}
		ps = (piece_square_t){r.v16};
		uint64_t m = _blsi_u64(n);
		b->bbs[ps.type3] |= m; 
		b->bbs[ps.color] |= m;
		b->mbox[ps.square7] = ps;
	}
	if ((r = update_indices(b, c)).error) {
		return r;
	}
	update_castles(b);
	return (result_t){0};
}

I will publish the full source should anyone request it. I believe this could be useful in e.g. legal positions number estimation.

Constructive feedback would be greatly appreciated.

Edit: Passing/returning the picoboard by value for better performance:

Code: Select all

pico_board_t pico_write(const board_t* b) {
	// ...
	return (pico_board_t){
		.occ = b->occ,
		.v1 = (uint64_t)v,
		.v2 = v >> 64
	};
}

leanchess · Post by **leanchess** » Sun Aug 27, 2023 7:14 am

leanchess wrote: ↑Sun Aug 27, 2023 2:57 am Edit: We need to encode white king to move if we want to stay colour-agnostic, as Pio puts it. Flipping would then only require swapping the nibbles while flipping the LSBs.

Let's do that!

Code: Select all

uint128_t get_type(const board_t* b, piece_square_t ps, uint8_t sq8) {
	return ps.type3 == TYPE3_KING && b->turn == ps.color
		? ps.color | TYPE3_TURN << 1
		: ((ps.type3 == TYPE3_PAWN && ps.square == sq8) ||
			(ps.type3 == TYPE3_ROOK && ps.virgin
				&& (b->castle & 1 << ((ps.index & 1) | ps.color << 1)))
					? ps.color | TYPE3_SPEC << 1
					: ps.type);
}

Now we can finally have some fun with PEXT and PDEP:

Code: Select all

pico_board_t pico_flip(pico_board_t p) {
	pico_board_t q = {.occ = _bswap64(p.occ), .v1 = 0, .v2 = 0};
	uint64_t bb0, bb;
	for (uint64_t m = 0x1111111111111111, i = 0; i < 4; ++i, m <<= 1) {
		bb0 = _pext_u64(p.v1, m) | _pext_u64(p.v2, m) << 16;
		bb = _pext_u64(_bswap64(_pdep_u64(i ? bb0 : ~bb0, p.occ)), q.occ);
		q.v1 |= _pdep_u64((uint16_t)bb, m);
		q.v2 |= _pdep_u64(bb >> 16, m);
	}
	return q;
}

This makes me wonder whether we chose the wrong bit order. Maybe we should move from deposited mailbox to deposited quadboards?

leanchess · Post by **leanchess** » Sun Aug 27, 2023 8:38 am

Updated type:

Code: Select all

enum { PICO_BB_COUNT = 4};

typedef struct {
	uint64_t occ;
	union {
		struct {
			uint64_t v1;
			uint64_t v2;
		};
		uint32_t bbs[PICO_BB_COUNT];
	};
} pico_board_t;

Technically, these are no bitboards, but I've been coding all night, and I'm out of naming ideas.

Updated encoder:

Code: Select all

pico_board_t pico_write(const board_t* b) {
	uint128_t v = 0;
	uint8_t sq8 = board_get_ep_square(b) ^ 8;
	for (uint64_t m = 0x0000000100000001, n = b->occ; n; n &= n - 1, m <<= 1) {
		piece_square_t ps = b->mbox[_ctz_u64(n)];
		uint8_t t = get_type(b, ps, sq8);
		v |= _pdep_u64(t, m) | (uint128_t)_pdep_u64(t >> 2, m) << 64;
	}
	return (pico_board_t){
		.occ = b->occ,
		.v1 = (uint64_t)v,
		.v2 = v >> 64
	};
}

(the return type of get_type() is changed to uint8_t).

Updated decoder:

Code: Select all

result_t pico_read(board_t* b, pico_board_t p) {
	uint8_t c[2] = {0, 0};
	board_clear(b);
	b->ep_file = FILE_COUNT;
	uint128_t v = p.v1 | (uint128_t)p.v2 << 64;
	result_t r;
	for (uint64_t m = 0x0000000100000001, n = b->occ = p.occ; n; n &= n - 1, m <<= 1) {
		piece_square_t ps = {
			.type   = (uint8_t)(_pext_u64((uint64_t)v, m) | _pext_u64(v >> 64, m) << 2),
			.square = _ctz_u64(n)
		};
		if ((r = read_piece(b, ps, c)).error) {
			return r;
		}
		ps = (piece_square_t){r.v16};
		uint64_t l = _blsi_u64(n);
		b->bbs[ps.type3] |= l; 
		b->bbs[ps.color] |= l;
		b->mbox[ps.square7] = ps;
	}
	if ((r = update_indices(b, c)).error) {
		return r;
	}
	update_castles(b);
	return (result_t){0};
}

These make me wish for 128-bit PEXT and PDEP instructions... m for mask, l for lazy/late at morning.

Finally, nice and (hopefully) fast colour flip:

Code: Select all

pico_board_t pico_flip(pico_board_t p) {
	uint64_t occ = _bswap64(p.occ);
	for (uint64_t i = 0; i < PICO_BB_COUNT; ++i) {
		p.bbs[i] = (uint32_t)_pext_u64(_bswap64(_pdep_u64(i ? p.bbs[i] : ~p.bbs[i], p.occ)), occ);
	}
	p.occ = occ;
	return p;
}

Copyright (C) 2023 Dmitry Shechtman

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License Version 2.

leanchess · Post by **leanchess** » Mon Aug 28, 2023 10:20 am

leanchess wrote: ↑Tue Aug 22, 2023 11:03 am Castling rights delta is deferred from the colour/side, the 2 LSBs of index4 and virgin (kings, A-castling rook and H-castling are placed in indices 0, 1 and 2 respectively). virgin bits are only set in from piece-squares, allowing to.square to be used as a whole for extra efficiency.

This got me thinking. Deriving the castle delta from index isn't very pretty. What if we could use two MSBs of piece-square instead of one?

Well, as it turns out, we can! The 6th bit of square is only required for captures, where it is set in sec.to.square. Therefore, if we are careful, we can safely reuse it in both prim.from and sec.from. So square7 becomes square6, and the virgin bit is replaced with two castle bits:

Code: Select all

typedef union {
	uint16_t v;
	struct {
		union {
			uint8_t piece;
			struct {
				uint8_t index4  : 4;
				uint8_t type    : 4;
			};
			struct {
				uint8_t index   : 5;
				uint8_t type3   : 3;
			};
			struct {
				uint8_t         : 4;
				uint8_t color   : 1;
				uint8_t         : 3;
			};
		};
		union {
			uint8_t square;
			struct {
				uint8_t square6 : 6;
				uint8_t castle  : 2;
			};
		};
	};
} piece_square_t;

New castling moves (again, before delta adjustments):

Code: Select all

K 05A087A2-0668C460
Q 03A040A1-0268C460
k 3DB0BFB2-3E78FC70
q 3BB078B1-3A78FC70

Note the H-rooks are unchanged, since they still have just their single uppermost bits set.

Supporting 0x88 will become slightly more complicated, but still possible - just spread the castling rights between bits 3 and 7 and use square & 0x77 instead of square6.

dangi12012 · Post by **dangi12012** » Wed Aug 30, 2023 4:58 pm

Good that someone is looking into that direction as well. I have been using this for a while now and here is how my approach works. I worked very long on this and I consider it optimal. You will see why.

Here is my board structure:
__m256i YMM; //really a struct 64x4 with sensible overloaded operators

The pieces are encoded as follows:
//64x4 =>() Pawns, Rooks, Bish, Knight
//Pawn & Rooks = Special. When on Rank0 it is castling rook. Otherwise Enpassant.
//Rooks & Bish = Queen
//Bish & Knight = King

So this encodes castling and EP as well as the color implicitly.

1) the board is color agnostic so its always from the perspective of the player about to move. This has some nice implications for nnue inference and zobrist. It implies that we need to rotate the board after each move. This is unconditional:

Code: Select all

	static inline u64x4 mirror_horizontal(u64x4 x) {
		return _mm256_or_si256(
			_mm256_shuffle_epi8(lu_lo, _mm256_and_si256(x.YMM, epi8)),
			_mm256_shuffle_epi8(lu_up, _mm256_and_si256(_mm256_srli_epi16(x.YMM, 4), epi8))
		);

Now here comes the cool part. You can for instance in the lookup of pext not go into a uint64_t bitboard of the resulting bits, but store a list of delta represantations the delta representation can be updated to take the current situation into account.

Now other than the rotation the actual application of the move is this:
__m256i next_pos = _mm256_xor_si256(YMM, delta);

To undo a move its this:
__m256i prev_pos = _mm256_xor_si256(YMM, delta);

There are more details with how to get the delta to work unconditionally since some flags like EP decay on their own. Its only EP target flag that has has the property of any change on the board not having to do with the moved piece.

But at the core of it is a single XOR. At the end I want to mention that if you inject a template into alphabeta (visitor pattern) the stack itself can do the undo move for you. When your board fits in a single YMM register - there is no need to do that kind of bookkeeping yourself.

hgm · Post by **hgm** » Mon Sep 04, 2023 4:32 pm

You advertize branchless Make/UnMake logic. But it is not clear to me how you handle castling to achieve that. Do you always move the Rook, but make that operation a dummy for all moves that are not castlings? I have thought about that, but always rejected the idea, as I felt putting the extra code needed for castling in an if-statement that does require a branch. Castlings are only possible during a small part of the game, and often occur (or see the rights destroyed) in the opening, so that they are in fact never possible while the engine thinks. So such a branch should be 100% predictably skip the extra code.

E.p. captures might also require additional actions, but these can be performed by simply shifting the capture square away from the to-square. But that would not save the piece on the to-square, openeing a possibility to make the engine crash by a garbage move (and thus require more testing for validating the hash move). E.p. captures are also very rare in the tree, so it is very questionable whether always doing the extra work (as a dummy when it is not needed) would be faster than just suffering the mispredict when you encounter the e.p. capture.

The thing I worry about most is how to handle creation of e.p. rights. Pawn double pushes are not all that rare. So putting the code for that behind a branch does cause significant misprediction. That suggests it would pay off to let it handle by always-executed code. Perhaps it can be combined with updating castling rights; moves that create e.p. rights would not destroy castling rights. You can keep track of the virginity of the Pawns the same way as you update castling rights, by rights &= spoiler[fromSqr] & spoiler[toSqr].

One way to get e.p. rights would be to duplicate the Pawn virginity bits, and let all spoilers clear those bits again. Except the spoilers for 2nd- and 4th-rank, which would leave alone the one in the same file. In addition the 2nd-rank spoiler would clear the original virginity bit for that file. The you would end up with a word that contains the new virginity of the 2nd-rank and castlings, plus a bit per file that indicates whether it has e.p. rights.

leanchess · Post by **leanchess** » Mon Sep 04, 2023 4:47 pm

hgm wrote: ↑Mon Sep 04, 2023 4:32 pm You advertize branchless Make/UnMake logic. But it is not clear to me how you handle castling to achieve that. Do you always move the Rook, but make that operation a dummy for all moves that are not castlings? I have thought about that, but always rejected the idea, as I felt putting the extra code needed for castling in an if-statement that does require a branch. Castlings are only possible during a small part of the game, and often occur (or see the rights destroyed) in the opening, so that they are in fact never possible while the engine thinks. So such a branch should be 100% predictably skip the extra code.

E.p. captures might also require additional actions, but these can be performed by simply shifting the capture square away from the to-square. But that would not save the piece on the to-square, openeing a possibility to make the engine crash by a garbage move (and thus require more testing for validating the hash move). E.p. captures are also very rare in the tree, so it is very questionable whether always doing the extra work (as a dummy when it is not needed) would be faster than just suffering the mispredict when you encounter the e.p. capture.

The thing I worry about most is how to handle creation of e.p. rights. Pawn double pushes are not all that rare. So putting the code for that behind a branch does cause significant misprediction. That suggests it would pay off to let it handle by always-executed code. Perhaps it can be combined with updating castling rights; moves that create e.p. rights would not destroy castling rights. You can keep track of the virginity of the Pawns the same way as you update castling rights, by rights &= spoiler[fromSqr] & spoiler[toSqr].

One way to get e.p. rights would be to duplicate the Pawn virginity bits, and let all spoilers clear those bits again. Except the spoilers for 2nd- and 4th-rank, which would leave alone the one in the same file. In addition the 2nd-rank spoiler would clear the original virginity bit for that file. The you would end up with a word that contains the new virginity of the 2nd-rank and castlings, plus a bit per file that indicates whether it has e.p. rights.

It's really simple. As I said, each move is handled as two half-moves:

1. prim is the quiet part (with the optional type change for promotion), while
2. sec is capture (including e.p.), castling or nothing.

Everything is handled transparently.

Two unused index nibbles serve as deltas:

1. e.p. file delta - XORed into the board's ep_file value (0-8) and
2. caste delta - XORed into the board's castle value (4 separate bits).

Of course, those need to be set depending on the board's state.

The beauty of it is that make and unmake is exacly the same (until we need to update halfmove clock and fullmove count).

This seems to be working perfectly with pseudolegal movegen so far. Legal movegen is a whole different issue, which I'm yet to solve.

hgm · Post by **hgm** » Mon Sep 04, 2023 7:14 pm

leanchess wrote: ↑Mon Sep 04, 2023 4:47 pmThe beauty of it is that make and unmake is exacly the same (until we need to update halfmove clock and fullmove count).

I am not sure that is a good thing; normally UnMake can be simpler, by just restoring the old value. XORing two variables needs two memory loads, an XOR, and a store. That is 4 instructions. If you spend a 5th for storing the original value in a save place, copying that back would require only a load and a store. Together only 3 instructions.

leanchess · Post by **leanchess** » Tue Sep 05, 2023 2:04 am

hgm wrote: ↑Mon Sep 04, 2023 7:14 pm I am not sure that is a good thing; normally UnMake can be simpler, by just restoring the old value. XORing two variables needs two memory loads, an XOR, and a store. That is 4 instructions. If you spend a 5th for storing the original value in a save place, copying that back would require only a load and a store. Together only 3 instructions.

A move is 64 bits, so it should be in a register. I suppose you could replace the deltas with values in make, but I'm not sure whether that would be optimal.

P.S. My code for updating the e.p. file, casting rights and side to move:

Code: Select all

	b->flags ^= (m.v & 0xF0000000F0000) | 1;

Branchless make/unmake logic

Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic

Re: Branchless make/unmake logic