My Perft Results

Mike Sherwin · Post by **Mike Sherwin** » Sat Nov 05, 2022 6:57 pm

dangi12012 wrote: ↑Sat Nov 05, 2022 10:48 am
Mike Sherwin wrote: ↑Sat Nov 05, 2022 2:39 am Well my generator is pseudo legal and I know no other way than to make the move, call InCheck and then unmake the move for the last ply. If that is stupid then please explain.
Do you need special code for castling? - the only move where the position is legal before and after, but yet the move might be illegal?

[fen]rn2k1nr/ppp3pp/3p4/4pp1b/8/8/PPPP1PPP/R3K1NR w KQkq - 0 1[/fen]

I'm not sure what you mean by special code. It is just common in in that it looks for occupied between squares and certain attacked squares. The algorithm for doing that I kind of feel is special. First I have two king types WK and WKC. A WK on the board does not consider castleing at any time. A WKC on the board considers castleing only if there is a WRC on h1 or a1.

Code: Select all

    case WKC:
      gbb[ply][fs] = kingMoves[fs] & notme;
      abb[ply] |= kingMoves[fs];
      gbb[ply][fs] |= (u64)((board[h1] == WRC) + ((occ & SWCS) == 0) + (AttackedByBlack(t, AWCS) == 0) == 3) << g1;
      gbb[ply][fs] |= (u64)((board[a1] == WRC) + ((occ & SWCL) == 0) + (AttackedByBlack(t, AWCL) == 0) == 3) << c1;
      break;
 
 s32 AttackedByBlack(Thread* t, u64 bb) {
  s32 n = 0;
  u64 bbn;
  u64 occ = piece[BLACK] | piece[WHITE];

  do {
	s32 sq = std::countr_zero(bb);
	bbn = (knightMoves[sq] & pctypbb[BN])
	    | (wPawnCapts[sq] & pctypbb[BP])
	    | (kingMoves[sq] & pctypbb[BK])
	    | (((ray[std::countr_zero(ray[sq].NW & occ)].NW
		   | ray[std::countr_zero(ray[sq].NE & occ)].NE
		   | rev[std::countl_zero(ray[sq].SE & occ)].SE
		   | rev[std::countl_zero(ray[sq].SW & occ)].SW)
		   ^ bishopMoves[sq])
		   & (pctypbb[BB] | pctypbb[BQ]))
	    | (((ray[std::countr_zero(ray[sq].NN & occ)].NN
		   | ray[std::countr_zero(ray[sq].EE & occ)].EE
		   | rev[std::countl_zero(ray[sq].SS & occ)].SS
		   | rev[std::countl_zero(ray[sq].WW & occ)].WW)
		   ^ rookMoves[sq])
		   & (pctypbb[BR] | pctypbb[BQ]));
	n += std::popcount(bbn);
	bb ^= 1ull << sq;
  } while (bb);

  return n;
}

Mike Sherwin · Post by **Mike Sherwin** » Sat Nov 05, 2022 7:59 pm

hgm wrote: ↑Sat Nov 05, 2022 9:17 am
Mike Sherwin wrote: ↑Sat Nov 05, 2022 2:39 amWell my generator is pseudo legal and I know no other way than to make the move, call InCheck and then unmake the move for the last ply. If that is stupid then please explain.
The 'stupid' qualification only referred to the case where you would make and unmake moves without doing anything in between. Testing for check would be a valid excuse for doing it. But it would be a very inefficient method for testing move legality. And that could be justifiably be held against the method for move generation. Most moves can very well be tested for legality without making them.

E.g. moves of a non-royal piece can only be illegal when that piece was pinned. (Or when you were in check to begin with, but that should be handled by a special-case move generator.) To be pinned such pieces have to be aligned with the friendly King. Most pieces won't be so aligned, and it is quite cheap to test for (e.g. if(aligned[fromSqr-kingSqr]) ... ), and failing the test excludes that the move can be illegal. And thus removes the necessity for making and unmaking them just for check testing.

And you don't even have to do that test for every individual move: you can do it for all moves of the piece at once, as these all have the same fromSqr. So you could do it in the move generator before generating moves for that piece, complete the test by looking for a pinner in the few cases that is is aligned with its King, and fall back on a special case of move generation only for moves along the pin ray when they are actually pinned, instead of generating all their moves.

This is basically what Qperft does. Except instead of subjecting all the pieces it generates moves for to the alignment test, it subjects all opponent sliders to such a test. (There are only 5 of those, while it could have 16 pieces of its own.) Again alignment is not common, but if it occurs it tests for a single piece on the ray to King, and then that piece is pinned and exempted from normal move generation, and uses the on-ray move generator instead.

None of the above is a 'perft trick'; it is just a method for making a speedy move generator from which an engine benefits just as well.

The above optimizations do not apply to King moves; actually Qperft does make all King moves, even in 'bulk counting' mode, to test whether they are legal by an IsSquareAttacked() call on the toSqr after the move. In principle this is still wasteful. Making the move before doing the IsSquareAttacked test only makes a difference when the King would have 'stepped into its own shadow', and this is only possible when it was in check to begin with. Which probably is known, but even if not, fully making the move could simply be replaced by temporarily taking the King off board during the test, so that it cannot block any slider attacks on the square behind it. Even worse, doing tests for all King destinations separately is wasteful. Because once you found out a slider attacks one such square, it usually attacks more. And the same test could have revealed that.

E.g. you could subject every enemy piece an alignment test with the king neighborhood through a lookup in bulkAligned[sqr-kingSqr], which would give you a mask where each bit corresponds to a square adjacent to King, set to 1 if the piece is aligned with it. You could have a second set of bits in the same word to indicate whether these are distant attacks. For those you would have to vet the attack for not being blocked. But usually finding a blocker would block all attacks of that piece into the King neighborhood. The masks with the unblocked attacks for all pieces could be ORed together to get the illegal King destinations, and then you only generate King moves to the remaining squares.

Again, this would not be a 'perft trick', but just an optimization of the move generator.

Good information!

Notes:
I do not call everything a trick.
Cool it is not my perft that is stupid.

It is my mg that is stupid.

np all's

JoAnnP38 · Post by **JoAnnP38** » Thu Nov 24, 2022 12:42 am

It's been a couple of weeks since I posted on this thread, but in that time I decided to redesign my move generator to use simple bitboards. I just wasn't able to achieve the search speed I was shooting for using mailbox and skip-lists. There seems to be a performance boundary I am running into by virtue of my engine design, my skill and the language I am using (i.e. C#). Even using simple bitboards I was not able to achieve sub 5 second perft 6 timings. When I finally got to the point that my "optimizations" were much more likely to negatively impact performance I decided enough was enough. Here are the best timings I was able to achieve:

NO TRANSPOSITION TABLES FOR SIMPLE SINGLE THREADED PERFT
Single threaded results:
1: Elapsed = 00:00:00.0076171, Mnps: 0.00, nodes = 20
2: Elapsed = 00:00:00.0000682, Mnps: 5.87, nodes = 400
3: Elapsed = 00:00:00.0011546, Mnps: 7.71, nodes = 8902
4: Elapsed = 00:00:00.0244530, Mnps: 8.07, nodes = 197281
5: Elapsed = 00:00:00.3723415, Mnps: 13.07, nodes = 4865609
6: Elapsed = 00:00:07.7679397, Mnps: 15.33, nodes = 119060324

This is my simple perft test. No tricks. Just brute force:

Code: Select all

        public ulong Execute(int depth)
        {
            if (depth == 0)
            {
                return 1ul;
            }

            ulong nodes = 0;

            board.GenerateMoves(moveList);
            Span<Move> moves = moveList.GetMoves(board.Ply);
            for (int i = 0; i < moves.Length; ++i)
            {
                if (board.MakeMove(moves[i]))
                {
                    nodes += Execute(depth - 1);
                    board.UnmakeMove();
                }
            }
            return nodes;
        }

However, after getting to this point and since I've also decided that my engine will use SMP to the best of my ability (probably Lazy SMP or some form of tree-splitting) I decided to see what that would do for my perft 6 performance. Since I will need to be able to support lockless transposition tables I also introduced them to the multi-threaded perft test and here is what I was able to achieve:

USING LOCKLESS TRANSPOSITION (XOR DATA & KEY) TABLES
Multi-threaded results (2 threads):
1: Elapsed = 00:00:00.0379713, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0092862, Mnps: 0.04, Nodes = 400
3: Elapsed = 00:00:00.0082186, Mnps: 1.08, Nodes = 8902
4: Elapsed = 00:00:00.0152597, Mnps: 12.93, Nodes = 197281
5: Elapsed = 00:00:00.1069914, Mnps: 45.48, Nodes = 4865609
6: Elapsed = 00:00:01.5681940, Mnps: 75.92, Nodes = 119060324

Multi-threaded results (4 threads):
1: Elapsed = 00:00:00.0082904, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0106227, Mnps: 0.04, Nodes = 400
3: Elapsed = 00:00:00.0085897, Mnps: 1.04, Nodes = 8902
4: Elapsed = 00:00:00.0125251, Mnps: 15.75, Nodes = 197281
5: Elapsed = 00:00:00.0592803, Mnps: 82.08, Nodes = 4865609
6: Elapsed = 00:00:00.7690445, Mnps: 154.82, Nodes = 119060324

Multi-threaded results (6 threads):
1: Elapsed = 00:00:00.0090133, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0087692, Mnps: 0.05, Nodes = 400
3: Elapsed = 00:00:00.0107237, Mnps: 0.83, Nodes = 8902
4: Elapsed = 00:00:00.0119909, Mnps: 16.45, Nodes = 197281
5: Elapsed = 00:00:00.0457344, Mnps: 106.39, Nodes = 4865609
6: Elapsed = 00:00:00.5560614, Mnps: 214.11, Nodes = 119060324

The transposition tables were the real winner here. I didn't preserve the timings for multi-threaded SMP and no transposition tables, but it was almost linear for perft 5 or 6. Of course, for perft 1-3, the overhead of setting up multiple threads is actually a detriment as the tables above show.

Now on to search (finally).

okidoki · Post by **okidoki** » Thu Nov 24, 2022 5:14 pm

JoAnnP38 wrote: ↑Thu Nov 24, 2022 12:42 am It's been a couple of weeks since I posted on this thread, but in that time I decided to redesign my move generator to use simple bitboards. I just wasn't able to achieve the search speed I was shooting for using mailbox and skip-lists. There seems to be a performance boundary I am running into by virtue of my engine design, my skill and the language I am using (i.e. C#). Even using simple bitboards I was not able to achieve sub 5 second perft 6 timings. When I finally got to the point that my "optimizations" were much more likely to negatively impact performance I decided enough was enough. Here are the best timings I was able to achieve:

NO TRANSPOSITION TABLES FOR SIMPLE SINGLE THREADED PERFT
Single threaded results:
1: Elapsed = 00:00:00.0076171, Mnps: 0.00, nodes = 20
2: Elapsed = 00:00:00.0000682, Mnps: 5.87, nodes = 400
3: Elapsed = 00:00:00.0011546, Mnps: 7.71, nodes = 8902
4: Elapsed = 00:00:00.0244530, Mnps: 8.07, nodes = 197281
5: Elapsed = 00:00:00.3723415, Mnps: 13.07, nodes = 4865609
6: Elapsed = 00:00:07.7679397, Mnps: 15.33, nodes = 119060324

This is my simple perft test. No tricks. Just brute force:
Code: Select all
        public ulong Execute(int depth)
        {
            if (depth == 0)
            {
                return 1ul;
            }

            ulong nodes = 0;

            board.GenerateMoves(moveList);
            Span<Move> moves = moveList.GetMoves(board.Ply);
            for (int i = 0; i < moves.Length; ++i)
            {
                if (board.MakeMove(moves[i]))
                {
                    nodes += Execute(depth - 1);
                    board.UnmakeMove();
                }
            }
            return nodes;
        }
However, after getting to this point and since I've also decided that my engine will use SMP to the best of my ability (probably Lazy SMP or some form of tree-splitting) I decided to see what that would do for my perft 6 performance. Since I will need to be able to support lockless transposition tables I also introduced them to the multi-threaded perft test and here is what I was able to achieve:

USING LOCKLESS TRANSPOSITION (XOR DATA & KEY) TABLES
Multi-threaded results (2 threads):
1: Elapsed = 00:00:00.0379713, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0092862, Mnps: 0.04, Nodes = 400
3: Elapsed = 00:00:00.0082186, Mnps: 1.08, Nodes = 8902
4: Elapsed = 00:00:00.0152597, Mnps: 12.93, Nodes = 197281
5: Elapsed = 00:00:00.1069914, Mnps: 45.48, Nodes = 4865609
6: Elapsed = 00:00:01.5681940, Mnps: 75.92, Nodes = 119060324

Multi-threaded results (4 threads):
1: Elapsed = 00:00:00.0082904, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0106227, Mnps: 0.04, Nodes = 400
3: Elapsed = 00:00:00.0085897, Mnps: 1.04, Nodes = 8902
4: Elapsed = 00:00:00.0125251, Mnps: 15.75, Nodes = 197281
5: Elapsed = 00:00:00.0592803, Mnps: 82.08, Nodes = 4865609
6: Elapsed = 00:00:00.7690445, Mnps: 154.82, Nodes = 119060324

Multi-threaded results (6 threads):
1: Elapsed = 00:00:00.0090133, Mnps: 0.00, Nodes = 20
2: Elapsed = 00:00:00.0087692, Mnps: 0.05, Nodes = 400
3: Elapsed = 00:00:00.0107237, Mnps: 0.83, Nodes = 8902
4: Elapsed = 00:00:00.0119909, Mnps: 16.45, Nodes = 197281
5: Elapsed = 00:00:00.0457344, Mnps: 106.39, Nodes = 4865609
6: Elapsed = 00:00:00.5560614, Mnps: 214.11, Nodes = 119060324

The transposition tables were the real winner here. I didn't preserve the timings for multi-threaded SMP and no transposition tables, but it was almost linear for perft 5 or 6. Of course, for perft 1-3, the overhead of setting up multiple threads is actually a detriment as the tables above show.

Now on to search (finally).

It would be nice to see another C# engine.
I too am stuck with about 14-15Mnps in perft. From posts of Leorik engine, twice the speed can be extracted. It is very likely how it is being designed by using structs and other advanced C# concepts for native performance.
Currently my knowledge of the language or the concepts is restricting me from getting there.

Hoping it is not too much to ask for you to share your code for comparison.

Mine can be found at https://github.com/frankcastle64/Hansel ... StaticImpl

JoAnnP38 · Post by **JoAnnP38** » Thu Nov 24, 2022 6:01 pm

okidoki wrote: ↑Thu Nov 24, 2022 5:14 pm It would be nice to see another C# engine.
I too am stuck with about 14-15Mnps in perft. From posts of Leorik engine, twice the speed can be extracted. It is very likely how it is being designed by using structs and other advanced C# concepts for native performance.
Currently my knowledge of the language or the concepts is restricting me from getting there.

Hoping it is not too much to ask for you to share your code for comparison.

Mine can be found at https://github.com/frankcastle64/Hansel ... StaticImpl

No problem but be forewarned. There are no comments, the code is in high flux, and only move generation/perft is implemented at the moment. Also keep in mind, that my intentions are for this engine to be used for experimentation. Therefore, I am still trying to use features of C# that allow me to easily read, update and debug the code. Performance, while important to the project (and my ego) has to be further down the list or priorities as I move on to my real task: self-leaning evaluation. Let me know if you have any issues accessing my repository. I'm not really a GIT pro although I've used it for many years and I've never had need to share a personal repository since before I retired, I was strictly a corporate developer and there were others that managed security.

https://github.com/JoAnnP38/Emerita

JacquesRW · Post by **JacquesRW** » Thu Nov 24, 2022 8:36 pm

I looked through your move generation code, and I think probably the easiest performance improvement would be to replace your sliding piece code with an alternative (branchless) method. Classical sliding piece attacks are very easy to implement, and I expect they would outperform the current code quite significantly.

This is the code I'm currently looking at btw, in your Board.cs file:

Code: Select all

#region Generate Bishop Moves

            bb1 = bbaPieces[sideToMove, Constants.PIECE_BISHOP];
            while (bb1 != 0)
            {
                from = ChessMath.LowestSetBitIndex(bb1);
                ChessMath.ResetLowestSetBit(ref bb1);
                for (int dir = Constants.DIR_NE; dir < Constants.MAX_DIRECTIONS; dir += 2)
                {
                    for (to = qrbMoves[from, dir]; to > -1; to = qrbMoves[to, dir])
                    {
                        if ((bbaMask[to] & bbAll) != 0)
                        {
                            if ((bbaMask[to] & bbaUnits[OpponentColor]) != 0)
                            {
                                moveList.Add(Constants.PIECE_BISHOP, from, to, MoveFlags.Capture,
                                    capture: board[to], score: captureScores[board[to], Constants.PIECE_BISHOP]);
                            }
                            break;
                        }
                        moveList.Add(Constants.PIECE_BISHOP, from, to, score: history[from, to]);
                    }
                }
            }

            #endregion

Mike Sherwin · Post by **Mike Sherwin** » Fri Nov 25, 2022 12:38 am

Some work has been done on classic bitboards lately to speed them up quite a bit.

Here is what they were in RomiChess from 2005.

Code: Select all

       case WQUEEN:
        h->moves[id] = ((ray7p[fs] & belowInc[FirstBit(ray7p[fs] & aPieces)])
                     |  (ray9p[fs] & belowInc[FirstBit(ray9p[fs] & aPieces)])
                     |  (ray7m[fs] & aboveInc[LastBit(ray7m[fs] & aPieces)])
                     |  (ray9m[fs] & aboveInc[LastBit(ray9m[fs] & aPieces)])
                     |  (ray8p[fs] & belowInc[FirstBit(ray8p[fs] & aPieces)])
                     |  (ray1p[fs] & belowInc[FirstBit(ray1p[fs] & aPieces)])
                     |  (ray8m[fs] & aboveInc[LastBit(ray8m[fs] & aPieces)])
                     |  (ray1m[fs] & aboveInc[LastBit(ray1m[fs] & aPieces)]));
        h->attacked |= h->moves[id];
        h->moves[id] &= notMe;
        continue;

In the FirstBit() and LastBit() functions is a hidden 'if' statement. Still it was quite snappy. In the more modern version there are no hidden 'if' statements and it uses faster intrinsics and less instructions.

Code: Select all

    case WQ:
      gbb[ply][fs]
        = (ray[std::countr_zero(ray[fs].NW & occ)].NW
         | ray[std::countr_zero(ray[fs].NE & occ)].NE
         | rev[std::countl_zero(ray[fs].SE & occ)].SE
         | rev[std::countl_zero(ray[fs].SW & occ)].SW
         | ray[std::countr_zero(ray[fs].NN & occ)].NN
         | ray[std::countr_zero(ray[fs].EE & occ)].EE
         | rev[std::countl_zero(ray[fs].SS & occ)].SS
         | rev[std::countl_zero(ray[fs].WW & occ)].WW) 
         ^ queenMoves[fs];
      abb[ply] |= gbb[ply][fs];
      gbb[ply][fs] &= notme;
      break;

Code: Select all

// Robert Hyatt's and Michael Sherwin's classical bitboard approach
// generate moves for the sliding pieces.

//Improvements by Daniel Infuehr 21.04.2022

#include <stdint.h>
#include <bit>
#include <array>

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift(0x8040201008040201ull, (X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift(0x0102040810204080ull, 7 - (X & 7) - (X >> 3)))
#	define GetLower(X) ((1ull << X) - 1)
#	define GetUpper(X) (0xFFFFFFFFFFFFFFFE << (X))

namespace Chess_Lookup::BobAdvanced {

	struct Ray8 
	{
		uint64_t rayNW; uint64_t rayNN; uint64_t rayNE; uint64_t rayEE;
		uint64_t raySE; uint64_t raySS; uint64_t raySW; uint64_t rayWW;
	};

	static constexpr uint64_t mask_shift(uint64_t bb, int ranks)
	{
		if (ranks > 0) return bb >> (ranks << 3);
		else return bb << -(ranks << 3);
	}

	constexpr std::array<Ray8, 65> ray = []() {
		std::array<Ray8, 65> ray;
		for (int sq = 0; sq < 64; sq++)
		{
			ray[sq].rayNW = dir_D2(sq) & GetUpper(sq); // Northwest
			ray[sq].rayNN = dir_VE(sq) & GetUpper(sq); // North
			ray[sq].rayNE = dir_D1(sq) & GetUpper(sq); // Northeast
			ray[sq].rayEE = dir_HO(sq) & GetUpper(sq); // East
			ray[sq].raySE = dir_D1(sq) & GetLower(sq); // Southeast
			ray[sq].raySW = dir_D2(sq) & GetLower(sq); // Southwest
			ray[sq].raySS = dir_VE(sq) & GetLower(sq); // South
			ray[sq].rayWW = dir_HO(sq) & GetLower(sq); // West
		}
		ray[64] = ray[63];
		return ray;
	}();

	constexpr std::array<Ray8, 65> rev = []() {
		std::array<Ray8, 65> rev;
		for (int sq = 0; sq < 64; sq++)
		{
			rev[sq] = ray[63 - sq];
		}
		rev[64] = ray[0];
		return rev;
	}();

	constexpr std::array<uint64_t, 64> queens = []()
	{
		std::array<uint64_t, 64> q = {};
		for (int sq = 0; sq < 64; sq++)
		{
			q[sq] = (dir_HO(sq) | dir_VE(sq) | dir_D1(sq) | dir_D2(sq)) ^ (1ull << sq);
		}
		return q;
	}();

	constexpr auto Size = sizeof(ray);

	static constexpr uint64_t Queen(int sq, uint64_t occ) 
	{
		uint64_t bb = (
			  ray[std::countr_zero(ray[sq].rayNW & occ)].rayNW
			| ray[std::countr_zero(ray[sq].rayNN & occ)].rayNN
			| ray[std::countr_zero(ray[sq].rayNE & occ)].rayNE
			| ray[std::countr_zero(ray[sq].rayEE & occ)].rayEE
			| rev[std::countl_zero(ray[sq].raySE & occ)].raySE
			| rev[std::countl_zero(ray[sq].raySS & occ)].raySS
			| rev[std::countl_zero(ray[sq].raySW & occ)].raySW
			| rev[std::countl_zero(ray[sq].rayWW & occ)].rayWW)
			^ queens[sq];

		return bb; // Rook(sq, occ) | Bishop(sq, occ);
	}
#undef dir_HO
#undef dir_VE
#undef dir_D1
#undef dir_D2
#undef GetLower
#undef GetUpper
}

However, for best speed one should either use Black Magic or PEXT bitboards.

lithander · Post by **lithander** » Fri Nov 25, 2022 1:13 am

okidoki wrote: ↑Thu Nov 24, 2022 5:14 pm It would be nice to see another C# engine.
I too am stuck with about 14-15Mnps in perft. From posts of Leorik engine, twice the speed can be extracted. It is very likely how it is being designed by using structs and other advanced C# concepts for native performance.
Currently my knowledge of the language or the concepts is restricting me from getting there.

A while ago there was a thread called Why C++ instead of C#?

We tried to compare the speed of different programming languages in a typical chess programming workload. It contains Perft implementations in C and C# and Java that are based on the QBBEngine by Fabio Gobbato. Here's the repo: https://github.com/lithander/QBB-Perft

It's fast: Computes perft 6 on the startpos in less than 2 seconds with a pseudo-legal movegen and no tricks. (e.g. every move is played and checked for legality) In the thread you'll find information and source code on how to go waaay beyond! (~170M nps or if I remember correctly but that was with a now legal movegen and cutting edge intrinsics etc)

The discussion and seeing the side-by-side implementation of the same algorithms in three different languages might help if you really want to go down that rabbit hole of the squeezing C-like performance out of C#.

JoAnnP38 · Post by **JoAnnP38** » Fri Nov 25, 2022 7:33 pm

I just wanted to give a big thanks to JacquesRW for taking time to look over my code. I really wasn't fishing for an unprompted code review, but it was *much* appreciated! It forced me to take a hard look at my slider move generation logic. I also want to thank Mike Sherwin for posting the relevant attack generation code. I'll have to admit that I didn't understand it much at first, but after reviewing an excellent web posting by Rhys Rustad-Elliott the slider attack generation from classical bitboards started to make *much* more sense. And with that spark I started to understand the code that Mike posted and why the vector (or ray) arrays had one extra element (i.e. to account for LZCNT(0) or TZCNT(0)) and why there needed to be a reverse representation of the same array. It allowed me to concoct my own version of branchless slider attacks (example below).

Code: Select all

            
        public ulong GetBishopAttacksBranchless(int from)
        {
            Ray ray = maskVectors[from];
            ulong bb = ChessMath.AndNot(ray.NorthEast, maskVectors[ChessMath.TrailingZeroCount(ray.NorthEast & bbAll)].NorthEast) |
                       ChessMath.AndNot(ray.NorthWest, maskVectors[ChessMath.TrailingZeroCount(ray.NorthWest & bbAll)].NorthWest) |
                       ChessMath.AndNot(ray.SouthEast, revMaskVectors[ChessMath.LeadingZeroCount(ray.SouthEast & bbAll)].SouthEast) |
                       ChessMath.AndNot(ray.SouthWest, revMaskVectors[ChessMath.LeadingZeroCount(ray.SouthWest & bbAll)].SouthWest);
            return bb;
        }

Unfortunately, while this was a totally (*totally*!!) worthwhile learning endeavor, it didn't yield much in terms of extra Perft performance (about 8% speedup.) It's another situation where Amdahl's law indicates that somewhere in my code there are probably much larger issues at play in terms of performance. But -- progress is being made! Thanks again guys!!

JoAnnP38 · Post by **JoAnnP38** » Mon Dec 05, 2022 10:39 am

I was curious enough yesterday to update my slider move/attack algorithm to use magic bitboards instead of just classical, but branchless bitboard methods. I found any performance improvement to be very small (on the order of 1%-2%). I suppose it helps to future enable my engine to run under future CPUs with larger caches.

My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results

Re: My Perft Results