Faster than fancy magic - fits in L1

smatovic · Post by **smatovic** » Sat Apr 01, 2023 8:56 am

Board representation via AVX2, move gen via AVX2, evaluation via AVX2, looking forward for the complete engine, keep it up.

--
Srdja

gxchess1 · Post by **gxchess1** » Sat Apr 01, 2023 12:27 pm

dangi12012 wrote: ↑Sat Apr 01, 2023 1:26 am So now its truly coming together. Having a full running implementation of SIMD __m256i (which is 4x uint64_t)

I have a full move implementation that goes from a native SIMD board via shuffle directly into the new movegenerator - which spits out the target bits again in 4x uint64_t simdwise manner.
These bitboards get enumerated like you would normally do it - but 4bbs at a time.

For example all sliders - independent of bishop, rook, queen can be consolidated into this loop:
Pins and check evasion is solved by respective masks - pin_valid and check_evasion like in the gigantua movegen.
Code: Select all
u64x4 sliders = brd.rrbb_vec() & pin_valid;
Bitloop_4x64(sliders) {
u64x4 source_squares = sliders.get_lsb();
auto mv = MV::Quad_Sliders(sliders.find_first_set(), occ) & check_evasion;
Bitloop_4x64(mv) {
	apply_move_sliders<core>(brd, source_squares, mv.get_lsb(), mv.is_not_zero_mask());
}}
Now these bitboards can be shuffled and masked in a way that the output of the respective least significant bits feed back into the position directly- to change the bits in the way that corresponds to a move.
Most helpful was this resource: https://www.intel.com/content/www/us/en ... hs=AVX_ALL

Now this V2 movegen does not use templates anymore - so compilation is fast, does not use PEXT or Fancy magic but the optimized galois field movegen which works on very well also on AVX2.

Todos: Hash for the leaf nodes
This is still with AVX2 where there are many functions that can be made much faster with avx512 but I hit a wall for now.

A true novelty: with avx2 its quite cheap to do inversion of all bits in 4x uint64_t - so my pawns dont walk up and down the board (<< 8 >>) but the color to move is always on the left of the board! so a pawn marching forward is always pawn >> 1. This has huge advantages of not needing to "if white" or lookup[color] or having seperate codepath as well as not needing a mask on the a and h files for the pawns. Many constants also consolidate away etc. Truly color agnostic chessprogramming helps a lot in terms of performance.

But it does use Galois field avx512 even though it operates on avx2 registers, right?

lithander · Post by **lithander** » Sat Apr 01, 2023 2:52 pm

Okay this is cool on so many different fronts. That it does work well on AVX2 hardware makes a huge difference imo. And Color agnostic code will pay dividends in a real engine in multiple ways. You are making a proper engine this time, right? Because you really should!

dangi12012 · Post by **dangi12012** » Sat Apr 01, 2023 8:00 pm

Yes I think I will continue this momentum towards an engine - and the best part is that the juicy instructions that replace whole blocks of code with 0.5 cycles are still missing since I dont have my hands on a 7800x3d yet. All completely vanilly AVX2 atm.

And still with vanilly AVX2 the technology S curve already surpassing the old gigantua has not flattened. Making 1 day progress on yesterday -

Code: Select all

Perft aggregate: 18999768562 25686ms 739.667 MNodes/s
Perft aggregate: 18999768562 24049ms 790.016 MNodes/s

no templates, no TT, no bulk counting, single thread

For the future:
Since the movegen algorithm fits in L1 completly I am looking forward what parallelisation will bring (only after robust Alpha-Beta or MCTS implementation).
Core scaling with L1 algorithms can be much more linear compared to L2 or L3 algos like PEXT but remains to be seen how that plays with TT and so on.

And of course SIMD eval with a non square centric board will be interesting but there are promising ideas of how to extract squares, piece sparse update information for NNUE efficiently with our old friend _mm256_gf2p8affine_epi64_epi8.

Mike Sherwin · Post by **Mike Sherwin** » Sat Apr 01, 2023 8:48 pm

When Harald Lussen helped me bring out what he coined Sherwin Bitboards (actually he did all the work) they were a little faster than Rotated Bitboards. But I got to take no victory lap because Magic Bitboards came out first. That was the beginning of my quest to outperform Magic. I knew that the Split Index method would save memory and therefore give me a theoretical chance to do so. SISSY was an improvement on Sherwin (wait a minute that does not sound so good

, oh well). All along I knew if I could get SISSY down to two lookups rather than 8 then SISSY would outperform Magic, even Black Magic. Martin Sedlak tried to help but I do not think he fully understood what needed to be done. It was not till I read about Kindergarten Bitboards a few dozen times did I realize that 'Kindergarten Math' was exactly what I needed to efficiently reduce SISSY to just two lookups. And whala, victory over Magic at last after ~16 years. But upon ascending the throne the King dies. Long live the King. Congrats Daniel. You are the new King!

The only thing that you may wish to check is the power consumption and heat production of running max threads with a heavy avx2 workload. It may cause problems for some users if their cpu cooling is not sufficient to handle it.

dangi12012 · Post by **dangi12012** » Sat Apr 15, 2023 5:23 pm

Update on that topic - it seems that a color agnostic board structure is really paying off when evaluatin with NNUE.
As there the idea was also implemented that it does not make sense to differentiate between black and white but rather by "self" and "enemy"

What that means in practice is that over on SF the code looks like this:

Code: Select all

template<Color Perspective>
inline IndexType make_index(int s, int pc, int ksq) const {
	return IndexType((s ^ OrientTBL[Perspective][ksq]) + PieceSquareIndex[Perspective][pc] + KingBuckets[Perspective][ksq]);
}

But OrientTBL is only for rotating bits (incidentally the same why my boardstructure is by default) making this whole effort 0ops + the 2 remaining tables can be consolidated as well as perspective not needing to be a template parameter when using seperate tables.

Code: Select all

inline IndexType make_index(int s, int pc, int ksq) const {
	return IndexType(s + OwnSquareIndex[pc] + OwnKingOffset[ksq]);
}
//Equals
inline int make_index_self(int s, int kpc) const {
	return s + square_info[kpc]; //k_sq * 16 + piece
}

And other parts become more readable:

Code: Select all

const auto psqt = (psqtAccumulation[perspectives[0]][bucket] - psqtAccumulation[perspectives[1]][bucket]) / 2;

const int psqt = (own_psqt[bucket] - opp_psqt[bucket]) >> 1;

So all the adjustments of "I am white but its blacks turn" which table do I use now and how do I rotate - (which is 2 levels of indirection) are not even needed to be done anymore.

Why this is really important:
The game of chess itself demands of us which structures to use. Its not a free choice at all - when the developers choice is good suddenly all the edge conditions go away. (Its the same in physics really - with more clarity complicated physical formulas consolidate into a few pretty lines - and cannot be reduced further)
I felt the same way with SIMD and my current board structure - as all edge conditions of pinned checks, ep etc are solved without any seperate codepaths. bishops, rooks and queens are not handled seperately, when having own pawns walk via pawn >> 1 always instead of if (white) pawn << 8 else pawn >> 8 - you dont need to mask the edges and without the conditionals suddenly all of chess can fit in SIMD (AVX or NEON) registers as well as having intrinsic functions available to work on them

Also looking up 4 pieces in one instruction is cheap! you just have to make knight[64] and king a an array of [65] because countr_zero will be 64 and that can lookup into a 0 without any interference - and use the appropriate gather instructions.

Once I have my hands on Zen4 its time for enabling AVX512 codepaths and building a whole engine. The updated SIMD galoisfield movegenerator will be part of that release.

I fully expect to hit above 1 Billion Nodes/s for the raw move generation on a single thread with no TT and not Bulk counting tricks etc. on Zen4
As for the incremented eval it remains to be seen how fast the "accumulator" and part behind it can be done really. Maybe I just iteratively deepen and have to bulk eval all leaf nodes on the gpu after all.

Interesting times ahead and also already completed.

Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1

Re: Faster than fancy magic - fits in L1