Alternatives to King-Pawn, King-All NNUE encoding

AndrewGrant · Post by **AndrewGrant** » Sun Sep 05, 2021 12:23 pm

Been thinking about schemes other than (half)king-pawn/(half)king-all. An even more massive over paramitization which can bake in a greater appreciation for pawn structure. But also move away from the focus on the King, which I think is overrated. King placement is of little consequence often times I feel. Or in subtle ways that are not revealed by eval anyway.

Something like pawn-piece, where you take each pawn and its relation to each other piece (including other pawns ???) . Looking at about 8 times the update cost for a given non-pawn move, and either 8x or like 16x for pawn moves when excluding/including other pawn mappings.

Clearly identifying factorization becomes more complex, but there is also significantly move "value" in those factorization I feel. Instead of simply factoring out the PSQT, and King-Piece distance maps, you are also adding the explicit relation between a pawn and a piece. Things which we've hand coded like having minors behind pawns. Having pawns which match/dont-match our bishop, And even a sort of directional mobility

When I ran the numbers initially, I recall the "incremental" update being something like a 10% speedup overall. So generously, a 25% speedup to the actual NNUE evaluation itself. So the increased burden of the updates is somewhat eaten by our incremental updates.

Under this scheme you also never have the concept of "throwing out the accumulator and restarting", which you have now when the King is moved. You just toss everything, and do (up-to) 31x additions. At most you impact either 2 pawns, or a pawn and a piece, or two pieces. I feel like you would never have to "refresh". So there is some time save there.

I can hand-wave and say this is better encoding -- just by taking halfkp and adding this ontop of it. By definition, more knowledge on the inputs, should be better or not worse. Since halfkp + king-nonpawn is a subset of this proposed encoding.

hgm · Post by **hgm** » Sun Sep 05, 2021 2:41 pm

King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Sep 05, 2021 6:55 pm

Given a non-pawn move, won't count(pawn) updates to the feature transformer be required with your proposed input features? I think this could be prohibitively costly to compute.

Something I've considered and experimented with very briefly was to learn a one hot encoding with a network ("index network") taking king+pawn features as input. The proposed learned indices would take the place of the king square index in the halfKA/halfKP inputs. The learned indices could be efficiently stored in a king+pawn cache effectively eliminating the cost associated with computing the index network. The advantages of this approach should be clear: similar cost as compared to halfKA/halfKP features, more expressive input features, freedom to adjust the number of input features, etc.

The great difficulty with this approach originates from training the index network which amounts to a "dynamic routing layer". Clearly, the index network should predict a distribution over all possible indices [0, N) where N is the chosen number routes (yielding N*12*64 input features). To obtain a distribution over N, the softmax function can be used and, at inference, one can just use the greatest logit to find the index with the greatest probability. When I experimented with this approach, I used the gumbel-softmax distribution ("reparameterization trick") to back propagate through sampling from the distribution predicted by the index network. The results were uninspiring and mode collapse occurred frequently, though perhaps factorization could combat the observed mode collapse (I wasn't using factorization back when I experimented with this). Some alternatives which I've not tried: use the REINFORCE algorithm to learn the indices (treating the index network as a policy network). Another approach would be to compute every route on each forward pass and train the index network to predict *softmin(loss_0, loss_1 ... loss_(N-1)) where loss_i is the loss associated with taking the ith route for the given sample and train the primary network to minimize *softargmin(loss_0, loss_1 ... loss_(N-1)).

*softmin(x) = exp(-x) / sum(exp(-x))
*softargmax(x) = -ln(sum(exp(-x)))

AndrewGrant · Post by **AndrewGrant** » Mon Sep 06, 2021 3:27 am

I started doing the exhaustive math and looking at fens to count average piece/pawn counts, and the search tree probability of certain moves. But short of posting all that,

I'll just handwave that under a Pawn-Piece setup, you never have to refresh, and you encode about 8x the information into the input as you do with HALFKP. So you should be able to use a smaller number of neurons to do the job (???).

Sopel · Post by **Sopel** » Mon Sep 06, 2021 2:52 pm

hgm wrote: ↑Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.

I did try king-pawn-pawn with some restrictions some time ago, and while it was a [small] visible improvement in what the net was able to learn I can't see it being positive with the cost of the actual implementation taken into account. What I tried was the following:

(king_square, pawn_square, this_pawn_color, pawn_offset, other_pawn_color) over all the pawns on the board
pawn_square is 48 values
pawn_offset = (pawn_offset_file + 1) * 6 + pawn_offset_rank - 1, where pawn_offset_file in -1..1, pawn_offset_rank in 0..5; also exclude if pawn_offset <0; total of 3*6-1=17 offsets. (Board is seen as if from the perspective's side, that is for the black's perspective positive values mean towards lower ranks)
in total 64*48*2*17*2=208896 features
worst case max active features is probably 128, but most of the time it should be <64
requires virtual (pawn_square, pawn_color, pawn_offset, pawn_color) features
not easy to update incrementally on pawn move but results can be cached
64x2, appended to the normal feature transformer output (to the respective halfs; 1/8th of the normal FT output). ~30MB

this is on top of HalfKA_v2-512x2-16-32-1

implementation https://github.com/Sopel97/nnue-pytorch ... fb591e8775

loss values indicate that it could maybe gain 10-15 elo at 20k nodes per move. I'm not sure it would even be better than HalfKA_v2-640-16-32-1

AndrewGrant · Post by **AndrewGrant** » Mon Sep 06, 2021 9:34 pm

hgm wrote: ↑Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.

Pawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.

No4b · Post by **No4b** » Tue Sep 07, 2021 12:34 pm

AndrewGrant wrote: ↑Mon Sep 06, 2021 9:34 pm
hgm wrote: ↑Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
Pawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.

That seems like an interesting idea. Maybe i`ll try this in a future.

DrCliche · Post by **DrCliche** » Fri Sep 10, 2021 9:27 am

How over-parameterized are we talking? If a 1 in your input vector represented something like {My Square, My Piece Type, Square, Piece Type or Empty, Attacked By Me, My Color}, you could encode a lot more probably (?) relevant info with only 64x6x64x7x2x2 = 688,128 bits!

There are also various HCE features (like SEE) where it's hard to imagine a NNUE architecture could approximate them efficiently. I'm not sure how you could reasonably integrate SEE into the sparse input vector, but you could normalize it to fit within quantization bounds and just throw it in as extra input to a later layer.

Maybe SEE is a bad example, though. As I understand it, SEE's main purpose is to cut evaluation short, so maybe it's not useful in positions where you'd bother to use NNUE? But if there are other things that are sort of philosophically similar to SEE ... where it's some quasi-deep calculation already implemented optimally by humans, and not only is it likely the NNUE architecture can't encode the process very well, you wouldn't even want it to try ...

Alternatives to King-Pawn, King-All NNUE encoding

Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding

Re: Alternatives to King-Pawn, King-All NNUE encoding