Alternatives to King-Pawn, King-All NNUE encoding

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Alternatives to King-Pawn, King-All NNUE encoding

Post by AndrewGrant »

Been thinking about schemes other than (half)king-pawn/(half)king-all. An even more massive over paramitization which can bake in a greater appreciation for pawn structure. But also move away from the focus on the King, which I think is overrated. King placement is of little consequence often times I feel. Or in subtle ways that are not revealed by eval anyway.

Something like pawn-piece, where you take each pawn and its relation to each other piece (including other pawns ???) . Looking at about 8 times the update cost for a given non-pawn move, and either 8x or like 16x for pawn moves when excluding/including other pawn mappings.

Clearly identifying factorization becomes more complex, but there is also significantly move "value" in those factorization I feel. Instead of simply factoring out the PSQT, and King-Piece distance maps, you are also adding the explicit relation between a pawn and a piece. Things which we've hand coded like having minors behind pawns. Having pawns which match/dont-match our bishop, And even a sort of directional mobility

When I ran the numbers initially, I recall the "incremental" update being something like a 10% speedup overall. So generously, a 25% speedup to the actual NNUE evaluation itself. So the increased burden of the updates is somewhat eaten by our incremental updates.

Under this scheme you also never have the concept of "throwing out the accumulator and restarting", which you have now when the King is moved. You just toss everything, and do (up-to) 31x additions. At most you impact either 2 pawns, or a pawn and a piece, or two pieces. I feel like you would never have to "refresh". So there is some time save there.

I can hand-wave and say this is better encoding -- just by taking halfkp and adding this ontop of it. By definition, more knowledge on the inputs, should be better or not worse. Since halfkp + king-nonpawn is a subset of this proposed encoding.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
User avatar
hgm
Posts: 27793
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by hgm »

King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
connor_mcmonigle
Posts: 530
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by connor_mcmonigle »

Given a non-pawn move, won't count(pawn) updates to the feature transformer be required with your proposed input features? I think this could be prohibitively costly to compute.

Something I've considered and experimented with very briefly was to learn a one hot encoding with a network ("index network") taking king+pawn features as input. The proposed learned indices would take the place of the king square index in the halfKA/halfKP inputs. The learned indices could be efficiently stored in a king+pawn cache effectively eliminating the cost associated with computing the index network. The advantages of this approach should be clear: similar cost as compared to halfKA/halfKP features, more expressive input features, freedom to adjust the number of input features, etc.

The great difficulty with this approach originates from training the index network which amounts to a "dynamic routing layer". Clearly, the index network should predict a distribution over all possible indices [0, N) where N is the chosen number routes (yielding N*12*64 input features). To obtain a distribution over N, the softmax function can be used and, at inference, one can just use the greatest logit to find the index with the greatest probability. When I experimented with this approach, I used the gumbel-softmax distribution ("reparameterization trick") to back propagate through sampling from the distribution predicted by the index network. The results were uninspiring and mode collapse occurred frequently, though perhaps factorization could combat the observed mode collapse (I wasn't using factorization back when I experimented with this). Some alternatives which I've not tried: use the REINFORCE algorithm to learn the indices (treating the index network as a policy network). Another approach would be to compute every route on each forward pass and train the index network to predict *softmin(loss_0, loss_1 ... loss_(N-1)) where loss_i is the loss associated with taking the ith route for the given sample and train the primary network to minimize *softargmin(loss_0, loss_1 ... loss_(N-1)).

*softmin(x) = exp(-x) / sum(exp(-x))
*softargmax(x) = -ln(sum(exp(-x)))
AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by AndrewGrant »

I started doing the exhaustive math and looking at fens to count average piece/pawn counts, and the search tree probability of certain moves. But short of posting all that,

I'll just handwave that under a Pawn-Piece setup, you never have to refresh, and you encode about 8x the information into the input as you do with HALFKP. So you should be able to use a smaller number of neurons to do the job (???).
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by Sopel »

hgm wrote: Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
I did try king-pawn-pawn with some restrictions some time ago, and while it was a [small] visible improvement in what the net was able to learn I can't see it being positive with the cost of the actual implementation taken into account. What I tried was the following:
  • (king_square, pawn_square, this_pawn_color, pawn_offset, other_pawn_color) over all the pawns on the board
  • pawn_square is 48 values
  • pawn_offset = (pawn_offset_file + 1) * 6 + pawn_offset_rank - 1, where pawn_offset_file in -1..1, pawn_offset_rank in 0..5; also exclude if pawn_offset <0; total of 3*6-1=17 offsets. (Board is seen as if from the perspective's side, that is for the black's perspective positive values mean towards lower ranks)
  • in total 64*48*2*17*2=208896 features
  • worst case max active features is probably 128, but most of the time it should be <64
  • requires virtual (pawn_square, pawn_color, pawn_offset, pawn_color) features
  • not easy to update incrementally on pawn move but results can be cached
  • 64x2, appended to the normal feature transformer output (to the respective halfs; 1/8th of the normal FT output). ~30MB
this is on top of HalfKA_v2-512x2-16-32-1


implementation https://github.com/Sopel97/nnue-pytorch ... fb591e8775

loss values indicate that it could maybe gain 10-15 elo at 20k nodes per move. I'm not sure it would even be better than HalfKA_v2-640-16-32-1
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
AndrewGrant
Posts: 1754
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by AndrewGrant »

hgm wrote: Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
Pawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
No4b
Posts: 105
Joined: Thu Jun 18, 2020 3:21 pm
Location: Moscow
Full name: Alexander Litov

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by No4b »

AndrewGrant wrote: Mon Sep 06, 2021 9:34 pm
hgm wrote: Sun Sep 05, 2021 2:41 pm King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
Pawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.
That seems like an interesting idea. Maybe i`ll try this in a future.
DrCliche
Posts: 65
Joined: Sun Aug 19, 2018 10:57 pm
Full name: Nickolas Reynolds

Re: Alternatives to King-Pawn, King-All NNUE encoding

Post by DrCliche »

How over-parameterized are we talking? If a 1 in your input vector represented something like {My Square, My Piece Type, Square, Piece Type or Empty, Attacked By Me, My Color}, you could encode a lot more probably (?) relevant info with only 64x6x64x7x2x2 = 688,128 bits!

There are also various HCE features (like SEE) where it's hard to imagine a NNUE architecture could approximate them efficiently. I'm not sure how you could reasonably integrate SEE into the sparse input vector, but you could normalize it to fit within quantization bounds and just throw it in as extra input to a later layer.

Maybe SEE is a bad example, though. As I understand it, SEE's main purpose is to cut evaluation short, so maybe it's not useful in positions where you'd bother to use NNUE? But if there are other things that are sort of philosophically similar to SEE ... where it's some quasi-deep calculation already implemented optimally by humans, and not only is it likely the NNUE architecture can't encode the process very well, you wouldn't even want it to try ...