Been thinking about schemes other than (half)king-pawn/(half)king-all. An even more massive over paramitization which can bake in a greater appreciation for pawn structure. But also move away from the focus on the King, which I think is overrated. King placement is of little consequence often times I feel. Or in subtle ways that are not revealed by eval anyway.
Something like pawn-piece, where you take each pawn and its relation to each other piece (including other pawns ???) . Looking at about 8 times the update cost for a given non-pawn move, and either 8x or like 16x for pawn moves when excluding/including other pawn mappings.
Clearly identifying factorization becomes more complex, but there is also significantly move "value" in those factorization I feel. Instead of simply factoring out the PSQT, and King-Piece distance maps, you are also adding the explicit relation between a pawn and a piece. Things which we've hand coded like having minors behind pawns. Having pawns which match/dont-match our bishop, And even a sort of directional mobility
When I ran the numbers initially, I recall the "incremental" update being something like a 10% speedup overall. So generously, a 25% speedup to the actual NNUE evaluation itself. So the increased burden of the updates is somewhat eaten by our incremental updates.
Under this scheme you also never have the concept of "throwing out the accumulator and restarting", which you have now when the King is moved. You just toss everything, and do (up-to) 31x additions. At most you impact either 2 pawns, or a pawn and a piece, or two pieces. I feel like you would never have to "refresh". So there is some time save there.
I can hand-wave and say this is better encoding -- just by taking halfkp and adding this ontop of it. By definition, more knowledge on the inputs, should be better or not worse. Since halfkp + king-nonpawn is a subset of this proposed encoding.
Alternatives to King-Pawn, King-All NNUE encoding
Moderators: hgm, Rebel, chrisw
-
- Posts: 1754
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Alternatives to King-Pawn, King-All NNUE encoding
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
-
- Posts: 27793
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Alternatives to King-Pawn, King-All NNUE encoding
King-Pawn is crucial to Shogi, which is a race to checkmate. It doesn't seem that useful for Chess, where Pawn structure often plays a dominant role. So it would seem useful to have Pawn-Pawn tables. Or perhaps even Pawn-Pawn-Pawn tables.
-
- Posts: 530
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
Re: Alternatives to King-Pawn, King-All NNUE encoding
Given a non-pawn move, won't count(pawn) updates to the feature transformer be required with your proposed input features? I think this could be prohibitively costly to compute.
Something I've considered and experimented with very briefly was to learn a one hot encoding with a network ("index network") taking king+pawn features as input. The proposed learned indices would take the place of the king square index in the halfKA/halfKP inputs. The learned indices could be efficiently stored in a king+pawn cache effectively eliminating the cost associated with computing the index network. The advantages of this approach should be clear: similar cost as compared to halfKA/halfKP features, more expressive input features, freedom to adjust the number of input features, etc.
The great difficulty with this approach originates from training the index network which amounts to a "dynamic routing layer". Clearly, the index network should predict a distribution over all possible indices [0, N) where N is the chosen number routes (yielding N*12*64 input features). To obtain a distribution over N, the softmax function can be used and, at inference, one can just use the greatest logit to find the index with the greatest probability. When I experimented with this approach, I used the gumbel-softmax distribution ("reparameterization trick") to back propagate through sampling from the distribution predicted by the index network. The results were uninspiring and mode collapse occurred frequently, though perhaps factorization could combat the observed mode collapse (I wasn't using factorization back when I experimented with this). Some alternatives which I've not tried: use the REINFORCE algorithm to learn the indices (treating the index network as a policy network). Another approach would be to compute every route on each forward pass and train the index network to predict *softmin(loss_0, loss_1 ... loss_(N-1)) where loss_i is the loss associated with taking the ith route for the given sample and train the primary network to minimize *softargmin(loss_0, loss_1 ... loss_(N-1)).
*softmin(x) = exp(-x) / sum(exp(-x))
*softargmax(x) = -ln(sum(exp(-x)))
Something I've considered and experimented with very briefly was to learn a one hot encoding with a network ("index network") taking king+pawn features as input. The proposed learned indices would take the place of the king square index in the halfKA/halfKP inputs. The learned indices could be efficiently stored in a king+pawn cache effectively eliminating the cost associated with computing the index network. The advantages of this approach should be clear: similar cost as compared to halfKA/halfKP features, more expressive input features, freedom to adjust the number of input features, etc.
The great difficulty with this approach originates from training the index network which amounts to a "dynamic routing layer". Clearly, the index network should predict a distribution over all possible indices [0, N) where N is the chosen number routes (yielding N*12*64 input features). To obtain a distribution over N, the softmax function can be used and, at inference, one can just use the greatest logit to find the index with the greatest probability. When I experimented with this approach, I used the gumbel-softmax distribution ("reparameterization trick") to back propagate through sampling from the distribution predicted by the index network. The results were uninspiring and mode collapse occurred frequently, though perhaps factorization could combat the observed mode collapse (I wasn't using factorization back when I experimented with this). Some alternatives which I've not tried: use the REINFORCE algorithm to learn the indices (treating the index network as a policy network). Another approach would be to compute every route on each forward pass and train the index network to predict *softmin(loss_0, loss_1 ... loss_(N-1)) where loss_i is the loss associated with taking the ith route for the given sample and train the primary network to minimize *softargmin(loss_0, loss_1 ... loss_(N-1)).
*softmin(x) = exp(-x) / sum(exp(-x))
*softargmax(x) = -ln(sum(exp(-x)))
-
- Posts: 1754
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Alternatives to King-Pawn, King-All NNUE encoding
I started doing the exhaustive math and looking at fens to count average piece/pawn counts, and the search tree probability of certain moves. But short of posting all that,
I'll just handwave that under a Pawn-Piece setup, you never have to refresh, and you encode about 8x the information into the input as you do with HALFKP. So you should be able to use a smaller number of neurons to do the job (???).
I'll just handwave that under a Pawn-Piece setup, you never have to refresh, and you encode about 8x the information into the input as you do with HALFKP. So you should be able to use a smaller number of neurons to do the job (???).
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Alternatives to King-Pawn, King-All NNUE encoding
I did try king-pawn-pawn with some restrictions some time ago, and while it was a [small] visible improvement in what the net was able to learn I can't see it being positive with the cost of the actual implementation taken into account. What I tried was the following:
- (king_square, pawn_square, this_pawn_color, pawn_offset, other_pawn_color) over all the pawns on the board
- pawn_square is 48 values
- pawn_offset = (pawn_offset_file + 1) * 6 + pawn_offset_rank - 1, where pawn_offset_file in -1..1, pawn_offset_rank in 0..5; also exclude if pawn_offset <0; total of 3*6-1=17 offsets. (Board is seen as if from the perspective's side, that is for the black's perspective positive values mean towards lower ranks)
- in total 64*48*2*17*2=208896 features
- worst case max active features is probably 128, but most of the time it should be <64
- requires virtual (pawn_square, pawn_color, pawn_offset, pawn_color) features
- not easy to update incrementally on pawn move but results can be cached
- 64x2, appended to the normal feature transformer output (to the respective halfs; 1/8th of the normal FT output). ~30MB
implementation https://github.com/Sopel97/nnue-pytorch ... fb591e8775
loss values indicate that it could maybe gain 10-15 elo at 20k nodes per move. I'm not sure it would even be better than HalfKA_v2-640-16-32-1
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 1754
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: Alternatives to King-Pawn, King-All NNUE encoding
Pawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
-
- Posts: 105
- Joined: Thu Jun 18, 2020 3:21 pm
- Location: Moscow
- Full name: Alexander Litov
Re: Alternatives to King-Pawn, King-All NNUE encoding
That seems like an interesting idea. Maybe i`ll try this in a future.AndrewGrant wrote: ↑Mon Sep 06, 2021 9:34 pmPawn-Pawn(-Pawn) would probably work quite well with the hand-crafted evaluations. Pawn hash has an extremely high hitrate, almost to the point where no eval function is too expensive. Unfortunately we've drifted away from that too much for me to try it at this point.
-
- Posts: 65
- Joined: Sun Aug 19, 2018 10:57 pm
- Full name: Nickolas Reynolds
Re: Alternatives to King-Pawn, King-All NNUE encoding
How over-parameterized are we talking? If a 1 in your input vector represented something like {My Square, My Piece Type, Square, Piece Type or Empty, Attacked By Me, My Color}, you could encode a lot more probably (?) relevant info with only 64x6x64x7x2x2 = 688,128 bits!
There are also various HCE features (like SEE) where it's hard to imagine a NNUE architecture could approximate them efficiently. I'm not sure how you could reasonably integrate SEE into the sparse input vector, but you could normalize it to fit within quantization bounds and just throw it in as extra input to a later layer.
Maybe SEE is a bad example, though. As I understand it, SEE's main purpose is to cut evaluation short, so maybe it's not useful in positions where you'd bother to use NNUE? But if there are other things that are sort of philosophically similar to SEE ... where it's some quasi-deep calculation already implemented optimally by humans, and not only is it likely the NNUE architecture can't encode the process very well, you wouldn't even want it to try ...
There are also various HCE features (like SEE) where it's hard to imagine a NNUE architecture could approximate them efficiently. I'm not sure how you could reasonably integrate SEE into the sparse input vector, but you could normalize it to fit within quantization bounds and just throw it in as extra input to a later layer.
Maybe SEE is a bad example, though. As I understand it, SEE's main purpose is to cut evaluation short, so maybe it's not useful in positions where you'd bother to use NNUE? But if there are other things that are sort of philosophically similar to SEE ... where it's some quasi-deep calculation already implemented optimally by humans, and not only is it likely the NNUE architecture can't encode the process very well, you wouldn't even want it to try ...