How to make a double-sized net as good as SF NNUE in a few easy steps

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4319
Joined: Tue Apr 03, 2012 4:28 pm

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by chrisw »

hgm wrote: Sun Feb 28, 2021 9:06 pm But with the usual method of learning (= weight adjusting), what is the same will stay the same. The old and the new half would never diverge from each other.
One third, two thirds
dkappe
Posts: 1631
Joined: Tue Aug 21, 2018 7:52 pm
Full name: Dietrich Kappe

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by dkappe »

hgm wrote: Sun Feb 28, 2021 9:06 pm But with the usual method of learning (= weight adjusting), what is the same will stay the same. The old and the new half would never diverge from each other.
I hope Chris gives it a shot. Would love to see the results of his 512x2x16x16 net.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
connor_mcmonigle
Posts: 533
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by connor_mcmonigle »

An interesting idea for sure. I'd be very curious to see the results. It might also be interesting to try starting with a much smaller network and then iteratively growing the first layer's size. Also, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.

There's no basis to the claim that Albert used this approach. As Dietrich points out, the remaining layers are 16->16 in the FF2 network. However, these weights are super easy to train relative to the first "feature transformer" layer. The first layer is most relevant to the strength of the network and very difficult to obtain good weights for. This means Albert could have used the approach suggested and saved a lot of work, but I'm quite confident he did not as visual inspection of the first layer of FF2 reveals that it is quite different (it appears Albert didn't use factorization).
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by hgm »

connor_mcmonigle wrote: Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
It does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.

1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.

Randomizing one layer, and zeroing the other would work, though! This would not affect the initial net output.
connor_mcmonigle
Posts: 533
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by connor_mcmonigle »

hgm wrote: Sun Feb 28, 2021 11:25 pm
connor_mcmonigle wrote: Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
It does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.

1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.
Yes. If two layers are zero initialized in a simple fully connected network, the gradient will be zero for both layers (higher order methods can overcome this). That's correct. However, that's not what chrisw was suggesting. In chrisw's scheme, only part of the first layer would be zero initialized. The successive layers would still have nonzero weights.
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by hgm »

Where do you read that? I just see:
Step 1. Initialise our new 1024 neuron network with all weights = 0.0

Step 2. Unload all SF NNUE weights (from topography 512) and use them to fill the left hand side of the 1024 topography net.
The new part of the network has both connections from the 512 new cells to the input and to the 32-cell layer. These had no counter-part in the 512-wide NNUE. These cells will always stay completely dead. Unless you use second-order methods, as you say. But that would still not solve the problem that they all contribute to the second order in exactly the same way.
connor_mcmonigle
Posts: 533
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by connor_mcmonigle »

Read step 2 :wink:
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by hgm »

Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells ("the right-hand side") have zero input and output, in this prescription.
connor_mcmonigle
Posts: 533
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by connor_mcmonigle »

hgm wrote: Sun Feb 28, 2021 11:47 pm Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells have zero input and output, in this prescription.
No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.
chrisw
Posts: 4319
Joined: Tue Apr 03, 2012 4:28 pm

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Post by chrisw »

connor_mcmonigle wrote: Sun Feb 28, 2021 11:52 pm
hgm wrote: Sun Feb 28, 2021 11:47 pm Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells have zero input and output, in this prescription.
No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.
One could, instead of zero-ing the right hand side, initialise it with what would have gone into the left hand side (the weights from the factoriser). Then RHS weights to hidden layer are set duplicate the already existing from LHS. Would that work?