One third, two thirds
How to make a double-sized net as good as SF NNUE in a few easy steps
Moderators: hgm, Rebel, chrisw
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
-
- Posts: 1631
- Joined: Tue Aug 21, 2018 7:52 pm
- Full name: Dietrich Kappe
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
I hope Chris gives it a shot. Would love to see the results of his 512x2x16x16 net.
Fat Titz by Stockfish, the engine with the bodaciously big net. Remember: size matters. If you want to learn more about this engine just google for "Fat Titz".
-
- Posts: 533
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
An interesting idea for sure. I'd be very curious to see the results. It might also be interesting to try starting with a much smaller network and then iteratively growing the first layer's size. Also, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
There's no basis to the claim that Albert used this approach. As Dietrich points out, the remaining layers are 16->16 in the FF2 network. However, these weights are super easy to train relative to the first "feature transformer" layer. The first layer is most relevant to the strength of the network and very difficult to obtain good weights for. This means Albert could have used the approach suggested and saved a lot of work, but I'm quite confident he did not as visual inspection of the first layer of FF2 reveals that it is quite different (it appears Albert didn't use factorization).
There's no basis to the claim that Albert used this approach. As Dietrich points out, the remaining layers are 16->16 in the FF2 network. However, these weights are super easy to train relative to the first "feature transformer" layer. The first layer is most relevant to the strength of the network and very difficult to obtain good weights for. This means Albert could have used the approach suggested and saved a lot of work, but I'm quite confident he did not as visual inspection of the first layer of FF2 reveals that it is quite different (it appears Albert didn't use factorization).
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
It does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.connor_mcmonigle wrote: ↑Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.
Randomizing one layer, and zeroing the other would work, though! This would not affect the initial net output.
-
- Posts: 533
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
Yes. If two layers are zero initialized in a simple fully connected network, the gradient will be zero for both layers (higher order methods can overcome this). That's correct. However, that's not what chrisw was suggesting. In chrisw's scheme, only part of the first layer would be zero initialized. The successive layers would still have nonzero weights.hgm wrote: ↑Sun Feb 28, 2021 11:25 pmIt does when two successive layers are zero. Any infinitesimal change of a weight in the first layer would then not affect the network output, because the second layer would not transmit it. And any such change in the second layer would have no effect because the cell it comes from gets zero activation. So the gradient is zero. In essence you are seeing the product rule for differentiation here: (f*g)' = f'*g + f*g', where f is the weight of one layer, and g of the other. If f = g = 0, it doesn't matter that f' and g' !=0.connor_mcmonigle wrote: ↑Sun Feb 28, 2021 10:29 pmAlso, zero initialization would work totally fine as the gradient of the loss function w.r.t to the weight matrix of a given affine transform layer is the outer product of the input to that layer and the gradient of the loss function w.r.t to the output of the given layer. Therefore, weights == zero does not imply gradient == zero.
1/3 and 2/3 would also not work. (And beware that you add two layers of weights; if you make the weights in both layers 3 times smaller, the contributibution of that part of the net gets 9 times smaller.) If the nets are just multiples of each other, their weights will affect the result in the same proportion, so they will also be modified in the same proportion. They would keep doing the same thing forever. You have to make sure they do something essentially different from the start.
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
Where do you read that? I just see:
The new part of the network has both connections from the 512 new cells to the input and to the 32-cell layer. These had no counter-part in the 512-wide NNUE. These cells will always stay completely dead. Unless you use second-order methods, as you say. But that would still not solve the problem that they all contribute to the second order in exactly the same way.Step 1. Initialise our new 1024 neuron network with all weights = 0.0
Step 2. Unload all SF NNUE weights (from topography 512) and use them to fill the left hand side of the 1024 topography net.
-
- Posts: 533
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
Sorry, I was still editing my post, to copy step 2 in it too. It won't help. The added cells ("the right-hand side") have zero input and output, in this prescription.
-
- Posts: 533
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
No worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
Re: How to make a double-sized net as good as SF NNUE in a few easy steps
One could, instead of zero-ing the right hand side, initialise it with what would have gone into the left hand side (the weights from the factoriser). Then RHS weights to hidden layer are set duplicate the already existing from LHS. Would that work?connor_mcmonigle wrote: ↑Sun Feb 28, 2021 11:52 pmNo worries. The output of a given layer isn't immediately relevant to calculating the gradient w.r.t its weight matrix (we need the gradient w.r.t to the output, not the output itself). The layer's input will also be nonzero as the layer is given the board (halfKP) features as input. As both the gradient w.r.t the first layer's output and the input vector are both nonzero, the gradient w.r.t to the first layer weight matrix is also nonzero.