How to make a double-sized net as good as SF NNUE in a few easy steps

chrisw · Post by **chrisw** » Sun Feb 28, 2021 7:26 pm

How to make a double-sized net as good as SF NNUE in a few easy steps.

Say we have a NNUE with topography 512 x 32 x 32 x 1 and we want to make a better (we hope) NNUE with topography 1024 x 32 x 32 x 1

Step 1. Initialise our new 1024 neuron network with all weights = 0.0

Step 2. Unload all SF NNUE weights (from topography 512) and use them to fill the left hand side of the 1024 topography net.

Step 3. Check the 1024-Net gives the same outputs as the 512 net. It should do, all the zero weighted half will contribute zero to the result.

Step 4. Start normal training at very low learning rate, and gradually the zero weighted side will fill and the SF-NNUE weighted side will adjust.

If you get it right, low enough LR, then there won't be any or few regressions, and you'll have a brand spanking new double size NNUE. Bingo.

Next question. How do you hunt down correlations between weights from an original and one it has been doubled to? There must be some, right?

dkappe · Post by **dkappe** » Sun Feb 28, 2021 8:04 pm

chrisw wrote: ↑Sun Feb 28, 2021 7:26 pm How to make a double-sized net as good as SF NNUE in a few easy steps.

Say we have a NNUE with topography 512 x 32 x 32 x 1 and we want to make a better (we hope) NNUE with topography 1024 x 32 x 32 x 1

Step 1. Initialise our new 1024 neuron network with all weights = 0.0

Step 2. Unload all SF NNUE weights (from topography 512) and use them to fill the left hand side of the 1024 topography net.

Step 3. Check the 1024-Net gives the same outputs as the 512 net. It should do, all the zero weighted half will contribute zero to the result.

Step 4. Start normal training at very low learning rate, and gradually the zero weighted side will fill and the SF-NNUE weighted side will adjust.

If you get it right, low enough LR, then there won't be any or few regressions, and you'll have a brand spanking new double size NNUE. Bingo.

Next question. How do you hunt down correlations between weights from an original and one it has been doubled to? There must be some, right?

“Double size” is already 512x2x32x32 as “regular” is 256x2x32x32. A more interesting experiment would be turning a 256x2x32x32 into a 512x2x16x16, which would be closer in terms of performance.

hgm · Post by **hgm** » Sun Feb 28, 2021 8:07 pm

Did you really try that?

Backpropagation through an NN to adjust the weights according to gradient descent of the total square error would adjust weight in the proportion to which they contributed to the error. If they are all zero, they don't contribute anything. So they would never be adjusted. For this reason AlphaZero started with random weights, rather than zero weights.

It might do something in this case because you are tuning only one layer that is already connected to cells that contribute to the output. But it is still not clear how the new 256 new KPST would ever get different from each other.

chrisw · Post by **chrisw** » Sun Feb 28, 2021 8:16 pm

hgm wrote: ↑Sun Feb 28, 2021 8:07 pm Did you really try that?

No, but I did wonder if that was how the Fat Fritz net was developed.

Backpropagation through an NN to adjust the weights according to gradient descent of the total square error would adjust weight in the proportion to which they contributed to the error. If they are all zero, they don't contribute anything. So they would never be adjusted. For this reason AlphaZero started with random weights, rather than zero weights.

chrisw · Post by **chrisw** » Sun Feb 28, 2021 8:23 pm

hgm wrote: ↑Sun Feb 28, 2021 8:07 pm Did you really try that?

Backpropagation through an NN to adjust the weights according to gradient descent of the total square error would adjust weight in the proportion to which they contributed to the error. If they are all zero, they don't contribute anything. So they would never be adjusted. For this reason AlphaZero started with random weights, rather than zero weights.

It might do something in this case because you are tuning only one layer that is already connected to cells that contribute to the output. But it is still not clear how the new 256 new KPST would ever get different from each other.

Well, if the new second half were initialised with very small random values and the training set was suitably different then the net weights would be forced to re-adjust themselves, no? I’ld guess that such a development would result in a SIM test footprint.

dkappe · Post by **dkappe** » Sun Feb 28, 2021 8:28 pm

hgm wrote: ↑Sun Feb 28, 2021 8:07 pm Did you really try that?

Backpropagation through an NN to adjust the weights according to gradient descent of the total square error would adjust weight in the proportion to which they contributed to the error. If they are all zero, they don't contribute anything. So they would never be adjusted. For this reason AlphaZero started with random weights, rather than zero weights.

It might do something in this case because you are tuning only one layer that is already connected to cells that contribute to the output. But it is still not clear how the new 256 new KPST would ever get different from each other.

I’ve distilled the old t10 leela nets to various sizes, but the most weight twiddling I’ve done with nnue is scaling the final 32 weights to inflate some of my nets.

hgm · Post by **hgm** » Sun Feb 28, 2021 8:31 pm

chrisw wrote: ↑Sun Feb 28, 2021 8:23 pmWell, if the new second half were initialised with very small random values and the training set was suitably different then the net weights would be forced to re-adjust themselves, no? I’ld guess that such a development would result in a SIM test footprint.

I suppose so. Albeit initially very slowly. The new training material would initially have far more effect on the old part of the network, because they weights there are already large. But I think it is essential to have some random inialization. Both to ensure that anything will happen at all, but also so that the 512 new cells will not keep doing the same forever, and effectively act as only a single new cell.

Because the remark in my previous posting was wrong. If I remeber the NNUE topology correctly adding the added cells would both require new input weights connecting them to the inputs, and new output weights to connect them to the existing layer of 32. If the weights in both these layers would start at zero, a change in a single such weight would never change the output, and thus never get adjusted, as the other layer then is still blocking all transmission.

dkappe · Post by **dkappe** » Sun Feb 28, 2021 8:33 pm

chrisw wrote: ↑Sun Feb 28, 2021 8:23 pm
hgm wrote: ↑Sun Feb 28, 2021 8:07 pm Did you really try that?

Backpropagation through an NN to adjust the weights according to gradient descent of the total square error would adjust weight in the proportion to which they contributed to the error. If they are all zero, they don't contribute anything. So they would never be adjusted. For this reason AlphaZero started with random weights, rather than zero weights.

It might do something in this case because you are tuning only one layer that is already connected to cells that contribute to the output. But it is still not clear how the new 256 new KPST would ever get different from each other.
Well, if the new second half were initialised with very small random values and the training set was suitably different then the net weights would be forced to re-adjust themselves, no? I’ld guess that such a development would result in a SIM test footprint.

The FF2 net is 512x2x16x16. So you’d average the 32 layers to the 16 layers? Or some other approach? I think you’d end up with a net so weak you’d be better off starting from scratch.

chrisw · Post by **chrisw** » Sun Feb 28, 2021 9:00 pm

hgm wrote: ↑Sun Feb 28, 2021 8:31 pm
chrisw wrote: ↑Sun Feb 28, 2021 8:23 pmWell, if the new second half were initialised with very small random values and the training set was suitably different then the net weights would be forced to re-adjust themselves, no? I’ld guess that such a development would result in a SIM test footprint.
I suppose so. Albeit initially very slowly. The new training material would initially have far more effect on the old part of the network, because they weights there are already large. But I think it is essential to have some random inialization. Both to ensure that anything will happen at all, but also so that the 512 new cells will not keep doing the same forever, and effectively act as only a single new cell.

Because the remark in my previous posting was wrong. If I remeber the NNUE topology correctly adding the added cells would both require new input weights connecting them to the inputs, and new output weights to connect them to the existing layer of 32. If the weights in both these layers would start at zero, a change in a single such weight would never change the output, and thus never get adjusted, as the other layer then is still blocking all transmission.

Another neuron doubling method could be to divide the layer one weights by two and use them as initialisers in two equal parts.

hgm · Post by **hgm** » Sun Feb 28, 2021 9:06 pm

But with the usual method of learning (= weight adjusting), what is the same will stay the same. The old and the new half would never diverge from each other.

How to make a double-sized net as good as SF NNUE in a few easy steps

How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps

Re: How to make a double-sized net as good as SF NNUE in a few easy steps