You've trained a brilliant NN(UE) King-Piece Network. Now what?
Posted: Thu Nov 19, 2020 8:57 am
Skip to the bottom if you don't want to humor me....
Monologue
::::
So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.
The result? An NNUE 2x[40960x256] => 512x32x32x1 that beats Ethereal master in Fischer Random Chess by +200 elo in even-nodes gameplay. That was trained with 300M positions, which is far from the "standard", and was also trained on Ethereal HCE. There is no reason to believe that additional games, and a regeneration of the tuning data, using the new NNUE evaluation to label, will net even more elo.
So... what to do with it. That is the question. I tried, and failed, to port it to SF's structure. SF's optimization requires playing games with the values of weights and biases. Something that is not a default for floating point trained Networks. I've not coded the actual "Incremental" part yet, but with some measurements I can show that my net will be performing a 4+ hundred knps worse than "Etherlito" even in the best case assumptions. That is, quite damning. Ethereal was already somewhat "Fast and Dumb", so the reduced speed is painful.
So lets test the theory further about speed. I trained another network, this time 2x[40960x64] => 128x32x32x1, which has a 1/4th the input weights, resulting in about 1/3rd the total FMAs needed (Trust the math). What does that do (in FRC)? Beats Ethereal master by about 14 elo, despite not doing incremental updates; IE recomputing the first layer everytime.
So you might say Andrew, you've shown that you _can_ beat master with an NNUE, why don't you man up and code the incremental update, then come cry on the forums. Well, that is a good question. I'll get around to that, at the expense of other things soon enough.
::::
So floating point NN implementations on the CPU have some downsides and some upsides. A downside is that you will not get deterministic behavior across platforms. This is an issue since I rely on that. As an upside, its trivial. Multiplying and adding floats together takes very few brain-cells to program. Since a 32-bit float times a 32-bit float is, once again, 32 bits, everything works out nice. When you deal with int16_t, you don't get that luxury.
So my real question I suppose. quantizing down to int16_t does not appear, to me, to be worth very much. You can perform some extra ops -- speedup relus and such, but the multiplication is not a perfect 2x gain, I believe. Am I wrong? Can you line up AVX instructions such that this is not a concern?
Assuming not, this means you have to dive down into the world of int8_t. That is a scary world. At that point, it appears to me, you have to take special action in the training process to ensure that weights are within some smallish range. Gary pointed this out in his pytorch thread in regards to "effectively (my quotes)" multiplying the NN output by 600 at the end, to avoid the final set of weights from having to grow very large.
::::
One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.
I suppose the end result might be that I don't care to do the implementation. In which case its weird that I have a big function of 20 million weights sitting on my desktop, which would gain 100+ elo if implemented well, but no one will ever see it.
Monologue
::::
So I wrote an NNTraining tool in C, made it very well threaded, cranked up optimizations, wrote AVX2 code for portions to perform weight updates faster, came up with a novel addition to Adam to limit throughput when dealing with sparse inputs by batching in fixed sizes, and refusing to shuffle intra-batch, only inter-batch.
The result? An NNUE 2x[40960x256] => 512x32x32x1 that beats Ethereal master in Fischer Random Chess by +200 elo in even-nodes gameplay. That was trained with 300M positions, which is far from the "standard", and was also trained on Ethereal HCE. There is no reason to believe that additional games, and a regeneration of the tuning data, using the new NNUE evaluation to label, will net even more elo.
So... what to do with it. That is the question. I tried, and failed, to port it to SF's structure. SF's optimization requires playing games with the values of weights and biases. Something that is not a default for floating point trained Networks. I've not coded the actual "Incremental" part yet, but with some measurements I can show that my net will be performing a 4+ hundred knps worse than "Etherlito" even in the best case assumptions. That is, quite damning. Ethereal was already somewhat "Fast and Dumb", so the reduced speed is painful.
So lets test the theory further about speed. I trained another network, this time 2x[40960x64] => 128x32x32x1, which has a 1/4th the input weights, resulting in about 1/3rd the total FMAs needed (Trust the math). What does that do (in FRC)? Beats Ethereal master by about 14 elo, despite not doing incremental updates; IE recomputing the first layer everytime.
So you might say Andrew, you've shown that you _can_ beat master with an NNUE, why don't you man up and code the incremental update, then come cry on the forums. Well, that is a good question. I'll get around to that, at the expense of other things soon enough.
::::
So floating point NN implementations on the CPU have some downsides and some upsides. A downside is that you will not get deterministic behavior across platforms. This is an issue since I rely on that. As an upside, its trivial. Multiplying and adding floats together takes very few brain-cells to program. Since a 32-bit float times a 32-bit float is, once again, 32 bits, everything works out nice. When you deal with int16_t, you don't get that luxury.
So my real question I suppose. quantizing down to int16_t does not appear, to me, to be worth very much. You can perform some extra ops -- speedup relus and such, but the multiplication is not a perfect 2x gain, I believe. Am I wrong? Can you line up AVX instructions such that this is not a concern?
Assuming not, this means you have to dive down into the world of int8_t. That is a scary world. At that point, it appears to me, you have to take special action in the training process to ensure that weights are within some smallish range. Gary pointed this out in his pytorch thread in regards to "effectively (my quotes)" multiplying the NN output by 600 at the end, to avoid the final set of weights from having to grow very large.
::::
One thing that has made me quite upset is that there appears to be very little literature online discussing the nuances of these things, and how to actually implement such things into C. It looks like looking at CFish is the best crash-course you could get, but I'm not too keen since I'm not a fan of the programming style, and I also don't want to just duplicate that effort. I wrote the trainer with no direct contact with the SF code, so I'de like to see if I could do the implementation on my own as well.
I suppose the end result might be that I don't care to do the implementation. In which case its weird that I have a big function of 20 million weights sitting on my desktop, which would gain 100+ elo if implemented well, but no one will ever see it.