Neural network quantization
Moderators: Harvey Williamson, bob, hgm
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 157
 Joined: Fri Apr 11, 2014 8:45 am
 Full name: Fabio Gobbato
 Contact:
Neural network quantization
I have a neural network with floating point weights and I want them to fit in int8_t . The weights goes from +64 and 57 and some are very low like 1e5.
How can I translate all these values in int8_t without losing accuracy? Is there a way to train the net to easily translate the weights to int8_t?
I have seen stockfish that use a >>6 so if I have understand correctly all the weights are between 2.0 and 1,984375 with 1/64 precision how can all the weights be in that range?
How can I translate all these values in int8_t without losing accuracy? Is there a way to train the net to easily translate the weights to int8_t?
I have seen stockfish that use a >>6 so if I have understand correctly all the weights are between 2.0 and 1,984375 with 1/64 precision how can all the weights be in that range?
Re: Neural network quantization
just some random thoughts:
 "vector quantization": could be say per layer or some fixed block size of weights, you compute a "palette" of 256 values that minimize the error (kmeans but you'll need good initial seed as it only finds a local optimum) of course you need way more weights than 256 in such block to save space
 try nonlinear mapping, say reserve 1 bit for sign and quantize the magnitude like pow(a, b), where a is abs(weight) and b some constant you have to optimize => not sure if this would apply to NN weights, but it might. I had a great success with this elsewhere
for this to work, you should probably prenormalize the values, like divide by max(a) for all weights. in the end, you end up with 8bit quantized values per block + 2 float (or better halffloat) values: b and denormalization scale
in both cases, decoding will be straightforward, simply build a 256entry LUT with floats and use the bytes to index that
 "vector quantization": could be say per layer or some fixed block size of weights, you compute a "palette" of 256 values that minimize the error (kmeans but you'll need good initial seed as it only finds a local optimum) of course you need way more weights than 256 in such block to save space
 try nonlinear mapping, say reserve 1 bit for sign and quantize the magnitude like pow(a, b), where a is abs(weight) and b some constant you have to optimize => not sure if this would apply to NN weights, but it might. I had a great success with this elsewhere
for this to work, you should probably prenormalize the values, like divide by max(a) for all weights. in the end, you end up with 8bit quantized values per block + 2 float (or better halffloat) values: b and denormalization scale
in both cases, decoding will be straightforward, simply build a 256entry LUT with floats and use the bytes to index that
Martin Sedlak
Re: Neural network quantization
here's a pseudocode of a simple nonlinear powbased quantization (actually it's what I quickly prototyped in my scripting language)
it produces the following output: (qval is quantized values, scale and power are reconstruction params and output is reconstructed input after dequantization)
of course in real world you'd have a much larger input, but this basically outlines the simple idea...
it produces the following output: (qval is quantized values, scale and power are reconstruction params and output is reconstructed input after dequantization)
Code: Select all
best_err=0.0255281 power=2.2
qval=[9]{115, 90, 14, 145, 205, 0, 0, 65, 255} scale=66 power=2.2
input: [9]{53, 31, 0.5, 0.75, 22, 0, 1e05, 15, 66}
output: [9]{53.0531, 30.9392, 0.516009, 0.790976, 21.9511, 0, 0, 15.1212, 66}
Code: Select all
constexpr float FLT_MAX = 1e+30f;
//
void build_dequant_table(float[] table, float scale, float power)
{
for (int i : table.size/2)
table[i] = pow(cast float i / (table.size/21), power) * scale;
for (int i = table.size/2 : table.size)
table[i] = table[i  table.size/2];
}
//
int quantize_value(float value, const float[] dequant)
{
int res = 0;
float besterr = FLT_MAX;
// note: naive and slow
for (int i : dequant.size)
{
auto delta = dequant[i]  value;
auto err = delta*delta;
if (err < besterr)
{
besterr = err;
res = i;
}
}
return res;
}
//
array<byte> quantize(const float[] block, float &scale, float &power)
{
array<byte> res;
res.resize(block.size);
// dequantization table
float dequant[256];
float amax = 0.0f;
for (int i : block.size)
amax = max(amax, abs(block[i]));
// avoid div by zero
if (!amax)
amax = 1.0f;
scale = amax;
power = 1.0f;
float besterr = FLT_MAX;
array<byte> tmp;
tmp.resize(block.size);
// range 0.1 to 4.0 seems reasonable, YMMV
for (float p = 0.1f; p<=4.0f; p += 0.1f)
{
// build dequantization table
build_dequant_table(dequant, scale, p);
float sqerr = 0.0f;
for (int i : block.size)
{
int b = quantize_value(block[i], dequant);
tmp[i] = cast byte b;
auto delta = dequant[b]  block[i];
sqerr += delta*delta;
}
if (sqerr < besterr)
{
besterr = sqerr;
power = p;
res = tmp;
}
}
"best_err=%t power=%t\n", besterr, power;
return res;
}
//
void main()
{
const float input[] = {
53, 31, 0.5, 0.75, 22, 0, 1e5, 15, 66
};
float scale, power;
auto qval = quantize(input, scale, power);
// quantized values, scale and power => all that's needed to decode
// quantized data
"qval=%t scale=%t power=%t\n", qval, scale, power;
float output[input.size];
float dequant[256];
build_dequant_table(dequant, scale, power);
// decoding is straightforward and fast
for (auto i : output.size)
output[i] = dequant[qval[i]];
"input: %t\n", input;
"output: %t\n", output;
}
Martin Sedlak

 Posts: 157
 Joined: Fri Apr 11, 2014 8:45 am
 Full name: Fabio Gobbato
 Contact:
Re: Neural network quantization
But if I have understood correctly in this way you get an index for an array but the weights of the neural net are multiplied with the input and added together.
If I have understood stockfish use a fixed point arithmetic because the output is shifted by 6 places so is like a division by 64. If it's so all the weights are in a predefined range.
If I have understood stockfish use a fixed point arithmetic because the output is shifted by 6 places so is like a division by 64. If it's so all the weights are in a predefined range.
Re: Neural network quantization
I thought you wanted to reduce the size of the net while minimizing the overall error, so I probably misunderstood what you want.Fabio Gobbato wrote: ↑Tue Sep 08, 2020 8:31 amI have a neural network with floating point weights and I want them to fit in int8_t .
if they shift by 6 then they use FP2:6, they still need a sign bit though so the real range should be is ~1.9999 to 1.9999. I've no idea what magnitude restrictions there are on the weights.
however, the NNUE data bundled with SF seems to be actually 16bit, no now I'm puzzled
Martin Sedlak
Re: Neural network quantization
either way, if you load the quantized weights as 8bit, you can unpack them as floats or whatever and do what you want.
the quantization I propose (8bit) should be superior to plain linear quantization anyway, so I still don't see any problem
the quantization I propose (8bit) should be superior to plain linear quantization anyway, so I still don't see any problem
Martin Sedlak

 Posts: 157
 Joined: Fri Apr 11, 2014 8:45 am
 Full name: Fabio Gobbato
 Contact:
Re: Neural network quantization
But the net calculations would be all on floating point values while stockfish use only integer arithmetic. It's an advantage to use only ints because the net calculation is faster. But I miss something because while training the net the weights differs a lot and int8_t gives low accuracy. But maybe it uses some tricks in the training.
Re: Neural network quantization
Maybe they just shift out the Numbers after the decimal point thus floor the number.Fabio Gobbato wrote: ↑Tue Sep 08, 2020 8:31 amI have a neural network with floating point weights and I want them to fit in int8_t . The weights goes from +64 and 57 and some are very low like 1e5.
How can I translate all these values in int8_t without losing accuracy? Is there a way to train the net to easily translate the weights to int8_t?
I have seen stockfish that use a >>6 so if I have understand correctly all the weights are between 2.0 and 1,984375 with 1/64 precision how can all the weights be in that range?

 Posts: 2208
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Neural network quantization
But the vast majority of weights 256x41024 are int16 (whose outputs are incremental updated in make move), only the remaining layers of 2x256x32Relu, 32x32Relu and 32x1 use 8bit weights implemented in the AffineTransform and ClippedReLU classes.Fabio Gobbato wrote: ↑Tue Sep 08, 2020 7:24 pmBut the net calculations would be all on floating point values while stockfish use only integer arithmetic. It's an advantage to use only ints because the net calculation is faster. But I miss something because while training the net the weights differs a lot and int8_t gives low accuracy. But maybe it uses some tricks in the training.
See the post from the Don't understand NNUE thread.
But I have not studied the backprop training code yet, whether they use intermediate floating point weights and finally map to integers, or do all math with integers (nope).Considering AVX2 in AffineTransform only, they use _mm256_maddubs_epi16 (vpmaddubsw) for intermediate saturated 16bit results as sum of two 8 byte multiplications, and _mm256_madd_epi16 (vpmaddwd ) as mul with 1, and horinzontal add of consecutive 16 to 32bit int, accumuleated via _mm256_add_epi32 in sum. After further horizontal add and some shuffling, the 32bit sums are written as 32bit ints to output. In ClippedReLU, the 32bit results are packed to 16bit integers using signed saturation (_mm256_packs_epi32 aka vpackssdw), and arithmetically shifted right by kWeightScaleBits = 6 (idiv 128). FInally with _mm256_packs_epi16 the 16bit words0 and words1 are packed to 8bit integers using signed saturation (_mm256_packs_epi16), before _mm256_max_epi8 implements the 8bit ReLu.
Double and Half Float so far
https://github.com/nodchip/Stockfish/bl ... earner.cpp
https://github.com/nodchip/Stockfish/bl ... lf_float.h
Re: Neural network quantization
The most common way of doing NN weight quantization is AFAIK bogsimple linear quantization. Since the weights are for the linear step, it's fine to have a 1e5 weight go to 0, because it's effectively 0 anyway compared to that +64 or 57 weight. You store the scale factors per layer (sometimes also per channel); just min/max is usually fine. No need for fancy VQ or nonlinear quantizers.
More advanced techniques exist, like cutting off outliers from the min/max, an extra pass of training (finetuning) after quantization to shake out some more accuracy, and various forms of quantizationaware training.
More advanced techniques exist, like cutting off outliers from the min/max, an extra pass of training (finetuning) after quantization to shake out some more accuracy, and various forms of quantizationaware training.