Neural network quantization

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 08, 2020 10:31 am

I have a neural network with floating point weights and I want them to fit in int8_t . The weights goes from +64 and -57 and some are very low like 1e-5.
How can I translate all these values in int8_t without losing accuracy? Is there a way to train the net to easily translate the weights to int8_t?
I have seen stockfish that use a >>6 so if I have understand correctly all the weights are between -2.0 and 1,984375 with 1/64 precision how can all the weights be in that range?

mar · Post by **mar** » Tue Sep 08, 2020 12:54 pm

just some random thoughts:
- "vector quantization": could be say per layer or some fixed block size of weights, you compute a "palette" of 256 values that minimize the error (k-means but you'll need good initial seed as it only finds a local optimum) of course you need way more weights than 256 in such block to save space
- try non-linear mapping, say reserve 1 bit for sign and quantize the magnitude like pow(a, b), where a is abs(weight) and b some constant you have to optimize => not sure if this would apply to NN weights, but it might. I had a great success with this elsewhere
for this to work, you should probably pre-normalize the values, like divide by max(a) for all weights. in the end, you end up with 8-bit quantized values per block + 2 float (or better half-float) values: b and de-normalization scale

in both cases, decoding will be straightforward, simply build a 256-entry LUT with floats and use the bytes to index that

mar · Post by **mar** » Tue Sep 08, 2020 2:32 pm

here's a pseudo-code of a simple non-linear pow-based quantization (actually it's what I quickly prototyped in my scripting language)

it produces the following output: (qval is quantized values, scale and power are reconstruction params and output is reconstructed input after dequantization)

Code: Select all

best_err=0.0255281 power=2.2
qval=[9]{115, 90, 14, 145, 205, 0, 0, 65, 255} scale=66 power=2.2
input:  [9]{53, 31, 0.5, -0.75, -22, 0, 1e-05, 15, -66}
output: [9]{53.0531, 30.9392, 0.516009, -0.790976, -21.9511, 0, 0, 15.1212, -66}

of course in real world you'd have a much larger input, but this basically outlines the simple idea...

Code: Select all

constexpr float FLT_MAX = 1e+30f;

//-----------------------------------------------------------------------------
void build_dequant_table(float[] table, float scale, float power)
{

	for (int i : table.size/2)
		table[i] = pow(cast float i / (table.size/2-1), power) * scale;

	for (int i = table.size/2 : table.size)
		table[i] = -table[i - table.size/2];
}

//-----------------------------------------------------------------------------
int quantize_value(float value, const float[] dequant)
{
	int res = 0;
	float besterr = FLT_MAX;

	// note: naive and slow
	for (int i : dequant.size)
	{
		auto delta = dequant[i] - value;
		auto err = delta*delta;

		if (err < besterr)
		{
			besterr = err;
			res = i;
		}
	}

	return res;
}

//-----------------------------------------------------------------------------
array<byte> quantize(const float[] block, float &scale, float &power)
{
	array<byte> res;
	res.resize(block.size);

	// dequantization table
	float dequant[256];

	float amax = 0.0f;

	for (int i : block.size)
		amax = max(amax, abs(block[i]));

	// avoid div by zero
	if (!amax)
		amax = 1.0f;

	scale = amax;
	power = 1.0f;

	float besterr = FLT_MAX;

	array<byte> tmp;
	tmp.resize(block.size);

	// range 0.1 to 4.0 seems reasonable, YMMV
	for (float p = 0.1f; p<=4.0f; p += 0.1f)
	{
		// build dequantization table
		build_dequant_table(dequant, scale, p);

		float sqerr = 0.0f;

		for (int i : block.size)
		{
			int b = quantize_value(block[i], dequant);
			tmp[i] = cast byte b;
			auto delta = dequant[b] - block[i];
			sqerr += delta*delta;
		}

		if (sqerr < besterr)
		{
			besterr = sqerr;
			power = p;
			res = tmp;
		}
	}

	"best_err=%t power=%t\n", besterr, power;

	return res;
}

//-----------------------------------------------------------------------------
void main()
{
	const float input[] = {
		53, 31, 0.5, -0.75, -22, 0, 1e-5, 15, -66
	};

	float scale, power;
	auto qval = quantize(input, scale, power);

	// quantized values, scale and power => all that's needed to decode
	// quantized data
	"qval=%t scale=%t power=%t\n", qval, scale, power;

	float output[input.size];
	float dequant[256];

	build_dequant_table(dequant, scale, power);

	// decoding is straightforward and fast
	for (auto i : output.size)
		output[i] = dequant[qval[i]];

	"input:  %t\n", input;
	"output: %t\n", output;
}

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 08, 2020 3:29 pm

But if I have understood correctly in this way you get an index for an array but the weights of the neural net are multiplied with the input and added together.
If I have understood stockfish use a fixed point arithmetic because the output is shifted by 6 places so is like a division by 64. If it's so all the weights are in a predefined range.

mar · Post by **mar** » Tue Sep 08, 2020 3:41 pm

Fabio Gobbato wrote: ↑Tue Sep 08, 2020 10:31 amI have a neural network with floating point weights and I want them to fit in int8_t .

I thought you wanted to reduce the size of the net while minimizing the overall error, so I probably misunderstood what you want.
if they shift by 6 then they use FP2:6, they still need a sign bit though so the real range should be is ~-1.9999 to 1.9999. I've no idea what magnitude restrictions there are on the weights.
however, the NNUE data bundled with SF seems to be actually 16-bit, no now I'm puzzled

mar · Post by **mar** » Tue Sep 08, 2020 3:50 pm

either way, if you load the quantized weights as 8-bit, you can unpack them as floats or whatever and do what you want.
the quantization I propose (8-bit) should be superior to plain linear quantization anyway, so I still don't see any problem

Fabio Gobbato · Post by **Fabio Gobbato** » Tue Sep 08, 2020 9:24 pm

But the net calculations would be all on floating point values while stockfish use only integer arithmetic. It's an advantage to use only ints because the net calculation is faster. But I miss something because while training the net the weights differs a lot and int8_t gives low accuracy. But maybe it uses some tricks in the training.

Pio · Post by **Pio** » Tue Sep 08, 2020 10:00 pm

Fabio Gobbato wrote: ↑Tue Sep 08, 2020 10:31 am I have a neural network with floating point weights and I want them to fit in int8_t . The weights goes from +64 and -57 and some are very low like 1e-5.
How can I translate all these values in int8_t without losing accuracy? Is there a way to train the net to easily translate the weights to int8_t?
I have seen stockfish that use a >>6 so if I have understand correctly all the weights are between -2.0 and 1,984375 with 1/64 precision how can all the weights be in that range?

Maybe they just shift out the Numbers after the decimal point thus floor the number.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Sep 08, 2020 10:43 pm

Fabio Gobbato wrote: ↑Tue Sep 08, 2020 9:24 pm But the net calculations would be all on floating point values while stockfish use only integer arithmetic. It's an advantage to use only ints because the net calculation is faster. But I miss something because while training the net the weights differs a lot and int8_t gives low accuracy. But maybe it uses some tricks in the training.

But the vast majority of weights 256x41024 are int16 (whose outputs are incremental updated in make move), only the remaining layers of 2x256x32Relu, 32x32Relu and 32x1 use 8-bit weights implemented in the AffineTransform and ClippedReLU classes.

See the post from the Don't understand NNUE thread.

Considering AVX2 in AffineTransform only, they use _mm256_maddubs_epi16 (vpmaddubsw) for intermediate saturated 16-bit results as sum of two 8 byte multiplications, and _mm256_madd_epi16 (vpmaddwd ) as mul with 1, and horinzontal add of consecutive 16 to 32-bit int, accumuleated via _mm256_add_epi32 in sum. After further horizontal add and some shuffling, the 32-bit sums are written as 32-bit ints to output. In ClippedReLU, the 32-bit results are packed to 16-bit integers using signed saturation (_mm256_packs_epi32 aka vpackssdw), and arithmetically shifted right by kWeightScaleBits = 6 (idiv 128). FInally with _mm256_packs_epi16 the 16-bit words0 and words1 are packed to 8-bit integers using signed saturation (_mm256_packs_epi16), before _mm256_max_epi8 implements the 8-bit ReLu.

But I have not studied the backprop training code yet, whether they use intermediate floating point weights and finally map to integers, or do all math with integers (nope).

Double and Half Float so far
https://github.com/nodchip/Stockfish/bl ... earner.cpp
https://github.com/nodchip/Stockfish/bl ... lf_float.h

Sesse · Post by **Sesse** » Wed Sep 09, 2020 12:16 am

The most common way of doing NN weight quantization is AFAIK bog-simple linear quantization. Since the weights are for the linear step, it's fine to have a 1e-5 weight go to 0, because it's effectively 0 anyway compared to that +64 or -57 weight. You store the scale factors per layer (sometimes also per channel); just min/max is usually fine. No need for fancy VQ or nonlinear quantizers.

More advanced techniques exist, like cutting off outliers from the min/max, an extra pass of training (fine-tuning) after quantization to shake out some more accuracy, and various forms of quantization-aware training.

Neural network quantization

Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization

Re: Neural network quantization