TensorRT Int8 results on volta

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 05, 2019 10:18 pm

I have finally finished support for INT8 on GPUs that support it. The older GPUs (10xx series) actually support only int8 and not half precision (fp16).
To use int8 an additional step (calibration) is required to minimize loss of information -- measured by the Kullback-Leibler divergence between the fp32 and int8 model. I added the calibration step now and measured performance on an 12x128 net using 10 batches of 1024 positions for calibration. Here are the results on volta, which supports both fp16 and int8 besides fp32

Tensorflow pb : 24849 nodes/s
TensorRT FP32 : 20790 nodes/s
TensorRT FP16 : 53405 nodes/s
TensorRT INT8 : 44355 nodes/s

Though int8 is about twice faster than FP32 it is slower than FP16 on volta. I don't know why that is but
on standard image recognition samples INT8 was slightly faster than FP16. So I need to investigate why
mine is slower ..

Comparing policy values of FP16 and INT8 models it seems there is not much loss of information.

FP16

Code: Select all

# Move   Value=(V,P,V+P)   Policy  Visits                  PV
#----------------------------------------------------------------------------------
#  1   (0.523,0.512,0.521)  26.45   62233   e2-e4 c7-c5 Ng1-f3 e7-e6 d2-d4 c5xd4 Nf3xd4 a7-a6 Bf1-d3 Qd8-c7 Ke1-g1 Ng8-f6 Qd1-e2 d7-d6 c2-c4 Nb8-d7 Nb1-c3 Bf8-e7 f2-f4 Ke8-g8
#  2   (0.530,0.530,0.530)  22.62  425256   d2-d4 d7-d5 Ng1-f3 c7-c5 c2-c4 c5xd4 Qd1xd4 Ng8-f6 c4xd5 Qd8xd5 Nb1-c3 Qd5xd4 Nf3xd4 a7-a6 e2-e4 e7-e5 Nd4-c2 Bf8-c5 Bc1-e3 Bc5xe3 Nc2xe3 Nb8-c6 Ne3-d5 Nf6xd5 Nc3xd5
#  3   (0.519,0.530,0.522)  16.61   47308   Ng1-f3 d7-d5 d2-d4 c7-c5 c2-c4 c5xd4 Qd1xd4 Ng8-f6 c4xd5 Qd8xd5 Nb1-c3 Qd5xd4 Nf3xd4 a7-a6 e2-e4 e7-e5 Nd4-c2 Bf8-c5 Bc1-e3 Bc5xe3 Nc2xe3
#  4   (0.520,0.494,0.515)  11.42   18343   c2-c4 e7-e5 Nb1-c3 Ng8-f6 Ng1-f3 Nb8-c6 g2-g3 d7-d5 c4xd5 Nf6xd5 Bf1-g2 Nd5-b6 Ke1-g1 Bf8-e7 d2-d3 Ke8-g8 a2-a3 Bc8-e6 b2-b4
#  5   (0.509,0.474,0.502)   5.85    6069   g2-g3 d7-d5 Bf1-g2 e7-e5 d2-d3 Nb8-c6 Ng1-f3 Ng8-f6 Ke1-g1 Bf8-e7 e2-e4 Ke8-g8 Nb1-c3 d5xe4 d3xe4 Qd8xd1
#  6   (0.482,0.446,0.475)   4.18    2448   f2-f4 d7-d5 Ng1-f3 Bc8-g4 e2-e3 Nb8-d7 Bf1-e2 Ng8-f6 Ke1-g1 e7-e6 b2-b3 Bf8-d6 Bc1-b2
#  7   (0.508,0.509,0.508)   2.30    2839   Nb1-c3 c7-c5 e2-e4 Nb8-c6 g2-g3 g7-g6 Bf1-g2 Bf8-g7 d2-d3 d7-d6 f2-f4 e7-e6 Ng1-f3 Ng8-e7
#  8   (0.486,0.456,0.480)   1.56    1001   b2-b3 e7-e5 Bc1-b2 Nb8-c6 e2-e3 Ng8-f6 Bf1-b5 Bf8-d6 Nb1-a3 Ke8-g8 Na3-c4 Rf8-e8 Nc4xd6 c7xd6 Ng1-e2
#  9   (0.487,0.430,0.476)   1.51     902   b2-b4 e7-e5 Bc1-b2 Bf8xb4 Bb2xe5 Ng8-f6 Ng1-f3 Ke8-g8 e2-e3 Nb8-c6 Be5-b2 d7-d5 Bf1-e2
# 10   (0.476,0.466,0.474)   1.49     868   c2-c3 d7-d5 d2-d4 Ng8-f6 Bc1-f4 c7-c5 e2-e3 Nb8-c6 Ng1-f3 Bc8-g4 h2-h3 Bg4xf3
# 11   (0.482,0.504,0.486)   1.34     966   d2-d3 d7-d5 e2-e4 d5xe4 d3xe4 Qd8xd1 Ke1xd1 Ng8-f6 Bf1-d3 Nb8-c6 Nb1-c3 e7-e5 Bc1-g5
# 12   (0.481,0.530,0.491)   1.25     990   e2-e3 Ng8-f6 Ng1-f3 g7-g6 c2-c4 Bf8-g7 Nb1-c3 Ke8-g8 Bf1-e2 d7-d6 Ke1-g1 e7-e5
# 13   (0.472,0.416,0.461)   0.55     265   Nb1-a3 e7-e5 c2-c4 Ng8-f6 Na3-c2 d7-d5 c4xd5 Nf6xd5 Ng1-f3 Nb8-c6 d2-d3 Bf8-e7
# 14   (0.470,0.506,0.477)   0.54     328   a2-a3 d7-d5 Ng1-f3 c7-c5 g2-g3 Nb8-c6 d2-d4 c5xd4 Nf3xd4 e7-e5
# 15   (0.469,0.453,0.466)   0.51     263   f2-f3 e7-e5 c2-c4 Ng8-f6 Nb1-c3 d7-d5 c4xd5 Nf6xd5 e2-e4 Nd5-b4 d2-d3 Nb8-c6
# 16   (0.470,0.415,0.459)   0.44     206   Ng1-h3 d7-d5 d2-d4 c7-c5 c2-c3 Nb8-c6 Nh3-f4 c5xd4 c3xd4
# 17   (0.462,0.471,0.464)   0.43     214   h2-h3 e7-e5 c2-c4 Ng8-f6 e2-e3 d7-d5 c4xd5 Nf6xd5 a2-a3
# 18   (0.456,0.437,0.452)   0.34     147   a2-a4 e7-e5 e2-e4 Ng8-f6 Nb1-c3 Bf8-b4 Ng1-f3 Ke8-g8 Nf3xe5 d7-d5
# 19   (0.452,0.383,0.439)   0.25      93   g2-g4 d7-d5 Bf1-g2 Bc8xg4 c2-c4 c7-c6 c4xd5 c6xd5 Qd1-b3 e7-e6
# 20   (0.460,0.437,0.456)   0.25     113   h2-h4 d7-d5 Ng1-f3 c7-c5 g2-g3 Nb8-c6 Bf1-g2 e7-e5

# nodes = 14581824 <34% qnodes> time = 10689ms nps = 1364189 eps = 753686 nneps = 50488
# Tree: nodes = 17740947 depth = 28 pps = 53405 visits = 570853 
#       qsearch_calls = 387 search_calls = 0

INT8

Code: Select all

# Move   Value=(V,P,V+P)   Policy  Visits                  PV
#----------------------------------------------------------------------------------
#  1   (0.507,0.526,0.511)  25.07  216737   e2-e4 d7-d6 d2-d4 Ng8-f6 Nb1-c3 g7-g6 Bc1-g5 Bf8-g7 Qd1-d2 h7-h6 Bg5xf6 Bg7xf6 f2-f4 Nb8-c6 Ng1-f3
#  2   (0.494,0.526,0.500)  21.00   90126   d2-d4 Ng8-f6 Ng1-f3 g7-g6 c2-c4 Bf8-g7 Nb1-c3 d7-d6 e2-e4 e7-e5 d4xe5 d6xe5 Qd1xd8 Ke8xd8 Nf3xe5
#  3   (0.495,0.526,0.501)  16.75   83362   Ng1-f3 Ng8-f6 c2-c4 e7-e6 g2-g3 d7-d5 b2-b3 Bf8-e7 Bf1-g2 Ke8-g8 Bc1-b2 Nb8-d7 d2-d3 c7-c6 Nb1-d2 b7-b6 Qd1-c2
#  4   (0.497,0.491,0.496)  13.04   34174   c2-c4 Ng8-f6 Nb1-c3 g7-g6 g2-g3 Bf8-g7 Bf1-g2 d7-d6 d2-d4 Nb8-d7 Ng1-f3 Ke8-g8 Ke1-g1 e7-e5
#  5   (0.490,0.458,0.483)   6.13    8009   g2-g3 e7-e5 Bf1-g2 d7-d5 d2-d3 Nb8-c6 Ng1-f3 Ng8-f6 c2-c3 Bf8-e7 b2-b4 Ke8-g8 b4-b5 e5-e4 Nf3-d4 Nc6xd4
#  6   (0.500,0.458,0.491)   4.13    8010   f2-f4 d7-d5 Ng1-f3 Bc8-g4 Nf3-e5 Bg4-f5 e2-e3 Nb8-d7 Ne5xd7 Qd8xd7 Bf1-e2 Ng8-f6
#  7   (0.491,0.440,0.481)   2.01    2283   b2-b3 e7-e5 e2-e3 d7-d5 Bc1-b2 Nb8-d7 g2-g3 Ng8-f6 Bf1-g2 Bf8-d6 Ng1-e2
#  8   (0.494,0.526,0.500)   1.99    8209   Nb1-c3 e7-e5 Ng1-f3 Nb8-c6 g2-g3 d7-d5 d2-d3 Ng8-f6 Bf1-g2 Bf8-e7 Bc1-g5 d5-d4 Bg5xf6 Be7xf6
#  9   (0.498,0.447,0.488)   1.94    2988   c2-c3 Ng8-f6 d2-d4 g7-g6 Bc1-g5 Bf8-g7 Nb1-d2 d7-d6 Ng1-f3 Nb8-d7 e2-e4 h7-h6 Bg5-h4
# 10   (0.498,0.496,0.498)   1.53    4701   d2-d3 d7-d5 Ng1-f3 Bc8-g4 Nb1-d2 Nb8-c6 e2-e4 d5xe4 d3xe4 e7-e6 Bf1-e2
# 11   (0.499,0.430,0.485)   1.42    1916   b2-b4 e7-e5 a2-a3 d7-d5 e2-e3 Bf8-d6 Bc1-b2 Ng8-f6 c2-c4 c7-c6 Ng1-f3 e5-e4
# 12   (0.495,0.526,0.501)   1.36    8027   e2-e3 e7-e5 c2-c4 Ng8-f6 Nb1-c3 Bf8-b4 Ng1-e2 Bb4xc3 Ne2xc3 Ke8-g8 Bf1-e2
# 13   (0.493,0.500,0.494)   0.65    1460   a2-a3 e7-e5 e2-e4 Ng8-f6 Nb1-c3 Nb8-c6 Ng1-f3 Bf8-c5 Bf1-e2 d7-d6 d2-d3 a7-a6 Bc1-g5 h7-h6
# 14   (0.490,0.420,0.476)   0.52     493   Nb1-a3 e7-e5 c2-c4 Ng8-f6 Na3-c2 d7-d5 c4xd5 Nf6xd5 Ng1-f3 Nb8-c6 d2-d3
# 15   (0.498,0.491,0.497)   0.49    1355   h2-h3 e7-e5 e2-e4 Ng8-f6 Nb1-c3 Nb8-c6 Ng1-f3 d7-d5 e4xd5 Nf6xd5 Bf1-b5 Nd5xc3 b2xc3
# 16   (0.487,0.436,0.477)   0.46     445   f2-f3 d7-d5 d2-d4 c7-c5 c2-c3 Nb8-c6 d4xc5 e7-e5 b2-b4
# 17   (0.493,0.415,0.477)   0.44     433   Ng1-h3 e7-e5 e2-e3 d7-d5 d2-d4 Nb8-c6 d4xe5 Nc6xe5 Nb1-d2 Ng8-f6 Nh3-f4 Bf8-d6
# 18   (0.492,0.437,0.481)   0.38     426   a2-a4 e7-e5 e2-e4 Ng8-f6 Nb1-c3 Nb8-c6 Ng1-f3 d7-d5 e4xd5 Nf6xd5 Bf1-b5 Nd5xc3
# 19   (0.495,0.383,0.473)   0.31     269   g2-g4 e7-e5 Bf1-g2 Ng8-f6 g4-g5 Nf6-h5 d2-d3 d7-d5 Ng1-f3 Nb8-c6
# 20   (0.461,0.437,0.456)   0.28     163   h2-h4 e7-e5 c2-c4 Ng8-f6 Nb1-c3 d7-d5 c4xd5 Nf6xd5 Ng1-f3

# nodes = 12742111 <40% qnodes> time = 10677ms nps = 1193416 eps = 793153 nneps = 41199
# Tree: nodes = 13932845 depth = 21 pps = 44355 visits = 473587 
#       qsearch_calls = 396 search_calls = 0

The good thing about supportin int8 is that on older GPUs that support int8 but not fp16, there is now a way to get that 2x speedup.

Daniel

grahamj · Post by **grahamj** » Wed Feb 06, 2019 8:40 am

Interesting.The LC0 developers tried an experiment like that, but the result in terms of ELO was disappointing. There was a suggestion that if the training games were generated using INT8 it would be better, but that wouldn't apply to supervised learning. The interactions between tensor cores, Winograd transform, INT8, fp16, and NVIDIAs libraries are probably very complicated.

Greyguy · Post by **Greyguy** » Sat Feb 16, 2019 1:43 pm

More about TensorCore:
http://blog.gpueater.com/en/2018/03/20/ ... nchmark_2/

And without TensorCore:
http://blog.gpueater.com/en/2018/03/07/ ... nchmark_1/

Daniel Shawul · Post by **Daniel Shawul** » Sat Feb 16, 2019 3:32 pm

My misunderstanding was that Volta tensor cores can not do int8 calculations -- thats why
int8 nps did not go above fp16 performance. The int8 calculation was being done on previous technology DP4A.
However, Turing tensor cores (RTX 2080 ti) can do int8 as well as int4 on tensor cores.

smatovic · Post by **smatovic** » Fri Feb 22, 2019 5:55 pm

Just a guess,

but it was previously stated that the 3x3 convolutions of LC0 are not able to utilize fully Turings TensorCores,
and HGM noted that chess (compared to Go) could profit by 4x4 convolutions.

So maybe a switch to 4x4 convolutions could be a win-win??

--
Srdja

TensorRT Int8 results on volta

TensorRT Int8 results on volta

Re: TensorRT Int8 results on volta

Re: TensorRT Int8 results on volta

Re: TensorRT Int8 results on volta

Re: TensorRT Int8 results on volta