1# TensorFlow Lite 8-bit quantization specification 2 3The following document outlines the specification for TensorFlow Lite's 8-bit 4quantization scheme. This is intended to assist hardware developers in providing 5hardware support for inference with quantized TensorFlow Lite models. 6 7## Specification summary 8 9We are providing a specification, and we can only provide some guarantees on 10behaviour if the spec is followed. We also understand different hardware may 11have preferences and restrictions that may cause slight deviations when 12implementing the spec that result in implementations that are not bit-exact. 13Whereas that may be acceptable in most cases (and we will provide a suite of 14tests that to the best of our knowledge include per-operation tolerances that we 15gathered from several models), the nature of machine learning (and deep learning 16in the most common case) makes it impossible to provide any hard guarantees. 17 188-bit quantization approximates floating point values using the following 19formula. 20 21$$real\_value = (int8\_value - zero\_point) \times scale$$ 22 23Per-axis (aka per-channel in Conv ops) or per-tensor weights are represented by 24`int8` two’s complement values in the range `[-127, 127]` with zero-point equal 25to 0. Per-tensor activations/inputs are represented by `int8` two’s complement 26values in the range `[-128, 127]`, with a zero-point in range `[-128, 127]`. 27 28There are other exceptions for particular operations that are documented below. 29 30Note: In the past our quantization tooling used per-tensor, asymmetric, `uint8` 31quantization. New tooling, reference kernels, and optimized kernels for 8-bit 32quantization will use this spec. 33 34## Signed integer vs unsigned integer 35 36TensorFlow Lite quantization will primarily prioritize tooling and kernels for 37`int8` quantization for 8-bit. This is for the convenience of symmetric 38quantization being represented by zero-point equal to 0. Additionally many 39backends have additional optimizations for `int8xint8` accumulation. 40 41## Per-axis vs per-tensor 42 43Per-tensor quantization means that there will be one scale and/or zero-point per 44entire tensor. Per-axis quantization means that there will be one scale and/or 45`zero_point` per slice in the `quantized_dimension`. The quantized dimension 46specifies the dimension of the Tensor's shape that the scales and zero-points 47correspond to. For example, a tensor `t`, with `dims=[4, 3, 2, 1]` with 48quantization params: `scale=[1.0, 2.0, 3.0]`, `zero_point=[1, 2, 3]`, 49`quantization_dimension=1` will be quantized across the second dimension of `t`: 50 51 t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1 52 t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2 53 t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3 54 55Often, the `quantized_dimension` is the `output_channel` of the weights of 56convolutions, but in theory it can be the dimension that corresponds to each 57dot-product in the kernel implementation, allowing more quantization granularity 58without performance implications. This has large improvements to accuracy. 59 60TFLite has per-axis support for a growing number of operations. At the time of 61this document, support exists for Conv2d and DepthwiseConv2d. 62 63## Symmetric vs asymmetric 64 65Activations are asymmetric: they can have their zero-point anywhere within the 66signed `int8` range `[-128, 127]`. Many activations are asymmetric in nature and 67a zero-point is an relatively inexpensive way to effectively get up to an extra 68binary bit of precision. Since activations are only multiplied by constant 69weights, the constant zero-point value can be optimized pretty heavily. 70 71Weights are symmetric: forced to have zero-point equal to 0. Weight values are 72multiplied by dynamic input and activation values. This means that there is an 73unavoidable runtime cost of multiplying the zero-point of the weight with the 74activation value. By enforcing that zero-point is 0 we can avoid this cost. 75 76Explanation of the math: this is similar to section 2.3 in 77[arXiv:1712.05877](https://arxiv.org/abs/1712.05877), except for the difference 78that we allow the scale values to be per-axis. This generalizes readily, as 79follows: 80 81$A$ is a $m \times n$ matrix of quantized activations. <br /> 82$B$ is a $n \times p$ matrix of quantized weights. <br /> 83Consider multiplying the $j$th row of $A$, $a_j$ by the $k$th column of 84$B$, $b_k$, both of length $n$. The quantized integer values and 85zero-points values are $q_a$, $z_a$ and $q_b$, $z_b$ respectively. 86 87$$a_j \cdot b_k = \sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} = 88\sum_{i=0}^{n} (q_{a}^{(i)} - z_a) (q_{b}^{(i)} - z_b) = 89\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)} - \sum_{i=0}^{n} q_{a}^{(i)} z_b - 90\sum_{i=0}^{n} q_{b}^{(i)} z_a + \sum_{i=0}^{n} z_a z_b$$ 91 92<!-- Don't change these `\\(` `\\)` to `$`. mathjax fails here with `$`--> 93 94The \\(\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}\\) term is unavoidable since it’s 95performing the dot product of the input value and the weight value. 96 97The $$\sum_{i=0}^{n} q_{b}^{(i)} z_a$$ and $$\sum_{i=0}^{n} z_a z_b$$ terms are 98made up of constants that remain the same per inference invocation, and thus can 99be pre-calculated. 100 101The \\(\sum_{i=0}^{n} q_{a}^{(i)} z_b\\) term needs to be computed every inference 102since the activation changes every inference. By enforcing weights to be 103symmetric we can remove the cost of this term. 104 105## int8 quantized operator specifications 106 107Below we describe the quantization requirements for our int8 tflite kernels: 108 109``` 110ADD 111 Input 0: 112 data_type : int8 113 range : [-128, 127] 114 granularity: per-tensor 115 Input 1: 116 data_type : int8 117 range : [-128, 127] 118 granularity: per-tensor 119 Output 0: 120 data_type : int8 121 range : [-128, 127] 122 granularity: per-tensor 123 124AVERAGE_POOL_2D 125 Input 0: 126 data_type : int8 127 range : [-128, 127] 128 granularity: per-tensor 129 Output 0: 130 data_type : int8 131 range : [-128, 127] 132 granularity: per-tensor 133 restriction: Input and outputs must all have same scale/zero_point 134 135CONCATENATION 136 Input ...: 137 data_type : int8 138 range : [-128, 127] 139 granularity: per-tensor 140 Output 0: 141 data_type : int8 142 range : [-128, 127] 143 granularity: per-tensor 144 restriction: Input and outputs must all have same scale/zero_point 145 146CONV_2D 147 Input 0: 148 data_type : int8 149 range : [-128, 127] 150 granularity: per-tensor 151 Input 1 (Weight): 152 data_type : int8 153 range : [-127, 127] 154 granularity: per-axis (dim = 0) 155 restriction: zero_point = 0 156 Input 2 (Bias): 157 data_type : int32 158 range : [int32_min, int32_max] 159 granularity: per-axis 160 restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) 161 Output 0: 162 data_type : int8 163 range : [-128, 127] 164 granularity: per-tensor 165 166DEPTHWISE_CONV_2D 167 Input 0: 168 data_type : int8 169 range : [-128, 127] 170 granularity: per-tensor 171 Input 1 (Weight): 172 data_type : int8 173 range : [-127, 127] 174 granularity: per-axis (dim = 3) 175 restriction: zero_point = 0 176 Input 2 (Bias): 177 data_type : int32 178 range : [int32_min, int32_max] 179 granularity: per-axis 180 restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) 181 Output 0: 182 data_type : int8 183 range : [-128, 127] 184 granularity: per-tensor 185 186FULLY_CONNECTED 187 Input 0: 188 data_type : int8 189 range : [-128, 127] 190 granularity: per-tensor 191 Input 1 (Weight): 192 data_type : int8 193 range : [-127, 127] 194 granularity: per-tensor 195 restriction: zero_point = 0 196 Input 2 (Bias): 197 data_type : int32 198 range : [int32_min, int32_max] 199 granularity: per-tensor 200 restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0) 201 Output 0: 202 data_type : int8 203 range : [-128, 127] 204 granularity: per-tensor 205 206L2_NORMALIZATION 207 Input 0: 208 data_type : int8 209 range : [-128, 127] 210 granularity: per-tensor 211 Output 0: 212 data_type : int8 213 range : [-128, 127] 214 granularity: per-tensor 215 restriction: (scale, zero_point) = (1.0 / 128.0, 0) 216 217LOGISTIC 218 Input 0: 219 data_type : int8 220 range : [-128, 127] 221 granularity: per-tensor 222 Output 0: 223 data_type : int8 224 range : [-128, 127] 225 granularity: per-tensor 226 restriction: (scale, zero_point) = (1.0 / 256.0, -128) 227 228MAX_POOL_2D 229 Input 0: 230 data_type : int8 231 range : [-128, 127] 232 granularity: per-tensor 233 Output 0: 234 data_type : int8 235 range : [-128, 127] 236 granularity: per-tensor 237 restriction: Input and outputs must all have same scale/zero_point 238 239MUL 240 Input 0: 241 data_type : int8 242 range : [-128, 127] 243 granularity: per-tensor 244 Input 1: 245 data_type : int8 246 range : [-128, 127] 247 granularity: per-tensor 248 Output 0: 249 data_type : int8 250 range : [-128, 127] 251 granularity: per-tensor 252 253RESHAPE 254 Input 0: 255 data_type : int8 256 range : [-128, 127] 257 granularity: per-tensor 258 Output 0: 259 data_type : int8 260 range : [-128, 127] 261 granularity: per-tensor 262 restriction: Input and outputs must all have same scale/zero_point 263 264RESIZE_BILINEAR 265 Input 0: 266 data_type : int8 267 range : [-128, 127] 268 granularity: per-tensor 269 Output 0: 270 data_type : int8 271 range : [-128, 127] 272 granularity: per-tensor 273 restriction: Input and outputs must all have same scale/zero_point 274 275SOFTMAX 276 Input 0: 277 data_type : int8 278 range : [-128, 127] 279 granularity: per-tensor 280 Output 0: 281 data_type : int8 282 range : [-128, 127] 283 granularity: per-tensor 284 restriction: (scale, zero_point) = (1.0 / 256.0, -128) 285 286SPACE_TO_DEPTH 287 Input 0: 288 data_type : int8 289 range : [-128, 127] 290 granularity: per-tensor 291 Output 0: 292 data_type : int8 293 range : [-128, 127] 294 granularity: per-tensor 295 restriction: Input and outputs must all have same scale/zero_point 296 297TANH 298 Input 0: 299 data_type : int8 300 range : [-128, 127] 301 granularity: per-tensor 302 Output 0: 303 data_type : int8 304 range : [-128, 127] 305 granularity: per-tensor 306 restriction: (scale, zero_point) = (1.0 / 128.0, 0) 307 308PAD 309 Input 0: 310 data_type : int8 311 range : [-128, 127] 312 granularity: per-tensor 313 Output 0: 314 data_type : int8 315 range : [-128, 127] 316 granularity: per-tensor 317 restriction: Input and outputs must all have same scale/zero_point 318 319GATHER 320 Input 0: 321 data_type : int8 322 range : [-128, 127] 323 granularity: per-tensor 324 Output 0: 325 data_type : int8 326 range : [-128, 127] 327 granularity: per-tensor 328 restriction: Input and outputs must all have same scale/zero_point 329 330BATCH_TO_SPACE_ND 331 Input 0: 332 data_type : int8 333 range : [-128, 127] 334 granularity: per-tensor 335 Output 0: 336 data_type : int8 337 range : [-128, 127] 338 granularity: per-tensor 339 restriction: Input and outputs must all have same scale/zero_point 340 341SPACE_TO_BATCH_ND 342 Input 0: 343 data_type : int8 344 range : [-128, 127] 345 granularity: per-tensor 346 Output 0: 347 data_type : int8 348 range : [-128, 127] 349 granularity: per-tensor 350 restriction: Input and outputs must all have same scale/zero_point 351 352TRANSPOSE 353 Input 0: 354 data_type : int8 355 range : [-128, 127] 356 granularity: per-tensor 357 Output 0: 358 data_type : int8 359 range : [-128, 127] 360 granularity: per-tensor 361 restriction: Input and outputs must all have same scale/zero_point 362 363MEAN 364 Input 0: 365 data_type : int8 366 range : [-128, 127] 367 granularity: per-tensor 368 Output 0: 369 data_type : int8 370 range : [-128, 127] 371 granularity: per-tensor 372 373SUB 374 Input 0: 375 data_type : int8 376 range : [-128, 127] 377 granularity: per-tensor 378 Input 1: 379 data_type : int8 380 range : [-128, 127] 381 granularity: per-tensor 382 Output 0: 383 data_type : int8 384 range : [-128, 127] 385 granularity: per-tensor 386 387SUM 388 Input 0: 389 data_type : int8 390 range : [-128, 127] 391 granularity: per-tensor 392 Output 0: 393 data_type : int8 394 range : [-128, 127] 395 granularity: per-tensor 396 397SQUEEZE 398 Input 0: 399 data_type : int8 400 range : [-128, 127] 401 granularity: per-tensor 402 Output 0: 403 data_type : int8 404 range : [-128, 127] 405 granularity: per-tensor 406 restriction: Input and outputs must all have same scale/zero_point 407 408LOG_SOFTMAX 409 Input 0: 410 data_type : int8 411 range : [-128, 127] 412 granularity: per-tensor 413 Output 0: 414 data_type : int8 415 range : [-128, 127] 416 granularity: per-tensor 417 restriction: (scale, zero_point) = (16.0 / 256.0, 127) 418 419MAXIMUM 420 Input 0: 421 data_type : int8 422 range : [-128, 127] 423 granularity: per-tensor 424 Output 0: 425 data_type : int8 426 range : [-128, 127] 427 granularity: per-tensor 428 restriction: Input and outputs must all have same scale/zero_point 429 430ARG_MAX 431 Input 0: 432 data_type : int8 433 range : [-128, 127] 434 granularity: per-tensor 435 436MINIMUM 437 Input 0: 438 data_type : int8 439 range : [-128, 127] 440 granularity: per-tensor 441 Output 0: 442 data_type : int8 443 range : [-128, 127] 444 granularity: per-tensor 445 restriction: Input and outputs must all have same scale/zero_point 446 447LESS 448 Input 0: 449 data_type : int8 450 range : [-128, 127] 451 granularity: per-tensor 452 Input 1: 453 data_type : int8 454 range : [-128, 127] 455 granularity: per-tensor 456 457PADV2 458 Input 0: 459 data_type : int8 460 range : [-128, 127] 461 granularity: per-tensor 462 Output 0: 463 data_type : int8 464 range : [-128, 127] 465 granularity: per-tensor 466 restriction: Input and outputs must all have same scale/zero_point 467 468GREATER 469 Input 0: 470 data_type : int8 471 range : [-128, 127] 472 granularity: per-tensor 473 Input 1: 474 data_type : int8 475 range : [-128, 127] 476 granularity: per-tensor 477 478GREATER_EQUAL 479 Input 0: 480 data_type : int8 481 range : [-128, 127] 482 granularity: per-tensor 483 Input 1: 484 data_type : int8 485 range : [-128, 127] 486 granularity: per-tensor 487 488LESS_EQUAL 489 Input 0: 490 data_type : int8 491 range : [-128, 127] 492 granularity: per-tensor 493 Input 1: 494 data_type : int8 495 range : [-128, 127] 496 granularity: per-tensor 497 498SLICE 499 Input 0: 500 data_type : int8 501 range : [-128, 127] 502 granularity: per-tensor 503 Output 0: 504 data_type : int8 505 range : [-128, 127] 506 granularity: per-tensor 507 restriction: Input and outputs must all have same scale/zero_point 508 509EQUAL 510 Input 0: 511 data_type : int8 512 range : [-128, 127] 513 granularity: per-tensor 514 Input 1: 515 data_type : int8 516 range : [-128, 127] 517 granularity: per-tensor 518 519NOT_EQUAL 520 Input 0: 521 data_type : int8 522 range : [-128, 127] 523 granularity: per-tensor 524 Input 1: 525 data_type : int8 526 range : [-128, 127] 527 granularity: per-tensor 528 529SHAPE 530 Input 0: 531 data_type : int8 532 range : [-128, 127] 533 granularity: per-tensor 534 535QUANTIZE (Requantization) 536 Input 0: 537 data_type : int8 538 range : [-128, 127] 539 granularity: per-tensor 540 Output 0: 541 data_type : int8 542 range : [-128, 127] 543 granularity: per-tensor 544``` 545 546## References 547 548[arXiv:1712.05877](https://arxiv.org/abs/1712.05877) 549