1# TensorFlow Lite 8-bit quantization specification
2
3The following document outlines the specification for TensorFlow Lite's 8-bit
4quantization scheme. This is intended to assist hardware developers in providing
5hardware support for inference with quantized TensorFlow Lite models.
6
7## Specification summary
8
9We are providing a specification, and we can only provide some guarantees on
10behaviour if the spec is followed. We also understand different hardware may
11have preferences and restrictions that may cause slight deviations when
12implementing the spec that result in implementations that are not bit-exact.
13Whereas that may be acceptable in most cases (and we will provide a suite of
14tests that to the best of our knowledge include per-operation tolerances that we
15gathered from several models), the nature of machine learning (and deep learning
16in the most common case) makes it impossible to provide any hard guarantees.
17
188-bit quantization approximates floating point values using the following
19formula.
20
21$$real\_value = (int8\_value - zero\_point) \times scale$$
22
23Per-axis (aka per-channel in Conv ops) or per-tensor weights are represented by
24`int8` two’s complement values in the range `[-127, 127]` with zero-point equal
25to 0. Per-tensor activations/inputs are represented by `int8` two’s complement
26values in the range `[-128, 127]`, with a zero-point in range `[-128, 127]`.
27
28There are other exceptions for particular operations that are documented below.
29
30Note: In the past our quantization tooling used per-tensor, asymmetric, `uint8`
31quantization. New tooling, reference kernels, and optimized kernels for 8-bit
32quantization will use this spec.
33
34## Signed integer vs unsigned integer
35
36TensorFlow Lite quantization will primarily prioritize tooling and kernels for
37`int8` quantization for 8-bit. This is for the convenience of symmetric
38quantization being represented by zero-point equal to 0. Additionally many
39backends have additional optimizations for `int8xint8` accumulation.
40
41## Per-axis vs per-tensor
42
43Per-tensor quantization means that there will be one scale and/or zero-point per
44entire tensor. Per-axis quantization means that there will be one scale and/or
45`zero_point` per slice in the `quantized_dimension`. The quantized dimension
46specifies the dimension of the Tensor's shape that the scales and zero-points
47correspond to. For example, a tensor `t`, with `dims=[4, 3, 2, 1]` with
48quantization params: `scale=[1.0, 2.0, 3.0]`, `zero_point=[1, 2, 3]`,
49`quantization_dimension=1` will be quantized across the second dimension of `t`:
50
51    t[:, 0, :, :] will have scale[0]=1.0, zero_point[0]=1
52    t[:, 1, :, :] will have scale[1]=2.0, zero_point[1]=2
53    t[:, 2, :, :] will have scale[2]=3.0, zero_point[2]=3
54
55Often, the `quantized_dimension` is the `output_channel` of the weights of
56convolutions, but in theory it can be the dimension that corresponds to each
57dot-product in the kernel implementation, allowing more quantization granularity
58without performance implications. This has large improvements to accuracy.
59
60TFLite has per-axis support for a growing number of operations. At the time of
61this document, support exists for Conv2d and DepthwiseConv2d.
62
63## Symmetric vs asymmetric
64
65Activations are asymmetric: they can have their zero-point anywhere within the
66signed `int8` range `[-128, 127]`. Many activations are asymmetric in nature and
67a zero-point is an relatively inexpensive way to effectively get up to an extra
68binary bit of precision. Since activations are only multiplied by constant
69weights, the constant zero-point value can be optimized pretty heavily.
70
71Weights are symmetric: forced to have zero-point equal to 0. Weight values are
72multiplied by dynamic input and activation values. This means that there is an
73unavoidable runtime cost of multiplying the zero-point of the weight with the
74activation value. By enforcing that zero-point is 0 we can avoid this cost.
75
76Explanation of the math: this is similar to section 2.3 in
77[arXiv:1712.05877](https://arxiv.org/abs/1712.05877), except for the difference
78that we allow the scale values to be per-axis. This generalizes readily, as
79follows:
80
81$A$ is a $m \times n$ matrix of quantized activations. <br />
82$B$ is a $n \times p$ matrix of quantized weights. <br />
83Consider multiplying the $j$th row of $A$, $a_j$ by the $k$th column of
84$B$, $b_k$, both of length $n$. The quantized integer values and
85zero-points values are $q_a$, $z_a$ and $q_b$, $z_b$ respectively.
86
87$$a_j \cdot b_k = \sum_{i=0}^{n} a_{j}^{(i)} b_{k}^{(i)} =
88\sum_{i=0}^{n} (q_{a}^{(i)} - z_a) (q_{b}^{(i)} - z_b) =
89\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)} - \sum_{i=0}^{n} q_{a}^{(i)} z_b -
90\sum_{i=0}^{n} q_{b}^{(i)} z_a + \sum_{i=0}^{n} z_a z_b$$
91
92<!-- Don't change these `\\(` `\\)` to `$`. mathjax fails here with `$`-->
93
94The \\(\sum_{i=0}^{n} q_{a}^{(i)} q_{b}^{(i)}\\) term is unavoidable since it’s
95performing the dot product of the input value and the weight value.
96
97The $$\sum_{i=0}^{n} q_{b}^{(i)} z_a$$ and $$\sum_{i=0}^{n} z_a z_b$$ terms are
98made up of constants that remain the same per inference invocation, and thus can
99be pre-calculated.
100
101The \\(\sum_{i=0}^{n} q_{a}^{(i)} z_b\\) term needs to be computed every inference
102since the activation changes every inference. By enforcing weights to be
103symmetric we can remove the cost of this term.
104
105## int8 quantized operator specifications
106
107Below we describe the quantization requirements for our int8 tflite kernels:
108
109```
110ADD
111  Input 0:
112    data_type  : int8
113    range      : [-128, 127]
114    granularity: per-tensor
115  Input 1:
116    data_type  : int8
117    range      : [-128, 127]
118    granularity: per-tensor
119  Output 0:
120    data_type  : int8
121    range      : [-128, 127]
122    granularity: per-tensor
123
124AVERAGE_POOL_2D
125  Input 0:
126    data_type  : int8
127    range      : [-128, 127]
128    granularity: per-tensor
129  Output 0:
130    data_type  : int8
131    range      : [-128, 127]
132    granularity: per-tensor
133  restriction: Input and outputs must all have same scale/zero_point
134
135CONCATENATION
136  Input ...:
137    data_type  : int8
138    range      : [-128, 127]
139    granularity: per-tensor
140  Output 0:
141    data_type  : int8
142    range      : [-128, 127]
143    granularity: per-tensor
144  restriction: Input and outputs must all have same scale/zero_point
145
146CONV_2D
147  Input 0:
148    data_type  : int8
149    range      : [-128, 127]
150    granularity: per-tensor
151  Input 1 (Weight):
152    data_type  : int8
153    range      : [-127, 127]
154    granularity: per-axis (dim = 0)
155    restriction: zero_point = 0
156  Input 2 (Bias):
157    data_type  : int32
158    range      : [int32_min, int32_max]
159    granularity: per-axis
160    restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
161  Output 0:
162    data_type  : int8
163    range      : [-128, 127]
164    granularity: per-tensor
165
166DEPTHWISE_CONV_2D
167  Input 0:
168    data_type  : int8
169    range      : [-128, 127]
170    granularity: per-tensor
171  Input 1 (Weight):
172    data_type  : int8
173    range      : [-127, 127]
174    granularity: per-axis (dim = 3)
175    restriction: zero_point = 0
176  Input 2 (Bias):
177    data_type  : int32
178    range      : [int32_min, int32_max]
179    granularity: per-axis
180    restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
181  Output 0:
182    data_type  : int8
183    range      : [-128, 127]
184    granularity: per-tensor
185
186FULLY_CONNECTED
187  Input 0:
188    data_type  : int8
189    range      : [-128, 127]
190    granularity: per-tensor
191  Input 1 (Weight):
192    data_type  : int8
193    range      : [-127, 127]
194    granularity: per-tensor
195    restriction: zero_point = 0
196  Input 2 (Bias):
197    data_type  : int32
198    range      : [int32_min, int32_max]
199    granularity: per-tensor
200    restriction: (scale, zero_point) = (input0_scale * input1_scale[...], 0)
201  Output 0:
202    data_type  : int8
203    range      : [-128, 127]
204    granularity: per-tensor
205
206L2_NORMALIZATION
207  Input 0:
208    data_type  : int8
209    range      : [-128, 127]
210    granularity: per-tensor
211  Output 0:
212    data_type  : int8
213    range      : [-128, 127]
214    granularity: per-tensor
215    restriction: (scale, zero_point) = (1.0 / 128.0, 0)
216
217LOGISTIC
218  Input 0:
219    data_type  : int8
220    range      : [-128, 127]
221    granularity: per-tensor
222  Output 0:
223    data_type  : int8
224    range      : [-128, 127]
225    granularity: per-tensor
226    restriction: (scale, zero_point) = (1.0 / 256.0, -128)
227
228MAX_POOL_2D
229  Input 0:
230    data_type  : int8
231    range      : [-128, 127]
232    granularity: per-tensor
233  Output 0:
234    data_type  : int8
235    range      : [-128, 127]
236    granularity: per-tensor
237  restriction: Input and outputs must all have same scale/zero_point
238
239MUL
240  Input 0:
241    data_type  : int8
242    range      : [-128, 127]
243    granularity: per-tensor
244  Input 1:
245    data_type  : int8
246    range      : [-128, 127]
247    granularity: per-tensor
248  Output 0:
249    data_type  : int8
250    range      : [-128, 127]
251    granularity: per-tensor
252
253RESHAPE
254  Input 0:
255    data_type  : int8
256    range      : [-128, 127]
257    granularity: per-tensor
258  Output 0:
259    data_type  : int8
260    range      : [-128, 127]
261    granularity: per-tensor
262  restriction: Input and outputs must all have same scale/zero_point
263
264RESIZE_BILINEAR
265  Input 0:
266    data_type  : int8
267    range      : [-128, 127]
268    granularity: per-tensor
269  Output 0:
270    data_type  : int8
271    range      : [-128, 127]
272    granularity: per-tensor
273  restriction: Input and outputs must all have same scale/zero_point
274
275SOFTMAX
276  Input 0:
277    data_type  : int8
278    range      : [-128, 127]
279    granularity: per-tensor
280  Output 0:
281    data_type  : int8
282    range      : [-128, 127]
283    granularity: per-tensor
284    restriction: (scale, zero_point) = (1.0 / 256.0, -128)
285
286SPACE_TO_DEPTH
287  Input 0:
288    data_type  : int8
289    range      : [-128, 127]
290    granularity: per-tensor
291  Output 0:
292    data_type  : int8
293    range      : [-128, 127]
294    granularity: per-tensor
295  restriction: Input and outputs must all have same scale/zero_point
296
297TANH
298  Input 0:
299    data_type  : int8
300    range      : [-128, 127]
301    granularity: per-tensor
302  Output 0:
303    data_type  : int8
304    range      : [-128, 127]
305    granularity: per-tensor
306    restriction: (scale, zero_point) = (1.0 / 128.0, 0)
307
308PAD
309  Input 0:
310    data_type  : int8
311    range      : [-128, 127]
312    granularity: per-tensor
313  Output 0:
314    data_type  : int8
315    range      : [-128, 127]
316    granularity: per-tensor
317  restriction: Input and outputs must all have same scale/zero_point
318
319GATHER
320  Input 0:
321    data_type  : int8
322    range      : [-128, 127]
323    granularity: per-tensor
324  Output 0:
325    data_type  : int8
326    range      : [-128, 127]
327    granularity: per-tensor
328  restriction: Input and outputs must all have same scale/zero_point
329
330BATCH_TO_SPACE_ND
331  Input 0:
332    data_type  : int8
333    range      : [-128, 127]
334    granularity: per-tensor
335  Output 0:
336    data_type  : int8
337    range      : [-128, 127]
338    granularity: per-tensor
339  restriction: Input and outputs must all have same scale/zero_point
340
341SPACE_TO_BATCH_ND
342  Input 0:
343    data_type  : int8
344    range      : [-128, 127]
345    granularity: per-tensor
346  Output 0:
347    data_type  : int8
348    range      : [-128, 127]
349    granularity: per-tensor
350  restriction: Input and outputs must all have same scale/zero_point
351
352TRANSPOSE
353  Input 0:
354    data_type  : int8
355    range      : [-128, 127]
356    granularity: per-tensor
357  Output 0:
358    data_type  : int8
359    range      : [-128, 127]
360    granularity: per-tensor
361  restriction: Input and outputs must all have same scale/zero_point
362
363MEAN
364  Input 0:
365    data_type  : int8
366    range      : [-128, 127]
367    granularity: per-tensor
368  Output 0:
369    data_type  : int8
370    range      : [-128, 127]
371    granularity: per-tensor
372
373SUB
374  Input 0:
375    data_type  : int8
376    range      : [-128, 127]
377    granularity: per-tensor
378  Input 1:
379    data_type  : int8
380    range      : [-128, 127]
381    granularity: per-tensor
382  Output 0:
383    data_type  : int8
384    range      : [-128, 127]
385    granularity: per-tensor
386
387SUM
388  Input 0:
389    data_type  : int8
390    range      : [-128, 127]
391    granularity: per-tensor
392  Output 0:
393    data_type  : int8
394    range      : [-128, 127]
395    granularity: per-tensor
396
397SQUEEZE
398  Input 0:
399    data_type  : int8
400    range      : [-128, 127]
401    granularity: per-tensor
402  Output 0:
403    data_type  : int8
404    range      : [-128, 127]
405    granularity: per-tensor
406  restriction: Input and outputs must all have same scale/zero_point
407
408LOG_SOFTMAX
409  Input 0:
410    data_type  : int8
411    range      : [-128, 127]
412    granularity: per-tensor
413  Output 0:
414    data_type  : int8
415    range      : [-128, 127]
416    granularity: per-tensor
417    restriction: (scale, zero_point) = (16.0 / 256.0, 127)
418
419MAXIMUM
420  Input 0:
421    data_type  : int8
422    range      : [-128, 127]
423    granularity: per-tensor
424  Output 0:
425    data_type  : int8
426    range      : [-128, 127]
427    granularity: per-tensor
428  restriction: Input and outputs must all have same scale/zero_point
429
430ARG_MAX
431  Input 0:
432    data_type  : int8
433    range      : [-128, 127]
434    granularity: per-tensor
435
436MINIMUM
437  Input 0:
438    data_type  : int8
439    range      : [-128, 127]
440    granularity: per-tensor
441  Output 0:
442    data_type  : int8
443    range      : [-128, 127]
444    granularity: per-tensor
445  restriction: Input and outputs must all have same scale/zero_point
446
447LESS
448  Input 0:
449    data_type  : int8
450    range      : [-128, 127]
451    granularity: per-tensor
452  Input 1:
453    data_type  : int8
454    range      : [-128, 127]
455    granularity: per-tensor
456
457PADV2
458  Input 0:
459    data_type  : int8
460    range      : [-128, 127]
461    granularity: per-tensor
462  Output 0:
463    data_type  : int8
464    range      : [-128, 127]
465    granularity: per-tensor
466  restriction: Input and outputs must all have same scale/zero_point
467
468GREATER
469  Input 0:
470    data_type  : int8
471    range      : [-128, 127]
472    granularity: per-tensor
473  Input 1:
474    data_type  : int8
475    range      : [-128, 127]
476    granularity: per-tensor
477
478GREATER_EQUAL
479  Input 0:
480    data_type  : int8
481    range      : [-128, 127]
482    granularity: per-tensor
483  Input 1:
484    data_type  : int8
485    range      : [-128, 127]
486    granularity: per-tensor
487
488LESS_EQUAL
489  Input 0:
490    data_type  : int8
491    range      : [-128, 127]
492    granularity: per-tensor
493  Input 1:
494    data_type  : int8
495    range      : [-128, 127]
496    granularity: per-tensor
497
498SLICE
499  Input 0:
500    data_type  : int8
501    range      : [-128, 127]
502    granularity: per-tensor
503  Output 0:
504    data_type  : int8
505    range      : [-128, 127]
506    granularity: per-tensor
507  restriction: Input and outputs must all have same scale/zero_point
508
509EQUAL
510  Input 0:
511    data_type  : int8
512    range      : [-128, 127]
513    granularity: per-tensor
514  Input 1:
515    data_type  : int8
516    range      : [-128, 127]
517    granularity: per-tensor
518
519NOT_EQUAL
520  Input 0:
521    data_type  : int8
522    range      : [-128, 127]
523    granularity: per-tensor
524  Input 1:
525    data_type  : int8
526    range      : [-128, 127]
527    granularity: per-tensor
528
529SHAPE
530  Input 0:
531    data_type  : int8
532    range      : [-128, 127]
533    granularity: per-tensor
534
535QUANTIZE (Requantization)
536  Input 0:
537    data_type  : int8
538    range      : [-128, 127]
539    granularity: per-tensor
540  Output 0:
541    data_type  : int8
542    range      : [-128, 127]
543    granularity: per-tensor
544```
545
546## References
547
548[arXiv:1712.05877](https://arxiv.org/abs/1712.05877)
549