1# Post-training quantization
2
3Post-training quantization is a conversion technique that can reduce model size
4while also improving CPU and hardware accelerator latency, with little
5degradation in model accuracy. You can quantize an already-trained float
6TensorFlow model when you convert it to TensorFlow Lite format using the
7[TensorFlow Lite Converter](../convert/).
8
9Note: The procedures on this page require TensorFlow 1.15 or higher.
10
11### Optimization Methods
12
13There are several post-training quantization options to choose from. Here is a
14summary table of the choices and the benefits they provide:
15
16| Technique            | Benefits                  | Hardware         |
17| -------------------- | ------------------------- | ---------------- |
18| Dynamic range        | 4x smaller, 2x-3x speedup | CPU              |
19: quantization         :                           :                  :
20| Full integer         | 4x smaller, 3x+ speedup   | CPU, Edge TPU,   |
21: quantization         :                           : Microcontrollers :
22| Float16 quantization | 2x smaller, GPU           | CPU, GPU         |
23:                      : acceleration              :                  :
24
25The following decision tree can help determine which post-training quantization
26method is best for your use case:
27
28![post-training optimization options](images/optimization.jpg)
29
30### Dynamic range quantization
31
32The simplest form of post-training quantization statically quantizes only the
33weights from floating point to integer, which has 8-bits of precision:
34
35<pre>
36import tensorflow as tf
37converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
38<b>converter.optimizations = [tf.lite.Optimize.DEFAULT]</b>
39tflite_quant_model = converter.convert()
40</pre>
41
42At inference, weights are converted from 8-bits of precision to floating point
43and computed using floating-point kernels. This conversion is done once and
44cached to reduce latency.
45
46To further improve latency, "dynamic-range" operators dynamically quantize
47activations based on their range to 8-bits and perform computations with 8-bit
48weights and activations. This optimization provides latencies close to fully
49fixed-point inference. However, the outputs are still stored using floating
50point so that the speedup with dynamic-range ops is less than a full fixed-point
51computation.
52
53### Full integer quantization
54
55You can get further latency improvements, reductions in peak memory usage, and
56compatibility with integer only hardware devices or accelerators by making sure
57all model math is integer quantized.
58
59For full integer quantization, you need to calibrate or estimate the range, i.e,
60(min, max) of all floating-point tensors in the model. Unlike constant tensors
61such as weights and biases, variable tensors such as model input, activations
62(outputs of intermediate layers) and model output cannot be calibrated unless we
63run a few inference cycles. As a result, the converter requires a representative
64dataset to calibrate them. This dataset can be a small subset (around ~100-500
65samples) of the training or validation data. Refer to the
66`representative_dataset()` function below.
67
68<pre>
69def representative_dataset():
70  for data in tf.data.Dataset.from_tensor_slices((images)).batch(1).take(100):
71    yield [data.astype(tf.float32)]
72</pre>
73
74For testing purposes, you can use a dummy dataset as follows:
75
76<pre>
77def representative_dataset():
78    for _ in range(100):
79      data = np.random.rand(1, 244, 244, 3)
80      yield [data.astype(np.float32)]
81 </pre>
82
83#### Integer with float fallback (using default float input/output)
84
85In order to fully integer quantize a model, but use float operators when they
86don't have an integer implementation (to ensure conversion occurs smoothly), use
87the following steps:
88
89<pre>
90import tensorflow as tf
91converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
92<b>converter.optimizations = [tf.lite.Optimize.DEFAULT]
93converter.representative_dataset = representative_dataset</b>
94tflite_quant_model = converter.convert()
95</pre>
96
97Note: This `tflite_quant_model` won't be compatible with integer only devices
98(such as 8-bit microcontrollers) and accelerators (such as the Coral Edge TPU)
99because the input and output still remain float in order to have the same
100interface as the original float only model.
101
102#### Integer only
103
104*Creating integer only models is a common use case for
105[TensorFlow Lite for Microcontrollers](https://www.tensorflow.org/lite/microcontrollers)
106and [Coral Edge TPUs](https://coral.ai/).*
107
108Note: Starting TensorFlow 2.3.0, we support the `inference_input_type` and
109`inference_output_type` attributes.
110
111Additionally, to ensure compatibility with integer only devices (such as 8-bit
112microcontrollers) and accelerators (such as the Coral Edge TPU), you can enforce
113full integer quantization for all ops including the input and output, by using
114the following steps:
115
116<pre>
117import tensorflow as tf
118converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
119converter.optimizations = [tf.lite.Optimize.DEFAULT]
120converter.representative_dataset = representative_dataset
121<b>converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]</b>
122<b>converter.inference_input_type = tf.int8</b>  # or tf.uint8
123<b>converter.inference_output_type = tf.int8</b>  # or tf.uint8
124tflite_quant_model = converter.convert()
125</pre>
126
127Note: The converter will throw an error if it encounters an operation it cannot
128currently quantize.
129
130### Float16 quantization
131
132You can reduce the size of a floating point model by quantizing the weights to
133float16, the IEEE standard for 16-bit floating point numbers. To enable float16
134quantization of weights, use the following steps:
135
136<pre>
137import tensorflow as tf
138converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
139<b>converter.optimizations = [tf.lite.Optimize.DEFAULT]
140converter.target_spec.supported_types = [tf.float16]</b>
141tflite_quant_model = converter.convert()
142</pre>
143
144The advantages of float16 quantization are as follows:
145
146*   It reduces model size by up to half (since all weights become half of their
147    original size).
148*   It causes minimal loss in accuracy.
149*   It supports some delegates (e.g. the GPU delegate) which can operate
150    directly on float16 data, resulting in faster execution than float32
151    computations.
152
153The disadvantages of float16 quantization are as follows:
154
155*   It does not reduce latency as much as a quantization to fixed point math.
156*   By default, a float16 quantized model will "dequantize" the weights values
157    to float32 when run on the CPU. (Note that the GPU delegate will not perform
158    this dequantization, since it can operate on float16 data.)
159
160### Integer only: 16-bit activations with 8-bit weights (experimental)
161
162This is an experimental quantization scheme. It is similar to the "integer only"
163scheme, but activations are quantized based on their range to 16-bits, weights
164are quantized in 8-bit integer and bias is quantized into 64-bit integer. This
165is referred to as 16x8 quantization further.
166
167The main advantage of this quantization is that it can improve accuracy
168significantly, but only slightly increase model size.
169
170<pre>
171import tensorflow as tf
172converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
173converter.representative_dataset = representative_dataset
174<b>converter.optimizations = [tf.lite.Optimize.DEFAULT]
175converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8]</b>
176tflite_quant_model = converter.convert()
177</pre>
178
179If 16x8 quantization is not supported for some operators in the model,
180then the model still can be quantized, but unsupported operators kept in float.
181The following option should be added to the target_spec to allow this.
182<pre>
183import tensorflow as tf
184converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
185converter.representative_dataset = representative_dataset
186converter.optimizations = [tf.lite.Optimize.DEFAULT]
187converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8,
188<b>tf.lite.OpsSet.TFLITE_BUILTINS</b>]
189tflite_quant_model = converter.convert()
190</pre>
191
192Examples of the use cases where accuracy improvements provided by this
193quantization scheme include: * super-resolution, * audio signal processing such
194as noise cancelling and beamforming, * image de-noising, * HDR reconstruction
195from a single image.
196
197The disadvantage of this quantization is:
198
199*   Currently inference is noticeably slower than 8-bit full integer due to the
200    lack of optimized kernel implementation.
201*   Currently it is incompatible with the existing hardware accelerated TFLite
202    delegates.
203
204Note: This is an experimental feature.
205
206A tutorial for this quantization mode can be found
207[here](post_training_integer_quant_16x8.ipynb).
208
209### Model accuracy
210
211Since weights are quantized post training, there could be an accuracy loss,
212particularly for smaller networks. Pre-trained fully quantized models are
213provided for specific networks in the
214[TensorFlow Lite model repository](../models/). It is important to check the
215accuracy of the quantized model to verify that any degradation in accuracy is
216within acceptable limits. There are tools to evaluate
217[TensorFlow Lite model accuracy](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/evaluation/tasks){:.external}.
218
219
220Alternatively, if the accuracy drop is too high, consider using
221[quantization aware training](https://www.tensorflow.org/model_optimization/guide/quantization/training)
222. However, doing so requires modifications during model training to add fake
223quantization nodes, whereas the post-training quantization techniques on this
224page use an existing pre-trained model.
225
226### Representation for quantized tensors
227
2288-bit quantization approximates floating point values using the following
229formula.
230
231$$real\_value = (int8\_value - zero\_point) \times scale$$
232
233The representation has two main parts:
234
235*   Per-axis (aka per-channel) or per-tensor weights represented by int8 two’s
236    complement values in the range [-127, 127] with zero-point equal to 0.
237
238*   Per-tensor activations/inputs represented by int8 two’s complement values in
239    the range [-128, 127], with a zero-point in range [-128, 127].
240
241For a detailed view of our quantization scheme, please see our
242[quantization spec](./quantization_spec.md). Hardware vendors who want to plug
243into TensorFlow Lite's delegate interface are encouraged to implement the
244quantization scheme described there.
245