1# Building a quantization paradigm from first principles
2
3**TLDR:** If you prefer example code over theory, look at
4[doc/quantization_example.cc](quantization_example.cc).
5
6## Overview
7
8gemmlowp allows to perform calculations on matrices on uint8 values, but these
9matrices are only useful insofar as they somehow approximate matrices of real
10numbers. By a _quantization paradigm_ we mean a correspondence between matrices
11of quantized 8bit values and matrices of real numbers. The choice of a
12quantization paradigm affects the calculations that gemmlowp itself needs to
13perform, specifically, it affects how one goes from internal 32bit accumulator
14to final 8bit outputs.
15
16The part of gemmlowp transforming internal 32bit accumulator to final
178bit outputs is the "output pipeline" described in [output.md](output.md).
18
19gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary
20output pipeline, allowing the user to implement their own preferred quantized
21arithmetic paradigm.
22
23In the present document, our purpose is to show how, reasoning from first
24principles and some domain-specific knowledge of neural networks, we can arrive
25naturally at some specific quantization paradigm, and how that can be
26implemented using a specific output pipeline.
27
28We also aim to show how that differs from the older, legacy quantization
29paradigm implemented by gemmlowp's legacy interfaces and why the change to the
30newer quantization paradigm described in this document was useful as far as some
31applications of gemmlowp were concerned.
32
33## Quantization as an affine map.
34
35In order for arithmetic on real values to map directly to arithmetic on
36quantized uint8 values, the mapping between real and quantized uint8 values must
37be affine, which means it must be of the form
38
39```
40real_value = A * quantized_value + B             (1)
41```
42
43for some constants A, B, or equivalently, of the form
44
45```
46real_value = C * (quantized_value + D)           (2)
47```
48
49for some constants C, D. Indeed, anything else than such an affine map would
50mean that the result of the quantized calculations do no longer readily provide
51an approximation to the result of the real-numbers calculation.
52
53## Domain-specific constraint: the real value 0 must be exactly representable.
54
55Here a domain-specific constrain from neural networks appears: for some neural
56network layers, it is very useful for optimized implementations that the
57real-value 0 be exactly representable.
58
59For instance, in a Convolutional or Pooling layer with padding, it is useful to
60be able to implement the padding by zero-padding the input array, so that
61optimized loops do not need to become more complex to avoid overrunning the
62array bounds.
63
64In order for such zero-padding to be feasible in a quantized implementation of
65such layers, it is important that the real value '0' be exactly representable in
66quantized form, i.e. that it correspond exactly to some quantized value, which
67we call the _zero-point_.
68
69Indeed, if '0' were not exactly representable, then we would have to use some
70quantized value for padding, that does not exactly correspond to the real value
71'0'. That would typically introduce inaccuracy in the result. In fact, using
72always the same such value would be worse: it would introduce _bias_ in the
73result.
74
75## The final form of the quantization equation
76
77Now let us phrase what this constraint — that the real value 0 be exactly
78representable — means in either quantization equations, (1) and (2).
79
80In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`,
81we get:
82
83```
840 = A * zero_point + B
85```
86
87equivalently:
88
89```
90zero_point = -B / A
91```
92
93We are thus left with a rather awkward constraint: the real number `-B / A` must
94somehow be guaranteed to be exactly integral, so that the special uint8 value
95`zero_point` can be exactly equal to it. Quite awkward!
96
97Now let us look at equation (2). Plugging `real_value = 0` and
98`quantized_value = zero_point`, we get:
99
100```
1010 = C * (zero_point + D)
102```
103
104Conveniently, the constant `C` plays no role anymore, so this equation
105simplifies to:
106
107```
1080 = zero_point + D
109```
110
111In other words, `D = -zero_point`. This suggests rewriting the quantization
112equation (2) into the following form (3), which will be the final form that we
113will consistently use:
114
115```
116real_value = scale * (quantized_value - zero_point)        (3)
117```
118
119To go from (2) to (3), we merely renamed `C` to `scale` and `D` to
120`-zero_point`.
121
122With this quantization equation (3), the condition that 0 be exactly
123representable is vacuously satisfied: `zero_point` is by definition one of the
124possible `quantized_value`'s, and equation (3) maps it to a `real_value` of
125exactly 0.
126
127Note that the final quantizaton equation (3) depends on two constants, one
128integral, the other an arbitrary positive real number:
129
130*   `zero_point` is integral, more specifically is one of the possible quantized
131    values (i.e. typically is a uint8 value).
132*   `scale` is a positive real number. Thus at this stage we have not yet shown
133    how to eliminate all usage of floating-point arithmetic. That will come
134    below.
135
136## Quantizing a matrix multiplication
137
138Now that we know — equation (3) — how real numbers are to correspond
139to quantized values (typically uint8), we turn to applying this knowledge to
140rewriting a multiplication of matrices of real numbers, by the equivalent
141multiplication of matrices of quantized values.
142
143Say that we have two matrices of real values `lhs_real_matrix`,
144`rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many
145products of individual matrix entries, say `lhs_real_value * rhs_real_value`.
146
147Now suppose that we have already quantized these two matrices according to the
148above equation (3), with some already-known quantization parameters `lhs_scale`,
149`rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries
150are quantized as
151
152```
153lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point)
154rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
155```
156
157We then rewrite the matrix product accumulator accordingly:
158
159```
160result_real_value
161  = Sum_over_i(lhs_real_value[i] * rhs_real_value[i])
162  = Sum_over_i(
163        lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) *
164        rhs_scale * (rhs_quantized_value[i] - rhs_zero_point)
165    )
166  = lhs_scale * rhs_scale * Sum_over_i(
167        (lhs_quantized_value[i] - lhs_zero_point) *
168        (rhs_quantized_value[i] - rhs_zero_point)
169    )                                                      (4)
170```
171
172Now our goal is to represent this result itself as a quantized matrix, i.e.
173still according to equation (3), for some pre-established quantization
174parameters `result_scale` and `result_zero_point`, as
175
176```
177result_real_value = result_scale *
178    (result_quantized_value - result_zero_point)
179```
180
181Here we need to keep in mind that our goal is to specify what the quantized
182matrix multiplication should do, i.e. how to compute `result_quantized_value`.
183The last equation above is equivalent to
184
185```
186result_quantized_value = result_zero_point +
187    result_real_value / result_scale
188```
189
190Now we can use equation (4) above to plug into this the expression of
191result_real_value in terms of the quantized operands, and we obtain:
192
193```
194result_quantized_value = result_zero_point +
195    (lhs_scale * rhs_scale / result_scale) *
196        Sum_over_i(
197            (lhs_quantized_value[i] - lhs_zero_point) *
198            (rhs_quantized_value[i] - rhs_zero_point)
199        )                                                  (5)
200```
201
202Equation (5) is the conclusion of this general discussion of how to specify what
203"quantized matrix multiplication" should actually compute, in order to be able
204to replace real matrix multiplications.
205
206## Implementation of quantized matrix multiplication
207
208Having obtained the mathematical form (5) of quantized matrix multiplication, we
209now turn to its actual implementation.
210
211The inner-most part of (5),
212
213```
214int32_accumulator =
215    Sum_over_i(
216        (lhs_quantized_value[i] - lhs_zero_point) *
217        (rhs_quantized_value[i] - rhs_zero_point)
218)
219```
220
221is the "kernel" accumulation loop. It is where the bulk of the computational
222cost goes. Luckily, it only involves integers: the quantized operands matrix
223entries, and their `zero_point` quantization parameters. Typically, all of these
224values are uint8. Typically, the above differences of uint8 values would be
225represented as signed int16; their products as signed int32.
226
227It is out of scope of the present doc to discuss how to avoid the overhead of
228having to subtract these `zero_point` constants in this inner loop; refer to
229[this section of
230low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The
231gist of it is that a mathematical trick allows us to take the handling of these
232`zero_point` constants out of this accumulation loop, so that it simplifies to
233
234```
235int32_accumulator =
236    Sum_over_i(
237      lhs_quantized_value[i] *
238      rhs_quantized_value[i]
239    )                                                      (6)
240```
241
242Anyway, the result is a `int32_accumulator` that we now plug back into the rest
243of (5):
244
245```
246result_quantized_value = result_zero_point +
247    (lhs_scale * rhs_scale / result_scale) * int32_accumulator       (7)
248```
249
250The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)`
251is a positive real number, not an integer in general. It is a constant, though.
252So what we have to implement here is the (approximate) scaling of a int32 value
253by some arbitrary positive constant multiplier.
254
255Moreover, it is safe to assume that this positive constant multiplier is smaller
256than one — each of the `scale` values here is typically smaller than one,
257as we are typically mapping the `[0..255]` quantized uint8 value range to an
258interval of real values that is much narrower than that, typically within
259`[-10,10]` in most neural networks. For example, a neural network using Relu6
260activation functions will typically have real activation values in the interval
261[0,6].
262
263So how do we implement the multiplication of a int32 value by a positive real
264constant that is smaller than one? Typically, by multiplying by a fixed-point
265constant multiplier in the normalized interval `[1/2,1)`, and right-shifting
266the result to achieve the correct multiplier.
267
268At this point we have obtained the int32 value of the product
269
270```
271(lhs_scale * rhs_scale / result_scale) * int32_accumulator
272```
273
274Looking at (7), it only remains to add to it the integral value
275`result_zero_point`, and we are done.
276
277## How this is implemented in gemmlowp
278
279The different parts of gemmlowp implementing aspects of the above discussion
280are:
281
282*   The packing stage (see [packing.md](packing.md)) implements the special
283    mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to
284    above, see [this section of
285    low-precision.md](low-precision.md#efficient-handling-of-offsets) for
286    details. Thanks to is, the rest of the calculation can proceed as if
287    `lhs_offset`, `rhs_offset` were 0.
288
289*   The compute/kernel stage (see [kernel.md](kernel.md)) performs the core
290    accumulation loop producing the `int32_accumulator`, see equation (6) above.
291
292*   The unpacking stage feeds into the output pipeline (see
293    [output.md](output.md)), which implements the rest of the evaluation of the
294    above equation (5), that we discussed in the previous section.
295
296Now, the point of gemmlowp's flexible output-pipelines mechanism (see
297[output.md](output.md)) is to support different quantization paradigms, so we
298now have to specify which particular flavor of output pipeline corresponds to
299the particular quantization paradigm that we detailed above in this document.
300
301The specific output pipeline stage implementing the present quantization
302paradigm, i.e. implementing the precise computation detailed in the previous
303section (equation (5)), is
304`OutputStageQuantizeDownInt32ByFixedPoint`.
305
306Please refer to the comment explaining it in
307[public/output_stages.h](../public/output_stages.h).
308
309## How this differs from the older legacy gemmlowp quantization paradigm
310
311The difference between the older legacy quantization paradigm described in
312[low-precision.md](low-precision.md) and the newer one described in this
313document boils down to the difference between the legacy output stage
314implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output
315stage implementing the new paradigm,
316`OutputStageQuantizeDownInt32ByFixedPoint`.
317
318Please refer to the comments in
319[public/output_stages.h](../public/output_stages.h) for details about these two
320output stages and how they differ.
321
322Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are:
323
3241.  The int32 accumulators (inputs to the output stage) undergo a plain int32
325    multiplication with a int32 multiplier, which may overflow. By contrast, in
326    the newer `OutputStageQuantizeDownInt32ByFixedPoint`, this
327    integer multiplication becomes a fixed-point multiplication and cannot
328    overflow.
329
330    *   In practice, to limit the risk of overflow, this pushes users to choose
331        smaller values for this integer multiplier, which means limited
332        multiplicative accuracy, which may cause multiplicative bias depending
333        on how it is used.
334
3352.  Note how the order of multiplying by the multipler and adding the
336    `result_offset` are swapped. This reflects a quantizatin equation of the
337    form (1) above, as opposed to the form (2)/(3) that the new quantization
338    paradigm uses. As a result, it is essentially impossible to guarantee that 0
339    is an exactly-representable value, which as discussed above is an issue at
340    least in some convolutional neural network applications.
341
342## Example code illustrating the new quantization paradigm
343
344Example code showing how to perfom a quantized matrix multiplication in the
345quantization paradigm discussed here is in
346[doc/quantization_example.cc](quantization_example.cc).
347