1# Building a quantization paradigm from first principles 2 3**TLDR:** If you prefer example code over theory, look at 4[doc/quantization_example.cc](quantization_example.cc). 5 6## Overview 7 8gemmlowp allows to perform calculations on matrices on uint8 values, but these 9matrices are only useful insofar as they somehow approximate matrices of real 10numbers. By a _quantization paradigm_ we mean a correspondence between matrices 11of quantized 8bit values and matrices of real numbers. The choice of a 12quantization paradigm affects the calculations that gemmlowp itself needs to 13perform, specifically, it affects how one goes from internal 32bit accumulator 14to final 8bit outputs. 15 16The part of gemmlowp transforming internal internal 32bit accumulator to final 178bit outputs is the "output pipeline" described in [output.md](output.md). 18 19gemmlowp's `GemmWithOutputPipeline` entry point allows specifying an arbitrary 20output pipeline, allowing the user to implement their own preferred quantized 21arithmetic paradigm. 22 23In the present document, our purpose is to show how, reasoning from first 24principles and some domain-specific knowledge of neural networks, we can arrive 25naturally at some specific quantization paradigm, and how that can be 26implemented using a specific output pipeline. 27 28We also aim to show how that differs from the older, legacy quantization 29paradigm implemented by gemmlowp's legacy interfaces and why the change to the 30newer quantization paradigm described in this document was useful as far as some 31applications of gemmlowp were concerned. 32 33## Quantization as an affine map. 34 35In order for arithmetic on real values to map directly to arithmetic on 36quantized uint8 values, the mapping between real and quantized uint8 values must 37be affine, which means it must be of the form 38 39``` 40real_value = A * quantized_value + B (1) 41``` 42 43for some constants A, B, or equivalently, of the form 44 45``` 46real_value = C * (quantized_value + D) (2) 47``` 48 49for some constants C, D. Indeed, anything else than such an affine map would 50mean that the result of the quantized calculations do no longer readily provide 51an approximation to the result of the real-numbers calculation. 52 53## Domain-specific constraint: the real value 0 must be exactly representable. 54 55Here a domain-specific constrain from neural networks appears: for some neural 56network layers, it is very useful for optimized implementations that the 57real-value 0 be exactly representable. 58 59For instance, in a Convolutional or Pooling layer with padding, it is useful to 60be able to implement the padding by zero-padding the input array, so that 61optimized loops do not need to become more complex to avoid overrunning the 62array bounds. 63 64In order for such zero-padding to be feasible in a quantized implementation of 65such layers, it is important that the real value '0' be exactly representable in 66quantized form, i.e. that it correspond exactly to some quantized value, which 67we call the _zero-point_. 68 69Indeed, if '0' were not exactly representable, then we would have to use some 70quantized value for padding, that does not exactly correspond to the real value 71'0'. That would typically introduce inaccuracy in the result. In fact, using 72always the same such value would be worse: it would introduce _bias_ in the 73result. 74 75## The final form of the quantization equation 76 77Now let us phrase what this constraint — that the real value 0 be exactly 78representable — means in either quantization equations, (1) and (2). 79 80In equation (1), plugging `real_value = 0` and `quantized_value = zero_point`, 81we get: 82 83``` 840 = A * zero_point + B 85``` 86 87equivalently: 88 89``` 90zero_point = -B / A 91``` 92 93We are thus left with a rather awkward constraint: the real number `-B / A` must 94somehow be guaranteed to be exactly integral, so that the special uint8 value 95`zero_point` can be exactly equal to it. Quite awkward! 96 97Now let us look at equation (2). Plugging `real_value = 0` and 98`quantized_value = zero_point`, we get: 99 100``` 1010 = C * (zero_point + D) 102``` 103 104Conveniently, the constant `C` plays no role anymore, so this equation 105simplifies to: 106 107``` 1080 = zero_point + D 109``` 110 111In other words, `D = -zero_point`. This suggests rewriting the quantization 112equation (2) into the following form (3), which will be the final form that we 113will consistently use: 114 115``` 116real_value = scale * (quantized_value - zero_point) (3) 117``` 118 119To go from (2) to (3), we merely renamed `C` to `scale` and `D` to 120`-zero_point`. 121 122With this quantization equation (3), the condition that 0 be exactly 123representable is vacuously satisfied: `zero_point` is by definition one of the 124possible `quantized_value`'s, and equation (3) maps it to a `real_value` of 125exactly 0. 126 127Note that the final quantizaton equation (3) depends on two constants, one 128integral, the other an arbitrary positive real number: 129 130* `zero_point` is integral, more specifically is one of the possible quantized 131 values (i.e. typically is a uint8 value). 132* `scale` is a positive real number. Thus at this stage we have not yet shown 133 how to eliminate all usage of floating-point arithmetic. That will come 134 below. 135 136## Quantizing a matrix multiplication 137 138Now that we know — equation (3) — how real numbers are to correspond 139to quantized values (typically uint8), we turn to applying this knowledge to 140rewriting a multiplication of matrices of real numbers, by the equivalent 141multiplication of matrices of quantized values. 142 143Say that we have two matrices of real values `lhs_real_matrix`, 144`rhs_real_matrix`. Each entry of their product is the sum (accumulation) of many 145products of individual matrix entries, say `lhs_real_value * rhs_real_value`. 146 147Now suppose that we have already quantized these two matrices according to the 148above equation (3), with some already-known quantization parameters `lhs_scale`, 149`rhs_scale`, `lhs_zero_point`, `rhs_zero_point`, so that their matrix entries 150are quantized as 151 152``` 153lhs_real_value[i] = lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) 154rhs_real_value[i] = rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) 155``` 156 157We then rewrite the matrix product accumulator accordingly: 158 159``` 160result_real_value 161 = Sum_over_i(lhs_real_value[i] * rhs_real_value[i]) 162 = Sum_over_i( 163 lhs_scale * (lhs_quantized_value[i] - lhs_zero_point) * 164 rhs_scale * (rhs_quantized_value[i] - rhs_zero_point) 165 ) 166 = lhs_scale * rhs_scale * Sum_over_i( 167 (lhs_quantized_value[i] - lhs_zero_point) * 168 (rhs_quantized_value[i] - rhs_zero_point) 169 ) (4) 170``` 171 172Now our goal is to represent this result itself as a quantized matrix, i.e. 173still according to equation (3), for some pre-established quantization 174parameters `result_scale` and `result_zero_point`, as 175 176``` 177result_real_value = result_scale * 178 (result_quantized_value - result_zero_point) 179``` 180 181Here we need to keep in mind that our goal is to specify what the quantized 182matrix multiplication should do, i.e. how to compute `result_quantized_value`. 183The last equation above is equivalent to 184 185``` 186result_quantized_value = result_zero_point + 187 result_real_value / result_scale 188``` 189 190Now we can use equation (4) above to plug into this the expression of 191result_real_value in terms of the quantized operands, and we obtain: 192 193``` 194result_quantized_value = result_zero_point + 195 (lhs_scale * rhs_scale / result_scale) * 196 Sum_over_i( 197 (lhs_quantized_value[i] - lhs_zero_point) * 198 (rhs_quantized_value[i] - rhs_zero_point) 199 ) (5) 200``` 201 202Equation (5) is the conclusion of this general discussion of how to specify what 203"quantized matrix multiplication" should actually compute, in order to be able 204to replace real matrix multiplications. 205 206## Implementation of quantized matrix multiplication 207 208Having obtained the mathematical form (5) of quantized matrix multiplication, we 209now turn to its actual implementation. 210 211The inner-most part of (5), 212 213``` 214int32_accumulator = 215 Sum_over_i( 216 (lhs_quantized_value[i] - lhs_zero_point) * 217 (rhs_quantized_value[i] - rhs_zero_point) 218) 219``` 220 221is the "kernel" accumulation loop. It is where the bulk of the computational 222cost goes. Luckily, it only involves integers: the quantized operands matrix 223entries, and their `zero_point` quantization parameters. Typically, all of these 224values are uint8. Typically, the above differences of uint8 values would be 225represented as signed int16; their products as signed int32. 226 227It is out of scope of the present doc to discuss how to avoid the overhead of 228having to subtract these `zero_point` constants in this inner loop; refer to 229[this section of 230low-precision.md](low-precision.md#efficient-handling-of-offsets) for that. The 231gist of it is that a mathematical trick allows us to take the handling of these 232`zero_point` constants out of this accumulation loop, so that it simplifies to 233 234``` 235int32_accumulator = 236 Sum_over_i( 237 lhs_quantized_value[i] * 238 rhs_quantized_value[i] 239 ) (6) 240``` 241 242Anyway, the result is a `int32_accumulator` that we now plug back into the rest 243of (5): 244 245``` 246result_quantized_value = result_zero_point + 247 (lhs_scale * rhs_scale / result_scale) * int32_accumulator (7) 248``` 249 250The difficulty here is of course that `(lhs_scale * rhs_scale / result_scale)` 251is a positive real number, not an integer in general. It is a constant, though. 252So what we have to implement here is the (approximate) scaling of a int32 value 253by some arbitrary positive constant multiplier. 254 255Moreover, it is safe to assume that this positive constant multiplier is smaller 256than one — each of the `scale` values here is typically smaller than one, 257as we are typically mapping the `[0..255]` quantized uint8 value range to an 258interval of real values that is much narrower than that, typically within 259`[-10,10]` in most neural networks. For example, a neural network using Relu6 260activation functions will typically have real activation values in the interval 261[0,6]. 262 263So how do we implement the multiplication of a int32 value by a positive real 264constant that is smaller than one? Typically, by multiplying by a fixed-point 265constant multiplier in the normalized interval `[1/2,1)`, and right-shifting 266the result to achieve the correct multiplier. 267 268At this point we have obtained the int32 value of the product 269 270``` 271(lhs_scale * rhs_scale / result_scale) * int32_accumulator 272``` 273 274Looking at (7), it only remains to add to it the integral value 275`result_zero_point`, and we are done. 276 277## How this is implemented in gemmlowp 278 279The different parts of gemmlowp implementing aspects of the above discussion 280are: 281 282* The packing stage (see [packing.md](packing.md)) implements the special 283 mathematical trick to handle `lhs_offset`, `rhs_offset` that we alluded to 284 above, see [this section of 285 low-precision.md](low-precision.md#efficient-handling-of-offsets) for 286 details. Thanks to is, the rest of the calculation can proceed as if 287 `lhs_offset`, `rhs_offset` were 0. 288 289* The compute/kernel stage (see [kernel.md](kernel.md)) performs the core 290 accumulation loop producing the `int32_accumulator`, see equation (6) above. 291 292* The unpacking stage feeds into the output pipeline (see 293 [output.md](output.md)), which implements the rest of the evaluation of the 294 above equation (5), that we discussed in the previous section. 295 296Now, the point of gemmlowp's flexible output-pipelines mechanism (see 297[output.md](output.md)) is to support different quantization paradigms, so we 298now have to specify which particular flavor of output pipeline corresponds to 299the particular quantization paradigm that we detailed above in this document. 300 301The specific output pipeline stage implementing the present quantization 302paradigm, i.e. implementing the precise computation detailed in the previous 303section (equation (5)), is 304`OutputStageQuantizeDownInt32ByFixedPoint`. 305 306Please refer to the comment explaining it in 307[public/output_stages.h](../public/output_stages.h). 308 309## How this differs from the older legacy gemmlowp quantization paradigm 310 311The difference between the older legacy quantization paradigm described in 312[low-precision.md](low-precision.md) and the newer one described in this 313document boils down to the difference between the legacy output stage 314implementing it, `OutputStageQuantizeDownInt32ToUint8Scale`, and the new output 315stage implementing the new paradigm, 316`OutputStageQuantizeDownInt32ByFixedPoint`. 317 318Please refer to the comments in 319[public/output_stages.h](../public/output_stages.h) for details about these two 320output stages and how they differ. 321 322Issues with the old output stage `OutputStageQuantizeDownInt32ToUint8Scale` are: 323 3241. The int32 accumulators (inputs to the output stage) undergo a plain int32 325 multiplication with a int32 multiplier, which may overflow. By contrast, in 326 the newer `OutputStageQuantizeDownInt32ByFixedPoint`, this 327 integer multiplication becomes a fixed-point multiplication and cannot 328 overflow. 329 330 * In practice, to limit the risk of overflow, this pushes users to choose 331 smaller values for this integer multiplier, which means limited 332 multiplicative accuracy, which may cause multiplicative bias depending 333 on how it is used. 334 3352. Note how the order of multiplying by the multipler and adding the 336 `result_offset` are swapped. This reflects a quantizatin equation of the 337 form (1) above, as opposed to the form (2)/(3) that the new quantization 338 paradigm uses. As a result, it is essentially impossible to guarantee that 0 339 is an exactly-representable value, which as discussed above is an issue at 340 least in some convolutional neural network applications. 341 342## Example code illustrating the new quantization paradigm 343 344Example code showing how to perfom a quantized matrix multiplication in the 345quantization paradigm discussed here is in 346[doc/quantization_example.cc](quantization_example.cc). 347