1# The low-precision paradigm in gemmlowp, and how it's implemented
2
3## Introduction
4
5"Low-precision" means that the input and output matrix entries are integers on
6at most 8 bits. The scalar type is uint8_t.
7
8This isn't the same as just doing plain matrix arithmetic over uint8_t, because
9that would overflow. To avoid overflow, we internally accumulate results on more
10than 8 bits, and at the end we keep only some significant 8 bits. This relies on
11the caller providing suitable offset/multiplier/shift parameters, which
12effectively govern how we extract some significant 8 bit from our more-than-8bit
13temporary accumulators.
14
15## Low-precision paradigms
16
17gemmlowp is flexible enough to support multiple low-precision paradigms, i.e.
18multiple ways that a meaning is attached to 8bit values so that a computation
19can rely on a 8bit GEMM provided by gemmlowp.
20
21### The current flexible design with arbitrary "output pipelines".
22
23See [output.md](output.md) for more details about output pipelines. This is a
24mechanism by which gemmlowp becomes generic enough to support multiple 8bit
25computation paradigms, by allowing the user to set up a chain of transformations
26to be performed on internal 32bit accumulators to obtain the final outputs.
27
28The public entry point in [public/gemmlowp.h](../public/gemmlowp.h) allowing to
29set up an arbitrary output pipeline is `GemmWithOutputPipeline`.
30
31Refer to [quantization.md](quantization.md) for details of how one gets from
32first principles to the actual output pipelines to assemble for successful
33real-world quantized calculations.
34
35For the scope of the present document, it suffices to say that quantized matrix
36multiplication takes the following parameters:
37
38-   The lhs matrix of uint8 quantized values.
39-   The rhs matrix of uint8 quantized values.
40-   A int32 lhs_offset, that will be added to each entry of the lhs matrix.
41-   A int32 rhs_offset, that will be added to each entry of the rhs matrix.
42-   An output pipeline, that will process int32 accumulators into final outputs.
43
44The overall computation goes through the following steps:
45
461.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
472.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
483.  Compute the int32 matrix product of the resulting lhs times rhs.
494.  Apply the output pipeline on these int32 accumulators, to obtain the final
50    outputs.
51
52### The legacy low-precision paradigm
53
54This older paradigm is the one exposed by the following entry points:
55
56*   In [public/gemmlowp.h](../public/gemmlowp.h), the `Gemm` entry point.
57*   The deprecateed `eight_bit_int_gemm` directory.
58
59Originally, gemmlowp started an implementation of the (now deprecated)
60EightBitIntGemm paradigm, where quantized matrix multiplication takes the
61following input parameters: - the lhs matrix of uint8 quantized values - the rhs
62matrix of uint8 quantized values - the following int32 "quantization
63parameters", which control how the uint8 quantized values in the matrices are to
64be interpreted during the matrix computation: - lhs_offset - rhs_offset -
65result_offset - result_mult_int - result_shift
66
67In that legacy paradigm, the mathematical expression to be computed is the
68result of the following steps:
69
701.  Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
712.  Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
723.  Compute the int32 matrix product of the resulting lhs times rhs.
734.  Add result_offset to each entry of the result.
745.  Multiply each entry of the result by the following fraction, and round to
75    the nearest integer:
76
77```
78result_mult_int
79---------------                             (1)
802^result_shift
81```
82
831.  Clamp the resulting int32 values to the `[0..255]` range and cast to uint8.
84
85Again, this paradigm is not recommended for new usage. See
86[quantization.md](quantization.md) for how reasoning from first principles, one
87arrives to a substantially different quantization paradigm.
88
89In addition, note that the integer multiplication by the numerator in the above
90step 5. risks overflowing. That concern is avoided in the currently recommended
91output stages by performing a fixed-point multiplication instead of an ordinary
92integer multiplication.
93
94# Efficient handling of offsets
95
96At first glance it may seem like the above-described quantized computation
97scheme requires adding the lhs_offset and rhs_offset to each of the lhs and rhs
98matrix entries.
99
100Doing that in the GEMM kernel would incur substantial overhead: - It would mean
101extra arithmetic work in the GEMM kernel; - It would require storing the
102lhs_offset and rhs_offset in registers, which would eat into the register space
103available for the rest of the GEMM kernel.
104
105One may then consider adding the lhs_offset and rhs_offset once and for all to
106lhs and rhs blocks, in a GEMM implementation operating on one lhs block and one
107rhs block at a time. However, doing so would require storing lhs and rhs blocks
108in 32 bit (or at least in 16 bit in real-world cases), which would partially
109negate the memory bandwidth benefits of low-precision computation.
110
111Fortunately, there is another way to handle these offsets that has none of the
112costs of the approaches described above. The idea is as follows.
113
114Let `P` denote the matrix shaped like `lhs`, but filled with 1's.
115
116Let `Q` denote the matrix shaped like `rhs`, but filled with 1's.
117
118Adding lhs_offset to each entry of `lhs`, means adding `lhs_offset * P` to
119`lhs`.
120
121Adding rhs_offset to each entry of `rhs`, means adding `rhs_offset * Q` to
122`rhs`.
123
124Thus, as far as handling `lhs_offset` and `rhs_offset` goes, the matrix product
125to be computed is:
126
127```
128(lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
129```
130
131Expanding this (using distributivity of matrix multiplication over addition), we
132see that the above product is equal to the following sum of 4 terms:
133
134```
135  lhs * rhs                                 (2)
136+ lhs_offset * P * rhs
137+ lhs * rhs_offset * Q
138+ lhs_offset * rhs_offset * P * Q
139```
140
141The first term, `lhs * rhs`, is just the matrix multiplication ignoring the
142offsets, i.e. as if `lhs_offset==rhs_offset==0`. Our claim here is that this is
143all what we have to compute in the GEMM kernel.
144
145In the second term, `lhs_offset * P * rhs`, notice that since P is filled with
1461's, `P * rhs` has all its rows equal to each other, and equal to the row-vector
147of sums of all the entries in each column of rhs.
148
149Thus, we can compute the second term, `lhs_offset * P * rhs`, by summing each
150column of rhs. This produces a single row-vector, and in order to add the second
151term, we simply need to add this row-vector (multiplied by lhs_offset) to each
152row of the result. This is just a rank one update of the result (equivalently,
153the second term is a rank one matrix), and we can efficiently store it as a
154single vector.
155
156The third term, `lhs * rhs_offset * Q`, is entirely similar to the second one,
157and can be similarly computed by summing each row of lhs, storing this in a
158single column-vector, and later multiplying these sums by rhs_offset.
159
160The fourth term is a single constant, repeated into all the entries of the
161matrix. The matrix `P * Q` is filled with the single constant value 'depth' (the
162depth of the matrix product i.e. the number of columns of the lhs). Thus the
163fourth term is simply the rank zero update adding this constant to each matrix
164entry:
165
166```
167lhs_offset * rhs_offset * depth
168```
169
170# Implementation of this technique in gemmlowp
171
172In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
173to prepare them for efficient repeated traversal by the kernel), we compute the
174sum of each row of the lhs block and the sum of each column of the rhs block.
175
176See in [internal/pack.h](../internal/pack.h), in the PackedSideBlock class, the
177following member:
178
179```
180// Handle on the additional buffer backing the vector of sums of slices
181// associated with this block. Owned.
182Allocator::Handle sums_of_each_slice_handle_;
183```
184
185sums_of_each_slice_handle_ is the handle to the buffer allocated to store the
186vector containing sums of rows of lhs, or of sums of columns of rhs.
187
188After these rank one updates have been computed at the packing stage, they are
189ignored at the compute kernel stage, since that stage is only concerned with the
190first of the four terms in (2); they are only used at the unpacking stage. See
191the default/reference implementation, `UnpackResultImpl`, in
192[internal/unpack.h](../internal/unpack.h).
193