1# Computation with less than 8 bits in gemmlowp
2
3## Introduction
4
5We assume familiarity with gemmlowp's low-precision uint8 computation paradigm,
6which is described in [low-precision.md](low-precision.md).
7
8This document is about the possibility of further reducing precision below 8
9bits.
10
11That allows to get higher arithmetic throughput on some architectures, at the
12cost of decreased accuracy.
13
14## The past, present, and future of less-than-8-bit computation in gemmlowp
15
16A meta note is needed here as to how this fits with the general gemmlowp design.
17
18### The past
19
20Less-than-8-bit computation was initially designed and implemented in gemmlowp
21as a drop-in replacement for regular 8bit computation, a plain optimization. The
22idea was that to automatically requantize 8bit operands to less-than-8bit during
23the O(N^2) packing stage, then take advantage of the lower bit depth during the
24O(N^3) compute stage. For large enough matrices, that should be worth it.
25
26### The present
27
28TODO(benoitjacob): update this documentation. This 'present' state just
29became the past (February 2017).
30
31At the moment, this less-than-8-bit mode of gemmlowp is not much used in
32practice, because the implicit requantization of operands from 8bit to
33less-than-8bit turned out to be more expensive than initially expected, both in
34terms of speed and accuracy:
35
361.  Speed: the O(N^2) requantization is only negligible compared to the O(N^3)
37    compute kernel when the matrix size N is large enough; in practice, smaller
38    matrix sizes turned out to be very important, making the requantization
39    approach slower than expected.
40
412.  Accuracy: As neural networks were optimized for size, their sensitivity to
42    numerical accuracy increased. Then the approach of requantizing
43    already-quantized data turned out to be more wasteful of accuracy than we
44    could afford.
45
46### The future
47
48Less-than-8bit still probably has good prospects; what should be dropped is the
49requantization. In other words, in the future, we might have neural networkds
50trained right away for some bit depth lower than 8 bits. The resulting values
51would probably still be stored as 8 bits (unless the bit depth eventually
52becomes very low). Thus, no particular work would be needed in the packing
53stage; no overhead or loss of accuracy would be incurred anymore.
54
55In other words: the design of less-than-8-bit kernels is probably useful in the
56long run; what is on the way out is requantization and packing/unpacking-stage
57aspects.
58
59With that said, the rest of this page retains its old content about the present
60approach:
61
62## Public interface
63
64### The BitDepthSetting parameter in the EightBitIntGemm interface
65
66Accessing less-than-8-bit computation via the EightBitIntGemm is very simple:
67EightBitIntGemm takes a BitDepthSetting enum which allows to choose among a
68fixed set of supported bit-depth combinations.
69
70### The BitDepthParams parameter in the public/gemmlowp.h interface
71
72The public/gemmlowp.h interface exposes more extensive control over
73quantization, by means of a BitDepthParams template parameter, which is a type
74parameter, carrying information about: 1. The LHS and RHS bit depth, which can
75be set arbitrarily and independently; 2. The 'RoundingStrategy', which is the
76heuristic used to choose a rounding mode, based on the accumulation size (a.k.a.
77the "depth" dimension of the Gemm). Details can be seen in public/bit_depth.h.
78
79### How does BitDepth{Setting,Params} affect input/output uint8 matrix data?
80
81Input/output matrix data is all uint8's, ranging from 0 to 255, regardless of
82the BitDepth{Setting,Params}.
83
84So the BitDepth{Setting,Params} is only an internal detail. It only means to
85allow gemmlowp to use lower precision internally, but the input/output data
86format is unaffected.
87
88As far as the API contract goes, the only thing that the
89BitDepth{Setting,Params} does is to relax the accuracy requirement. With
90standard 8bit/8bit computation, gemmlowp is required to return the exact result
91as specified in [low-precision.md](low-precision.md). With lower bit depths,
92gemmlowp is no longer required to return an exact result.
93
94## Implementation
95
96Here we refer to the 3 stages of computation as described in
97[design.md](design.md), namely: packing, computation kernel, unpacking.
98
99The general idea is that at the packing stage, we requantize input (Lhs/Rhs)
100data to less-than-8-bit depths by scaling them, thus shrinking the range of the
101packed matrix entries; for instance, if the Rhs bit depth is to be 5 bits then
102packed Rhs matrix entries will be in the range [0 ... 31]. This then allows the
103GEMM kernel to use narrower accumulators without risking overflow, thus
104achieving higher arithmetic throughput. Finally, at the unpacking stage, it only
105remains to scale the result values to compensate for the scalings applied
106earlier.
107
108Let us go into more detail for each of those stages:
109
110### Packing stage
111
112The packing stage is where most of the work specific to the BitDepthParams takes
113place.
114
115Here, we have to scale input matrix values from their original range of [0 ...
116255] to the range specified by the BitDepthParams, which is [0 ... (2^N)-1]
117where N is the number of bits for the matrix at hand (Lhs or Rhs). For example,
118for a bit depth of 5 bits, we need to scale down to [0 ... 31].
119
120This scaling is what we call "requantization". The pedantic name matches the
121fact that this is actually quite nontrivial to do correctly i.e. in such a way
122that the result accuracy will be good enough for real-world applications. See
123the section below on requantization details.
124
125Concretely, this work happens in PackingRegisterBlock::Pack(), which calls
126Requantize(). This is in internal/pack.h. This code can be overridden for
127specific architectures, see internal/pack_neon.h.
128
129This requantization work is costly and makes packing slower. This means that, at
130least in our approach, less-than-8-bit computation is only interesting for
131large-enough, square-enough GEMMs where packing is only a small fraction of the
132overall cost. In cases where packing overhead is more prevalent (highly
133rectangular cases), less-than-8-bit is probably a waste of time as long as we
134treat it as an internal computation detail. What might help there, might be if
135we shrunk the input/output data format to lower memory bandwidth usage.
136
137### Computation kernel stage
138
139In principle, the computation kernel stage simply doesn't have to care about the
140bit depth at all. In fact, on architectures where we do not have specific
141optimized kernels for less-than-8-bit cases, we simply use our standard kernel
142there, and that's correct!
143
144However, while the kernel doesn't have to know about the fact that the operands
145are on less than 8 bits, it can use that information to make special
146optimizations that would be incorrect in the general 8-bit case and become
147correct here thanks to the more restricted range of inputs. That's the whole
148point of this less-than-8-bit computation idea.
149
150With Lhs entries guaranteed to be smaller than 2^N, and Rhs entries guaranteed
151to be smaller than 2^M, each product is thus guaranteed to be smaller than
1522^(M+N). Thus, one may accumulate 2^(16-(M+N)) such products and still be
153guaranteed that such an accumulator will be smaller than 2^16, and thus can be
154stored as a uint16 without risking overflow.
155
156For example, in the L7R5 case, the Lhs enties are on 7 bits (N=7) and the Rhs
157entries are on 5 bits (M=5), so each product fits in 12 bits and one can thus
158accumulate 16 ( = 2^(16-12)) such products into uint16 accumulators with no risk
159of overflow.
160
161This means that a computation kernel may use uint16 accumulators for several
162loop iterations (16 in the above example), provided that it is allowed to assume
163that inputs are in such restricted range.
164
165After this fixed number of loop iterations, the kernel must accumulate the local
166uint16 accumulators back into global uint32 accumulators.
167
168On SIMD architectures with suitable uint16 arithmetic, this in principle allows
169to multiply arithmetic throughput by up to 2x, since twice more accumulators now
170fit in each SIMD vector register. This is partially offset by the cost of
171accumulating back into global uint32 accumulators every several loop iterations,
172but our experience on ARM NEON has been that we still get quite close to a 2x
173speedup. See internal/kernel_neon.h, specifically
174NEON32Kernel12x4Depth2Assuming12BitProducts.
175
176### Unpacking stage
177
178At the unpacking stage, it only remains to scale the result values to compensate
179for the scaling of the inputs. This is easier because now we are expanding the
180range instead of shrinking it, so we don't need to worry about ways to minimize
181a loss of accuracy. We simply need to multiply result values by a constant
182fraction, rounding to nearest.
183
184Since the inputs were scaled by factors of (2^lhs_bits - 1)/255 and
185(2^rhs_bits - 1)/255 respectively, the scaling of the outputs needs to be by the
186following factor:
187
188                 255 * 255
189    -----------------------------------
190    (2^lhs_bits - 1) * (2^rhs_bits - 1)
191
192This is done by a MultiplyByConstantFraction function, see internal/unpack.h
193
194## Requantization details
195
196Here we go into more detail on the Requantize() function used at the packing
197stage to requantize input matrix data. See this function in internal/pack.h.
198
199It depends on the bit depth and on a rounding mode, and requantizes an input
200value in [0 ... 255] to the range [0 ... (2^N)-1] specified by the bit depth N.
201
202### Naive, bad rounding, that's plainly biased
203
204Naive and inaccurate ways to achieve this requantization include: 1. By shifting
205bits rights by (8-N) bits; 2. By multiplying by ((2^N) - 1) and dividing by 255.
206
207Both of those are biased in some way: 1. has the wrong "derivative", since it
208approximates (((2^N) - 1) / 255) by ((2^N) / 256) ; 2. has bias since it
209effectively implements rounding towards 0.
210
211In practice, both of the above requantization functions give results that are
212too inaccurate in practice for the neural network that we tried (GoogLeNet).
213
214### Round-to-nearest rounding: unbiased in principle but not in practice
215
216The simplest fix is to avoid the bias in 2. by rounding-to-nearest instead of
217rounding towards 0. This can be achieved by doing
218
219dst = (src * maxval + rounding_offset) / 255;
220
221Where maxval = ((2^N) - 1) is the highest requantized value, and the
222rounding_offset can be set to
223
224rounding_offset = 127
225
226to achieve rounding-to-nearest (while the above rounding towards 0 corresponded
227to rounding_offset = 0).
228
229In principle, rounding-to-nearest is unbiased and optimal in various ways.
230
231In practice though, our input data is not random real numbers, but
232already-quantized 8-bit values. That means that even in the best case, there
233would be at most 255 different possible input values; in practice, we generally
234see the input values distributed non-uniformly in that range, so that a majority
235of input values tend to be in a much smaller range. See test/test_data.cc.
236
237Having a large part of the input values in a very small finite set, means that
238the corresponding rounding errors are also in a very small finite set, which can
239be small enough that the mean of these rounding errors is significantly
240different from 0. That rounding-to-nearest is "unbiased" only means that over a
241sufficiently large set of input values, the bias would become arbitrarily close
242to 0; here, the set of input values is effectively small enough that the
243resulting bias is significant.
244
245This leads to biasing the matrix product entries, resulting in an error that
246grows linearly with the depth dimension of the GEMM.
247
248### Probabilistic rounding: unbiased even on small finite input distributions
249
250To address that, we can instead use probabilistic rounding. The idea is that for
251instance if we have to round the value 3.8 to the nearest integer, we can round
252it to 3 with 20% probability and to 4 with probability 80%. If that value 3.8
253occurs many times, the mean requantized value will thus tend to 3.8.
254
255This amounts to keeping the above requantization formula,
256
257dst = (src * maxval + rounding_offset) / 255;
258
259but now the rounding_offset is a random value in [0 .. 254].
260
261This guarantees zero bias no matter how small the distribution of input values
262is.
263
264On the other hand, the variance of the error term here is higher than with
265rounding-to-nearest --- one can check that it is 2x higher.
266
267So the error term coming from the Central Limit Theorem, which grows with the
268square root of the accumulator depth i.e. the GEMM depth, will be 2x higher
269here.
270
271Still, for large enough GEMM depth, that is better than rounding-to-nearest
272which has an error term growing linearly with the GEMM depth.
273
274### Switching between rounding-to-nearest and probabilistic rounding
275
276Thus, for fixed input values and bit depths, we expect that probabilistic
277rounding will give more accurate results for large enough GEMM depths, while
278rounding-to-nearest will be more accurate for smaller GEMM depths.
279
280That is why use switch between these rounding modes based on GEMM depth, see
281ChooseRoundingMode in internal/bit_depth_util.h.
282
283It is based on a constant, kProbabilisticRoundingThreshold, defined in
284internal/common.h and empirically determined. See the comment there. It would be
285nice to better understand the statistics here and come up with better heuristics
286for this switching.
287
288### Choice of pseudorandom number generator
289
290We provide two PRNGs. The first is an 8-bit Xorshift. It is fast, naturally
291produces values ranging over an interval of width 255, which is what we need
292here (as opposed to an interval of width 256), and turns out, from empirical
293tests, to produce better results than a linear congruential generator (LCG).
294That's unfortunate, as a 8-bit LCG performs better (we confirmed that on various
295ARM devices) but we need as perfect un-biased-ness as we can get.
296
297The second is an "add-mod" sequence generator, which generates non-random values
298in the sequence x = (x + 97) % 255. This generates a low-discrepancy sequence
299that minimizes the "clumpiness" of the random offsets (Thus, for example,
300quantizing a 3x3 matrix will have a maximum additive error of about 200 from the
301random offsets). While not random, this sequence performs well empirically for
302many quantizations. (For information about why 97 is a good value, see
303https://en.wikipedia.org/wiki/Low-discrepancy_sequence#Additive_recurrence and
304http://mollwollfumble.blogspot.com/2011/03/subrandom-numbers.html 97/255 = 0.38;
3050.382 is the best choice. For discrete numbers, the choice must be relatively
306prime to the modulus. 97 is prime, so it is safely relatively prime to 255. 107
307is another near-optimal choice.
308
309The low-discrepancy sequence generator is the default.
310
311More details and results are given in a comment on the default PRNG in
312internal/pack.h. Interested users can change the PRNG used by setting
313DefaultRoundingGenerator in bit_depth_util.h.
314