1# Overview of gemmlowp design
2
3## Primer on GEMM, kernels, and cache friendliness
4
5gemmlowp, like most GEMMs, implements the straightforward matrix multiplication
6algorithm, which takes n^3 multiply-accumulate instructions for n*n sized
7matrices. Because the arithmetic complexity grows quicker than the memory
8complexity (n^3 vs. n^2), memory accesses are redundant (each matrix entry is
9accessed n times). A large part of a GEMM's performance and design goes toward
10minimizing the inefficiency resulting from these redundant memory accesses.
11
12Ultimately, once values are loaded into CPU registers, they cost nothing to
13access, so as long as we can work within registers, this problem doesn't exist.
14Thus, in order to be efficient, a GEMM's inner loops must wisely use the
15available registers to do as much arithmetic work as possible before loading
16more data from memory into registers. This means that a GEMM implementation
17needs to have architecture-specific inner loops tailored for architecture
18details such as the number of registers, and typically written in assembly. This
19'inner loops' architecture-specific component is referred to as the GEMM kernel.
20(More details about kernels are in [kernel.md](kernel.md)).
21
22However, only small blocks can fit at a given time in registers, so at larger
23scales one needs to repeatedly load blocks of matrices from memory, and these
24accesses are redundant for the reason outlined above. The way that one minimizes
25the resulting inefficiency is by organizing for cache locality, so that most of
26these accesses hit the L1 cache, and most of the remaining ones hit the L2
27cache, etc.
28
29This is achieved by subdividing the matrices into blocks sized to fit in L2
30cache, and subdividing these blocks into sub-blocks sizes to fit in L1 cache,
31and performing the matrix multiplication one such block at a time.
32
33In practice, it tends to pay off to "pack" input blocks for optimally efficient
34traversal by the kernel, since they will be traversed multiple times. "packing"
35means at least reordering the data layout for 1) simple access patterns that fit
36the CPU's cache behavior (in particular, the cache line size), and 2) simple
37loading into SIMD vector registers by the kernel.
38
39So a typical GEMM, in pseudo-code, tends to look like this:
40
41```
42allocate(some_lhs_L2_block);
43allocate(some_rhs_L2_block);
44for (some_lhs_L2_block) {
45  pack(some_lhs_L2_block);
46  for (some_rhs_L2_block) {
47    pack(some_rhs_L2_block);
48    for (some_lhs_sub_block in some_lhs_L2_block) {
49      for (some_rhs_sub_block in some_rhs_L2_block) {
50        kernel(some_lhs_sub_block, some_rhs_sub_block);
51      }
52    }
53  }
54}
55```
56
57## Impact of low-precision computation on gemmlowp design
58
59Refer to [low-precision.md](low-precision.md) for specifics of the
60low-precision-computation paradigm and how it's implemented in gemmlowp.
61
62Inputs and outputs are matrices of uint8 values, but internally we are
63accumulating int32 values, only converting them back to uint8 at the end. This
64means that we need so store a block of int32 accumulators at a time. We compute
65a block of the result in int32 accumulators and then we "unpack" it into the
66destination matrix at once. In this way, we minimize the amount of memory used
67to store int32 values at a given time.
68
69Because of that, besides the "pack" and "kernel" stages outlined above, a third
70stage is needed in gemmlowp, which we call "unpack". Thus we arrive at the
713-stage computation scheme that gemmlowp uses:
72
731.  Pack lhs/rhs blocks from the input matrices.
742.  Compute the product of the packed blocks, using the kernel.
753.  Unpack the result block into the output matrix.
76
77The pseudo-code overview of gemmlowp now looks like:
78
79```
80allocate(some_lhs_L2_block);
81allocate(some_rhs_L2_block);
82// new: temp storage for int32 accums
83allocate(some_int32_accumulators_block);
84for (some_lhs_L2_block) {
85  pack(some_lhs_L2_block);
86  for (some_rhs_L2_block) {
87    pack(some_rhs_L2_block);
88    for (some_lhs_sub_block in some_lhs_L2_block) {
89      for (some_rhs_sub_block in some_rhs_L2_block) {
90        // new: pass int32 accums to kernel
91        kernel(&some_int32_accumulators_block,
92               some_lhs_sub_block,
93               some_rhs_sub_block);
94      }
95    }
96    // new: unpack int32 accums into destination matrix
97    unpack(some_int32_accumulators_block);
98  }
99}
100```
101
102## Exploring gemmlowp code
103
104The design outlined above can be readily matched to gemmlowp source code, in
105particular in this file, which gives a simple GEMM implementation fitting in one
106rather small function:
107
108```
109internal/single_thread_gemm.h
110```
111
112The reader can compare the above pseudo-code to the actual code in this file:
113
114```
115for (int r = 0; r < rows; r += block_params.l2_rows) {
116  int rs = std::min(block_params.l2_rows, rows - r);
117
118  PackLhs(&packed_lhs, lhs.block(r, 0, rs, depth));
119
120  for (int c = 0; c < cols; c += block_params.l2_cols) {
121    int cs = std::min(block_params.l2_cols, cols - c);
122
123    if (!pack_rhs_once) {
124      PackRhs(&packed_rhs, rhs.block(0, c, depth, cs));
125    }
126
127    Compute(kernel, block_params, &packed_result, packed_lhs, packed_rhs);
128
129    auto result_block = result->block(r, c, rs, cs);
130    UnpackResult(&result_block, packed_result, packed_lhs, packed_rhs, depth,
131                 result_offset, result_mult_int, result_shift);
132  }
133}
134```
135
136The files in `internal/` fall into a few categories:
137
138There are two top-level GEMM implementations,
139
140*   [internal/single_thread_gemm.h](../internal/single_thread_gemm.h)
141*   [internal/multi_thread_gemm.h](../internal/multi_thread_gemm.h)
142
143They both call into pack/compute/unpack stages (see [kernel.md](kernel.md) and
144[packing.md](packing.md)) implemented in the following files:
145
146*   [internal/pack.h](../internal/pack.h)
147*   [internal/compute.h](../internal/compute.h)
148*   [internal/unpack.h](../internal/unpack.h)
149    *   This in turn calls into [internal/output.h](../internal/output.h) for
150        the output pipeline (see [output.md](output.md))
151
152The pack.h and unpack.h files contain generic templated code that can be
153overridden by optimized code in template specializations; for example, see the
154NEON optimized code here:
155
156*   [internal/pack_neon.h](../internal/pack_neon.h)
157*   [internal/unpack_neon.h](../internal/unpack_neon.h)
158    *   This in turn calls into
159        [internal/output_neon.h](../internal/output_neon.h)
160
161The compute stage contains generic code in compute.h that only calls into
162optimized code through the Kernel::Run() entry point. Each kernel is basically
163just as struct offering a Run() implementation; see the NEON kernels in:
164
165*   [internal/kernel_neon.h](../internal/kernel_neon.h)
166