1METAPROGRAMMED GEMM 2=================== 3 4The two main goals of this library are: 5- providing a new matrix multiplication kernel. 6- providing the optimized codepaths for as many possible user scenarios without 7 enforcing additional input data constraints (padding, sizes, strides, layout) 8 9To enable this code add -DGEMMLOWP_USE_META_FASTPATH to your build setup. 10 11The new kernel 12-------------- 13 14The multiplication kernel - the most inner loop of the matrix multiplication, 15which is responsible for the row/column products was rewritten. The new code 16produces a 3x3 result patch and processes the row/column arrays in 8 element 17packs (the kernel 'shape' is 3x3x8 compared to the previous 12x4x2). By using 18specialized 8bit multiplication, aggregating to vector aggregators and then 19reduction with parallel horizontal addition we devised code that achieved 20higher arithmetical density (arithmetical operation per assembly instruction). 21The arithmetical performance of the new kernel exceeds 18 GOps/s on a vanilla 22Nexus 5 phone (which is practically peak for this device). 23 24In order to feed the kernel with input data and minimize the number of 25instructions other than the arithmetical operations a different packing 26scheme was used. Three rows (columns) are interweaved every 8 elements so that 27they can be read from continuous memory in one op inside the kernel. Additional 28memory preload hint operations are inserted into the kernel to hide memory 29latency behind arithmetical operations. 30 31Generated code 32-------------- 33 34The basic kernel used in this approach is of shape 3x3x8. Obviously this 35kernel can be easily applied to multipications where matrix sizes are: 36M x K, K x N where M and N are multiplies of 3 and K is a multiply of 8. 37 38We rejected two obvious solutions of: padding the matrix sizes to appropriate 39sizes, or using the reference implementation for the leftovers. Neither did 40we consider enforcing extra constraints on the caller. 41 42In order to allow all matrix sizes the kernels processing all combinations of 431, 2 or 3 rows and 1, 2 or 3 columns are required. Similarily to allow all 44possible depths the leftover values (up to 7 elements) needed to be handled. 45 46Instead of writing those kernels by hand we decided to generate them with 47some python scripts. 9 Versions of the multiplication kernel were prepared. 48Additionally packing and unpacking code for different row/column counts and 49depth leftovers was generated. Finally different code was generated for 50aligned memory reads/writes and unaligned memory reads/writes. 51 52Using those multiplication and packing/unpacking primitives 144 gemm function 53versions were prepared. On top of them one high level gemm function that would 54switch to one of those preoptimized versions. 55 56This approach allowed moving all unnecessary branching and conditional execution 57outside of the inner loops. It also allowed removing of all short loops required 58for leftover handling. Finally aligned memory reads/writes are used everywhere 59where the provided input data allows. 60 61Results 62------- 63 64The library shows up to 35% faster gemm execution in some cases (e.g. ImageNet 65benchmark). 66 67Files 68----- 69 70single_thread_gemm.h 71-- generated ARM/NEON 8bit x 8bit gemm implementation. Contains all the 72 optimized, unrolled and curried pack/unpack, and multiply procedures and 73 a single gemm function that switches between the optimized versions based 74 on the runtime parameters. 75 76multi_thread_gemm.h 77-- a simple parallelization scheme for the gemm function. 78 79generators/gemm_NxMxK_neon.py 80-- script that generates the single_thread_gemm.h header library. 81 Usage: python gemm_NxMxK_neon > single_thread_gemm.h 82