176 lines
7.3 KiB
Text
176 lines
7.3 KiB
Text
Kernels in gemmlowp
|
|
*******************
|
|
|
|
|
|
Kernels provide an inner-loop implementation, and a format
|
|
==========================================================
|
|
|
|
Here we assume familiarity with the concepts of kernels and of packing
|
|
as explained in doc/design.txt.
|
|
|
|
gemmlowp is designed to be easily extensible to different architectures and
|
|
other low-level details, while achieving high performance. Thus a line had to
|
|
be drawn between the generic GEMM code and the specific parts that need to
|
|
be manually designed for each architecture, etc. The design choice made in
|
|
gemmlowp is to have easily swappable GEMM kernels.
|
|
|
|
In itself, a GEMM kernel is just an implementation of the inner-most loop
|
|
in a GEMM (That inner-most loop has to be over the 'depth' dimension so as
|
|
to be able to accumulate into a small enough number of accumulators to fit
|
|
in registers).
|
|
|
|
Thus, by itself, a GEMM kernel should be just a function computing a block
|
|
of GEMM.
|
|
|
|
However, GEMM kernels may need to differ not just in how they implement this
|
|
computation, but also in the format of data that they operate on. Indeed,
|
|
in order to maximize the ratio of arithmetic instructions to memory access
|
|
instructions, GEMM kernels want to handle blocks as wide as possible given
|
|
the number of registers of the CPU architecture.
|
|
|
|
Thus, in order to allow efficient specialization to diverse architectures,
|
|
gemmlowp allows each GEMM kernel to dictate the format of data that it expects,
|
|
in addition to providing its inner-loop implementation.
|
|
|
|
The former is given by a 'Format' typedef, and the latter by a 'Run'
|
|
method.
|
|
|
|
A good example is to look at internal/kernel_neon.h, and specifically at
|
|
the NEONKernel12x4Depth2 kernel, which specifies its format as
|
|
|
|
typedef KernelFormat<KernelSideFormat<CellFormat<4, 2>, 3>,
|
|
KernelSideFormat<CellFormat<4, 2>, 1> > Format;
|
|
|
|
The meaning of these terms is explained in the lengthy comment at the
|
|
top of internal/kernel.h. Here, they mean that this kernel handles at
|
|
each iteration (along the depth dimension):
|
|
- 3 'cells' of size 4x2 each of the lhs, so a total lhs block
|
|
of size 12x2
|
|
- 1 'cell' of size 2x4 of the rhs.
|
|
In other words, this kernel handles 12 rows of the lhs and 4 columns of the
|
|
rhs, and handles two levels of depth at once. The 'cells' and 'CellFormat'
|
|
details the layout of these 12x2 and 2x4 blocks.
|
|
|
|
This kernel then loads these 12x2 and 2x4 blocks and computes the corresponding
|
|
12x4 GEMM; for ease of reference let us paste the critical comment and code here:
|
|
|
|
"loop_NEONKernel12x4Depth2_%=:\n"
|
|
|
|
// Overview of register layout:
|
|
//
|
|
// A 2x4 cell of Rhs is stored in 16bit in d0--d1 (q0).
|
|
// A 12x2 block of 3 4x2 cells Lhs is stored in 16bit in d2--d7
|
|
// (q1--q3).
|
|
// A 12x4 block of accumulators is stored in 32bit in q4--q15.
|
|
//
|
|
// +-----+-----+-----+-----+
|
|
// |d0[0]|d0[1]|d0[2]|d0[3]|
|
|
// Rhs +-----+-----+-----+-----+
|
|
// |d1[0]|d1[1]|d1[2]|d1[3]|
|
|
// +-----+-----+-----+-----+
|
|
//
|
|
// | | | | |
|
|
//
|
|
// Lhs | | | | |
|
|
//
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// |d2|d3| | q4 | q5 | q6 | q7 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// |d4|d5| | q8 | q9 | q10 | q11 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// |d6|d7| | q12 | q13 | q14 | q15 |
|
|
// +--+--+ - - - - +-----+-----+-----+-----+
|
|
//
|
|
// Accumulator
|
|
|
|
// Load 1 Rhs cell of size 2x4
|
|
"vld1.8 {d0}, [%[rhs_ptr]:64]!\n"
|
|
|
|
// Load 3 Lhs cells of size 4x2 each
|
|
"vld1.8 {d2}, [%[lhs_ptr]:64]!\n"
|
|
"vld1.8 {d4}, [%[lhs_ptr]:64]!\n"
|
|
"vld1.8 {d6}, [%[lhs_ptr]:64]!\n"
|
|
|
|
// Expand Lhs/Rhs cells to 16 bit.
|
|
"vmovl.u8 q0, d0\n"
|
|
"vmovl.u8 q1, d2\n"
|
|
"vmovl.u8 q2, d4\n"
|
|
"vmovl.u8 q3, d6\n"
|
|
|
|
// Multiply-accumulate, level of depth 0
|
|
"vmlal.u16 q4, d2, d0[0]\n"
|
|
"vmlal.u16 q5, d2, d0[1]\n"
|
|
"vmlal.u16 q6, d2, d0[2]\n"
|
|
"vmlal.u16 q7, d2, d0[3]\n"
|
|
"vmlal.u16 q8, d4, d0[0]\n"
|
|
"vmlal.u16 q9, d4, d0[1]\n"
|
|
"vmlal.u16 q10, d4, d0[2]\n"
|
|
"vmlal.u16 q11, d4, d0[3]\n"
|
|
"vmlal.u16 q12, d6, d0[0]\n"
|
|
"vmlal.u16 q13, d6, d0[1]\n"
|
|
"vmlal.u16 q14, d6, d0[2]\n"
|
|
"vmlal.u16 q15, d6, d0[3]\n"
|
|
|
|
// Multiply-accumulate, level of depth 1
|
|
"vmlal.u16 q4, d3, d1[0]\n"
|
|
"vmlal.u16 q5, d3, d1[1]\n"
|
|
"vmlal.u16 q6, d3, d1[2]\n"
|
|
"vmlal.u16 q7, d3, d1[3]\n"
|
|
"vmlal.u16 q8, d5, d1[0]\n"
|
|
"vmlal.u16 q9, d5, d1[1]\n"
|
|
"vmlal.u16 q10, d5, d1[2]\n"
|
|
"vmlal.u16 q11, d5, d1[3]\n"
|
|
"vmlal.u16 q12, d7, d1[0]\n"
|
|
"vmlal.u16 q13, d7, d1[1]\n"
|
|
"vmlal.u16 q14, d7, d1[2]\n"
|
|
"vmlal.u16 q15, d7, d1[3]\n"
|
|
|
|
// Loop. Decrement loop index (depth) by 2, since we just handled 2
|
|
// levels of depth (Kernel::kDepth=2).
|
|
"subs %[run_depth], #2\n"
|
|
"bne loop_NEONKernel12x4Depth2_%=\n"
|
|
|
|
|
|
|
|
Packing code adapts to the format chosen by the kernel
|
|
======================================================
|
|
|
|
As explained in doc/design.txt, gemmlowp starts by packing blocks of the
|
|
lhs and rhs matrices for optimally efficient traversal by the kernel. This
|
|
depends on fine details of the kernel format, in ways that can only be
|
|
efficiently handled by knowing these kernel format details at compile-time.
|
|
|
|
This is the reason why all the code in internal/pack.h is templated in
|
|
the corresponding kernel format.
|
|
|
|
The code in internal/pack.h isn't tightly optimized by itself, but it is
|
|
structured in such a way that the critical code is in a template,
|
|
PackingRegisterBlock,
|
|
that can easily be specialized to override the slow generic code with
|
|
fast specific packing code for specific formats, on specific platforms.
|
|
|
|
See internal/pack_neon.h which provides NEON specializations of the
|
|
packing code for the particular kernel formats that are used by the NEON
|
|
kernels in internal/kernel_neon.h.
|
|
|
|
|
|
Wrapping up: how to optimize gemmlowp for a CPU architecture
|
|
============================================================
|
|
|
|
In conclusion, the key feature of gemmlowp when it comes to efficiently
|
|
supporting a specific CPU architecture, is that it allows to freely replace
|
|
the inner loop of the GEMM by providing one's own GEMM kernel, which is
|
|
also free to dictate its required data layout; each data layout then also
|
|
needs optimized packing code. The steps are thus:
|
|
1) Freely design a GEMM kernel with a freely chosen data layout
|
|
2) Implement the GEMM kernel, similar to internal/kernel_neon.h
|
|
3) Implement the optimized packing code, similar to internal/pack_neon.h.
|