159 lines
7.1 KiB
Text
159 lines
7.1 KiB
Text
The low-precision paradigm in gemmlowp, and how it's implemented
|
|
****************************************************************
|
|
|
|
|
|
Introduction
|
|
============
|
|
|
|
"Low-precision" means that the input and output matrix entries are integers
|
|
on at most 8 bits. The scalar type is uint8_t.
|
|
|
|
This isn't the same as just doing plain matrix arithmetic over uint8_t,
|
|
because that would overflow. To avoid overflow, we internally accumulate
|
|
results on more than 8 bits, and at the end we keep only some significant
|
|
8 bits. This relies on the caller providing suitable offset/multiplier/shift
|
|
parameters, which effectively govern how we extract some significant 8 bit
|
|
from our more-than-8bit temporary accumulators.
|
|
|
|
Gemmlowp supports further reducing precision below 8 bits. That is not
|
|
the subject of this document; for that, refer to doc/less-than-8-bit.txt.
|
|
|
|
|
|
The low-precision paradigm
|
|
==========================
|
|
|
|
gemmlowp is an implementation of the EightBitIntGemm paradigm, where quantized
|
|
matrix multiplication takes the following input parameters:
|
|
- the lhs matrix of uint8 quantized values
|
|
- the rhs matrix of uint8 quantized values
|
|
- the following int32 "quantization parameters", which control how the
|
|
uint8 quantized values in the matrices are to be interpreted during the
|
|
matrix computation:
|
|
- lhs_offset
|
|
- rhs_offset
|
|
- result_offset
|
|
- result_mult_int
|
|
- result_shift
|
|
|
|
The mathematical expression to be computed is the result of the following steps:
|
|
1. Cast lhs entries from uint8 to int32 and add lhs_offset to each of them.
|
|
2. Cast rhs entries from uint8 to int32 and add rhs_offset to each of them.
|
|
3. Compute the int32 matrix product of the resulting lhs times rhs.
|
|
4. Add result_offset to each entry of the result.
|
|
5. Multiply each entry of the result by the following fraction, and round
|
|
to the nearest integer:
|
|
|
|
result_mult_int
|
|
--------------- (1)
|
|
2^result_shift
|
|
|
|
6. Clamp the resulting int32 values to the [0..255] range and cast to uint8.
|
|
|
|
Thus the caller of this interface is expected to have precomputed suitable
|
|
quantization parameters
|
|
|
|
The rationale for these parameters is as follows:
|
|
- The three offsets may improve quantization accuracy in cases where the
|
|
range of values is limited, and they also conveniently allow to reduce all
|
|
eight combinations of signednesses to just the unsigned*unsigned->unsigned
|
|
case. One may at first glance worry that these offsets would incur
|
|
substantial overhead to the GEMM computation, but that is actually not the
|
|
case thanks to a trick described below (see "Efficient handling of
|
|
offsets").
|
|
- The result_mult_int and result_shift parameters allow approximating
|
|
arbitrarily closely any real multiplier, as a fraction of the form given
|
|
in (1) above, without using floating-point arithmetic and without using
|
|
a division instruction (only a right shift).
|
|
|
|
|
|
Efficient handling of offsets
|
|
=============================
|
|
|
|
At first glance it may seem like the above-described quantized computation
|
|
scheme requires adding the lhs_offset and rhs_offset to each of the lhs and
|
|
rhs matrix entries.
|
|
|
|
Doing that in the GEMM kernel would incur substantial overhead:
|
|
- It would mean extra arithmetic work in the GEMM kernel;
|
|
- It would require storing the lhs_offset and rhs_offset in registers,
|
|
which would eat into the register space available for the rest of the
|
|
GEMM kernel.
|
|
|
|
One may then consider adding the lhs_offset and rhs_offset once and for all
|
|
to lhs and rhs blocks, in a GEMM implementation operating on one lhs block
|
|
and one rhs block at a time. However, doing so would require storing lhs and
|
|
rhs blocks in 32 bit (or at least in 16 bit in real-world cases), which would
|
|
partially negate the memory bandwidth benefits of low-precision computation.
|
|
|
|
Fortunately, there is another way to handle these offsets that has none of
|
|
the costs of the approaches described above. The idea is as follows.
|
|
|
|
Let P denote the matrix shaped like lhs, but filled with 1's.
|
|
Let Q denote the matrix shaped like rhs, but filled with 1's.
|
|
|
|
Adding lhs_offset to each entry of lhs, means adding lhs_offset * P to lhs.
|
|
Adding rhs_offset to each entry of rhs, means adding rhs_offset * Q to rhs.
|
|
|
|
Thus, as far as handling lhs_offset and rhs_offset goes, the matrix product to be
|
|
computed is:
|
|
|
|
(lhs + lhs_offset * P) * (rhs + rhs_offset * Q)
|
|
|
|
Expanding this (using distributivity of matrix multiplication over addition),
|
|
we see that the above product is equal to the following sum of 4 terms:
|
|
|
|
lhs * rhs (2)
|
|
+ lhs_offset * P * rhs
|
|
+ lhs * rhs_offset * Q
|
|
+ lhs_offset * rhs_offset * P * Q
|
|
|
|
The first term, lhs * rhs, is just the matrix multiplication ignoring the
|
|
offsets, i.e. as if lhs_offset==rhs_offset==0. Our claim here is that this
|
|
is all what we have to compute in the GEMM kernel.
|
|
|
|
In the second term, lhs_offset * P * rhs, notice that since P is filled
|
|
with 1's, P * rhs has all its rows equal to each other, and equal to the
|
|
row-vector of sums of all the entries in each column of rhs.
|
|
|
|
Thus, we can compute the second term, lhs_offset * P * rhs, by summing
|
|
each column of rhs. This produces a single row-vector, and in order to add the
|
|
second term, we simply need to add this row-vector (multiplied by lhs_offset)
|
|
to each row of the result. This is just a rank one update of the result
|
|
(equivalently, the second term is a rank one matrix), and we can efficiently
|
|
store it as a single vector.
|
|
|
|
The third term, lhs * rhs_offset * Q, is entirely similar to the second one,
|
|
and can be similarly computed by summing each row of lhs, storing this in a
|
|
single column-vector, and later multiplying these sums by rhs_offset.
|
|
|
|
The fourth term is a single constant, repeated into all the entries of the
|
|
matrix. The matrix P * Q is filled with the single constant value 'depth'
|
|
(the depth the the matrix product i.e. the number of columns of the lhs).
|
|
Thus the fourth term is simply the rank zero update adding this constant
|
|
to each matrix entry:
|
|
|
|
lhs_offset * rhs_offset * depth
|
|
|
|
|
|
Implementation of this technique in gemmlowp
|
|
============================================
|
|
|
|
In gemmlowp, at the packing stage (where we traverse blocks of the lhs and rhs
|
|
to prepare them for efficient repeated traversal by the kernel), we compute
|
|
the sum of each row of the lhs block and the sum of each column of the rhs
|
|
block.
|
|
|
|
See in internal/pack.h, in the PackedSideBlock class, the following member:
|
|
|
|
// Handle on the additional buffer backing the vector of sums of slices
|
|
// associated with this block. Owned.
|
|
Allocator::Handle sums_of_each_slice_handle_;
|
|
|
|
sums_of_each_slice_handle_ is the handle to the buffer allocated to store
|
|
the vector containing sums of rows of lhs, or of sums of columns of rhs.
|
|
|
|
After these rank one updates have been computed at the packing stage, they are
|
|
ignored at the compute kernel stage, since that stage is only concerned
|
|
with the first of the four terms in (2); they are only used at the unpacking
|
|
stage. See the default/reference implementation, UnpackResultImpl, in
|
|
internal/unpack.h.
|