206 lines
8.9 KiB
Text
206 lines
8.9 KiB
Text
The packing stage in gemmlowp
|
|
*****************************
|
|
|
|
|
|
Introduction
|
|
============
|
|
|
|
We assume familiarity with doc/design.txt and with the overall
|
|
3 stages of computations described there: packing, kernel, unpacking.
|
|
|
|
This page goes into more details about the first stage: packing.
|
|
|
|
We also assume familiarity with doc/kernel.txt as it describes the
|
|
packed format requirements that the kernels expect, and that forms
|
|
basically the contract that the packing stage must honor.
|
|
|
|
Some parts below also assume familiarity with doc/low-precision.txt
|
|
as the packing stage also has to compute the vectors of sums or columns as
|
|
described there.
|
|
|
|
|
|
The storage order of packed blocks, partly hidden behind sequential access
|
|
==========================================================================
|
|
|
|
As explained in doc/design.txt, the primary purpose of packing is to
|
|
ensure that when the kernel traverses Lhs/Rhs matrix data, it can do
|
|
so efficiently thanks to having the data stored in an order that is
|
|
as similar as possible to the order in which the compute stage has to
|
|
traverse this data.
|
|
|
|
This traversal order is nontrivial for the reasons outlined in
|
|
doc/design.txt: at the innermost level, one tries to work within registers
|
|
as much as possible; at the next level, one tries to stay within L1 cache
|
|
as much as possible. The packed blocks that we handle are supposed
|
|
to fit entirely in L2 cache.
|
|
|
|
Thus it has become standard in GEMM design to describe complicated
|
|
"Z-order" or "fractal order" storage for packed blocks.
|
|
|
|
However, we should keep in mind that the whole point of the packed storage
|
|
order is to be as similar as possible to the order of traversal during
|
|
the compute stage. The storage order doesn't matter in itself; the only
|
|
thing that matters is simple access patterns during the compute stage.
|
|
|
|
This suggests the following approach to implementing packing: take the
|
|
exact same hierarchy of nested loops of the compute stage, drop the loops
|
|
that are not relevant to the side (Lhs or Rhs) being packed, and try to use
|
|
mostly sequential access to the destination packed data.
|
|
|
|
This hierarchy of nested loops can be seen in PackSideBlockImpl
|
|
(PackL2, PackL1, PackRun), compare to the similar hierarchy of loops
|
|
in internal/compute.h.
|
|
|
|
In this way, the more intricate small-scale details or the packed data format
|
|
never need to be made explicit (which would be complicated). We still use
|
|
some "seeking" but only at larger scales, where the storage order is less\
|
|
complicated to describe.
|
|
|
|
|
|
Sequential access to PackedSideBlock data
|
|
-----------------------------------------
|
|
|
|
See PackedSideBlock in internal/pack.h, specifically the following data
|
|
members:
|
|
|
|
// Handle on the buffer backing this packed block. Owned.
|
|
Allocator::Handle data_handle_;
|
|
|
|
and:
|
|
|
|
// pos_ is the current position in the buffer, which we access
|
|
// sequentially, like a file.
|
|
// The idea is that we pack data in the same order as it is
|
|
// going to be traversed during the computation, which for
|
|
// cache-friendliness reasons is complicated to random-access,
|
|
// as the offsets calculations would be intricate. So we
|
|
// give up random-access addressing, and instead content ourselves
|
|
// with sequential access.
|
|
//
|
|
// pos_ is mutable because during the computation we will want to
|
|
// be able to iterate on the data in a const PackedSideBlock.
|
|
mutable int pos_;
|
|
|
|
The methods exposing sequential access are:
|
|
|
|
std::uint8_t* current_data() {
|
|
return allocator_->GetPointer<std::uint8_t>(data_handle_) + pos_;
|
|
}
|
|
|
|
and:
|
|
|
|
void seek_next_cell() const { pos_ += KernelSideFormat::Cell::kSize; }
|
|
|
|
void seek_forward_n_cells(int n) const {
|
|
pos_ += n * KernelSideFormat::Cell::kSize;
|
|
}
|
|
|
|
|
|
Random access to PackedSideBlock data at larger scales
|
|
------------------------------------------------------
|
|
|
|
We still need some random access at larger scales (with high granularity),
|
|
which is unavoidable since GEMM is O(n^3) and has to traverse each of the
|
|
O(n^2) inputs O(n) times.
|
|
|
|
The watershed between sequential access and random access is at the level
|
|
of a 'Run'. Throughout gemmlowp we consistently use the term 'Run' to refer
|
|
to the innermost GEMM loop in the depth dimension. That's the critical
|
|
inner loop that must be as fast as possible, thus for which we absolutely
|
|
want sequential access to packed data so that the storage order is optimal
|
|
by construction. At larger scales i.e. between runs, we accept that
|
|
the storage order is less optimal and since it's also less intricate, it's
|
|
not too hard to implement random access there.
|
|
|
|
This is done by the seek_run method:
|
|
|
|
void seek_run(int start_width, int start_depth) const {
|
|
int kernel_run_depth =
|
|
std::min<int>(params_.l1_depth, params_.l2_depth - start_depth);
|
|
pos_ = params_.l2_width * start_depth + start_width * kernel_run_depth;
|
|
}
|
|
|
|
We see that the formula involves the l1_depth parameter, which is how the
|
|
packed storage order depends on L1 cache size. Again, the whole packed
|
|
block is supposed to fit in L2 cache.
|
|
|
|
|
|
The innermost loop of the packing stage, PackRun, and PackingRegisterBlock
|
|
==========================================================================
|
|
|
|
Keeping with our consistent usage of the term 'Run' throughout gemmlowp,
|
|
the innermost loop is called PackRun().
|
|
|
|
Here we recall a very important principle that was explained in
|
|
doc/kernels.txt: the kernel is free to dictate the precise data format
|
|
that it expects; the packing code has to honor it. So there's an asymmetry
|
|
here: the kernel is the master, the packing is the slave. That's why the
|
|
packing code is templatized in the KernelSideFormat. At larger scales,
|
|
the packing is independent of kernel format details, but inside PackRun is
|
|
where we take care of the small-scale details that do depend on the kernel
|
|
format details. That's why it's a good thing that we only need sequential
|
|
access here, as it would be very complicated to spell out random access
|
|
at this scale.
|
|
|
|
Anyway, PackRun.
|
|
|
|
Since it is the critical inner loop, it is what we want to allow specializing
|
|
for particular CPU architectures. To allow that, we handle at a time blocks
|
|
of fixed dimensions, that is intended to be friendly enough to optimization.
|
|
These blocks are PackingRegisterBlock's and their dimensions are:
|
|
width = KernelWidth
|
|
depth = kRegisterSize
|
|
See doc/kernels.txt and internal/kernel.h for the former, and internal/common.h
|
|
for the latter.
|
|
|
|
See the comments around PackingRegisterBlock in internal/pack.h:
|
|
|
|
// A PackingRegisterBlock is a small fixed-size block of a matrix being
|
|
// packed. This class is the generic non-optimized implementation,
|
|
// it is inherited by the generic implementation of PackingRegisterBlock,
|
|
// which may be overriden by template specialization. Overriding it is how
|
|
// one may provide optimized packing code paths.
|
|
//
|
|
// The packing of a block proceeds in two steps:
|
|
// 1. Ensuring that we have a complete block of source data, i.e. a block of
|
|
// the compile-time prescribed size. This is where we handle unaligned
|
|
// boundaries: if we don't have a complete block of source data, then
|
|
// we copy and zero-extend it into a local temporary (complete_src_),
|
|
// see MakeCompleteSrc. In the generic case, we do have a complete block,
|
|
// so we just use it in-place, see UseCompleteSrcInPlace.
|
|
// 2. Packing a complete block into the destination, see Pack. This is the
|
|
// most critical part, so it's convenient that unaligned boundaries have
|
|
// already been handled in step 1.
|
|
|
|
|
|
Other things that the packing stage has to do
|
|
=============================================
|
|
|
|
Besides storing matrix entries in a suitable order, the packing stages also has
|
|
two other things to do.
|
|
|
|
First, packing has to compute the vectors of sums of entries along the depth
|
|
dimension. If this is any mysterious, read doc/low-precision.txt. These will
|
|
only be used at the unpacking stage.
|
|
|
|
Second, if the BitDepthSetting requires less than 8 bit of precision, then at
|
|
the packing stage we have to requantize inputs accordingly.
|
|
See doc/less-than-8-bit.txt for details. This is the Requantize() function.
|
|
|
|
|
|
Specialized packing paths for specific formats on specific CPU architectures
|
|
============================================================================
|
|
|
|
Please refer to internal/pack_neon.h for examples of doing that. The piece
|
|
of code to be specialized is PackingRegisterBlock. However, inside of it,
|
|
only the Pack() method typically needs to be specialized (the rest is unlikely
|
|
to be critical). So one typically specializes PackingRegisterBlock but still
|
|
inheriting PackingRegisterBlockBase to keep the generic stuff, and then one
|
|
typically wants to override the Pack() method.
|
|
|
|
Template specialization for the right template parameters is how one specifies
|
|
in which case a given path is to be used in place of the generic packing code.
|
|
|
|
It is entirely possible to set the value of kRegisterSize differently based on
|
|
the CPU architecture (for example, 32 on x86 with AVX) as long as all the
|
|
specialized packing paths used on that CPU architecture are consistent with it.
|