Bit-sliced matrix multiply

There’s a matrix multiplication in the middle of a lot of popular things, right now, and it’s in an operation which frequently tolerates remarkably low precision.

In the SIMD world precision usually bottoms out at 8-bit. You don’t save much CPU time by trying to get less precise than that, except for marginal savings in how often you have to mitigate overflows.

But here’s an operation which I think should exist. Like matrix multiply, but one argument is a vector of 8-bit values, and the other argument is a vector of 64 1-bit values.

Rather than calculating a vector of eight-bit-by-eight-bit multiplies with 16-bit results, it computes a vector of sums of one-bit-by-eight-bit multiplies, with eleven-bit results.

In other words, a vector of conditional sums of eight different eight-bit inputs.

(As an additional detail the pragmatist might want an accumulating version of this, which also adds in the previous value of Vdst.)

So you take one of your input matrices and slice it into bitplanes. So that plane n represents a packed bitmap of bit n of each value in the original matrix.

Now for each bit-plane you perform the matrix multiplications with an instruction using conditional additions rather than multiplication. You get eight multiply-accumulate operations done for the price of one multiply-accumulate in eight-bit precision with smaller intermediates and rarer overflow conditions; but only considering one bit of one of the inputs.

In order to get the full precision one started with it’s necessary to repeat the operation for each bit-plane, and then shift and add the results together. But those additions happen outside of the innermost loop so that cost is amortised, and the smaller intermediates may offset that cost entirely.

But more importantly; if you only need five or six bits of precision, then just quit early and save the remaining CPU time.

It seems to fit neatly into a SIMD architecture, but I’ll have to go into that detail when I have more time for it.

TODO: fill all this in.

Widening, narrowing, and overflow

A general frustration with SIMD multiplication is that the results are wider than the operands and so you typically need to issue the instruction twice to get a complete result. Either by issuing two widening operations for each half of the input vectors, or by issuing separate high and low operations with results the same width as the inputs.

In this instance the needed precision may be so low that it avoids overflow even without widening. If the input is only five bits wide then the sum will fit in an eight-bit result; although if it’s an accumulating sum then that won’t quite work.

I wonder if separate high and low operations taking the top or bottom four bits of the eight-bit operand would be the best way to go. This saves an extra bit of headroom for an accumulation without overflow (but just the one; is that actually useful?), and the two separate sums can be fused together with another weighted shift and addition at the end of the loop.

Beyond 64-bit

The instruction illustrated above distributes eight eight-bit values across what might be taken as the full length of the vector. If the vector type is much longer than 64 bits this can create a locality-of-refence problem in its implementation, and it may be preferable to break off and use a new set of eight-bit values after 64-bits – making this a self-contained 64x64-bit operation vectorised across many 64-bit chunks.

Separate 64-bit splat instructions can redistribute the initial values as far as needed, or other rows or columns can be spliced in as needed if the original matrices are not especially wide, so that the potential gains are not overshadowed by alignment padding.

Number encodings other than two’s complement

For signed data it would be an easy default to take the most significant bit as having a negative weight and then all the other bits building the weight back towards zero. This isn’t the only way to distribute the weights.

Base -2 can be used, for example, with weights 1, -2, 4, -8, etc., which can represent arbitrary signed values but in a way that the high bits become zero for small values even if they’re negative. This can be helpful for things like run-length encoding individual planes to avoid large chunks of redundant work.

Since the bit weights are applied outside the inner loop, they don’t necessarily have to be powers of two, either. They can be scaled for better approximation of other distributions, or to bear some redundancy so that lossy techniques may be tolerable in the above chunking/optimisation/compression.