tīkōuka.dev

CISC-V: code compression in the style of a CISC architecture

2026-05-31T00:00:00+00:00

Given a mixed-length instruction stream like Thumb or RVC you encounter a few different headaches. The most infamous case is the complexity of handling un-aligned 32-bit instructions which can straddle architectural boundaries like page or cache line boundaries. There are also caching and security implications of being able to jump into the middle of an instruction, and multi-issue problems with recognising correct instruction start points, and so on…

It also, while introducing some of the complexity of CISC, leaves behind other CISC advantages like explict macro-op encoding for often-fused operations in higher-performance machines. But by blithely throwing in CISC instructions to get code size down you give up the simplicity of implementation for compact designs.

So I’ve been tinkering to try to find a different compromise which allows both fast and compact implementations to do things in a way that suits their separate goals.

I had hoped that this post, when I got to it, would be where I published a more complete proof of concept over what I have discussed before, but things have not gone to plan and I’m rushing this out tonight.

On the plus side, at least I have the name, now: CISC-V

And as we all know, naming is half the battle.

Instruction-packet-based compression

While strict alignment could be imposed on an existing 16-bit instruction compression scheme, that gives up coding space to express things which are no longer legal. It makes more sense to use 32-bit packets (or, conceivably, 64-bit packets) which can, in turn, exploit the intrinsic pairing of two opcodes within a packet for other gains.

That’s the plan, here. One 32-bit opcode may occupy a whole packet, or two compact instructions can occupy the same space instead.

Since a lot of assembly refers to the same register repeatedly from one instruction to the next it also makes sense to exploit this redundancy within packets to regain some coding space. CISC typically exploits this by using the same register as destination and one of its sources, but we can take things further by sharing with an adjacent instruction.

The down side is that some instructions which should be compact will fail to be compact because they can’t be moved adjacent to something they can share the packet with.

Other constraints can be imposed on the packet, as well.

In particular, both instructions must execute (give or take exceptions). There’s no branching to just the second instruction in the packet (all branch offsets must be 32-bit aligned, so relative branches reach further). And any branch out of the packet is coded as the second instruction, so it can’t prevent the execution of the other instruction in the packet.

(There’s an alternative model, here, where the first instruction can be a forward branch to the next packet, rendering the second instruction in the packet conditional, and the implementation can decide how it’s going to handle that)

The architecture still has to allow for exception restarts at the second instruction, but it doesn’t have to encode relative branches, and use of such return addresses can be restricted to appropriate instructions and/or privilege levels. Or difficult restarts can just be emulated in software.

Decoding models

Baby steps

Rather than decoding four bytes starting at the 16-bit address and advancing either two or four bytes accordingly, we always examine a 32-bit packet and decode it in one of two different ways depending on whether our instruction pointer is at an odd or even 16-bit boundary.

Alternatively, upon entry to a 32-bit packet, we decode the first instruction and begin to execute it, and concurrently we shuffle the bits around to produce a 32-bit opcode to decode on the next cycle using the same instruction decoder (with a bit of extra logic so exception resume can ignore the first instruction when necessary).

This is the target of the “as-if model” for other approaches. Ambiguous cases, if they’re allowed (preferably not!), should be resolved in terms of what this model would do. Here we risk producing those gnarly situations that cause high-performance implementations to do pipeline flushes, so try very hard to avoid this.

Omnomnomnom!

If you have the sort of implementation which likes to pick up dozens of opcodes at once and throw them all down the pipeline in parallel to be sorted out later, then you probably don’t want to hear about decoding each 32-bit packet twice and producing up to twice as many instructions as packets, even if most of them are no-op placeholders for instructions which didn’t decode.

To avoid that headache I propose pushing it back to the µ-op fission stage, since that exists already and it’s already in the habit of splitting things up which are too complicated.

What could possibly go wrong? I don’t know, so I assume it’s fine.

As a secondary benefit, it creates opportunities to not break some instructions, and to treat some packets as pre-fused macro-ops.

Exceptions

A restartable exception may be triggered within an instruction pair, and the architecture has to allow for this.

My working assumption is that bit 1 in the program counter or return address signals that the first instruction is to be ignored (it has already executed and retired), but this will not be used in any normal operation outside of exceptions. No calls can set this bit, no returns outside of exceptions should accept it being set, and relative branches are in 32-bit increments.

Also, while instruction pairing may suggest the use of a direct data path from one intruction to the next within the packet, without the need to land the result in a register, this data still has to be exposed for save, restore, and inspection by an exception handler. So a temporary register must be available.

The experiment

I vibe-coded a tool to try to maximise pairs of instructions by reordering instructions in ways that were functionally equivalent (in particular, by noting when registers were dead) but which would open up pairing opportunities. Unfortunately Claude soon became mired in its own bad code, and I was spending more time asking it to clean up its messes than I was trying new experiments.

The whole effort was kind of a bust, and I needed to spend more time than I had doing it all from scratch. I could, theoretically, vibe-code it from scratch asking for a much more restricted tool where I could do the hard parts myself, but I don’t have time for that.

What I did get, though, was a better feel for what pairing rules are actually useful if the compiler were to put things in an order that exploited them.

Pairing types

The rules I found which most often identify pairable instructions are:

load/store at adjacent memory locations (Aarch64’s ldp and stp opcodes) (top by a large margin)
pairs of independent mv or li instructions
double-indirect memory accesses (load a pointer, then load/store whatever that pointer points to)
pre/post increment memory operations
pairs of independent arithmetic in two-operand form (rsd, rs2)
compare-branch chains (branch depends on result of comparison)
load-branch chains (load followed by conditional branch on loaded value; which is then discarded)
arithmetic chains

Chain rules

Chains are where the second instruction depends on the first, and in the experiments that I did the result of the first must also be discarded after use by the second, so the intermediate value needn’t be exposed in an architectural register – except that it’s still needed for single-stepping implementations and exceptions. My solution, here, is to use x31 or x15 as the hard-coded temporary register, with a rule that the content of the register is undefined after the second instruction.

Some pre-increment rules could be chains, too. These are cases where a value is added to a register before using that register as the base of a memory operation, and the base register is discarded after use. Pre-increment achieves this but it overwrites the original base register with the modified address, which is not always the desired outcome (maybe 50:50, varying significantly by testcase).

So, instruction pairs like:

op0, rd0, ra0, rb0
op1, rd1, rd0, rb1

op0, rd1, ra0, rb0
op1, rd1, rd1, rb1

can be replaced this with:

op0, x31, ra0, rb0
op1, rd1, x31, rb1

and then x31 is not coded at all, but rather deduce that from the pairing category.

This pattern has further sub-classes where the relationship between op0 and op1 is constrained to save coding space.

First, exclude the two-operand instructions (one-in-one-out), like li and mv because they don’t benefit.

If the second instruction is load/store, then the first is preparing its base address (pre-increment, no write-back), and shifts and bit manipulation aren’t likely to be any use, so make a category and exclude those possibilities. At the same time the load/store instruction encodes a data size and this can restrict possibilities for the other instruction; eg., by choosing the right X for shXadd so we need only one entry in the set for all the adds.

Some instructions might not themselves be common enough to make available in both slots in a generic chain, but when they’re used we can guess at what they’ll pair with and make a mini set for them. For example, slli is often followed by srli, srai, add, sub, or or.

I had hoped to prohibit load-use-discard altogether and to keep the emphasis on things that paired without huge unknowable delays between the two, but the reality is that it’s too common to ignore in a compression scheme.

So if the first instruction is a load it may go on to generic arithmetic (which usually has other pairing opportunities anyway), but a lot of cases involve conditional branches or second indirections, and these are very hard to ignore.

I’d be curious to see how load-branch-discard could play out if it’s signalled explicitly in the instruction stream. Could branch prediction handle it differently from the signals it gets from the ALU?

Similarly, could explicitly-coded double-indirection (load-load-discard) suggest data prediction optimisations where they’re not otherwise warranted?

two-operand arithmetic rules

The best-known code size optimisation is to encode three-operand instructions with two operands by re-using the first source as the destination register.

In theory this could be another mode for chain rules, where the first result is saved for later and also forwarded to the second instruction; but that doesn’t seem to come up that frequently in the general case (a notable exception is pre-increment addressing).

Really a pair of these just sweeps up a lot of arithmetic which doesn’t otherwise fit a chain rule. This is redundant with chain rules if two of these in a packet use the same destination register (reserve for future use?), and the chain version is slightly more flexible in that it starts with two read-only inputs rather than just one.

But these can also be used to fill the pairing space with compact non-arithmetic instructions, like updating the stack pointer before function return, or implementing pre-increment with writeback before a memory access.

It’s tempting to take aside some operations which are uninteresting when ra and rb are the same register, and to re-purpose those as unary operations like neg, not, and slli rd, 1, rd.

Another special case here would be addi sp, imm, where the immediate here has to be a multiple of 16, so it can reach further, which is not a case which applies to addi rd, sp, imm.

two-in-two-out rules

There’s a group of operations which are often paired implicitly in CISC architectures, like mul/mulh or div/rem, and it’s potentially beneficial to fuse these because most of the work overlaps.

Instructions like those take the same two inputs for two different operations which produce two different outputs.

Other obvious pairs would be min/max, add/sub, and/andn, but these are only meaningful for coding efficiency (if you want to keep the inputs unmodified).

For coding efficiency I also put mv/mv, li/li, and mv/li (li/mv is uninteresting) into this set, where each instruction simply takes one or other argument directly.

We can also put the load-pair and store-pair instructions in this category, with a small wrinkle that the second instruction does the same thing but takes the immediate with an extra offset (the data size of the memory operation).

The compressable instruction sets

I simply haven’t had enough time to decide what to put in my straw-man proposal, yet. And I’ve run out of time to work it out. I blame AI.

I’ve cobbled together enough fundamentals to make a Linux kernel smaller than its RVC build (if my vibe-coded analysis tool is to be trusted), but when I tried the same on a compiled Godot binary results were much, much poorer. I blame C++ for that.

Instruction encoding

I’ve avoided dealing with this. What I do instead is count up the number of bits I need to encode all the fields, and then count the number of combinations this creates and add these to the total, and try to keep that total less than 2«30.

Actually laying the bits out in an instruction packet doesn’t seem terribly interesting. I guess it’s nice to align the source and destination registers of the first instruction with those of the 32-bit instructions (in a compact implementation there’s another cycle to use to redistribute the bits for a second interpretation).

In-filling 3D prints with holes

2026-01-18T00:00:00+00:00

I got to thinking about 3d printing infill the other day, and eventually I decided that there should be ways of scooping large chunks out of the middle, rather than in-filling with an homogenous sparse pattern, and retaining some or all of the original strength of the homogenous fill.

I was thinking a sphere, originally. And applying that subtraction recursively in all of the solid areas left by the previous removals. Why a sphere? Well, that’s the mathematical ideal. Unfortunately you can’t print a sphere without something inside to support the roof (and maybe floor) while it’s printing. Also, the continuous symmetry of a sphere doesn’t really mean much when it’s sliced into layers and has different strength characteristics in different directions.

So spheres are obviously not ideal at all. But lots of things won’t be ideal, here. Let’s start compromising!

First, the sphere’s internal support. Ideally this support structure would not be too rigid. This is because a rigid support undermines the even distribution of forces we’re trying to get from a sphere. I’m no civil engineer but I have played Bridge Builder Game, and I know that if you make one part too strong you can force other parts to fail because they then get all the stress.

Ignoring that problem, I tried to get OrcaSlicer to add whatever it thought appropriate to support the roof of a sphere. Because my sphere was a void inside of a cube it was technically an external support, and it gave me this tree-like thing made of rings, which added a lot to the print time.

I think the ideal support would have been Lightning but I didn’t see that as an option. I guess it would deface the print if it were used as an external support, but my “external” is actually internal to the model and I won’t be trying to remove it. Another caveat there is you probably wouldn’t want a deliberately flimsy printing support to break off and rattle around inside the model at some later point.

So I stopped messing about with that and made a different shape which loosely approximated a sphere but tapered to points at each end. The so-called “fusiform”. It looks to me like an onion:

html,body { overflow: clip; margin: 0; background-color: transparent; justify-content: center center; text-align: center; height: 100%; } .maximised-image { object-fit: contain; width: 100%; height: 100%; } .button { position: absolute; top: 5%; left: 5%; padding: 6px 6px; border: 1px outset buttonborder; color: buttontext; background-color: buttonface; font-family: sans-serif; text-decoration: none; }

Click to view in 3D

I think the tapers should be better than 30° overhang, but OrcaSlicer disagreed and only stopped adding supports when I lowered the threshold to about 27°. I figured it was a rounding error and I just switched supports off and assume my effort was good enough.

This shape is at least circular in one axis, meaning that it should be resistant to buckling. It comes to a point at the top and bottom, but unlike a bridge those points are points on a 2D plane supported all around by thicker material, and hopefully that’s good enough. Plus I have a few millimetres clearance before hitting the outside wall, with infill to spread that load (don’t try that excuse when building a bridge!).

download STL

With OrcaSlicer’s default settings (15% crosshatch infill, and some walls and stuff) the cost of the onion’s walls inside of a 10cm cube approaches the cost of the infill it replaces. It’s less material but only about 10% less time.

In order to make myself look more successful I changed the infill configuration to use more expensive infill. Presumably stronger infill, and still relevant – maybe more relevant – when there’s a huge hole in the middle.

10cm cube	15% c/hatch	30% c/hatch	15% cubic	20% cubic	20% gyroid
Filament, solid	78.63m	133.39m	79.15m	98.45m	94.99m
Filament, onion	66.77m	100.35m	66.12m	77.91m	76.47m
Time, solid	7h19m	12h46m	5h28m	6h38m	11h57m
Time, onion	6h28m	10h08m	5h26m	6h14m	9h21m

I also tried adding extra, smaller onions in the corners to eat up more volume, but it only made things worse – wall thickness remaining constant, wall area shrinking, but enclosed volume shrinking much faster. So nevermind that; but it did highlight that I should test a smaller cube, and I did that instead.

5cm cube	15% c/hatch	30% c/hatch	15% cubic	20% gyroid	30% gyroid
Filament, solid	12.11m	18.62m	12.02m	13.94m	18.19m
Filament, onion	11.80m	15.45m	11.49m	12.70m	15.11m
Time, solid	1h20m	2h00m	1h07m	1h50m	2h36m
Time, onion	1h24m	1h50m	1h16m	1h41m	2h08m

The big question, though, is is it as strong as or stronger than the homogenous infill?

Well, I don’t know! I don’t have a 3D printer, and I don’t have the means to test it scientifically, and there are a lot of different ways to define “stronger”. This is all abstract and theoretical.

The next step is to try to increase the volume of the void without deviating too much further from our spherical ideal.

Enter, The Sphube!

Click to view in 3D

You might be more familiar with the squircle, and this is just a 3D extension on that idea. These squircles and hypersquircles have the benefit of being continous curves, and so should be a bit more resistant to buckling than a flat-faced cube would be. That makes them a more viable monocoque.

What we’re looking at in the general form would be some kind of shrunken, rounded monocoque approximation of the real model – a rigid empty shell – and on top of that we build up using a practical in-fill pattern, and on top of that we build the desired outer shape of the model. This construction should work like a truss arch bridge, with the infill acting as truss, the monocoque providing the arch(es), and the external model being the road surface people drive across.

Consider how the cross-section looks something like a bridge (two bridges):

This is a really half-arsed bridge. The infill is just a regular pattern rather than triangles with carefully-chosen dimensions and placement. But it might be sufficient. Baby steps.

The down-side is that trusses add tensile stress (where steel excels, but 3D prints do not), which means that layer adhesion becomes a much more concerning factor. Maybe that’s why this isn’t a standard approach already.

But that’s really the big idea, here. Make all the walls into trusses with the underside of the truss being a strong monocoque which resists compression by being wholly convex.

Right now I don’t have the means to convert an arbitrary model to its eroded, curved form. I tried searching for model erosion tools and mostly just found ways to make things look weathered.

But I do at least know how to erode a square down to a squircle. That can be done with something like a smoothmax(), or a generalised mean of x and y coordinates. As in “we’re inside the squircle if smoothmax(abs(dx), abs(dy)) < r”.

If you imagine using max() in place of smoothmax() then that would give you a square boundary. And if you replaced max() with sqrt(dx^2 + dy^2) then a circle. smoothmax() and generalised mean can pick functions somewhere in between, with a parameter that allows them to express both.

After a bit of digging I found a simple way to get from those simple equations to an .STL file.

STL sphubes (lower-resolution STL sphubes)

But as we learned with the sphere, this isn’t going to work because of the roof problem, and my lack of access to an “external” lightning fill. Now it’s a bit worse because that roof is wider and flatter.

download STL anyway

So back to the compromises I must go…

Or I can just post this as-is and go tinker with sphubes generalised to other platonic solids for no clear reason at all:

Click to view in Desmos

Choosing n different colours for graphs

2025-11-22T00:00:00+00:00

One way to generate a palette of colours for distinguishing different lines and objects in diagrams is to take regular steps around the hue parameter of the HSL colour wheel. If you know how many you’ll need then your can subdivide the space evenly, or if you do not then you can use 1/φ as the interval instead. But this has limitations…

Of course a much simpler solution is to just pick a bunch of reasonable colours and put them in a table (eg., 1 2 3). But doing things the hard way is more interesting. Also, a list which includes both black and white isn’t solving quite the right problem for this post…

Spoiler alert: this won’t (directly) attempt to address accessibility for colour-blind users.

I’ve written in the past about trying to draw diagrams and graphs on web pages. The essential point is that you can embed SVG with a transparent background but you must use currentColor as the pen colour when you do this, so that the image is drawn in the same colour as the text, rather than assuming that the background is always white so you need to draw black on top. If you use something like Dark Reader you’ll often see this go awry.

Alternatively you can force the background of the image to be a known colour, but then on a contrasting background that can still be hard to look at.

So I know how to draw lines with reasonable contrast from the background, without assuming that the background will be light or dark. The next problem is to add to that palette some extra colours which also contrast with the background but are visibly distinct from each other. Like three lines on a graph.

Single-parameter variation

A quick-and-dirty notion of “contrast” is having a different brightness. Having a different colour but the same brightness can be very hard to look at. So for starters, let’s look at just varying the colour while keeping brightness at a single value chosen to contrast with the text or the background.

$\frac{n}{\varphi} \mod 1$ has the property that every new $n$ falls inside one of the largest gaps, and inside the largest span of contiguous largest gaps (when there are many largest-equal gaps), etc., subdividing that gap/span by 1:φ, which is tolerably close to 1:2.

That is to say that each new value is as far as possible from as many previous values as possible without deciding in advance how many values you’ll need or changing step sizes at different stages in the sequence.

Anyway, let’s have a look. These colours step around the hue of the HSL space:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

HSL(n / φ % 1 × 360°, 60%, 70%)

It’s interactive. You can resize the box to change the way rows line up, so you can put different colours next to each other for comparison.

And if you do that you’ll see a problem. It seems to visit relatively few colours before coming back around to use something very similar to a colour that’s already been used. So things get indistinct much sooner than one might hope.

Fun fact: When taking steps of 1/φ mod 1 those “kind of similar” colours occur at distances which are Fibonacci numbers. Resize the box to have a Fibonacci number of columns and you’ll see stripes.

HSL is tied to the numerical coding of colour in RGB. It’s made out of up and down ramps of R and G and B without regard to how they’re perceived. OKLCh, on the other hand, is tied more closely to human perception. Maybe that’ll help:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh(75% 30% (n / φ % 1 × 360°))

This has the unfortunate effect (normally a feature) of flattening the lightness of each colour, so none of the colours are distinguished by the perceptual lightness variations which would sneak through HSL. Maybe the hues are more evenly spread, but I can’t see it.

On the positive side, the contrast with the numbers written on the boxes is more even. That’s important.

Another problem with OKLCh is that it’s so easy to stumble out of gamut (the range of colours which the display can represent) and this brings gamut mapping into play. The way to do that is not well defined right now and it may never be defined in a way that’s useful for these purposes. It’s not always obvious how and when the test swatches I’m using here will be clipped to fit the display capabilities, so it’s hard to be confident that everybody sees the same thing.

That’s a problem with human perception anyway, but this makes it so much worse.

But let’s persevere with it a while longer…

Multi-parameter variation

Changing just the one parameter doesn’t seem to get us a lot of distinct choices. The next thing we can change without interfering with our fixed brightness constraint is saturation, or C for “chromatic intensity” in OKLCh. Alternatively, C represents the distance which the a (green-red) and b (blue-yellow) values are from 0,0 in OKLab, so we could vary a and b instead of C and h.

So how do you get the properties of $\frac{n}{\varphi} \mod 1$ in two dimensions? It turns out a (the?) generalisation takes us to the plastic ratio (ρ=1.3247), next. In short, multiply $n$ by (1/ρ, 1/ρ²), or (0.7548776662, 0.5698402910) mod

This maximises the minimum distance between any two points in two dimensions.

Here’s how that looks in OKLab:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLab(.75 (n / ρ % 1 × .4 - .2) (n / ρ² % 1 × .4 - .2))

This gives uniform coverage of a square in the chroma plane, so it has pointy corners where the saturation reaches further out than it can near the edges. It’s probably going out of gamut and being clipped in unpredictable ways.

In another problem space we could use rejection sampling to avoid those ugly corners, but then we can’t define a colour as a simple function of $n$. Instead, a technique to map two uniform random values (a square) to a uniform distribution over a disc is to take one value as the radius and the other as an angle around that circle. Squaring the value used as radius compensates for the over-concentration of points around the centre (proof left as an exercise for Google search).

Does this retain the mathematical rigor of low-discrepancy sequences? No. Not at all. But it’s the best I have right now.

And here’s what that gives us for OKLCh:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh(.75 (sqrt(n / ρ % 1) × .2) (n / ρ² % 1 × 360°))

Something really unfortunate about the plastic ratio shows up, here. It’s too close to 4/3. This has the consequence that one parameter appears nearly periodic mod 4, with a very slow precession. For example, in the polar test case, we start at 0 so the first radius is zero (grey), and every fourth colour after that is very close to grey as well, and it takes a long time to climb out of that hole.

By switching the axes around then the problem will manifest in the hue instead:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh(.75 (sqrt(n / ρ² % 1) × .2) (n / ρ % 1 × 360°))

For completeness, let’s also try OKLCh but with fixed C and varying the lightness instead.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh((n/ρ % 1 × .25 + .63) .12 (n / ρ² % 1 × 360°))

Or swapping the axes:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh((n/ρ² % 1 × .25 + .63) .12 (n / ρ % 1 × 360°))

Varying all three parameters

Next step is to make adjustments to all three parameters; but only modest adjustments so that all results still have strong contrast with the background colour.

I don’t know of a name for what comes after Golden and Plastic, but its value is g=1.22074408460575947536, and the reciprocals of the powers are (0.8191725134, 0.6710436067, 0.5497004779).

The lightness figure needs compression to ensure things don’t wander too far and start failing to meet the original contrast limitation.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLab((n/g % 1 × .25 + .63) (n/g² % 1 × .35 - .175) (n/g³ % 1 × .35 - .175))

But I preferred the result with the terms in a different order:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLab((n/g² % 1 × .25 + .63) (n/g³ % 1 × .35 - .175) (n/g % 1 × .35 - .175))

I wasn’t sure about the appropriateness of compressing an axis of an LDS the way I was doing it, so I tried using a smaller modulo instead:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLab((n/g² % .25 + .63) (n/g³ % 1 × .35 - .175) (n/g % 1 × .35 - .175))

but this version becomes distinctly worse at intervals of 22. Which is respectable, but it’s not as good as the previous version.

Those are all OKLab, so they have pointy saturation corners – though I did reduce the range a little to compensate. Let’s try another OKLCh:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

OKLCh((n/g² % 1 × .25 + .63) (sqrt(n/g³ % 1) × .2) (n/g % 1 × 360°))

And back to HSL:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

HSL((n/g % 1 × 360°), (sqrt(n/g³ % 1) × 70%), (n/g² % 1 × 25% + 55%))

And HSL with the axes rearranged:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

HSL((n/g³ % 1 × 360°), (sqrt(n/g % 1) × 70%), (n/g² % 1 × 25% + 55%))

So many choices… also, you can add some arbitrary starting value to pick a handful of colours you like the look of, and the subsequent colours will only come up in extreme cases.

Putting it into context

Given some function or other to turn an index into a colour, that colour still has to make sense for the way it’s being used. Coloured lines want contrast with the background while being distinguishable from each other, but if you fill in a box you probably want that fill to have contrast with any text that goes inside, so it should be close to the background colour.

In my totally unscientific tinkering I’ve found that low-saturation light colours (pastels) work well for lines on dark backgrounds and for fill colours behind dark text, and that high-saturation dark colours (“deep” colours) work well well for lines on light backgrounds and fill colours behind light text.

Also, fills turn out to be easier to distinguish from each other than lines, so lines might need their saturation amplified a bit to compensate. Maybe. I don’t want to go that deep right now.

All that said; one should have other means to distinguish things because not everybody sees colour the same way.

Code plz!

In CSS you can deduce a contrasting background colour with something like: HSL(from currentColor 0, 0, clamp(0, l * -100 + 50, 1)) This negates the luminance and amplifies 100-fold so as to hit the limits imposed by clamp() right away. Resulting in either black or white being chosen.

One can also deduce that a low saturation might be desired when currentColor has a low lightness value, and high saturation is desired when currentColor has a high lightness value.

It’s easier to do this in two steps, first making a “mask” colour, and then using that mask as the basis for palette colours:

* {
  --stroke-mask: oklab(from currentColor
      clamp(.40, l *  100 - 50, .9)
      clamp(.15, l * -100 + 50, .3)
      clamp(.15, l * -100 + 50, .3));
  --fill-mask: oklab(from currentColor
      clamp(.40, l * -100 + 50, .9)
      clamp(.15, l *  100 - 50, .3)
      clamp(.15, l *  100 - 50, .3))

  --colour-stroke: oklab(from var(--stroke-mask)
    calc(calc(mod(.6710436067 * var(--n), 1) - l) * .25 + l)
    calc(calc(mod(.5497004779 * var(--n), 1) - 0.5) * a)
    calc(calc(mod(.8191725134 * var(--n), 1) - 0.5) * b));
  --colour-fill: oklab(from var(--fill-mask)
    calc(calc(mod(.6710436067 * var(--n), 1) - l) * .25 + l)
    calc(calc(mod(.5497004779 * var(--n), 1) - 0.5) * a)
    calc(calc(mod(.8191725134 * var(--n), 1) - 0.5) * b));
}

Where --n is an integer colour index. Just set --n to different numbers for each group of objects which should have the same colour, and use var(--colour-stroke) and/or var(--colour-fill) as appropriate within that.

TODO: here’s where I’d demonstrate boxes and lines in different colours, and on different backgrounds, but I don’t really have time right now.

Designing a Lego card shuffler

2025-11-20T00:00:00+00:00

A problem with mechanical card shufflers is that they do things like riffles with mechanical precision, and mechanical precision tends to produce predictable outcomes (at least in theory). Thinking about this gave me the idea that I could do my own but with deliberate and controlled use of robust random numbers in order to produce a true shuffle.

I figured the thing to do would be to enumerate the cards randomly and then radix-sort them into place. This seemed like a comparatively (comparatively) easy mechanism to build. As a side effect, enumerating and reordering means that you could also add a camera and then identify and sort the cards by their actual value. It’s also much easier to verify that sorting has been done correctly than it is to verify that shuffling has been done correctly.

In fact, the user could choose whether to sort or to shuffle simply by placing the cards face-up or face-down. Or if it’s a real mess then the deck can be separated into face-up and face-down stacks in one pass.

Equivalence of shuffling, sorting, and riffling

What’s a riffle?

A riffle can be modelled as dividing the cards into two stacks and randomly picking either the left or right stack to deliver the next card to the result, over and over until there are no more cards. Each choice is based on probabilities proportional to the number of cards in each stack, and this model implies the dealer tries to mix the two stacks evenly rather than letting one side expire early and then simply dropping the rest of the cards from the other stack on top.

Unfortunately if you have too much precision then the outcome is that you interleave the cards in a regular left-right-left-right pattern, which is completely predictable. Some people can do this deliberately!

If it is ideally unpredictable then you need to do at least six of these to get a fair shuffle in a deck of 52 cards. Probably more, but certainly not less.

What’s a radix sort?

Radix sorting is a multi-pass binning operation, where the cards are sent to one of n (n will be two in this build) different bins depending on whether they should be at the front or the back of the sorted list. Doing this in multiple stages means making the decision based on different conditions on each pass. You might separate even and odd cards, then place one pile on the other for the next round, then low and high numbered cards mod-4, then mod-8, etc., with the final pass separating the red and black cards.

Technically separating the cards into two bins is the opposite of riffling; but the overall effect in either case is a permutation which can be identified by the binary decisions made along the way.

To use a sorting machine as a shuffler you can randomly assign unique numbers to each card, and then sort the cards by their associated numbers.

What you should not do is to take a sorting algorithm and then make random decisions at each comparison. That rarely works. Radix sorting might perform comparatively well in this arrangement, but it’s still wrong. In fact it’s radix sort’s good (but not perfect) performance that makes riffle shuffling converge on a strong shuffle after only one or two extra rounds beyond the theoretical minimum.

How are they similar?

If every card remembers whether it came from the left pile or the right pile, for every riffle step in a shuffle, then it would come out with six or seven boolean values, which you can combine as bits into a number, which represents its index in the shuffled pile. In essense the process gives each card a random 6-bit number and then sorts them by those numbers.

A radix sort replays that same string of decisions but in reverse order. Reading those same index numbers from the other end.

But look out. Just assigning numbers this way allows the possibility that two cards could have the same number, and then their final order won’t be changed from their initial order. More riffles adds more bits to the numbers, decreasing the chances of two numbers having the same number and being “stuck together” for the whole shuffle, but it’s a coarse approximation of picking predictably unique numbers.

An ideal shuffle chooses unique indices for each card, and then sorts by that value. Moreover; an ideal shuffle chooses one of the 52! possible permutations and then puts the cards in that order, and that order can be described by numbering the cards according to where they land. A pair of cards could still come out in the same order as they went in, but only with a suitably low probability.

That’s a thing we can do trivially in a microcontroller, but not-at-all-trivially in a Victorian-era mechanical contraption.

How to build a thing for that?

To build a radix-sorting machine I need to be able to take cards one at a time from the source deck and to deliver them to one of two (or n) other bins according to logic of some sort, and once all the cards are redistributed, to combine those two piles and bring them back to the starting point for the next round.

Binning

Starting with the easiest bit; capturing the cards in multiple bins and bringing them back together into a single pile for the next round.

For this I decided on a vertical column with three shelves. The source pile at the top, and two output piles below that. A shuttle (also acting as the bottom shelf) can then lift the cards to the top, but as it lifts the cards, the shelves above have to get out of the way while depositing their cards on top of those already on the shuttle.

To achieve this, I made the shelves a pair of forks, on diagonal sliders. Upward pressure from below would push the forks backwards out of the way, while the wall they retreated into would keep the cards lined up with the shuttle as it rose. When the shuttle passed the forks they would drop back into place behind it.

Then the shuttle needs to deposit the cards on the top shelf and go back to the bottom. To achieve this it’s made of overlapping wings which lift up and slip between the forks on the way back down, leaving the cards on top of those forks.

And that actually worked! Hurrah!

Here I would offer a picture, but the kids stripped my build for parts and now we have a Lego Porsche 911 instead of a card shuffler. So I’m going to offer a quick and dirty 3D mockup instead:

Click to view in 3D

Dealing

Next we have to deal the cards one at a time from the top pile, and decide where they should be delivered. Dealing cards with Lego is a problem that seems really hard. How do I ensure only one card is drawn at a time? How does a printer do that?

I ran out of time for the project before I could build any prototype, but I had thoughts and I hope to revisit the problem imminently.

My thinking, such as it is, involves a roller (motorised Lego wheel) on top of the deck pushes at least the top card out, while a brush sits beneath where the top card protrudes and tries to sweep back any other cards which got dragged along with the top card. Not sure if it’s necessary, but I feel like I at least have a plan if it turns out it is necessary.

Once the top card is protruding far enough that I think the brush has isolated it, slightly faster rollers can pick it up and get it moving on its way.

This mechanism has to have a bit of vertical freedom so that it can adapt to the shrinking pile, obviously, and I guess the smart thing would be to sense when the pile is empty (ie., when there’s no card supporting the roller, and it falls beyond a threshold). It also has to get out of the way when the shuttle is trying to refill the pile. I figure that the refill action should lift both mechanisms together, and then replace both mechanisms together.

I intend to use the same motor to drive the rollers and also to raise and lower the shuttle and rollers. Why? Because I only bought two motors and two motor controllers. This means turning the motor in one direction will lift things up and disengage the rollers, and then turning that motor in the other direction will lower everything and at the point it’s seated further motion on the motor toggles over to driving the rollers.

Slightly fiddly, but probably easier overall than adding more motors.

TODO: draw a diagram

Routing

Every card drawn has to be directed to one of the two bins, and so some kind of switch is in order. I figure that’s basically just a slide which can be raised or lowered to point at the appropriate bin. The main complication comes from wanting to make sure that timing errors don’t cause a card to get jammed in a destructive way, so there has to be clearance for the card to find a safe escape path if things move at the wrong time, and it has to jump over this gap in normal operation.

Also, I need a sensor to regulate the timing of the switching. One which will tell me when the next card is passing by. Or, in the fancy version a sensor to read the face value of the cards, and also that the next card is passing by.

There’s no Lego sensor for the second version, so I’m sticking with the first (though I do have a thing in a box somewhere…?).

TODO: more diagrams?

Actuation

controller

For the actual control logic I went with a micro:bit, because it’s cheap and because my employer gave me one to celerbate an anniversary. Also my boss gave me another one because he thought he’d never have time to use his.

Moreover, at the time I felt that the EV3 brick was unreasonably expensive and I wanted to do my part in making that cheaper so that Lego education kits could be stretched a bit further. But that whole thing is for another blog post.

Here’s all the bits I needed on some breadboard:

Parts: edge connector, driver, connector.

Since then Lego has changed its connector standard (again). I have the older motors right now, but I think I should re-do the build for the modern connectors at some point. Maybe Lego will stop changing the connector standard now?

motors

Lego Mindstorms “servo” motors are a combination of 9V DC motor and quadrature encoder. That and a PWM output from the controller are enough for a PID control loop to manage speed and position, but it won’t know its absolute position at boot.

One solution to this is to have a bump switch to confirm the zero position at start-up. Alternatively, Lego has a clutch gear and at boot time you can just over-extend the position and let that slip to reset the zero position. This introduces the risk of drift and may require periodic re-homing, but (depending on levers and stuff) it may also lessen the damage if a card gets jammed in the wrong place.

I have to say, playing with a PID loop on a Lego motor is fun and everybody should try it at least once. It’s interesting to feel how the poll rate and parameters affect the feel of the motor as it resists you pushing on it. It can feel disconcertingly solid in contrast with the elastic feeling of a motor without control – depending on parameters.

logic

With all the machinery in place I have to actually write some code. Well, I wrote some code early on. Starting with a driver for the quadrature decoder which is provided by the nRF51822 on the micro:bit, and the PID controller… but none of that means much without a machine to attach it to. Which I don’t have. I just have a racing car, a bunch of rubber-band launchers, and some stuff the dog’s been chewing on.

But what I would do is:

With the cards set on the top shelf, turn the first motor forwards continuously, feeding cards one at a time towards the ramp. In the first pass it doesn’t matter where the ramp points, because we’re just counting the cards (or checking the current face value order if we’re fancy). Keep rolling until a sensor tells us we’re out of cards.

Here we stop and do some thinking to decide what order we want to put the cards into. If we saw n cards we enumerate them in random order from 1..n, and that’s going to be our target order. Knowing how many cards we have we also know we’ll have to do ceil(log2(n)) passes.

Next reverse the motor, which stops the rollers and lifts the shuttle and rollers. Keep doing that until [TBD], then turn the motor forwards again to begin lowering everything. At this point rollers are still disengaged and the cards are on the shuttle which is above the top shelf.

On they way down the shuttle deposits the cards on the top shelf and passes between the forks. Once the shuttle hits the bottom, the [TBD] mechanism disengages from the shuttle and begins turning the rollers again. As each card passes by, move the ramp up or down to direct the card appropriately for its planned position in the final sort.

This is just an LSD radix sort. Odd numbered cards go up, even numbered cards go down, or whatever. Keep going until we run out of cards. Make sure the count is consistent with last time, or we’ve done a whoopsie.

Shuttle up, shuttle down. Repeat. This time the shuttle position is determined by the next bit in the cards’ indices.

Over and over until we’ve done enough passes to fully shuffle the cards. All done. Yay!

Now I just have to rebuild what I used to have, build and test the bits I didn’t already have, write the code, ???, and profit!

One day. When I’m retired, or whatever.

Getting even light from long LED strips

2025-09-27T00:00:00+00:00

Something you may notice for very long runs of LED strip is that they can be bright at one end and dim at the other. That’s because the strips are two long power rails with a bit of internal resistance and current through the LEDs at the far end have more of that resistance to travel through.

Here’s how LED strips are typically wired:

The vertical stack of LEDs is distributed along the strip somewhat, which is why you’re restricted to cutting the strip at regular intervals of every three or six LEDs.

Let’s simplify that by treating the LED circuits as lamps:

You can hover over a lamp to see where the current flows. The further you go from the power supply the greater the cumulative resistance of the power rails.

The expedient but costly solution is to reinforce the power rails in the strip by soldering some heavy hookup wire onto the strip at some of the cut points where you don’t actually cut it (once every metre should be plenty; more frequently would be unnecessarily tedious).

But if you happen to be running the strip in a loop, such that the ends end up somewhat close to each other, or if you’re willing to run one length of hookup wire alongside the strip, then there’s a simpler fix for that difference in brightness.

Connect one side of the power supply to the near end of the strip, and connect the other side of the power supply to the far end of the strip. Be careful to still connect minus to minus and plus to plus in the usual way, though. Like so:

This way the length of the circuit through each LED is (approximately) the same, and so the resistance is the same and they all come out the same brightness all the way along the strip.

You’ll see this in some prefabricated lighting strings which are not designed to be cut. They’ll have a third wire which is not be connected directly to the LEDs, but at the end it’ll be connected to one of the other wires, and that will complete the circuit from the far end back to the power supply to balance things out. If you cut those then they won’t work anymore, because you would need to reconnect two of the wires at the cut point.

It doesn’t matter if the power supply hookup lines are different lengths. Having the total length be unnecessarily long will be less energy efficient, but they affect all the LEDs equally regardless of whether or not both sides are the same length.

It might be tempting to link both ends of the strip together in tee intersections with the power supply. That should work, and you’ll get more light out of the system overall, but you may still see a bit of dimming in the middle of the loop.

Initialisation at declaration considered harmful

2025-08-29T00:00:00+00:00

Suppose you have a variable x.

int x;

Hello x.

Now suppose you decide that under some circumstances x should have a particular value.

if (some_circumstances) x = particular_value;

And later on x’s good buddy y wants to have it’s own value based onx’s value.

int y = f(x);

Hey y, how’s it going? What’s that? You say you don’t feel so good?

Oh dear. It looks like somebody’s coming down with a touch of the Undefined Behaviour. Perhaps some_circumstances wasn’t the only case we should have addressed, here.

Conventional wisdom says you should avoid this situation by always initialising your variables when you define them. Ideally you do this by declaring them only when you know what they should be. But sometimes you have only partial information when you want to put a value in there, and so in the alternate case you can only make something up:

int x = 0;
if (some_circumstances) x = particular_value;
int y = f(x);

But what if your intention was not to set y to f(0)? What if the real bug was in failing to consider another case and come up with a suitable result in that case as well? What if x was actually a uid_t? Should a uid be initialised to zero as a “safe default” in the case of logic bugs?

Well, you could initialise x with a value so absurd that the mistake was bound to be highly visible in some way or other. Good choices are signalling NaNs, nullptr, etc., or something you’ll catch in an assert() eventually (if you remember, and if you have the test coverage). That’s problematic if your type can only represent legal and appropriate values (very often the case for DSP work).

You could use a bigger type as a temporary, or use std::optional<> which includes an explicit flag saying whether or not the variable has been initialised. But these require that extra checks be manually implemented before the variable is used. Otherwise they’ll likely produce silent failures of their own. And checks might not be put in all the necessary places, because they’re a manual effort.

The thing is, though, leaving the variable uninitialised is setting it to an illegal value which the compiler will try to prove cannot escape:

int x;  // will definitely get overwritten
if (some_circumstances) x = particular_value;
if (some_other_circumstances) x = different_value;
if (unusual_circumstances) x = spooky_value;
int y = f(x);

Ideally, if (some_circumstances || some_other_circumstances || unusual_circumstances) isn’t provably true then the compiler will gripe about this and you’ll have to revisit the code and make it right. This is most valuable if the code was clean before you made changes and afterwards this warning suddenly turns up.

Sadly, Clang and GCC really only care if they’re going to produce an undefined value, and with optimisations enabled most of these cases are obviated by replacing predicates with constants. That might cover security vulnerabilities but it’s no help with logic bugs. To get the job done properly you need to run Clang with --analyze, or use your own favourite static analyser.

Clearly the compiler’s still not going to get all of them, and the static analyser might miss something too, so being the diligent you that you are you’ll hopefully catch the remaining cases when you run your unit tests with -fsanitize=memory.

But if you do initialise the variable before you know what should be in that variable, then those checks will never work. Consequently you can introduce bugs which cause the initialiser you chose (before you knew what the value should be) to become the final value, and neither the compiler nor the sanitiser will be able to tell you that you’ve done so. You’d have been better off knowing you just broke something, but instead you’ll just get that “safe” value you initialised with.

Modern tooling has made an uninitialised variable the implicit signalling illegal state. But it’s also long-established bad style, so people have put time and effort into hiding bugs which would have been surfaced by the tools had they not tried to improve their code.

It’s unfortunate that there’s no consistent way to explicitly declare a variable as having an illegal state which should raise an error if it’s used. All we have is well-known ad-hoc solutions like nullptr, NAN, std::numeric_limits::signaling_NaN, maybe T::end(), etc..

I would prefer explicit syntax for “I don’t know yet” initialisers which still allow the tools to do their job but can drop in default fill values when the tools reach their limits. Like C++26’s erroneous behaviour, but made explicit so as to stave off those generic “uninitialised variable” warnings. Perhaps name it undecided{} or uncommitted{} or provisional{}, with an optional value argument if you don’t want to leave that choice to the implementation, reflecting that the developer hasn’t chosen a value and any attempt to read it before it changes would be a mistake, but without implying that it could be uninitialised.

int x = uncommitted<int>{};
if (some_circumstances) x = particular_value;
if (some_other_circumstances) x = different_value;
if (unusual_circumstances) x = spooky_value;
int y = f(x);  // invoke C++26 erroneous behaviour as needed

Ideally the compiler or static analyser would pick up any oversights in that logic. If not, -fsanitize=memory might pick it up provided you have a test case that covers it. If not, then a default value is inserted as chosen by uncommitted{}, or the value you specify if you choose to do so (even though you’ve clearly never tested it). One might expect uncomitted{} to choose a signalling NaN and any pointer type to choose nullptr.

C++26 might achieve that if you leave the variable uninitialised at definition, but that just looks like a mistake, and it’s a landmine if you don’t have your compiler configured appropriately.

Additionally, if you could be explicit, you can be more explicit about other things, like:

int f(Result& result, int arg) {
    result = uncommitted<Result>{};
    // ...
    result = work_in_progress;
    // ...
    if (accident_happened) {
        result = uncommitted<Result>{};
        return -1;
    }
    // ...
    return 0;
}

And let the tools ensure that result is left untouched when it’s in an undefined state.

An experimental RISCV instruction compression

2025-08-04T00:00:00+00:00

I wanted to experiment with a means of reducing compiled RISCV code size in a way that did not allow for the creation of un-aligned 32-bit opcodes, so I had a bit of a tinker with 32-bit packets containing instruction pairs.

Rationale

RISCV sees implementations ranging from lightweight scalar to wide OOE superscalar, each needing to take very different approaches to how the instruction stream is ingested.

Things like the large number of instruction entrypoints with unaligned 32-bit opcodes are problematic for out-of-order machines; while the low-end processors still want to minimise code size and icache burden.

I’ve previously mused over the idea of aligned 32-bit packets of 16-bit instructions with extra constraints to try to make it easy to ingest the packet as a single opcode, and then to split it into micro-ops later in the pipeline, where everything else gets split into micro-ops already.

And at the same time I observed that overlapping rd and rs1 operands is not the only way to overload the bits in an opcode.

So without any real insight into the technicalities of how those things would work out in practice, I set about making my own little straw-man.

I’ve taken inspiration from other proposals, and tried to make such extensions available as pairs of more pedestrian opcodes within the same 32-bit packet. So what might look like a CISC instruction can be dressed up as two compressed RISC instructions instead; even if one were to implement it as a single fused instruction.

Design objectives

Support only 32-bit opcode packets, but squeezing pairs of instructions into those packets for compression.
Ensure that every such packet can be interpreted in two passes as two independent instructions, each conforming to the standard RISCV ISA model (2-in-1-out, etc.).
Restrict branching to only the final operation of a packet.
Try to exploit the frequent sharing of common registers in adjacent instructions to aid compression.
Capture some proposed instruction extensions which could be implemented as macro-op fusion instructions and formalise them as pairs within one 32-bit packet.
Use no more than 1/4 (30 bits) of the opcode space.
Make code smaller.

References

Qualcomm Znew/Zics:

https://lists.riscv.org/g/tech-profiles/attachment/332/0/code_size_extension_rvi_20231006.pdf

Macro-op fusion stuff:

RISCV reference card:

http://riscvbook.com/greencard-20181213.pdf (warning: non-SSL link)
https://www.cl.cam.ac.uk/teaching/1617/ECAD+Arch/files/docs/RISCVGreenCardv8-20151013.pdf

A provisional attempt

With no statistical model of instruction-pair frequency, I just guessed at what might work and came up with the following.

For expediency I’ve only counted the number of instructions in each class and laid them out sequentially. It would be folly to try to arrange the specific bit patterns for efficient decoding before the supported instruction set is decided.

       0x0+0x10000000: 14: arith4  rsd,rsd,rs_imm          14: arith4  rsd,rsd,rs_imm          (28 bits)  660 hits
0x10000000+0x10000000: 14: arith4  t6,rs1,rs_imm           14: arith4  rd,t6,rs_imm            (28 bits)  0 hits
0x20000000  +0x800000: 14: arith5i rsd,rsd,imm5             9: arith5i rsd,rsd,{imm}           (23 bits)  79 hits
0x20800000  +0x800000: 14: arith5r rsd,rsd,rs2              9: arith5r rsd,rsd,{rs2}           (23 bits)  1 hits
0x21000000  +0x800000: 14: arith5i rsd,rsd,imm5             9: arith5r rsd,rsd,{rd}            (23 bits)  27 hits
0x21800000  +0x800000: 14: arith5r rsd,rsd,rs2              9: arith5r rsd,rsd,{rd}            (23 bits)  8 hits
0x22000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: beqz    {rd},imm11              (25 bits)  23 hits
0x24000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: bnez    {rd},imm11              (25 bits)  32 hits
0x26000000 +0x1000000: 13: cmpi    t6,rs1,imm5             11: beqz    t6,imm11                (24 bits)  0 hits
0x27000000 +0x1000000: 13: cmpi    t6,rs1,imm5             11: bnez    t6,imm11                (24 bits)  0 hits
0x28000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: j       imm11                   (25 bits)  76 hits
0x2a000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: jal     ra,imm11                (25 bits)  92 hits
0x2c000000  +0x100000: 15: arith5  rsd,rsd,rs_imm           5: jr      rs2                     (20 bits)  16 hits
0x2c100000  +0x100000: 15: arith5  rsd,rsd,rs_imm           5: jalr    ra,rs2                  (20 bits)  7 hits
0x2c200000  +0x200000: 21: --reserved--                    (21 bits)  0 hits
0x2c400000  +0xc00000: 19: pair.a  rd,rs1,rs2               5: {opcode:pair} rd,{rs1},{rs2}    (24 bits)  0 hits
0x2d000000 +0x1000000: 24: ldst    rd,imm10(rs1)            0: {opcode} {rd:next},{imm:next}({rs1})  (24 bits)  341 hits
0x2e000000 +0x8000000: 13: arith3  rsd,rsd,rs_imm          14: ldst    rd,0(rs1)               (27 bits)  364 hits
0x36000000 +0x8000000: 14: ldst    rd,0(rs1)               13: arith3  rsd,rsd,rs_imm          (27 bits)  635 hits
total size: 0x3e000000,  bits: 30
saved=2361, total=10258

Or here’s another verison:

       0x0+0x10000000: 14: arith4  rsd,rsd,rs_imm          14: arith4  rsd,rsd,rs_imm          (28 bits)  658 hits
0x10000000+0x10000000: 14: arith4  t6,rs1,rs_imm           14: arith4  rd,t6,rs_imm            (28 bits)  0 hits
0x20000000  +0x800000: 14: arith5i rsd,rsd,imm5             9: arith5i rsd,rsd,{imm}           (23 bits)  78 hits
0x20800000  +0x800000: 14: arith5r rsd,rsd,rs2              9: arith5r rsd,rsd,{rs2}           (23 bits)  1 hits
0x21000000  +0x800000: 14: arith5i rsd,rsd,imm5             9: arith5r rsd,rsd,{rd}            (23 bits)  27 hits
0x21800000  +0x800000: 14: arith5r rsd,rsd,rs2              9: arith5r rsd,rsd,{rd}            (23 bits)  8 hits
0x22000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: beqz    {rd},imm11              (25 bits)  23 hits
0x24000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: bnez    {rd},imm11              (25 bits)  32 hits
0x26000000 +0x1000000: 13: cmpi    t6,rs1,imm5             11: beqz    t6,imm11                (24 bits)  0 hits
0x27000000 +0x1000000: 13: cmpi    t6,rs1,imm5             11: bnez    t6,imm11                (24 bits)  0 hits
0x28000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: j       imm11                   (25 bits)  79 hits
0x2a000000 +0x2000000: 14: arith4  rsd,rsd,rs_imm          11: jal     ra,imm11                (25 bits)  92 hits
0x2c000000  +0x100000: 15: arith5  rsd,rsd,rs_imm           5: jr      rs2                     (20 bits)  18 hits
0x2c100000  +0x100000: 15: arith5  rsd,rsd,rs_imm           5: jalr    ra,rs2                  (20 bits)  7 hits
0x2c200000  +0x800000: 15: arith5  rsd,rsd,rs_imm           8: sw      {rd},imm8(sp)           (23 bits)  1 hits
0x2ca00000  +0x800000: 15: arith5  rsd,rsd,rs_imm           8: sd      {rd},imm8(sp)           (23 bits)  0 hits
0x2d200000  +0x800000: 13: lw      rd,imm8(sp)             10: arith5  {rd},{rd},rs_imm        (23 bits)  1 hits
0x2da00000  +0x800000: 13: ld      rd,imm8(sp)             10: arith5  {rd},{rd},rs_imm        (23 bits)  1 hits
0x2e200000  +0x200000: 21: --reserved--                    (21 bits)  0 hits
0x2e400000  +0xc00000: 19: pair.a  rd,rs1,rs2               5: {opcode:pair} rd,{rs1},{rs2}    (24 bits)  0 hits
0x2f000000 +0x1000000: 19: ldst    rd,imm5(rs1)             5: {opcode} rd,{imm:next}({rs1})   (24 bits)  402 hits
0x30000000 +0x8000000: 13: arith3  rsd,rsd,rs_imm          14: ldst    rd,0(rs1)               (27 bits)  372 hits
0x38000000 +0x8000000: 14: ldst    rd,0(rs1)               13: arith3  rsd,rsd,rs_imm          (27 bits)  620 hits
total size: 0x40000000,  bits: 30
saved=2420, total=10258

Other opcodes like breakpoint can be overloaded in the rd=0 space. Or fall back to 32-bit encoding.

The mem,mem operations essentially mimic the load/store pair instructions proposed by Qualcomm, but lacking pre/post-increment because that would break the 2-in-1-out contract in a two-round implementation. These share the base register and immediate offset arguments, and the destination register is a consecutive pair.

The arithmetic,mem and mem,arithmetic pairs provide the pre/post-increment operations proposed by Qualcomm, but are then generalised to offer other arithmetic operations as well. There are details to work out, here, regarding how the implicit shift produced by a load operation should interact with various types of arithmetic.

The mem,arithmetic pairs should probably be defined to prohibit use of the load result in the second operation, even though this is probably a very reasonable thing to expect to do in general.

And the cmp,b pairs produce the beqi and bnqi Qualcomm proposal.

The notes about hits and saved (you need to scroll right) are how many times that pair was used by a simplistic regex (currently only considering adjacent pairs) on a trivial benchmark which I ran through qemu. In the case of duplication the first viable row takes the credit.

About 2400 intructions out of 10000 instructions were squeezed into the preceeding instruction. The original RVC compression used about 5500 16-bit opcodes, so to compare like-for-like that means I used 4800 “16-bit opcodes”.

I don’t think that’s too bad considering that no tuning has been done either in my opcode selection or in the compiler to put things in a viable order. And I’ve put a lot of space into things the compiler obviously wouldn’t generate without modification.

Big caveat regarding the quality of my regular expressions, though.

load/store ops

| RV32  | RV64  | RV128 |
|-------|-------|-------|
| `lb`  | `lb`  | `lb`  |
| `lh`  | `lh`  | `lh`  |
| `lw`  | `lw`  | `lw`  |
|  --   | `ld`  | `ld`  |
|  --   |  --   | `lq`  |
| `sb`  | `sb`  | `sb`  |
| `sh`  | `sh`  | `sh`  |
| `sw`  | `sw`  | `sw`  |
|  --   | `sd`  | `sd`  |
|  --   |  --   | `sq`  |
| `lbu` | `lbu` | `lbu` |
| `lhu` | `lhu` | `lhu` |
| `flw` | `lwu` | `lwu` |
|  --   | `fld` | `fld` |
| `fsw` |  --   |  --   |
|  --   | `fsd` | `fsd` |

The x/y options differ between RV32, RV64, and RV128; if the unsigned version would be identical to the signed version because that is the native word size, then this instruction is repurposed as a native-sized floating-point load or store instead (resulting in RV128 having no floating-point load or store – oh well).

arithmetic ops

| 3 bits, 50% immediates|
|-----------------------|
| `addi`     # imm+0    |
| `addi`     # imm+32   |
| `addi`     # imm-64   |
| `addi`     # imm-32   |
| `add`                 |
| `sub`                 |
| `and`                 |
| `or`                  |

|4 bits, 50% immediates |
|-----------------------|
| `addi`     # imm+0    |
| `addi`     # imm-32   |
| `addiw`    # imm+0    |
| `addiw`    # imm-32   |
| `addi4spn` # imm+0    |
| `addi4spn` # imm+32   |
| `andi`     # imm+0    |
| `andi`     # imm-32   |
| `add`                 |
| `addw`                |
| `sub`                 |
| `subw`                |
| `and`                 |
| `bic`                 |
| `or`                  |
| `xor`                 |

|5 bits, 50% immediates |
|-----------------------|
| `addi`     # imm+0    |
| `addi`     # imm-32   |
| `addiw`    # imm+0    |
| `addiw`    # imm-32   |
| `andi`     # imm+0    |
| `andi`     # imm-32   |
| `addi4spn` # imm+0    |
| `addi4spn` # imm+32   |
| `slli`     # imm+0    |
| `slli`     # imm+32   |
| `srli`     # imm+0    |
| `srli`     # imm+32   |
| `srai`     # imm+0    |
| `srai`     # imm+32   |
| `rsbi`     # imm+0    |
| `rsbi`     # imm+32   |
| `add`                 |
| `addw`                |
| `sub`                 |
| `subw`                |
| `and`                 |
| `bic`                 |
| `or`                  |
| `xor`                 |
| `mul`                 |
| `mulh`                |
| `div`                 |
| `rem`                 |
| `sll`                 |
| `srl`                 |
| `sra`                 |

For the addi*4spn instruction, the rsd field is used simply as rd and sp is used as the new rs1. Also the immediate is multiplied by four. I suppose this should be an unsigned immediate because that’s where all the useful data is. A couple of other operations need unsigned immediates, too.

Where the second operation borrows its rs2_imm argument from the first operation it doesn’t have free choice between a register or immediate value. Consequently one bit of the encoding is redundant. I’ll fix that later. In fact, while sharing the immediate between two insturctions makes sense (eg., shl/shr patterns), it’s less clear that the extra bit of free choice for immediate serves any purpose. But it’s harder to recycle that bit.

|cmp (3 bits, all immediates)|
|----------------------------|
| `slti`     # imm+0         |
| `slti`     # imm-32        |
| `sltiu`    # imm+0         |
| `sltiu`    # imm+32        |
| `seqi`     # imm+0         |
| `seqi`     # imm-32        |
| `bittesti` # imm+0         |
| `bittesti` # imm+32        |

I don’t think bittest is a thing in any RISCV extension? But I’m throwing it in here because it fills a niche. The immediate operand is the bit index to test and to branch on.

Some instructions I just made up to fill in gaps while I didn’t want to think about it.

| pairs (4 bits, no immediates) |
| `add`     | `sltu`    |
| `sub`     | `add`     |
| `min`     | `max`     |
| `minu`    | `maxu`    |
| `and`     | `bic`     |
| `mulhsu`  | `mul`     |
| `mulh`    | `mul`     |
| `mulhu`   | `mul`     |
| `div`     | `rem`     |
| `divu`    | `remu`    |
| `???`     | `???`     |
| `???`     | `???`     |

The use of an add,sltu pair forms and add-with-carry, but is problematic in its definition. It breaks the pattern of sharing both source registers, needing the result of the previous add instead, unless the sltu part is instead redefined to be a different operation which simply computes the carry from the inputs.

TODO: Extracting carry like this raises questions on whether overflow is also warranted, and also if there should be branching versions of the same ops for efficiently handling small arithmetic with low-overhead escapes to longer arithmetic as needed (signed and/or unsigned, like as needed in in python and JavaScript).

Caveats

Arithmetic paired with ldst are affected by the ld/st width (yikes?), which means that if you overwrite the load with a breakpoint you still need to be able to encode the effect on the adjacent op.
Also, I didn’t think too hard about statistical merits of any of these choices. I took some guidance from the existing compressed instruction extension to keep it in roughly the right place, but my changes may add their own implications.
There might be much to much overlap between the different register sharing modes for arithmeric. This needs to be looked at still.

Variations

For mem,mem the immediate could be smaller and the pair of destination registers could be arbitrary, consistent with the arithmetic instructions which share both source registers.
op.full rd,rs,rs ; =~op.full =rd,=rd,rs is also 25 bits and could probably be more use in that it doesn’t corrupt the original sources. just have to pick a sensible ~op.full.
As well as the usual overloading of Rd=x0, it might make sense, for example, if Rsd=t6 then to read that as sp and write t6 in the first opcode, and then read Rsd in the second operation as t6 and write Rsd as the register actually specified. Or something like that.
I notice there’s this RV32E profile, which discards the top 16 registers from the register file. This might be a reasonable compromise to repurpose a couple of bits, and presumably it’s easier to get the compiler to generate test code for it.

Questions

I didn’t do anything about an optimisation using the same rd for both instructions (implicit discard of the first result after it’s used by the second). Why is that?
When an arithmetic instruction has an implicit shift provided by being paired with a load or store (which has a data size), when should it apply? Should it affect only immediates? (I say no!) Should it affect only add and sub operations? Should it affect only operations whose destination register is the same as the base register in the memory operation?
What are the alignment constraints of these mem,mem ops? I don’t know!
Did I choose the right basic arithmetic instructions?

Observations

Reserving a portion of the coding space for compressed instructions it’s different from Thumb. One doesn’t have to squeeze everything in. If something is difficult it can be ignored and left to the 32-bit encoding, leaving coding space to allow anything else to capture more cases.

On the other hand it’s tempting to hang on to some of the CISC-like tuples on the basis that they are strong candidates for fusion, and sometimes that is a squeeze. It’s bad form to pre-suppose the implementation in the ISA, but it’s still tempting to make such an optimisation available.

Next steps

I really need more data about why each instruction fails to fall into a pair. Is it because I chose the wrong shortlist of opcodes, or because the operand constraints don’t fit, or because the immediate is too big? A lot of this hangs on choices the compiler made, which in turn reflect the instruction set it was aiming for, but I don’t think I’m capable of iterating over the compiler’s notion of available instructions, so I’ll just use proxy configurations instead.

As a general guide I plan to use:

qemu-riscv64-static -d nochain,in_asm,execxx ./benchmark

(or something like that) to collect translation blocks of instructions and count the number of times each block is executed. These blocks, compiled in different ways, can be used for a casual measure of the compression ratio, but it would rely on some re-ordering of instructions and a contract in the compiler to not use the t6 register because I borrowed it for some operation pairs with throwaway results.

What would be better is to see how different arrangements fare in an actual compiler trying to optimise for them, but I don’t know if that’s a realistic thing to experiment with.

How I made a gzip encoder faster than memcpy

2025-07-02T00:00:00+00:00

In the compression world it’s usual to compare the time spent compressing and decompressing data with the time difference in transmitting the compressed or uncompressed data over a given network. In this experiment I managed to make the compression faster than the bandwidth to RAM. Sort of. Under special circumstances and with no apologies for the egregious clickbait headline.

In the simplest possible terms this compression works by maintaining a dictionary of pre-cooked strings, appending those to the output stream, and noting when they’ve already been emitted recently (a simple index to last use with bounds check) and emitting a backreference code instead of the full string in those cases.

Non-pre-cooked strings are not supported efficiently. It’s an encoder restricted to very specific applications. Probably.

The bit-packing overhead is obviated by contriving Huffman codes which always fall on byte boundaries. This is impossible for a generic octet stream in the Delete format, but is achievable for UTF-8 text.

The hard part turned out to be the checksum calculation. When I thought of the idea I assumed (hoped) it would be an Adler32 checksum where it is easy to reason about appending precomputed checksums to the running checksum. It turned out gzip uses CRC32, and gzip is the preferred format over zlib in web browsers. So I had to figure out how to append CRC checksums as efficiently as possible.

It turns out you can precompute the string checksum and store the string length as a multiplier to be applied to the running checksum via clmul and folding that with a 64-bit crc32 operation.

Arm has CPU instructions for both of these operations, but x86 only has the former (its CRC instruction uses the wrong polynomial), which means using clmul to calculate the crc as well. Typically this is optimised for SIMD use, but a scalar operation is all that’s needed here. I suspect the extra work to batch it into SIMD chunks would be worse than the savings.

TODO: a bunch of extra exposition

Here’s the code: defl-8bit.

possible improvements

Write a preprocessor to break input text into strings at the most appropriate boundaries, adding flexibility in random string generation.
Implement the higher-level backref operator so multiple backreferences can be consolidated and their checksums can be computed as the difference between start and end of previous copy.
Make larger backrefs using the conventional rolling hash thing, but on the precomputed string fragments rather than every byte.
Or, remember previous backref distance and merge them when possible.
Clean up the code.
Figure out a proper generic interface with virtual methods in places that make sense and don’t have scary performance implications.
Add a practical fallback implementation for CRC for webasm compatibility (all that work for nothing!).
Does the Adler-32 implementation even work?
Tweak the clmul crc for performance.
Tweak everything else for performance.
Clean up this post.

Hiding messages in machine-generated text

2025-06-14T00:00:00+00:00

On a whim I thought I’d try getting ChatGPT to do a bit of steganography for me. There are a bazillion ways (give or take) for hiding a secret message in unrelated cleartext, where there’s a trade-off of secret bandwidth against cleartext flexibility. I chose Morse code encoded in the last letter of each word, because it’s obvious and easy to express as rules that anybody can follow.

Anybody except for ChatGPT, it turns out.

The rules I gave were simple enough:

A word ending with a vowel represents a dot.
A word ending with a consonant represents a dash.
A word ending with a y represents the gap between letters.
Express a message in morse code, using the above substitutions of words for symbols.
The words should be chosen to form coherent sentences.

That seemed to leave plenty of freedom for choosing words.

It turns out schemes like these are called acrostics or telestichs (the latter in my case). The extra layer of using morse code and groups of letters helps to make it less obvious than traditional acrostics, but it takes several words to make a letter of the hidden message.

I thought an LLM should be able to churn out a coherent paragraph under those constraints, and I’m sure that it could if it could remember what it was supposed to do, but I had some difficulties.

For the secret message example, ChatGPT gave me:

Alone Henry left a trail then slept quietly, and followed slowly too.

I don’t know what that means or why it wrote it, but it decodes as:

. -.--- -- .

Which comes back out as E�ME. That’s not quite right.

Let’s try fixing it by hand:

Alone, Henry left the tree. Then sleepily he sat. Slowly following on my horse. What could he say? The adverbs are too many to try.

. -..- .- -- .--. .-.. .

There. Fixed. But that wasn’t as much fun as I thought it would be. The adverbs are indeed too many. While there are plenty of words ending in y to choose from, it gets hard to think of things that aren’t adverbs or adjectives. And that need comes up too frequently.

Obviously having only words ending in one letter to choose from for that marker is too restrictive, and y is an especially distracting choice. Some markers could simply be dropped because there are a lot of cases where it’s not ambiguous, but unambiguous cases are at the end of infrequent letters, where it’s frustrating already.

I won’t try to fix this because it’s not my priority. All I wanted was something with the simplicity of a children’s game. Unfortunately I feel like it’s a bit too tedious for many kids to encode their own messages. At least without the support of an editor or thesaurus to offer up practical synonyms.

Or, of course, one could just make an LLM do it, as was the original plan. I’m sure it should have no trouble if it can be put in a suitable wrapper to keep it on track. But I’m too lazy, which is why I went to ChatGPT in the first place.

But there are many related schemes would could be devised, and I think technology is in a place, right now, where it should be trivial to automate. I just can’t be bothered doing that.

Generating nonsense text even more efficiently

2025-04-16T00:00:00+00:00

In my previous post, for the purpose of defining performance expectations I compared the way in which I was generating text (Mad-Libs style string substitution and concatenation) to Lempel-Ziv text decompression. That is, it’s simply the task of scheduling a series of string copies of various lengths and offsets. The complexity of deciding which strings to copy comes in one case from decoding the input stream to get the instructions, or in the other case from navigating tables and picking randomly from them.

Well-optimised LZ decompressors with low-complexity input decoders advertise rates as high as a couple of GB/s on top-end machines. After a bit of tweaking I got my JavaScript-based generator up to 20MB in around 200ms, or 100MB/s. Just one order of magnitude less; which is probably OK.

This only needed heavy optimisation because I cannot get crawlers to execute javascript on their end.

But actually there is a programmable client-side mechanism I might be able to work with. They have the gzip decoder. Normally one would generate text and compress it; but that wouldn’t be any help in reducing server-side burden. Instead it would be more use to synthesise the compressed bitstream directly.

Doing that isn’t entirely silly, since a gzip decompressor is just using a Huffman decoder to unpack a schedule of literal bytes and string copies. We mostly just want the string copies. We need only send enough raw text to get things rolling, and then because the vocabulary is so limited it can be all string copies from that point on.

Unfortunately that bit-packing is [comparatively] expensive and we don’t want to spend time on it, except we kind of have to because that is the standard.

Much of this pain can be alleviated by not bothering with any of the usual statistics of Huffman coding and instead contriving a fixed set of symbols which always fall on byte boundaries. Or at least small tuples of usable symbols which end on byte boundaries. Then in our pool of strings we disregard the strings themselves and instead keep note of the places to copy from and the length of the copies. Pre-coded into a packed bit-string which happens to be a whole number of bytes long so there’s no bit packing to do.

So instead of performing whole string copies, we just copy a couple of bytes with some touch-ups.

That last bit is a little complicated. First, strings can’t be allowed to fall out of the 32kB back-reference window. If they do then there’s nowhere to copy them from and we’d have to insert them from literals. Second, those touch-ups are about encoding the relative distance back to the previous use of the target string; and different distances have different encodings, so there’s going to be some extra translation between distance and symbols.

Another hassle is that the construction of the tables is always Canonical Huffman, which doesn’t leave any room for gaps. Symbols can’t just look like whatever we want them to. They’re allocated in a prescribed order. Even if we can’t get what we want and it did boil down to manually bit-packing, hard-coding a symbol table can make the job much easier than having to support dynamically changing symbol sizes.

And then there’s the checksum problem. The checksums of the individual strings can be pre-computed to save having to work on that, and the previous state of the checksum can be quickly fast-forwarded and added in to the precomputed checksum. It’s a bit of a fiddle, but hopefully not too slow.

Alternatively, it might be possible to ignore the checksum. Maybe nobody would notice?

The work so far…

In gzip we have variable-length codes in two dictionaries to consider. The first is a unified literal and copy-length dictionary, and the second, used whenever a copy length is read, is the back-reference distance. The copy lengths and the back references consume some number of “extra bits”, depending on the specific code. To make these fixed-length symbols the variable-length code will have to be complementary in size to those extra bits.

Looking at the distance codes first there are two codes with 13 extra bits, so one bit will be needed to distinguish between the two codes, and one more will be needed to distinguish between those codes and all the others. Four codes have zero extra bits, and the longest those can be is 15 bits.

That all fits very neatly. All the distances can be made a constant 15 bits long:

|      |Extra|             |                 |
| Code | Bits|   Distance  |       VLC       |
|------|-----|-------------|-----------------|
|   0  |   0 |       1     | 111111111111110 |
|   1  |   0 |       2     | 111111111111111 |
|   2  |   0 |       3     | 111111111111100 |
|   3  |   0 |       4     | 111111111111101 |
|   4  |   1 |      5,6    | 11111111111100x |
|   5  |   1 |      7,8    | 11111111111101x |
|   6  |   2 |      9-12   | 1111111111100xx |
|   7  |   2 |     13-16   | 1111111111101xx |
|   8  |   3 |     17-24   | 111111111100xxx |
|   9  |   3 |     25-32   | 111111111101xxx |
|  10  |   4 |     33-48   | 11111111100xxxx |
|  11  |   4 |     49-64   | 11111111101xxxx |
|  12  |   5 |     65-96   | 1111111100xxxxx |
|  13  |   5 |     97-128  | 1111111101xxxxx |
|  14  |   6 |    129-192  | 111111100xxxxxx |
|  15  |   6 |    193-256  | 111111101xxxxxx |
|  16  |   7 |    257-384  | 11111100xxxxxxx |
|  17  |   7 |    385-512  | 11111101xxxxxxx |
|  18  |   8 |    513-768  | 1111100xxxxxxxx |
|  19  |   8 |   769-1024  | 1111101xxxxxxxx |
|  20  |   9 |   1025-1536 | 111100xxxxxxxxx |
|  21  |   9 |   1537-2048 | 111101xxxxxxxxx |
|  22  |  10 |   2049-3072 | 11100xxxxxxxxxx |
|  23  |  10 |   3073-4096 | 11101xxxxxxxxxx |
|  24  |  11 |   4097-6144 | 1100xxxxxxxxxxx |
|  25  |  11 |   6145-8192 | 1101xxxxxxxxxxx |
|  26  |  12 |  8193-12288 | 100xxxxxxxxxxxx |
|  27  |  12 | 12289-16384 | 101xxxxxxxxxxxx |
|  28  |  13 | 16385-24576 | 00xxxxxxxxxxxxx |
|  29  |  13 | 24577-32768 | 01xxxxxxxxxxxxx |

The distance symbols only appear after a copy length. Since the distances are all 15 bits, we need the lengths to all be 9 bits so the pair falls on a byte boundary.

Length codes use the same alphabet as literals – the 256 different byte values – and the end code (code 256):

|     |Extra|         |           |
|Code | Bits| Length  |   VLC     |
|-----|-----|---------|-----------|
| 257 |   0 |    3    | ????????? |
| 258 |   0 |    4    | ????????? |
| 259 |   0 |    5    | ????????? |
| 260 |   0 |    6    | ????????? |
| 261 |   0 |    7    | ????????? |
| 262 |   0 |    8    | ????????? |
| 263 |   0 |    9    | ????????? |
| 264 |   0 |   10    | ????????? |
| 265 |   1 |  11,12  | ????????x |
| 266 |   1 |  13,14  | ????????x |
| 267 |   1 |  15,16  | ????????x |
| 268 |   1 |  17,18  | ????????x |
| 269 |   2 |  19-22  | 0111000xx |
| 270 |   2 |  23-26  | 0111001xx |
| 271 |   2 |  27-30  | 0111010xx |
| 272 |   2 |  31-34  | 0111011xx |
| 273 |   3 |  35-42  | 011000xxx |
| 274 |   3 |  43-50  | 011001xxx |
| 275 |   3 |  51-58  | 011010xxx |
| 276 |   3 |  59-66  | 011011xxx |
| 277 |   4 |  67-82  | 01000xxxx |
| 278 |   4 |  83-98  | 01001xxxx |
| 279 |   4 |  99-114 | 01010xxxx |
| 280 |   4 | 115-130 | 01011xxxx |
| 281 |   5 | 131-162 | 0000xxxxx |
| 282 |   5 | 163-194 | 0001xxxxx |
| 283 |   5 | 195-226 | 0010xxxxx |
| 284 |   5 | 227-257 | 0011xxxxx |
| 285 |   0 |   258   | ????????? |

Here things get a little complicated, which is why the table above is incomplete.

In order to be able to write out literals, those literals are going to have to be exactly 8 bits long. There’s not going to be enough space for all of them alongside the already-allocated encoding for length codes, growing the leftovers to 16-bit codes isn’t legal, and so sacrifices must be made.

Control characters

Most control characters aren’t important for plain text. Only line feeds are essential, but HTML can probably get by even without that. But that’s a little extreme, though, so let’s try to keep it.

If it did turn out to be challenging to squeeze in a line feed as an 8-bit code, then the alternative would be to find something to pair it with to make it a round number of bytes. Carriage return fills that role neatly so it would be possible to make a 12-bit code for carriage return, and a 12-bit code for line feed, and send the two of them together, and most recipients should have no complaint about that.

Printable ASCII

These really need to be eight-bit codes. Thankfully there are only 95 of them to worry about, leaving some space for a couple of other codes of the same length or longer.

The end-of-block code

code 256, not mentioned in the length table above, is used to signal the end of the block. Because this isn’t an optimised dynamic Huffman system there’s no point ending the block more than once, so once that’s done it’ll be padded to the end of the byte and the file is finalised. So it doesn’t matter how long this code is.

Non-ASCII characters

Trying to find byte-aligned encoding for the rest of the characters in, eg., ISO-8859 would probably be impossible. There’s not that much room in the 8-bit symbol space along with everything else we need, and there’s no guarantee that things will appear in tuples which combine to a multiple of eight bits.

Thankfully UTF-8 is where it’s at, and that does have constraints we can work with.

UTF-8

When UTF-8 introduces a multi-byte codepoint it starts with a value that tells us how many extension bytes follow, and then we have that many bytes with values that can’t appear anywhere else in a legal UTF-8 stream.

If we start with the latter and assign them n-bit symbols, then we can calculate from that how many bits the corresponding prefix code must be to round out the whole codepoint to a multiple of 8 bits.

By choosing a 10-bit extension code we can determine that UTF-8 codepoints with one extension byte (beginning with 0xC0-0xDF) need the first symbol to be either six or 14 bits long. Six is too short for the 32 different prefixes, so 14 it is. If there are two extension bytes then that’s 20 bits needing a 12-bit prefix, and three extension bytes need a 10-bit prefix.

That not going to be good for some languages. An alternative might be to use 9-bit extension codes (64 of those), requiring 30 prefixes of length 7, 16 prefixes of length 6, and 8 prefixes of length 5, and then stripping the ASCII alphabet down to the bare minimum to support HTML, but it’s a tight fit and may still not be possible.

It might be tempting to code the control codes with no explicit coding as non-canonical UTF-8, but those codes aren’t legal and the behaviour is associated with security attacks which means it’ll raise alarms. So don’t bother with that.

The result

This all has to be coded in a header which describes the length of each symbol. The header also uses variable-length codes and a bit of run-length syntax, so it may be necessary to fiddle things about a little to make it end at a byte boundary; but this shouldn’t be too difficult. Once it’s done it can be handled as a hard-coded blob.

Here are the rest of the variable-length codes I ended up with:

8-bit codes:
01111000....... BEL  (7)
01111001....... BS  (8)
01111010....... TAB  (9)
01111011....... LF  (10)
01111100....... VT  (11)
01111101....... FF  (12)
01111110....... CR  (13)
01111111....... ESC  (27)
10000000....... ASCII    (32)
10000001....... ASCII !  (33)
...
11011101....... ASCII }  (125)
11011110....... ASCII ~  (126)
11011111....... End of block  (256)
11100000x...... length 11-12  (265)
11100001x...... length 13-14  (266)
11100010x...... length 15-16  (267)
11100011x...... length 17-18  (268)

9-bit codes:
111001000...... NUL  (0)
111001001......   (1)
111001010......   (2)
111001011......   (3)
111001100......   (4)
111001101......   (5)
111001110......   (6)
111001111...... DEL  (127)
111010000...... length 3  (257)
111010001...... length 4  (258)
111010010...... length 5  (259)
111010011...... length 6  (260)
111010100...... length 7  (261)
111010101...... length 8  (262)
111010110...... length 9  (263)
111010111...... length 10  (264)
111011000...... length 258  (285)

10-bit codes:
1110110010..... UTF-8 ext  (128)
1110110011..... UTF-8 ext  (129)
...
1111110000..... UTF-8 ext  (190)
1111110001..... UTF-8 ext  (191)
1111110010..... UTF-8 4-byte prefix (240)
1111110011..... UTF-8 4-byte prefix (241)
...
1111111000..... UTF-8 4-byte prefix (246)
1111111001..... UTF-8 4-byte prefix (247)

12-bit codes:
111111101000... UTF-8 3-byte prefix (224)
111111101001... UTF-8 3-byte prefix (225)
...
111111110110... UTF-8 3-byte prefix (238)
111111110111... UTF-8 3-byte prefix (239)

14-bit codes:
11111111100000. UTF-8 2-byte prefix (192)
11111111100001. UTF-8 2-byte prefix (193)
...
11111111111110. UTF-8 2-byte prefix (222)
11111111111111. UTF-8 2-byte prefix (223)

Since we don’t have any control over the order of the symbols (they’re ordered by length and then by code), there’s going to need to be a look-up table to convert even simple ASCII into the byte-aligned codes we have.

And the lengths and distances get even worse treatment. They’re a combination of big-endian and little-endian coded data, as an unfortunate complication of the way the bit packing was defined for a little-endian architecture against the way bitstreams naturally parse.

Generation

The idea would be to have an engine which emitted pre-cooked strings and made a note of where it had last put them. When a string is needed, if it’s in-range then emit the length and distance tuple. If it’s new or if it’s fallen out of range then emit the raw literals for the string. Either way, update the last-emitted position.

Part of the pre-cooking would be to compute the Adler-32 checksum of the string. Upon emitting a given string the rolling checksum can be fast-forwarded by the number of bytes in the string, and the pre-computed string checksum can be added to that.

This would all make much less less sense if the intended output was not already a recurring concatenation of a small set of repeated strings.

And now to put the whole thing out of my head and get on with my life. Or not. Oh well.