Once I moved beyond the 6502, I ended up porting a 16-bit PRNG from 16-bit x86 to provide me with a decent source of randomness for my projects. After that, in one of my nonretrocoding endeavours, I ended up making use of an enhanced 32-bit RNG that would pass a demanding suite of tests for PRNG quality. That’s actually quite nice, and the source code for the generator is also pleasingly tiny:

uint32_t xorshift64star(void) { uint64_t x = rng_state; x ^= x >> 12; x ^= x << 25; x ^= x >> 27; rng_state = x; return (x * 0x2545F4914F6CDD1DLLU) >> 32; }

This code is compact enough that it doesn’t really need to be optimized. In *fact*, on a fully modern 64-bit system, every operation in this function can be done in a single machine instruction, so there’s really nothing to optimize in the first place.

On a less fully modern machine though, we only really get 32-bit registers to work in and all of these operations involve multiprecision math. I have been known to occasionally race against compilers to see how much better I can make hand-tuned assembly code run. When compiling the 16-bit PRNG, gcc’s 32-bit ARM compiler was clever enough that it was able to produce smaller code than hand-written, self-modifying x86 code. Can it do as well when compiling 64-bit math to a 32-bit system? *(Spoiler alert: no. No, it can’t. But in both the ARM32 and 32-bit x86 cases, it used some tricks I’d missed in my first pass that let me improve the handwritten versions even further.)*

## So, What Are We Doing Here?

We’ll be making hand-coded implementations of this function for three 32-bit architectures, each of which will need to approach the problems of multiprecision math differently: ARMv4 (1994), Intel 386 (1985), and the Motorola 68000 (1979). Each of these chips made different decisions about how to handle multiprecision operations, and between them we can explore a variety of techniques for addressing the issues multiprecision math can pose.

Our basic problem here is that there are four 64-bit operations that we are doing here, and we need to do them on a 32-bit chip:

- Assignment
- Exclusive Or
- Bit shifts
- Multiplication

Assignment and exclusive-or are basically free. Because each bit in the result is copied or mutated individually, one 64-bit operation can simply by done as two 32-bit operations (or indeed four 16-bit operations or eight 8-bit ones—the solutions here are structurally similar all the way back to the Z80 and 6502, if we were so inclined). The only real optimization we can do here is that ARM and 68000 have “load/store multiple” instructions that let us save some instruction space while doing multiword copies into or out of registers.

The other two cases are more fun. We’ll cover them below the fold.

## Bit Shifts

Multiprecision bit shifts on the 8-bit chips was done in a loop. We would shift the “furthest” end of the value one bit, which pushed the bit that needed to go into the next word into the carry bit. You would then do a series of rotate-bits-through-carry instructions to propagate it through the rest of the value, and then do this whole process once for each bit you wanted to shift over. We had to do that in both our 6502 and Z80 implementations of the 16-bit xorshift PRNG. The three chips we are considering here, however, all have capabilities that make this a little easier.

The 386 gives us the most direct tool for this: the ** SHLD** and

**instructions (**

`SHRD`**SH**ift

**L**eft/

**R**ight

**D**ouble) take three arguments instead of the usual two: a register to shift and an amount to shift it, as usual, but then also a second register from which to pull bits to fill in the gaps the shift produces. A 64-bit shift may then be performed by doing this, then shifting that second register the same amount on its own. These two instructions will shift the value in

**right 12 bits:**

`EDX:EAX`shrd eax, edx, 12 shr edx, 12

The ARM and 68000 do not have an instruction like this, but it’s not too difficult to build one up out of simpler components. The basic technique is to shift as if we were doing a double-shift instruction, which will just leave zeroes there. We can then shift the other half of the value the other direction an amount to fill that gap, and then logical-or that result into place to fill the gap. Then we can shift the other half the same way we did in the double-shift case. Here, for instance, is some 68000 code for shifting the value of ** d0-d1** (most significant dword in

**) to the left by 25:**

`d0`moveq #25,d3 lsl.l d3,d0 move.l d1,d2 lsr.l #7,d2 or.l d2,d0 lsl.l d3,d1

One nasty quirk of the 68000 is that we can only order shifts of up to 8 bits at a time if we want to use immediate values, so we end up needing to use a register to store the 25 that is our shift amount. The shift of 7 (25 + 7 = 32, our register size) is small enough to be embedded right in the instruction, at least.

### Merging Bitshifts With Exclusive-Or

We aren’t just bitshifting arbitrarily, though. Every shift we do of a value is followed by XOR-ing that shifted value against its original value and storing it in the same place. When I asked gcc to compile the C version of this routine, it did that by copying the 64-bit value to two other registers, doing a 64-bit shift on it, then doing a 64-bit XOR to knit them all together. That’s still basically optimal on the 386, but the 68000 will do a little extra work. Because the post-shift bits never overlap, we don’t need to create a complete 64-bit temporary value, and we don’t need to logical-or the fragmentary pieces together. As long as we are careful about the order, we can simply exclusive-or each piece right back into the source, as seen here:

moveq #25,d3 move.l d0,d2 lsl.l d3,d2 eor.l d2,d0 move.l d1,d2 lsr.l #7,d2 eor.l d2,d0 move.l d1,d2 lsl.l d3,d2 eor.l d2,d1

With ARM we can do even better, because when applying an argument to an operation (like, say, exclusive-or) we can perform a temporary shift of a register argument on the way in *as part of that instruction*. This lets us remove all copies and all standalone shift instructions:

eor r1, r1, r1, lsl #25 eor r1, r1, r0, lsr #7 eor r0, r0, r0, lsl #25

This produces the rather impressive result that, despite not having any hardware support for multiprecision shifting, the ARM code for doing the multiprecision shift-and-xor-in-place ends up still taking up three fewer bytes than the x86 code. Normally, ARM code ends up denser than x86 and 68000 code because it allows the destination operand to be different from both source operands (“three-address code” as opposed to the “two-address code” of the other chips), which saves you a bunch of copy operations, but here it’s the automatic shifting that carries the day.

## 64-bit Multiplication

Multi-precision multiplication is finicky and awful. We do at least here have the advantage that all three of these chips even *have* a hardware multiplier—our 8-bit chips lacked multiplication and division capability entirely and as a result you had to do all multiplication through a sum of partial products. Wikipedia has a pretty good rundown on how multiplication in binary works, and how it can be written out in a way to make it be similar to computing products by hand in decimal. If we take that process and break it down into pseudocode that restricts itself to simple machine operations, we get this algorithm to multiply two *n*-bit numbers *A* and *B*:

- Set the product to zero.
- If the lowest bit of
*A*is 1, add*B*to the product. - Shift
*A*right one and*B*left one. - Repeat the previous two steps until every 1 bit in
*A*has been processed.

Since a right-shift operation ends up shifting the lowest bit into the carry bit, we can simplify this algorithm by replacing step 2 with “Shift *A* right 1, and if the carry bit is set, add *B* to the product” and then only shift *B* in step 3. We can also just iterate *n* times instead of doing a zero-check on A after each shift. Here’s some code for the 68000 which will multiply two 64-bit numbers in ** d0-d1** and

**and put the result in**

`d2-d3`**:**

`d4-d5`clr.l d4 clr.l d5 moveq #63,d6 m64: lsr.l d0 roxr.l d1 bcc.s m64_next add.l d3,d5 addx.l d2,d4 m64_next: lsl.l d3 roxl.l d2 dbra d6,m64

The product of two 64-bit numbers is actually a 128-bit number, but with only 8 data registers there’s not enough room for both operands, the product, and the loop counter. We don’t need the top 64 bits anyway, though—we only are actually returning bits 32-63 of the result (** d4** in this routine), so we don’t need anything else except for its contribution to the final product. That means the high 32 bits, but we also need to accurately compute the low 32 bits in case the partial products provide a carry bit into the higher bits.

### Taking Advantage of Mid-1980s Hardware

All three of the chips we’re looking at don’t have to do things with rotates and adds, though. Each includes an instruction that will multiply two registers together and then produce a result with twice the precision. The 386’s ** MUL** instruction will multiply the value of its argument with the

**register and store the result in**

`EAX`**—which is to say, the low 32 bits go in**

`EDX:EAX`**and the high 32 bits go in**

`EAX`**. Likewise, ARMv4 introduced the**

`EDX`**(**

`UMULL`**U**nsigned

**MUL**tiply

**L**ong) instruction with two destination registers, producing a similar result. Simpler versions (

**on Intel,**

`IMUL`**on ARM) will multiply two 32-bit values into a 32-bit result, and the Intel version will also accept constant arguments right in the instruction if needed.) The 68000 series got an instruction to do this around the time that the 386 was released (with the 68020 chip), but the 68000 itself only allows for multiplying 16-bit values into a 32-bit product. The operands are treated as word operands when used as sources and dword operands when used as destinations.**

`UMUL`Access to a 32-bit multiplier helps us a lot. We saw with the binary multiplication algorithm that there isn’t a lot of difference between multiplying in base 2 or in base 10, and the general technique generalizes to any multiple of digits or bits. So, if we have two 64-bit numbers, we can arrange partial products of them as if they were two-digit numbers in base 2^{32} in the same way:

A:B x C:D ------ BD AD BC + AC ------ result

When we translate this into our problem statement, we want the third column of that result. Furthermore, since there is only one entry in the fourth column, we don’t even have to worry about carries; we just add the high 32 bits of the product B*D and then the low 32 bits of each product A*D and B*C.

The situation is much less pleasant on the 68000. This now becomes a product of four digit numbers and that means there are sixteen subproducts:

|AB|CD x |EF|GH ------------ | |DH | D|G |DF| D|E | | C|H |CG| C|F | CE| | |BH| B|G | BF| | BE | | A|H | AG| | AF | | + AE | | ----------- result

Sixteen subproducts and *ten* of them need to be factored into the answer, including some very ugly cases where we need to add carry bits into the *middle* of registers as we move along. My attempt to implement this took over a hundred bytes of code on its own and was fragile enough that I can’t say I trust it even though it did give correct answers for my test cases. Prior to the 68020, I think it’s probably best simply to use the same algorithm you’d use on the 8-bits. At least that way everything will still be kept in registers.

## The Final Implementations

We now have all the tools we need. Let’s produce our three implementations of this routine.

### The 386 Implementation

The 386 implementation is, overall, the most direct of the three. We will need four registers to hold our shifted value to XOR back into the initial value, alongside the initial value itself. We’ll use the four registers from ** EAX** through

**, and the traditional x86 ABI says that we need to save**

`EDX`**and are allowed to trash the other three.**

`EBX`push ebx

Then we load our random seed ** rng** into

**with a pair of move operations.**

`ECX:EBX`mov ebx, [rng] mov ecx, [rng+4]

Our basic xorshift operation involves copying it to ** EDX:EAX**, shifting that, then XORing it back into place.

mov eax, ebx mov edx, ecx shrd eax, edx, 12 shr edx, 12 xor ebx, eax xor ecx, edx

Two more xorshifts after that.

mov eax, ebx mov edx, ecx shld edx, eax, 25 shl eax, 25 xor ebx, eax xor ecx, edx mov eax, ebx mov edx, ecx shrd eax, edx, 27 shr edx, 27 xor ebx, eax xor ecx, edx

Store the result back out into the random seed.

mov [rng], ebx mov [rng+4], ecx

Then we do the 64-bit multiply. Even though we can hardcode constants here, we store the low dword in a register anyway so we can reuse it without having to spend space specifying it twice.

mov eax, 0x4f6cdd1d imul ecx, eax mul ebx add ecx, edx imul eax, ebx, 0x2545f491 add eax, ecx

One trick that gcc taught me here is that ** IMUL** has a three-address form as long as one argument is an immediate constant. That lets us put the result we actually care about directly in

**where it needs to be to be the return value. After that we just need to restore the initial value of the register we weren’t allowed to trash and return:**

`EAX`pop ebx ret

Being more careful about reusing our values and not computing anything that we didn’t absolutely need for the result meant that we ended up saving about 25% space from gcc with the -O2 option.

### The ARMv4 implementation

The ARM implementation is pretty similar with its data usage to the 386 version. However, the ARM ABI actually allows us to trash the first four registers, which means we don’t have to save off anything at all. Unlike the x86, though, ARM generally doesn’t let us use immediate values. We use a special syntax to indicate that a constant should be loaded into a register from a special table called a *literal pool*. Literal pools are created and filled by the assembler in ways that the programmer can influence but not complete dictate. I’m glossing over that for now. Once we have the address, we can use the copy-multiple-registers command to load the 64-bit value into two registers at once:

ldr r2, =rng ldmia r2, {r0, r1}

Then we do the three xorshift operations, which merge into very tight code as described in the earlier discussion:

eor r0, r0, r0, lsr #12 eor r0, r0, r1, lsl #20 eor r1, r1, r1, lsr #12 eor r1, r1, r1, lsl #25 eor r1, r1, r0, lsr #7 eor r0, r0, r0, lsl #25 eor r0, r0, r0, lsr #27 eor r0, r0, r1, lsl #5 eor r1, r1, r1, lsr #27

Then we store it back:

stmia r2, {r0, r1}

Then we do the 64-bit multiply piece by piece in a way almost identical to the x86 code, losing the immediate multiplies (since ARM lacks immediate values) and coalescing the last two arithmetic instructions into a single ** MLA** (

**M**u

**L**tiply-

**A**ccumulate) instruction:

ldr r2, =0x4F6CDD1D mul r1, r2, r1 umull r2, r3, r0, r2 add r1, r1, r3 ldr r2, =0x2545F491 mla r0, r2, r0, r1

And with no registers saved, we don’t need to restore any registers before we return, either. The thing is, ARM doesn’t keep a call stack on its own. The return address goes into a special register called the *link register* and it’s up to you to put it on a stack if you have one and need it. Since we didn’t call any functions ourselves, though, we didn’t need it, and we can thus return from the function by copying the link register back to the program counter.

mov pc,lr

With gcc targeting ARMv7, it kept the same redundant operations we saw in the x86 case, leading to a very similar 25% or so size reduction in our hand-written code. Despite targeting the ARMv7, it didn’t use any instructions that weren’t also in ARMv4, so this code would run fine on a Game Boy Advance, but not an Acorn Archimedes.

### The MC68000 implementation

Although we’re lacking a lot of instructions in the m68k instruction set that would make our lives easier, the overall abilities any instruction has are quite comprehensive and will let us phrase our operations in ways that are similar to the ARM and x86 cases. Our initial load ends up looking like the ARM case. Unlike ARM, we are *allowed* to use a hardcoded address in our lookup operation, but it’s more efficient to keep it in a register so we can reuse it when we store it back. The m68k ABI lets us use ** d0-d1/a0-a1** as scratch space, but we’ll need seven registers to do the 64-bit multiply at the end, so we’ll need to save off every register but the first two, with a move-multiple that alters the target register (the stack pointer, in this case, instead of a base address like the 64-bit load) along with the copy.

lea.l rng,a0 movem.l (a0),d0-d1 movem.l d2-d6,-(sp)

Then we do our xorshift operation the hard way…

moveq #12,d3 moveq #20,d4 move.l d1,d2 lsr.l d3,d2 eor.l d2,d1 move.l d0,d2 lsl.l d4,d2 eor.l d2,d1 move.l d0,d2 lsr.l d3,d2 eor.l d2,d0

… and the other two are slightly easier because one of the shifts is in the 1-8 range so we can just put it right in the instruction without any cost in code size.

moveq #25,d3 move.l d0,d2 lsl.l d3,d2 eor.l d2,d0 move.l d1,d2 lsr.l #7,d2 eor.l d2,d0 move.l d1,d2 lsl.l d3,d2 eor.l d2,d1 moveq #27,d3 move.l d1,d2 lsr.l d3,d2 eor.l d2,d1 move.l d0,d2 lsl.l #5,d2 eor.l d2,d1 move.l d0,d2 lsr.l d3,d2 eor.l d2,d0

Storing the value back is as easy as it was on ARM.

movem.l d0-d1,(a0)

Using the actual multiply instruction is, as we saw earlier, kind of more trouble than it’s worth when we can only do it 16 bits at a time. So we do a more primitive shift-and-add 64-bit multiplication routine instead.

move.l #$2545F491,d2 move.l #$4F6CDD1D,d3 clr.l d4 clr.l d5 moveq #63,d6 m64: lsr.l d0 roxr.l d1 bcc.s m64_next add.l d3,d5 addx.l d2,d4 m64_next: lsl.l d3 roxl.l d2 dbra d6,m64

The return value goes in ** d0**. I suppose I could have made this instruction unnecessary by juggling all the rest of the registers and having this result start out in

**.**

`d0`move.l d4,d0

Computation accomplished, we restore our non-scratch registers and return.

movem.l (sp)+,d2-d6 rts

My m68k compilers couldn’t actually deal with the C code for this routine, so I have nothing to compare it against. If anything, that seems like an even stronger reason to not mess with 64-bit math on a mostly-16-bit chip like the 68000. But if I ever really needed it on the Genesis or the Atari ST or something, now I have it.