Category Archives: retrocoding

To DLL Hell And Back

I’d mentioned in the previous post that I’d now written “a program” for each of the machines I’d grown up with. This is an interesting proposition because the notion of what it means to be “a program,” or to “load” and “run,” has changed over the years. Let’s take a tour through the various ways you get from “a program on some storage medium” to “a program that is actually running.”

This is in roughly increasing order of the complexity of the loading system, which is not quite historical order and not even quite increasing order of the complexity of loading any individual program.

Continue reading


An Ambition, Completed

I belong that that somewhat narrow cohort that straddles Generation X and the Millenials—we grew up with the microcomputer omnipresent but we remember a time before The Internet. The cohort has gotten various attempts at nicknames but I’ve always been fond of the name Oregon Trail generation, after the famous educational software.

My parents were also somewhat technically inclined (or at least gadget-oriented), and as a result I had grown up alongside a number of game consoles and home computers. A lot of the work I did on those, both as a hobby and for schoolwork, shaped my talents in the years to come.

One of my goals here on Bumbershoot over the past few years has been to demystify the machines I grew up with. I’m now a genuine software professional, after all—surely I’ve learned enough to see how the tricks were done and to join in myself, if a bit late.

As such, once I realized that this was what I was doing, I formalized the nature of the task and started getting systematic about it. And now, with this latest release, I have finished it.

The Rules

Here are the rules I had set for this task.

  • Childhood ends in 1992. This is a somewhat arbitrary division point, but it turns out I can pick any year from there to my 20th birthday and not change the list of targeted systems, and 1992 means I don’t have to distinguish between when a platform was released in North America and when I got access to it.
  • Systems require a full year of exposure to count as “one I grew up with.” I had brief acquaintance with the Macintosh and the SNES, but both were systems I didn’t really have any experience with until adulthood. I first beat Super Metroid in 2010.
  • Programs must run on the metal. Interpreted languages are permitted only to the extent that they can then transfer control to machine code. This started out as “BASIC and LOGO don’t count” but eventually became a requirement that, for at least one project, I produce a binary where every single byte in the program exists because I put it there.
  • Programs must be for the platform. Writing a highly portable C program and jamming it through a compiler for each system doesn’t count. I should make use of a generous set of system-specific capabilities that let it show off what it’s good at. Any stunt that reaches demoeffect quality automatically achieves this goal.
  • The platform must have been intended to receive third-party programs. That means the graphing calculators we had in high school don’t count, even though they had Z80 chips in them and you could program them with special connectors and PC programs that no longer run on 64-bit operating systems.
  • Period tools are not required. Since half the targets are game consoles, cross-development tools were the norm even at the time.

So, without further ado, the systems I grew up with.

Continue reading

Release: The Cyclic Cellular Automaton for the Sega Genesis/Mega Drive

I’ve finished my project: an implementation of the Cyclic Cellular Automaton for the Sega Genesis. This uses a very simple rule across a grid to let that grid evolve from a chaotic start condition to a highly ordered end condition:


The grid in this program is a 128×128 grid with 16 possible states. The screen can only display a 80×56 window into that grid at a time, though. You can scroll around the grid with the controller’s D-Pad; since the map is also a torus, it will wrap around if you keep scrolling in any direction. The START button will reset the display with a new, randomized grid.

This program uses the controller, the music chip, both the scroll layers, and a digital sound effect during initialization. It doesn’t use sprites or the legacy sound chip, but those were covered in their own articles. That’s enough work with the platform that I feel that I can say I’ve put the console as a whole through its paces.


Download the SMD ROM image here. It should run in any Genesis emulator—I’ve tested it on Gens, DGen, Kega Fusion, and Exodus. I have not yet had the opportunity to test it on real hardware—if you have a flashcart and hardware I’d love to hear how it goes.

Continue reading

Genesis: The YM2612 Music Chip

The last piece of this Genesis CCA application is background music using the actual Yamaha YM2612 chip instead of the TI SN76489 that’s left over from the Sega Master System. The general principle turns out to be quite similar.

The general plan is still to hook the Z80 to the new-frame interrupt and send new data to the sound chip as needed each time. This task is simplified for the YM2612 because it has an actual envelope generator—we don’t need to manually step down the volume each frame and just need to mark key on/off transitions. What makes the task a bit more complicated is that we may need to do nearly arbitarily large amounts of work each frame—configuring an instrument is a lot of work.

To address that, I’ve taken my cues from iD software’s old music format for early PCs. This was a collection of register writes to the adlib card along with how long to wait until the next register write. This was permitted to be zero. In my system I provide a number of frames to wait before the next block of writes, how many writes are in this block, and then that many writes. This restricts me to the first three channels on the YM2612 as written, but the sample song I’m using only has two voices anyway so it should be fine. It could be easily extended to have two blocks, one for the first block of registers and one for the second. Under most circumstances, one of those would be zero in any given interruption point.

There is also an interesting side effect which is that apparently it doesn’t matter whether you write multiple start-note values—the note will not re-gate unless you turn the note off, then on again. That can be important because if the note has already been held so long that it has faded to silence, then you won’t hear later notes without that gap. So our stream of notes has to include a tiny rest at the end of every note or the next one won’t start right.

The song data that I’ve been feeding my playroutines is generally automatically generated from textual sources, and this has started to get sufficiently onerous that I decided to make a more compact representation. I based this on the PLAY command supported by the BASICs on the IBM PC and the Commodore 128. This was a music macro language that was sufficient for one voice but broke down rather dramatically when you needed more than one (as was sadly demonstrated by the C128). The BASIC implementation would block if you queued up a note for a voice that was already playing a note, so you had to rapidly cycle between all the voices from longest note to shortest so that you never missed a cue.

That’s more tedious than I’d generally like, so while I implemented most of the macro language I kept each voice separate and then used a separate collation step to fold it all together. That collation results in a series of register writes that look a lot like the C128 multivoice case, but here the challenge is that we never get to block at all, because the interrupt timer is pitiless and unforgiving.

We’ve done what we came to do, though. This project will be done soon.

Further Reading

Official documentation for the YM2612 was sparse and misleading. The members of the SMS Power forum seem to have done most of the work in determining things like how to actually compute frequency numbers for given targets. This forum post in particular, from 2001, was critical to getting my songs to sound right.

Optimizing the CCA

Last time, we wrote a basic implementation of the CCA automaton in m68k assembly language, but it wasn’t running as fast as we’d like. In this article, we will speed up that routine by a factor of four. The techniques that we’ll be using to do this aren’t really specific to 68000 assembly language, or indeed assembly language coding at all, so much of the work in this post will also be shown in C equivalents.

That said, remember the two rules of optimization:

  1. Don’t do it.
  2. (Experts only) Don’t do it yet.

It’s been a week, so now we can do it. Also, we get to skip what is usually the main work of performance tuning, in that we do not need a profiler to tell us which function is slowing us down—there is only one function doing any significant work at all here. Working on hopped-up toy problems has its advantages.

Our Starting Point

Here is some C code that is roughly equivalent to our initial implementation from last time:

#define INDEX(x, y) ((((y) & 0x7f) << 7) | ((x) & 0x7f))
#define CHECK(x, y) if (in[INDEX(x, y)] == target) val = target

void full_check(const char *in, char *out, int x, int y)
    int val, target;
    val = in[INDEX(x, y)];
    target = (val + 1) & 0x0f;
    CHECK(x-1, y);
    CHECK(x+1, y);
    CHECK(x, y-1);
    CHECK(x, y+1);
    out[INDEX(x, y)] = val;

void cca_step_0(const char *in, char *out)
    int x, y;
    for (y = 0; y < 128; ++y) {
        for (x = 0; x < 128; ++x) {
            full_check(in, out, x, y);

I’ve made a couple of organizational changes from the assembly language code; the loop body has become a subroutine in its own right, and the “internal subroutines” of the original implementation have been rendered as macros and are being inlined in the code itself. These would both be somewhat questionable implementation decisions if you were writing a CCA routine in C for real, but they will make it easier to reason about the code we already have.

Optimization 1: Remove redundant checks

The 68000 is an early enough chip that we don’t really have to worry about pipelining or cache misses or anything like that—if we want a computation to take less time, our only real option is to do less work. On the plus side, this also means that any time we find ourselves repeating a bunch of work in an inner loop, we know it’s ultimately going to be slowing us down a bunch.

There isn’t a whole lot we can do about the fact that we need to visit all 16,384 cells in the grid, so our initial focus will be on full_check. As written, we are calling that function 16,384 times, so any time we can save in there will be magnified greatly. What can we save?

Not much, in the most general case, as it turns out. We need to check each neighbor, and that means we need to do the INDEX computation at least five times. We could spare ourselves one, perhaps, if we cached the result from the first call (for INDEX(x, y)). But that’s missing the forest for one rather large tree: The only reason we need INDEX at all instead of just taking fixed offsets from our starting point is that we’re worried about wrapping around. We can drastically simplify the non-edge case, breaking out the edges and corners into their own loop afterwards:

void cca_step_1(const char *in, char *out)
    int x, y, index, val, target;
    for (y = 1; y < 127; ++y) {
        for (x = 1; x < 127; ++x) {
            index = INDEX(x, y);
            val = in[index];
            target = (val + 1) & 0x0f;
            if (in[index - 128] == target ||
                in[index -   1] == target ||
                in[index +   1] == target ||
                in[index + 128] == target) {
                val = target;
            out[index] = val;
    for (y = 0; y < 128; ++y) {
        full_check(in, out, 0, y);
        full_check(in, out, y, 0);
        full_check(in, out, y, 127);
        full_check(in, out, 127, y);

One very nice thing about this is that it’s saving us a lot of largish bit-shift operations. It turns out that even though the 68000’s instruction set lets you specify large shifts in a single instruction, those instructions take longer the further you shift. A shift of seven, as we use in INDEX, costs as much as five accesses to memory.

Continue reading

Genesis: First draft of the CCA

It’s time to move on to actually implementing the core algorithm for the Cyclic Cellular Automaton. I’ll start by describing the plan in English, and then below the fold we’ll break it down into actual code.

This will end up being a two-parter, I think, because this was more of an adventure than I expected.

Before We Begin

I was recently linked this memoir of an adventure from the same era. It’s a fun read, a pretty good explanation of why memory overrun vulnerabilities are bad, and (OMINOUS FORESHADOWING) might serve as a counterpoint to some of the adventures I had in my own implementation.

The Overall Plan

At the very highest level, here’s our plan:

  • We have two chunks of memory; one that is our current automaton state and one of which is an equally-sized scratch space.
  • Check to see if the user’s pushed the START button.
    • If they haven’t:
      • Compute the next frame, putting it in the scratch space.
      • Make the scratch space the new current animation state.
    • If they have, randomize the contents of the automaton state.
  • Hand the new frame to the renderer.

Let’s take each of these elements in turn.


Our automaton state is a 128×128 grid, which takes up 16KB of RAM. The scratch space is another 16KB. That’s half our RAM, and since the processor’s own stack for tracking procedure calls and such lives at the top of RAM, we’ll just dedicate the entire bottom half of RAM to our state and scratch space.

We will not, however, be fixing which part of that space is the automaton state and which part is the scratch space. We’ll instead be using pointers into the space to distinguish that as needed.

Checking the START Button

We’ve already covered the basics of reading the controllers, but our task here is a little bit more tricky. We don’t really want to simply check if the START button is down right now. If we do that, we’ll keep resetting the state each tick for as long as the button is held down, and if they push START but release it before the end of the frame, that input would be forgotten. What need is to know if the START button had transitioned from not-pressed to pressed since the last time we reset.

Computing the Next Frame

We have a 128×128 grid, and each cell in that grid can have one of 16 states. To get the next state, we need to do these steps for each cell:

  • Load the current value for that cell.
  • Compute the value that eats that value – 0 eats 15, and for all other x, x+1 eats it.
  • If the cells north, south, east, or west of this cell (wrapping around if necessary) are of the color that eats this cell, write that value into the corresponding cell of the scratch space.
  • Otherwise, write the current value into the corresponding cell of the scratch space.

Because we’re using pointers to identify the automaton state and the scratch space, the “make the scratch space the new automaton state” step is simply swapping the scratch space and automaton state pointers—a much cheaper operation than actually copying 16KB around.

In fact, you may have noticed that all our boundaries in this system are built out of powers of two—16 states, 16KB buffers, 128×128 grids. This lets most of our operations happen at the level of individual bits. In fact, since we simply allocated the entire automaton state to the 32KB starting at $FF0000, we can alter which half of it we point to simply by toggling the $4000 bit on the pointer.

Likewise, we can compute the value that eats our cell by adding one and masking out all bits but the lower 4 (15 =&nbsp$0F + 1 = $10, $10 & $0F = $00 = 0), and when working out the neighbors of our cells, we can just increment or decrement our X and Y coordinate, mask out all bits but the ones that represent 0-127, and then slot those bits into the appropriate parts of our pointer. The way our rendering code works, the X coordinate occupies bits 0-6 exactly (which may be extracted from our pointer by applying the bitmask $7F), and the Y coordinate occupies bits 7-13 (for a mask of $3F8). Slotting those bits into place produces a value we can add to our base pointer to produce the address of the neighbor we care about.

Randomizing the State

This is pretty straightforward; it’s just filling each spot in the automaton state memory with a random value between 0 and 15. The only real tricky bit here is that we’d prefer that the user get different configurations each time they play, and since we’re initializing our PRNG from values in the ROM, the same patterns will come out unless we mix it up somehow. I pick a first value that lets the automaton work through all its various states, and then every time a new generation is computed, I advance the RNG one step by collecting a random number and throwing it away.

It’s not a perfect system—we’re consuming so many values on a reset that the difference of a few generations ends up producing very similar maps—but it’s better than nothing.

Handing the Frame to the Renderer

We’ve already set up a variable that can be used to signal the renderer that it’s got new data to copy. We preprocess the new state into a form the VDP can interpret, and then we set that signal. The only real wrinkle is that we only have one place to put that processed VDP data, and so we cannot start writing to it until we’ve finished rendering the previous frame. We thus first wait for the renderer to signal back that it’s done rendering. (It does that by clearing the signal we sent ourselves, so we can go back and forth with a single byte of state.)

Implementation and Complications

I’ll be walking through my first working implementation of this below, or, at least, my almost-first implementation. My first attempt did work, mind you, at least more or less…


… but it was incredibly, incredibly slow. I was worried in the earlier posts about whether I was going to have enough time between generations to worry about it taking two frames to actually blit it out. My first implementation required 558 frames per generation, for a total animation speed of about 0.11 FPS. It was still getting the correct results, though, and it wasn’t actually crashing or corrupting memory or anything. So that means it must be basically correct code, right?

Not so much, but given what the problem was, you’d normally think the consequences would be far more dire. Since I needed to compute the neighbors of each cell, which meant extracting the X and Y coordinates of each cell, I decided to cut out the middle man and just loop through each cell by looping through each row, and then looping through each column within each row. I could then build my pointer out of those values and go from there.

The mistake that I made was that instead of looping through X and Y coordinates from 0 to 127 decimal, I would up looping through them from 0 to 127 hex. Ordinarily this would mean that instead of keeping all my computation within the 16KB of an automaton state, it would blow out over 21KB past that point. This should have, at minimum, corrupted the graphics state, and possibly started consuming the actual data that controlled the simulation and display routines.

But that didn’t happen. Our renegade programs did not escape. Why not?

In the malfunctioning lightcycles program I linked up top, the computer opponent blasted a hole in the outer wall and drove out into system memory, where it began treating the contents of RAM overall as a set of obstacles to be avoided. It knew where it was and it was examining its surroundings. What saved CCA’s integrity here is that any operation that ever touched memory or even eventually touched memory was aggressively wrapped around so that only some of its bits were used to set its dimensions. No matter how far you run on the surface of the Earth, you’ll never be able to reach the Moon. Likewise, our mad loop variables couldn’t reach outside of the 16KB allotted for them.

So that’s why it didn’t crash. But not only did it not crash, it also was producing the right answer. This was, to handwave a little bit, because I’d written the CCAstep routine to be purely functional. It received a buffer (the current state) and then only read from it. It also read only from that state to compute the next state. That next state was also treated as write-only. This means that while the loop strayed wildly beyond its intended bounds, it got constrained to remain in one place and then, since access to its input was read-only, got the same answer every time. We were repeating our work excessively, but we would keep getting the same, correct answer.

Fixing those loop bounds sped it up by a factor of approximately five, bringing us to 124 frames per generation, or an 0.48 FPS animation speed.

We can do better than that, but premature optimization is the root of all evil, so let’s start by working through our simple, slow implementation and save optimization for the next post.

Continue reading

Genesis: Scrolling and Interrupts

The next step for this CCA project is to get a scrollable and updatable bitmap display. We want scrolling, in particular, to work smoothly no matter how busy the automaton itself is. We’d also really prefer to not have to care about when frames begin and end while computing the next step of the automaton.

This is a problem we ran into when we were implementing a different automaton for a different system, and the solution is going to be roughly similar—anything that requires reliable display timing will need to be pushed into an interrupt handler for the vertical blanking period.

What’s different about this system is that the Video Display Processor is extremely stateful, and as a result if we do any writing of VRAM during the interrupt, we’ll need to ensure that all VRAM writes are banished to interrupts so that our state doesn’t get corrupted by race conditions. (If we didn’t care about smooth animation quite so aggressively, we could allow the occasional frame of lag by turning off the VBLANK interrupt every time the main thread needed to touch VRAM, but that won’t really meet our needs.)

That doesn’t turn out to be a problem for us. We can have our main program loop look like this:

  • Compute the next frame of the automaton.
  • Wait for the previous frame to finish rendering, if necessary.
  • Copy the next frame into the renderer’s source space.
  • Signal the renderer that it may start rendering this new data.
  • Make the ‘next’ frame the ‘current’ frame and the previous frame the new ‘next’ frame.

The VBLANK interrupt then becomes a sort of “rendering thread”, which looks like a loop that does this:

  • Read the gamepad and update scroll values as necessary.
  • If there is data to render, render a half-screen’s worth of data.
  • If there is not data to render, signal the main program that it may copy over the new data as needed.
  • Do nothing until the next VBLANK.

This implementation may eventually need some extra time delay—most likely, we’ll probably want to make sure that the animation is limited to say 5 FPS—but this will do for a starting point.

We’ve already covered the rendering and the gamepad reading in previous posts; today’s update involves setting the scroll values and configuring interrupts on the MC68000.


The Genesis doesn’t really have a “background layer”. It offers two “scroll layers” and the cells within it can have their priority set with respect to each other and sprites independently. These can be set at a much more fine-grained level that we’re used to seeing, as well; the horizontal scrolling information isn’t a simple register but is instead a set of tables for looking up scroll data at a character cell or even individual scanline level. The vertical scrolling information is another table, allowing vertical scroll levels to vary every sixteen pixels.

We don’t really need any of those things, though. We’re scrolling both scroll layers in sync with one another and they are also being scrolled as a complete unit. The kind of scrolling is configured by VDP register $0B but we don’t really have to worry about it because the reset code has thoughtfully prepared that for us to begin with. The reset code also allocates the horizontal scroll information to $AC00 in VRAM, and vertical scroll information has its own special block of “memory.” I’m using scare quotes for that because I get the impression that the color memory and vertical scroll memory are actually some sort of on-chip register bank, but as far as our programming interface is concerned, they’re just chunks of memory we must write a word at a time.

The actual scroll values are straightforward values that let us range over the fullness of our 64×64 maps, and since excess bits are ignored we move around the map with simple increments and decrements to our X and Y values. The only part we have to be careful with is that we need to write each value to VRAM twice—once for scroll layer A and once for scroll layer B.


We’ve already looked at interrupts on the Z80 chip, but those barely count as interrupts—there isn’t even a dedicated instruction for atomically re-enabling interrupts on the way out of a routine. It’s just a normal function that gets called as a side effect of some external signal.

The Motorola 68000 is more exciting. Hardware signals are assigned one of seven interrupt levels and these are set in a fixed order. When in supervisor mode, we can set our interrupt mask to any value between 0 and 7. Any interrupt whose level is larger than our mask will get through. One strange exception to this is that even though you can set a mask of 7, interrupt level 7 is nonmaskable and always comes through.

(We’ve sort of seen this before, with the 6502 family of chips. IRQ and NMI can be treated as levels 0 and 1. NMI even stands for “nonmaskable interrupt”.)

The actual interrupt level is part of the flags register. When we’re in supervisor mode, this is called the status register, is named sr, and is sixteen bits wide. The interrupt mask level is in bits 8-10. The supervisor bit itself is bit 13, and we need to make sure that one stays on. The other bits should go to zero. (Bits 0-7 are where things like the zero flag live.) Thus, we can set the interrupt mask to n with the instruction MOVE.W #$2n00, sr.

The Genesis assigns VBLANK to interrupt level 6 and HBLANK to interrupt level 4. The command MOVE.W #$2500, sr will get us ready to go. We also need to set a bit in VDP register $01 to make the VBLANK interrupt even be attempted. That register also controls DMA, vertical display height, and the overall controls for enabling the display at all, so we will need to juggle our needs with a bit of care.

All in all though, there still isn’t a whole lot that’s terribly difficult going on here. With a simple countdown variable like we used on the Game Boy, the two-phase screen update, and the joystick and scroll code, the complete VBLANK routine and its support functions weigh in at just 214 bytes. Those 214 bytes take most of a frame to execute sometimes, since we occasionally have to copy 8 kilobytes over an incredibly crowded data channel, but it does that at the fastest speed the system permits.

What’s left?

I still haven’t actually implemented the automaton at this point, just the ability to convert a grid into a form interpretable by the VDP and then actually copied into VRAM regularly. So I’ll need to write my CCAStep routine. I’ll also need to expand the joystick routine a bit so that I reset the automaton with the START button.

If display tearing is too noticable, I’ll need to reimplement the double buffering technique I used on the Game Boy automaton here. We’ve got plenty of VRAM spare to do this, but I’m treating that as more of a bonus refinement for now.

Lastly, I still haven’t used the YM2612 sound chip to do the FM synthesis it was designed to actually do. So the final piece of mandatory work for the project will be to give it a soundtrack.

From here on out, that’s really all just ticking boxes. There’s nothing new I need to really master; it’s just a matter of doing things I’ve done before in other contexts in this new context. On we go!