Looking Back on the Mac

Earlier this month, Ars Technica ran an article on attempting to use a Macintosh IIsi in the modern day to do actual work. I was never really part of the classical Mac world—even our school computer labs were either Apple IIgs machines or 32-bit x86 systems. (While I wasn’t really aware of it at the time, based on what I remember of the labs I think they were actually running DOS under Novell NetWare.)

What I found interesting was that this wasn’t the first time they’d done a stunt like this; four years or so ago they pulled the same stunt with a G3 PowerBook running System 9. This was interesting both because the author there had much more trouble dealing with System 9 than the newer author had with System 7, but also because it linked to Ars Technica’s comprehensive review of OS X 10.0 during its public beta period and initial release, and those initial reviews were unkind indeed.

The thing that jumped out the most at me from looking through these articles was the way classic Mac OS was presented as a system designed to be aggressively micromanaged and configured within an inch of its life. I first seriously entered the Mac world around 2008 when the transition to Intel architecture was just starting to get entrenched with consumers. By then, the common stereotype was that Apple systems were designed to make basically all design and interaction decisions for the user and personalization of one’s computing experience was generally considered a Bad Thing by the community. This stereotype very clearly had no basis until OS X, and in the runup to the OS X initial releases there was a palpable sense of panic from the writing that the System 9 faithful were going to be relegated to a niche market not worth serving in the new world.

In retrospect, I should have recognized some of this in advance. Around the time I was learning to develop in and for Mac OS X, I encountered Andrew Plotkin’s silly user-experience diary of installing and acclimating to Mac OS X 10.1. A lot of the trouble he had revolved around setting up his interaction model to let him work the way he wanted to. In rereading it now I’m struck by the way his “One True Way” of organizing applications in the Apple Menu matched the ways I organized my Start Menu back in the days when you organized your Start Menu rather than running a search engine over it.

I did not, and still don’t, have much interest in learning about the internals and idioms of classic Mac OS. Everything I have seen about it reminds me of what I consider the “awkward era” of personal computing, running from about 1995 through about 2005. We knew, more or less, what we wanted a personal computer to be, at this point, but the only machines that could actually do it were workstation-level systems that, while they weren’t exactly filling entire rooms, still ended up costing tens or even hundreds of thousands of dollars. After the home computer apocalypse of the early 1990s, Apple and Microsoft ended up evolving their systems towards an ideal obviously set by systems like SGI Unix workstations—but they needed to preserve the things that made their computers attractive to their users along the way. The story of Mac OS from System 7 through OS X 10.4 or so and the story of Windows from Windows 95 through Windows XP is one of that evolution, with the steps necessary to reach that endpoint taken in very different orders.

Unlike the Windows case, though, where XP was acknowledged as the NT line of business-level operating systems finally getting compatible enough with consumer hardware and Win9x-era software to let consumers use an NT kernel with all the benefits that was already known to have, it seems undeniable that in the Mac OS 9 to OS X transition, something real was lost.

Advertisements

Racing Against the Atmel AVR’s Toolkit

Forays into the Sega and DOS worlds aside, my work on this blog has mostly involved old 8-bit CPUs. One might reasonably object that “old 8-bit” is a redundant phrase, but this objection is misplaced. The microcontroller market—that is, programmable or programmed devices that are not computers—retain a significant 8-bit presence, and were overwhelmingly 8-bit until quite recently. Some of these chips are modern descendants of the earliest processors—the Z80, 6502, General Instrument PIC, and Intel 8051 all have living descendants today—but one of the most visible chip lines these days is a relative newcomer: the Atmel AVR.

It says something about how long-lived chip designs are that a 20-year-old system is a “relative newcomer,” but it’s undeniable. All the other examples I gave above are from 1980 or earlier. The AVR is only six years older than the x86-64 architecture that dominates the desktop market. 1997 is a respectably recent vintage overall. It’s firmly planted in the modern era of CPU design, and it’s very popular in the hobby market because it’s the core of the Arduino family of hobby products. Let’s take a quick tour through its design and see what it means to be 8-bit at the turn of the century.

Continue reading

To DLL Hell And Back

I’d mentioned in the previous post that I’d now written “a program” for each of the machines I’d grown up with. This is an interesting proposition because the notion of what it means to be “a program,” or to “load” and “run,” has changed over the years. Let’s take a tour through the various ways you get from “a program on some storage medium” to “a program that is actually running.”

This is in roughly increasing order of the complexity of the loading system, which is not quite historical order and not even quite increasing order of the complexity of loading any individual program.

Continue reading

An Ambition, Completed

I belong that that somewhat narrow cohort that straddles Generation X and the Millenials—we grew up with the microcomputer omnipresent but we remember a time before The Internet. The cohort has gotten various attempts at nicknames but I’ve always been fond of the name Oregon Trail generation, after the famous educational software.

My parents were also somewhat technically inclined (or at least gadget-oriented), and as a result I had grown up alongside a number of game consoles and home computers. A lot of the work I did on those, both as a hobby and for schoolwork, shaped my talents in the years to come.

One of my goals here on Bumbershoot over the past few years has been to demystify the machines I grew up with. I’m now a genuine software professional, after all—surely I’ve learned enough to see how the tricks were done and to join in myself, if a bit late.

As such, once I realized that this was what I was doing, I formalized the nature of the task and started getting systematic about it. And now, with this latest release, I have finished it.

The Rules

Here are the rules I had set for this task.

  • Childhood ends in 1992. This is a somewhat arbitrary division point, but it turns out I can pick any year from there to my 20th birthday and not change the list of targeted systems, and 1992 means I don’t have to distinguish between when a platform was released in North America and when I got access to it.
  • Systems require a full year of exposure to count as “one I grew up with.” I had brief acquaintance with the Macintosh and the SNES, but both were systems I didn’t really have any experience with until adulthood. I first beat Super Metroid in 2010.
  • Programs must run on the metal. Interpreted languages are permitted only to the extent that they can then transfer control to machine code. This started out as “BASIC and LOGO don’t count” but eventually became a requirement that, for at least one project, I produce a binary where every single byte in the program exists because I put it there.
  • Programs must be for the platform. Writing a highly portable C program and jamming it through a compiler for each system doesn’t count. I should make use of a generous set of system-specific capabilities that let it show off what it’s good at. Any stunt that reaches demoeffect quality automatically achieves this goal.
  • The platform must have been intended to receive third-party programs. That means the graphing calculators we had in high school don’t count, even though they had Z80 chips in them and you could program them with special connectors and PC programs that no longer run on 64-bit operating systems.
  • Period tools are not required. Since half the targets are game consoles, cross-development tools were the norm even at the time.

So, without further ado, the systems I grew up with.

Continue reading

Release: The Cyclic Cellular Automaton for the Sega Genesis/Mega Drive

I’ve finished my project: an implementation of the Cyclic Cellular Automaton for the Sega Genesis. This uses a very simple rule across a grid to let that grid evolve from a chaotic start condition to a highly ordered end condition:

cca_defect

The grid in this program is a 128×128 grid with 16 possible states. The screen can only display a 80×56 window into that grid at a time, though. You can scroll around the grid with the controller’s D-Pad; since the map is also a torus, it will wrap around if you keep scrolling in any direction. The START button will reset the display with a new, randomized grid.

This program uses the controller, the music chip, both the scroll layers, and a digital sound effect during initialization. It doesn’t use sprites or the legacy sound chip, but those were covered in their own articles. That’s enough work with the platform that I feel that I can say I’ve put the console as a whole through its paces.

Downloads

Download the SMD ROM image here. It should run in any Genesis emulator—I’ve tested it on Gens, DGen, Kega Fusion, and Exodus. I have not yet had the opportunity to test it on real hardware—if you have a flashcart and hardware I’d love to hear how it goes.

Continue reading

Genesis: The YM2612 Music Chip

The last piece of this Genesis CCA application is background music using the actual Yamaha YM2612 chip instead of the TI SN76489 that’s left over from the Sega Master System. The general principle turns out to be quite similar.

The general plan is still to hook the Z80 to the new-frame interrupt and send new data to the sound chip as needed each time. This task is simplified for the YM2612 because it has an actual envelope generator—we don’t need to manually step down the volume each frame and just need to mark key on/off transitions. What makes the task a bit more complicated is that we may need to do nearly arbitarily large amounts of work each frame—configuring an instrument is a lot of work.

To address that, I’ve taken my cues from iD software’s old music format for early PCs. This was a collection of register writes to the adlib card along with how long to wait until the next register write. This was permitted to be zero. In my system I provide a number of frames to wait before the next block of writes, how many writes are in this block, and then that many writes. This restricts me to the first three channels on the YM2612 as written, but the sample song I’m using only has two voices anyway so it should be fine. It could be easily extended to have two blocks, one for the first block of registers and one for the second. Under most circumstances, one of those would be zero in any given interruption point.

There is also an interesting side effect which is that apparently it doesn’t matter whether you write multiple start-note values—the note will not re-gate unless you turn the note off, then on again. That can be important because if the note has already been held so long that it has faded to silence, then you won’t hear later notes without that gap. So our stream of notes has to include a tiny rest at the end of every note or the next one won’t start right.

The song data that I’ve been feeding my playroutines is generally automatically generated from textual sources, and this has started to get sufficiently onerous that I decided to make a more compact representation. I based this on the PLAY command supported by the BASICs on the IBM PC and the Commodore 128. This was a music macro language that was sufficient for one voice but broke down rather dramatically when you needed more than one (as was sadly demonstrated by the C128). The BASIC implementation would block if you queued up a note for a voice that was already playing a note, so you had to rapidly cycle between all the voices from longest note to shortest so that you never missed a cue.

That’s more tedious than I’d generally like, so while I implemented most of the macro language I kept each voice separate and then used a separate collation step to fold it all together. That collation results in a series of register writes that look a lot like the C128 multivoice case, but here the challenge is that we never get to block at all, because the interrupt timer is pitiless and unforgiving.

We’ve done what we came to do, though. This project will be done soon.

Further Reading

Official documentation for the YM2612 was sparse and misleading. The members of the SMS Power forum seem to have done most of the work in determining things like how to actually compute frequency numbers for given targets. This forum post in particular, from 2001, was critical to getting my songs to sound right.

Optimizing the CCA

Last time, we wrote a basic implementation of the CCA automaton in m68k assembly language, but it wasn’t running as fast as we’d like. In this article, we will speed up that routine by a factor of four. The techniques that we’ll be using to do this aren’t really specific to 68000 assembly language, or indeed assembly language coding at all, so much of the work in this post will also be shown in C equivalents.

That said, remember the two rules of optimization:

  1. Don’t do it.
  2. (Experts only) Don’t do it yet.

It’s been a week, so now we can do it. Also, we get to skip what is usually the main work of performance tuning, in that we do not need a profiler to tell us which function is slowing us down—there is only one function doing any significant work at all here. Working on hopped-up toy problems has its advantages.

Our Starting Point

Here is some C code that is roughly equivalent to our initial implementation from last time:

#define INDEX(x, y) ((((y) & 0x7f) << 7) | ((x) & 0x7f))
#define CHECK(x, y) if (in[INDEX(x, y)] == target) val = target

void full_check(const char *in, char *out, int x, int y)
{
    int val, target;
    val = in[INDEX(x, y)];
    target = (val + 1) & 0x0f;
    CHECK(x-1, y);
    CHECK(x+1, y);
    CHECK(x, y-1);
    CHECK(x, y+1);
    out[INDEX(x, y)] = val;
}    

void cca_step_0(const char *in, char *out)
{
    int x, y;
    for (y = 0; y < 128; ++y) {
        for (x = 0; x < 128; ++x) {
            full_check(in, out, x, y);
        }
    }
}

I’ve made a couple of organizational changes from the assembly language code; the loop body has become a subroutine in its own right, and the “internal subroutines” of the original implementation have been rendered as macros and are being inlined in the code itself. These would both be somewhat questionable implementation decisions if you were writing a CCA routine in C for real, but they will make it easier to reason about the code we already have.

Optimization 1: Remove redundant checks

The 68000 is an early enough chip that we don’t really have to worry about pipelining or cache misses or anything like that—if we want a computation to take less time, our only real option is to do less work. On the plus side, this also means that any time we find ourselves repeating a bunch of work in an inner loop, we know it’s ultimately going to be slowing us down a bunch.

There isn’t a whole lot we can do about the fact that we need to visit all 16,384 cells in the grid, so our initial focus will be on full_check. As written, we are calling that function 16,384 times, so any time we can save in there will be magnified greatly. What can we save?

Not much, in the most general case, as it turns out. We need to check each neighbor, and that means we need to do the INDEX computation at least five times. We could spare ourselves one, perhaps, if we cached the result from the first call (for INDEX(x, y)). But that’s missing the forest for one rather large tree: The only reason we need INDEX at all instead of just taking fixed offsets from our starting point is that we’re worried about wrapping around. We can drastically simplify the non-edge case, breaking out the edges and corners into their own loop afterwards:

void cca_step_1(const char *in, char *out)
{
    int x, y, index, val, target;
    for (y = 1; y < 127; ++y) {
        for (x = 1; x < 127; ++x) {
            index = INDEX(x, y);
            val = in[index];
            target = (val + 1) & 0x0f;
            if (in[index - 128] == target ||
                in[index -   1] == target ||
                in[index +   1] == target ||
                in[index + 128] == target) {
                val = target;
            }
            out[index] = val;
        }
    }
    for (y = 0; y < 128; ++y) {
        full_check(in, out, 0, y);
        full_check(in, out, y, 0);
        full_check(in, out, y, 127);
        full_check(in, out, 127, y);
    }
}

One very nice thing about this is that it’s saving us a lot of largish bit-shift operations. It turns out that even though the 68000’s instruction set lets you specify large shifts in a single instruction, those instructions take longer the further you shift. A shift of seven, as we use in INDEX, costs as much as five accesses to memory.

Continue reading