Genesis: First draft of the CCA

It’s time to move on to actually implementing the core algorithm for the Cyclic Cellular Automaton. I’ll start by describing the plan in English, and then below the fold we’ll break it down into actual code.

This will end up being a two-parter, I think, because this was more of an adventure than I expected.

Before We Begin

I was recently linked this memoir of an adventure from the same era. It’s a fun read, a pretty good explanation of why memory overrun vulnerabilities are bad, and (OMINOUS FORESHADOWING) might serve as a counterpoint to some of the adventures I had in my own implementation.

The Overall Plan

At the very highest level, here’s our plan:

  • We have two chunks of memory; one that is our current automaton state and one of which is an equally-sized scratch space.
  • Check to see if the user’s pushed the START button.
    • If they haven’t:
      • Compute the next frame, putting it in the scratch space.
      • Make the scratch space the new current animation state.
    • If they have, randomize the contents of the automaton state.
  • Hand the new frame to the renderer.

Let’s take each of these elements in turn.

Memory

Our automaton state is a 128×128 grid, which takes up 16KB of RAM. The scratch space is another 16KB. That’s half our RAM, and since the processor’s own stack for tracking procedure calls and such lives at the top of RAM, we’ll just dedicate the entire bottom half of RAM to our state and scratch space.

We will not, however, be fixing which part of that space is the automaton state and which part is the scratch space. We’ll instead be using pointers into the space to distinguish that as needed.

Checking the START Button

We’ve already covered the basics of reading the controllers, but our task here is a little bit more tricky. We don’t really want to simply check if the START button is down right now. If we do that, we’ll keep resetting the state each tick for as long as the button is held down, and if they push START but release it before the end of the frame, that input would be forgotten. What need is to know if the START button had transitioned from not-pressed to pressed since the last time we reset.

Computing the Next Frame

We have a 128×128 grid, and each cell in that grid can have one of 16 states. To get the next state, we need to do these steps for each cell:

  • Load the current value for that cell.
  • Compute the value that eats that value – 0 eats 15, and for all other x, x+1 eats it.
  • If the cells north, south, east, or west of this cell (wrapping around if necessary) are of the color that eats this cell, write that value into the corresponding cell of the scratch space.
  • Otherwise, write the current value into the corresponding cell of the scratch space.

Because we’re using pointers to identify the automaton state and the scratch space, the “make the scratch space the new automaton state” step is simply swapping the scratch space and automaton state pointers—a much cheaper operation than actually copying 16KB around.

In fact, you may have noticed that all our boundaries in this system are built out of powers of two—16 states, 16KB buffers, 128×128 grids. This lets most of our operations happen at the level of individual bits. In fact, since we simply allocated the entire automaton state to the 32KB starting at $FF0000, we can alter which half of it we point to simply by toggling the $4000 bit on the pointer.

Likewise, we can compute the value that eats our cell by adding one and masking out all bits but the lower 4 (15 =&nbsp$0F + 1 = $10, $10 & $0F = $00 = 0), and when working out the neighbors of our cells, we can just increment or decrement our X and Y coordinate, mask out all bits but the ones that represent 0-127, and then slot those bits into the appropriate parts of our pointer. The way our rendering code works, the X coordinate occupies bits 0-6 exactly (which may be extracted from our pointer by applying the bitmask $7F), and the Y coordinate occupies bits 7-13 (for a mask of $3F8). Slotting those bits into place produces a value we can add to our base pointer to produce the address of the neighbor we care about.

Randomizing the State

This is pretty straightforward; it’s just filling each spot in the automaton state memory with a random value between 0 and 15. The only real tricky bit here is that we’d prefer that the user get different configurations each time they play, and since we’re initializing our PRNG from values in the ROM, the same patterns will come out unless we mix it up somehow. I pick a first value that lets the automaton work through all its various states, and then every time a new generation is computed, I advance the RNG one step by collecting a random number and throwing it away.

It’s not a perfect system—we’re consuming so many values on a reset that the difference of a few generations ends up producing very similar maps—but it’s better than nothing.

Handing the Frame to the Renderer

We’ve already set up a variable that can be used to signal the renderer that it’s got new data to copy. We preprocess the new state into a form the VDP can interpret, and then we set that signal. The only real wrinkle is that we only have one place to put that processed VDP data, and so we cannot start writing to it until we’ve finished rendering the previous frame. We thus first wait for the renderer to signal back that it’s done rendering. (It does that by clearing the signal we sent ourselves, so we can go back and forth with a single byte of state.)

Implementation and Complications

I’ll be walking through my first working implementation of this below, or, at least, my almost-first implementation. My first attempt did work, mind you, at least more or less…

cca_droplet

… but it was incredibly, incredibly slow. I was worried in the earlier posts about whether I was going to have enough time between generations to worry about it taking two frames to actually blit it out. My first implementation required 558 frames per generation, for a total animation speed of about 0.11 FPS. It was still getting the correct results, though, and it wasn’t actually crashing or corrupting memory or anything. So that means it must be basically correct code, right?

Not so much, but given what the problem was, you’d normally think the consequences would be far more dire. Since I needed to compute the neighbors of each cell, which meant extracting the X and Y coordinates of each cell, I decided to cut out the middle man and just loop through each cell by looping through each row, and then looping through each column within each row. I could then build my pointer out of those values and go from there.

The mistake that I made was that instead of looping through X and Y coordinates from 0 to 127 decimal, I would up looping through them from 0 to 127 hex. Ordinarily this would mean that instead of keeping all my computation within the 16KB of an automaton state, it would blow out over 21KB past that point. This should have, at minimum, corrupted the graphics state, and possibly started consuming the actual data that controlled the simulation and display routines.

But that didn’t happen. Our renegade programs did not escape. Why not?

In the malfunctioning lightcycles program I linked up top, the computer opponent blasted a hole in the outer wall and drove out into system memory, where it began treating the contents of RAM overall as a set of obstacles to be avoided. It knew where it was and it was examining its surroundings. What saved CCA’s integrity here is that any operation that ever touched memory or even eventually touched memory was aggressively wrapped around so that only some of its bits were used to set its dimensions. No matter how far you run on the surface of the Earth, you’ll never be able to reach the Moon. Likewise, our mad loop variables couldn’t reach outside of the 16KB allotted for them.

So that’s why it didn’t crash. But not only did it not crash, it also was producing the right answer. This was, to handwave a little bit, because I’d written the CCAstep routine to be purely functional. It received a buffer (the current state) and then only read from it. It also read only from that state to compute the next state. That next state was also treated as write-only. This means that while the loop strayed wildly beyond its intended bounds, it got constrained to remain in one place and then, since access to its input was read-only, got the same answer every time. We were repeating our work excessively, but we would keep getting the same, correct answer.

Fixing those loop bounds sped it up by a factor of approximately five, bringing us to 124 frames per generation, or an 0.48 FPS animation speed.

We can do better than that, but premature optimization is the root of all evil, so let’s start by working through our simple, slow implementation and save optimization for the next post.

Continue reading

Advertisements

Genesis: Scrolling and Interrupts

The next step for this CCA project is to get a scrollable and updatable bitmap display. We want scrolling, in particular, to work smoothly no matter how busy the automaton itself is. We’d also really prefer to not have to care about when frames begin and end while computing the next step of the automaton.

This is a problem we ran into when we were implementing a different automaton for a different system, and the solution is going to be roughly similar—anything that requires reliable display timing will need to be pushed into an interrupt handler for the vertical blanking period.

What’s different about this system is that the Video Display Processor is extremely stateful, and as a result if we do any writing of VRAM during the interrupt, we’ll need to ensure that all VRAM writes are banished to interrupts so that our state doesn’t get corrupted by race conditions. (If we didn’t care about smooth animation quite so aggressively, we could allow the occasional frame of lag by turning off the VBLANK interrupt every time the main thread needed to touch VRAM, but that won’t really meet our needs.)

That doesn’t turn out to be a problem for us. We can have our main program loop look like this:

  • Compute the next frame of the automaton.
  • Wait for the previous frame to finish rendering, if necessary.
  • Copy the next frame into the renderer’s source space.
  • Signal the renderer that it may start rendering this new data.
  • Make the ‘next’ frame the ‘current’ frame and the previous frame the new ‘next’ frame.

The VBLANK interrupt then becomes a sort of “rendering thread”, which looks like a loop that does this:

  • Read the gamepad and update scroll values as necessary.
  • If there is data to render, render a half-screen’s worth of data.
  • If there is not data to render, signal the main program that it may copy over the new data as needed.
  • Do nothing until the next VBLANK.

This implementation may eventually need some extra time delay—most likely, we’ll probably want to make sure that the animation is limited to say 5 FPS—but this will do for a starting point.

We’ve already covered the rendering and the gamepad reading in previous posts; today’s update involves setting the scroll values and configuring interrupts on the MC68000.

Scrolling

The Genesis doesn’t really have a “background layer”. It offers two “scroll layers” and the cells within it can have their priority set with respect to each other and sprites independently. These can be set at a much more fine-grained level that we’re used to seeing, as well; the horizontal scrolling information isn’t a simple register but is instead a set of tables for looking up scroll data at a character cell or even individual scanline level. The vertical scrolling information is another table, allowing vertical scroll levels to vary every sixteen pixels.

We don’t really need any of those things, though. We’re scrolling both scroll layers in sync with one another and they are also being scrolled as a complete unit. The kind of scrolling is configured by VDP register $0B but we don’t really have to worry about it because the reset code has thoughtfully prepared that for us to begin with. The reset code also allocates the horizontal scroll information to $AC00 in VRAM, and vertical scroll information has its own special block of “memory.” I’m using scare quotes for that because I get the impression that the color memory and vertical scroll memory are actually some sort of on-chip register bank, but as far as our programming interface is concerned, they’re just chunks of memory we must write a word at a time.

The actual scroll values are straightforward values that let us range over the fullness of our 64×64 maps, and since excess bits are ignored we move around the map with simple increments and decrements to our X and Y values. The only part we have to be careful with is that we need to write each value to VRAM twice—once for scroll layer A and once for scroll layer B.

Interrupts

We’ve already looked at interrupts on the Z80 chip, but those barely count as interrupts—there isn’t even a dedicated instruction for atomically re-enabling interrupts on the way out of a routine. It’s just a normal function that gets called as a side effect of some external signal.

The Motorola 68000 is more exciting. Hardware signals are assigned one of seven interrupt levels and these are set in a fixed order. When in supervisor mode, we can set our interrupt mask to any value between 0 and 7. Any interrupt whose level is larger than our mask will get through. One strange exception to this is that even though you can set a mask of 7, interrupt level 7 is nonmaskable and always comes through.

(We’ve sort of seen this before, with the 6502 family of chips. IRQ and NMI can be treated as levels 0 and 1. NMI even stands for “nonmaskable interrupt”.)

The actual interrupt level is part of the flags register. When we’re in supervisor mode, this is called the status register, is named sr, and is sixteen bits wide. The interrupt mask level is in bits 8-10. The supervisor bit itself is bit 13, and we need to make sure that one stays on. The other bits should go to zero. (Bits 0-7 are where things like the zero flag live.) Thus, we can set the interrupt mask to n with the instruction MOVE.W #$2n00, sr.

The Genesis assigns VBLANK to interrupt level 6 and HBLANK to interrupt level 4. The command MOVE.W #$2500, sr will get us ready to go. We also need to set a bit in VDP register $01 to make the VBLANK interrupt even be attempted. That register also controls DMA, vertical display height, and the overall controls for enabling the display at all, so we will need to juggle our needs with a bit of care.

All in all though, there still isn’t a whole lot that’s terribly difficult going on here. With a simple countdown variable like we used on the Game Boy, the two-phase screen update, and the joystick and scroll code, the complete VBLANK routine and its support functions weigh in at just 214 bytes. Those 214 bytes take most of a frame to execute sometimes, since we occasionally have to copy 8 kilobytes over an incredibly crowded data channel, but it does that at the fastest speed the system permits.

What’s left?

I still haven’t actually implemented the automaton at this point, just the ability to convert a grid into a form interpretable by the VDP and then actually copied into VRAM regularly. So I’ll need to write my CCAStep routine. I’ll also need to expand the joystick routine a bit so that I reset the automaton with the START button.

If display tearing is too noticable, I’ll need to reimplement the double buffering technique I used on the Game Boy automaton here. We’ve got plenty of VRAM spare to do this, but I’m treating that as more of a bonus refinement for now.

Lastly, I still haven’t used the YM2612 sound chip to do the FM synthesis it was designed to actually do. So the final piece of mandatory work for the project will be to give it a soundtrack.

From here on out, that’s really all just ticking boxes. There’s nothing new I need to really master; it’s just a matter of doing things I’ve done before in other contexts in this new context. On we go!

Genesis: Badlines’ Revenge

I closed the last post by saying “time will tell what other obstacles are thrown in our way.” As it happens, the next obstacle was, in fact, time.

This will require a little background before I get to the problem itself. If you’ve been following Bumbershoot for a while, you’ll have seen my battles with the phenomenon on the C64 developers call “badlines”. (For those of you just joining us, I posted a history of my struggles with it and the fruits thereof earlier this year.) This turns out to be a specific solution to a more general problem that all systems of any complexity face.

The issue is this: multiple chips in the computer want to read or write memory, but a wire can only carry one signal at a time. The general way this is set up is that memory and peripherals and the like are all hooked up to the same set of wires (the bus) and things are arranged so that any given time exactly one chip is actually setting values to it. Any number might read it, but only one can set anything. The way this is enforced varies based on the system in question.

  • On the NES and the Atari 2600, it simply isn’t enforced. You can try to write to any point whenever you want, and if you write to places where something else is happening, you’ll get interference and probably unnecessary power usage and heat. Certain operations nevertheless involved doing something that looked a great deal like a write to ROM, and the standard practice was to write a value to the ROM that matched the value in the written location exactly. With no difference in value written, there was no contention and everything was fine. However, on the NES in particular, trying to write to VRAM when the Picture Processing Unit had its own ideas about what should be going on produced garbled displays.
  • Mitigating this a bit, the NES also kept its video memory completely separate from its CPU memory, so under ordinary circumstances the PPU and CPU simply could not interfere with each other at all. With two separate data buses, there can be no contention. The C64 likewise included a separate 4-bit data bus for fetching color memory from RAM independently of anything else the CPU or VIC-II might be doing.
  • The Game Boy had protections against contention with video memory while its PPU was working with it. However, there was nothing stopping you from trying to read or write it anyway—you just got garbage back or your write would be ignored. Keeping your writes at valid times was on you as the programmer.
  • The C64 during ordinary operation pre-arranges access to memory so that the devices that need this access never actually conflict with one another and each chip then proceeds as the please.
  • More sophisticated machines—the kind that we’d now acknowledge as true personal computers—included specialized circuitry called a bus controller or bus arbiter that served as a gatekeeper to the bus. Any device can try to access the bus whenever it wants, but if something else is using it, it will be halted until the resource is available. The C64 behaves like this during badlines—the VIC-II graphics chip serves as the arbiter here and can halt the CPU whenever the graphics system requires double-speed access to RAM—but the most familiar example to users these days would be the bus systems of the IBM PC and its descendants. The original 1981 IBM PC actually needed to block on the data bus during even ordinary operation, because its 8088 chip could only fetch 8 bits of data at a time but used the 8086 instruction set that required 16-bit machine words.

The Genesis uses a combination of bus arbitration and separated buses. There are three major chunks of memory: main CPU ROM/RAM, the VDP’s VRAM, and the Z80’s private 8KB of SRAM. These each have a bus associated with them, and they’re all independent. When a DMA operation is triggered, control of the source and destination buses are captured by the VDP for the duration of the transfer. Furthermore, the VDP is clever enough to make sure that data from VRAM is always available when needed. If you attempt to write VRAM during the display of a frame, whoever attempted the write—whether it is the DMA controller or the CPU directly—will basically be put on hold until the VDP can spare some time to slot it in.

This also means that attempting to DMA out of RAM will halt the ability of the 68000 chip to run anything until the DMA completes, and the Z80 will be halted if it tries to read from the banked-in ROM area. Of these two, the Z80 is the more problematic case because we need predictable, cycle-exact timing in order to have smooth sample playback. For the purposes of this project, though, the key is that the DMA speed varies depending on whether or not we’re in VBLANK… and we have a lot of data that we want to transfer each frame.

Transferring memory to VRAM can be done at 205 bytes per scanline in VBLANK and 18 bytes per scanline during the main screen display. If we use the narrower screen that’s only 256 pixels across, those numbers drop to 167 and 16 bytes, respectively. That’s interesting—it implies that the VDP is shutting itself down completely during HBLANK, even during VBLANK proper—but more importantly it lets us work out how much data we can send per frame.

According to the Sega docs, a complete frame will give us 36 lines of VBLANK and then 224 lines of normal display, which adds up to 11,412 bytes transferrable per frame. That reckoning only adds up to 260 scanlines, though, which appears to imply the whole video system shuts down during VSYNC. Either way, we can’t actually blit 16KB in a frame, and this is our problem: the simplest and fastest mechanism for updating the display will cause other parts of the display (noticably scrolling) to hitch as they miss their cue to update.

How about if we split it across two frames? There’s a natural breakpoint for odd/even rows, given how we’ve designed the pseudo-bitmap display. How much wiggle room do we have if we only blit 8KB in a frame?

  • In the worst case scenario, we do our preliminary work during VBLANK and then trigger a blit that costs the entire rest of the frame, but finishes in time for us to leave the interrupt routine and be refired later.
  • 224 lines at 18 bytes per line means the mid-frame blit adds up to 4,032 bytes.
  • The remaining 4,160 bytes take 21 VBLANK lines to send at 205 bytes per line. (20 and a bit, really, but we have to round up.
  • There are 36 lines in VBLANK, which leaves 15 scanlines of wiggle room.

We want the DMA to happen as much as possible inside VBLANK, because during normal display we can make much more efficient use of RAM than the system as a whole. This introduces a strong incentive to do as little else as possible inside the VBLANK routine. Since the only things that must happen are polling the joysticks and updating the scroll values, we should be in good shape overall. Every extra scanline of VBLANK we don’t use before the blit starts is more than ten scanlines saved on the full screen.

But with this strategy worked out, one way or another actually displaying the updates won’t be an issue. I’m not familiar enough with the 68000 or my ability to make it work to know just how fast it will be, processing the automaton flat out. I guess we’ll find out!

Genesis: Designing a Pseudo-Bitmap Mode

I’ve decided what I want my “full” Genesis project to be: an implementation of the Cyclic Cellular Automaton. We start with a grid full of cells in random states, from 1 to N. Each tick, if a cell in state X has a neighbor in state X+1, it is “eaten” and becomes state X+1 next tick. Otherwise it stays the same color. State N gets eaten by state 1. Only neighbors in the four cardinal directions count.

These simple rules produce a very interesting evolution from chaos to order; random static evolves into pulsing blobs of color that either eat the entire space or evolve further into ever-growing spirals. It’s one of the neatest demonstrations I’ve seen from the simplest of rules.

I first encountered this algorithm as a child in a Computer Recreations column by A. K. Dewdney, and implemented it in BASIC on a DOS system. It’s since become one of my standard first programs in a new graphical system for many, many years, but I’ve never had a chance to actually implement it for any new-to-me platform here on this blog. It relies too heavily on multicolor bitmaps as a technology, and the Genesis is the first system we’ve had where this is even remotely reasonable.

Reasonable doesn’t mean easy though. We’ll need to do some tricks to make this really work.

The Overall Plan

I like to tune the implementation so that it usually but does not always proceeds to its final evolution. My experiments have generally shown that being about 100×100 with 12-14 states. Since I started on BASIC on DOS, Those states corresponded to the original 16 CGA colors.

The closest I can get to that on the Genesis while still keeping the display in approximately those bounds, and while using the system’s capabilities to its utmost, would be a 128×128 display with 16 states. We’ll divide each character cell in the map into 2×2 subcells, and then use our 16 colors in a map palette to represent our states. It’s tradition at this point to match the CGA palette as closely as I can:

fake_cga_test_pattern

We can fit an 80×46 grid onscreen at any given time, which is too small for our purpose. We’ll need to let the user scroll around the full screen with the joystick. Our map wraps around anyway, so that means we don’t even need to do any bounds-checking.

Memory Layout

We have 64KB of work RAM. a naïve implementation assigns a single byte to each cell, and we’ll need two copies of it so that we can appear to update every cell simulataneously. That’s 128*128*2 = 32,768 bytes, or half our available work RAM. We’ll have plenty to spare on the normal RAM side, at least.

If we use a full bitmap for our 64×64 tile grid, that would cost us 64*64*32 = 131,072 bytes to hold each pattern. That’s twice as much as our entire VRAM space, so that won’t work at all. The number of total possible patterns is 16*16*16*16 = 65,536, which is also more than the total number of patterns we can define. Fitting our full display onto one scroll layer seems like it will elude us.

But we have two scroll layers.

The trick here is that we define 256 patterns, with the low and high nybbles representing the color of the upper left and upper right quadrants, respectively. We leave the entire bottom half blank. We then can interleave the rows so that they fill the entire screen, and that will produce displays like the one we showed above. We could do the offset by having the second scroll layer be four pixels below the first, but it’s actually easier to tie the two layers together completely and just tell each of the cells for the odd-number rows to be vertically flipped.

Either way, we’ve gotten what we wanted, a 128×128 16-color grid we can program at will, and quickly. The two 64×64 name tables end up costing 16 KB, and the pattern table itself only consumes 8KB. There’s plenty of VRAM left if we want to add windows or fonts or, really, anything, and we can prepare the next frame in RAM independent of our two normal buffers while still having 16KB left for our stack and incidental variables.

This seems like a very promising plan of attack. Time will tell what other obstacles are thrown in our way.

Genesis: Digital Playback

Let’s actually use the Genesis sound chip as intended, shall we? Let’s take a look at the Yamaha 2612. As a chip, the closest point of comparison we have is to the Creative Labs Sound Blaster—it’s got a Yamaha FM synthesis system and 8-bit digital audio. The FM system has slightly different tradeoffs from the Sound Blaster—it has fewer simultaneous channels but the channels have more oscillators that can in turn be combined in more complex ways—but Sega’s digital audio has a significant disadvantage: there is no DMA that feeds the sound chip. On the Sound Blaster, one could simply prepare a region of memory and then ask the card and the memory bus to do the rest. On the YM2612, though, we have to cyclecount and push the values by hand. As Sega’s sound documentation puts it:

The 8-bit Digitized Audio exists as a replacement of FM channel 6, meaning that turning on the DAC turns off FM channel 6. Unfortunately, all timing must be done by software—meaning that unless the software has been very cleverly constructed, it is impossible to use any of the FH channels at the same time as the DAC.

There’s that word “impossible” again. I do not think it means what they think it means.. And indeed, it doesn’t seem to have particularly slowed anybody down for long. This was absolutely a solved problem at least for things like drum effects by the time Sonic the Hedgehog was released, and the music and sound effects in Earthworm Jim are so smooth you’d be forgiven for thinking the Genesis had a Sound Blaster in it. Toy Story managed to actually write a proper MOD player, and I’m actually willing to grant that one “impossible trick” status since as far as I know it only got done once.

Looking through my old Genesis library, the only game I ever played that actually did suffer from this limitation was the unfortunately-named Lightening Force. The voice clips announcing what powerup you just picked up would stop the soundtrack and sound effects.

The Overall Approach

The general technique for playing a sound effect is similar to the techniques we used for cycle-counting out square waves on the Apple II. We need to create a loop that always takes exactly the same number of cycles to run, and we need to update the DAC’s value at the end of that loop.

It’s not too tough to mix that with FM music playback. Notes are turned on and off much less frequently than samples are updated, so it is entirely reasonable to spend the time between sample updates flipping a few of the FM registers. If we need to flip a lot of registers, the sample playback is fast enough that we can just put a hard limit on number of updates per sample.

That said, at this point, for us, mixing in FM is just borrowing trouble. Let’s stick to simply playing a sample for now.

Working Out the Details

The Z80 runs at 3.58 MHz, and so to play an 8kHz sample we need to write a new value to the DAC every 3580 / 8 = 447.5 cycles. We obviously won’t be hitting that exactly, but we should be able to get close.

Setting up the cycle counting itself is a bit odd. As we saw on the Game Boy, the Z80 family of CPUs sort of expect the memory bus to be running much more slowly than the CPU itself, and so instructions take a certain number of CPU cycles and a certain number of memory cycles, and the hardware designers have to bake in some way of keeping those straight outside of both (that is, on the motherboard or its equivalent). The Game Boy standardized on the memory speed, but the Genesis is running fast enough (thanks to the 8MHz Motorola CPU, I expect) that the Z80 can run at its maximum speed. Unfortunately, that does end up meaning that a lot of the instruction lengths are kind of odd and not neat multiples of each other. It’s possible that we’ll end up having to accept a bit of inaccuracy.

The timings are available from Zilog’s own programming reference manuals, but I have found CLRHOME’s interactive table to be more convenient.

Armed with these, we can start to work out our routine. Here is the tightest loop we can write in Z80 code:

        ld      b, N
loop:   djnz    loop

Loading the B register costs seven cycles, and then the decrement-and-loop operation costs thirteen cycles per branch taken and eight for the final loop. The basic formula is thus 7+13*(N-1)+8, or, after a little algebra, 2+13*N. We can use the four-cycle NOP and the seven-cycle LD B, N instructions to tune the results more carefully, if we’d like.

Below the fold, we’ll build up a complete playback routine.

Continue reading

Makefiles for Mortals

There are dozens of build systems out there these days—possibly hundreds—but the humble Makefile has continued to persist and is still the easiest system to use when you’re putting together a large number of small programs that are built in similar ways. Once things get particularly intricate, or you need to be able to support extremely heterogeneous target environments, or you need to build exactly identical artifacts on extremely heterogeneous host environments, or for build systems where compiling a dozen files is just as fast as compiling one, Make starts to strain and underperform. Those dozens of build systems were all made for reasons.

But handcrafted Makefiles can bear a greater load than it usually does. I’ve been exploiting that for my various sample programs here over the years, and there’s a few other simple techniques I haven’t yet needed but keep to hand.

In this article I will outline these techniques and show what you can buy with them.

Continue reading

Genesis: The Z80 and the SN76489

With graphics, I/O, and memory control covered, it’s time to move on to sound. The Genesis has two sound chips as part of its architecture—a Yamaha YM2612 OPN2 and a Texas Instruments SN76489 PSG. The Yamaha chip is more sophisticated and complicated to program (it’s a cousin of the YM3812 OPL2 we know better as the DOS era’s “Adlib-compatible” sound), so we’ll start with the simpler PSG for now.

The Genesis inherited the SN76489 from its predecessor, the Sega Master System. This chip and its clones were also used in the BBC Micro, the Colecovision, the ill-fated PCjr, and the rather less ill-fated Tandy 1000 line that cloned the PCjr. Its similarly-powered rival, the General Instrument AY-3-8910, was used in most computers in the ZX Spectrum line and the Atari ST. The ST is, if anything, more powerful than the Genesis, so we could reasonably expect the PSG to be enough of a powerhouse that we’d be in good hands. Oddly enough, though, the chip ends up feeling uniquely primitive given my history on this blog. The other sound chips I’ve worked with here that are worthy of the name—the C64’s SID, and the integrated sound circuitry on the NES and the Game Boy—the PSG feels like it’s about half a generation behind. It’s particularly jarring to the the Atari ST’s chip, from the generation ahead of them, treated as a peer to this one.

Still, we’ve done more with less before. Let’s see what the PSG provides us and if see what we can easily get out of it.

The PSG’s Capabilities

The PSG provides three square wave voices and one noise channel. Each of the square wave channels has its own frequency and volume controls, and the noise channel has a handful of modes that alter the kind of noise produced.

That’s it. That’s the complete list of its features. Except for the noise channel and the weighted-average performed by the volume registers, this is just a trio of programmable clock signal generators that we happen to have hooked to a speaker. Of course, the PC speaker was actually literally programmed by tying its output to the Programmable Interval Timer. If all you want is pulse waves, it’s not a terrible base design. That said, the sound of a PC speaker is incredibly harsh, and we’re going to want to be a little more clever with what additional power we have been given to get a more pleasant or musical sound.

So, what are we missing compared to our rivals? The SID is a beast apart, so unique that even today people reuse the chips for use in custom synthesizers, so let’s leave it aside for now. The NES and Game Boy sound chips were also mostly focused on pulse waves and noise, but they offered secondary waveforms as well. The NES had a triangle wave, and the Game Boy let us specify a repeating sound wave as the basis for that voice’s timbre. Both of these were really only used for bass lines or sound effects, though; the PSG doesn’t suffer too much on this score. The AY-3-8910 likewise was restricted to 3 square channels and one noise channel.

The NES and Game Boy also had, properly speaking, not square wave channels but pulse wave channels: the percentage of time spent in the high state vs the low state (the duty cycle) could be tuned to alter the character of the sound. That’s a noticable loss, but the truly striking uses of this control required finer grain than either provided. The SID provided sixteen bits of control for its duty cycles, and sweeping the duty cycle value produces an effect that’s one of the iconic parts of C64 music.

These can all mostly be shrugged off, though. The thing the PSG lacks that hurts the most is an envelope generator—circuitry for automatically making a note fade to silence once struck, or fade in as it is struck. The AY-3-8910 turns out to have had an envelope generator, but this generator was shared among all voices and as such composers generally did not use it, or repurposed it for effects that were not fully intended by the designers. (I suspect this is part of why the AY-3-8910 is grouped with the PSG more than it is with the likes of the SID or the NES’s 2A03.) We’ll need to do our own envelope management in software if we want music that sounds less grating than a PC speaker signalling an error.

Fortunately for us, we’ve got one entire 8-bit CPU dedicated completely to wrangling the sound hardware. We’ll be fine.

Programming the PSG on the Genesis

The Genesis uses memory-mapped I/O for everything. The PSG has eight input pins and so communications with it are mapped to a single byte in memory: $C00011 on the Motorola chip, and $7F11 on the Z80. Unlike the NES, C64, or Game Boy, there are not different memory locations for different registers—which register is written is part of the byte we send it. We’ve seen protocols like this before, when setting the VDP’s registers through $C00004. This mechanism is quite similar to that.

All writes start with a byte that has the high bit set, followed by the register number in bits 6-4, and then the low four bits of the value to write. For the noise and volume registers, there’s only four bits to write in the first place, so we are done. For the pitch registers (0, 2, and 4), the value to write is 10 bits wide so we then write another byte to the chip afterwards with the high six bits. (The top two bits of those secondary writes are always zero. As I understand it, the top bit is really signalling what kind of command we are sending.)

As for the pitches themselves, this works more or less like the PC speaker did or our hand-crafted square waves for the Apple II; we are selecting a half-wavelength in terms of clock cycles. The PSG gets a 3.58 MHz clock signal and this signal is divided a bit before it’s actually used. To get a tone of a specific frequency out, the ten-bit value to write will be n = 3580000 / (32*hz).

Volume, meanwhile, is a simple logarithmic scale from 0 to 15 with 15 meaning “off” and all other values meaning 2dB of extra volume drop from maximum. These registers are all odd-numbered, which is why the “shut the PSG up” code we saw at the end of system reset wrote $9F,$BF,$DF,$FF. It’s a simple volume select.

A Simple Playroutine

My initial plan was to build a playroutine that was roughly similar to the old C64 playroutine I wrote fifteen years ago, with each voice controlled independently and playing off an individual script. This worked pretty well for the SID, but it was relying a lot on the envelope generator to make notes be mostly fire and forget.

Another possibility, looking forward to eventually getting some music out of the YM2612, is to set something up not unlike iD Software’s IMF format for the Adlib card. This is simply a stream of raw register writes and the delays between them. That’s also a bit obnoxious for our purposes though because generating the register writes to wrangle our instruments will be tedious and eat a lot of space.

I ultimately decided to split the difference. I’d keep a single timeline of new notes, shared across all voices. Every time a voice got a new note, its volume would reset to its highest point. Every other frame, every voice gets one tick quieter until it fades completely. Finally, to simplify the register logic, new notes will simply be represented as the two register writes required to make the note appear in that voice.

I wrote a simple Python script to render a well-known song into the format I needed, and then created my playroutine to play it.

I think the results ended up quite good. Here is a capture of a couple of loops of the song.

Below the fold, I’ll dig into the playroutine in detail and also discuss how I got it incorporated into the rest of my toolchain.

Continue reading