Lessons Learned: Color Chart Madness Revisited

As long as I’m revisiting old work, I might as well revisit one of the first projects I undertook for this blog, four years ago: Color Chart Madness. I took a sample program from one of my reference books, took it apart, then put it back together again, better.

What’s kind of funny, looking back on it, is that I really had no idea how the VIC-II chip actually worked, and I needed to understand a lot of details about it in order to grasp what was really going on. I didn’t, however, seem to need to understand those details in order to fix all the problems I saw with it. In retrospect, it’s a case study of my past self working around my own ignorance.

What I Thought Was Happening

The program I was dissecting was, at core, a simple rasterbar effect. It wanted to display every combination of foreground and background, so it printed out text in all sixteen colors and then changed the background color sixteen times a frame to produce all the combinations. The initial implementation of this shut off the clock interrupt and then spinlocked on the raster count register in $D012 updating the background color in $D021 every eight lines. The problem with it—beyond the way that it was essentially locking up the system so hard the only way out was a soft-reset—was that the color change was happening a line too early. This meant that the character cells looked a bit uneven, and there was a spurious black line at the very bottom of the text display. Worse, setting the spinlock to terminate one line later ended up making the display incredibly unstable or, failing that, made the color changes happen on different incorrect lines.


My initial theory here as that there was somehow something wrong with the $D012 read—either the register was fundamentally unreliable, or the emulator was taking shortcuts. I rewrote the routine to instead be based on writes to $D012, replacing the old spinlock with a series of raster interrupts. That rewrite, with no other changes to the program constants, fixed the display completely. That only deepened my suspicions about the reliability of the data the first version depended on.

I couldn’t say I really understood what was going on fully, though, because there were still anomalies left behind. In particular, the raster values I had chosen matched the original version’s, but based on both Compute!’s Mapping The C64 and the C64 Programmer’s Reference Guide those values were one less than they should have been to change the color at the top of each letter. Furthermore, even though I had a display that looked the way I wanted it to, changing the raster target by a single line did not actually consistently change the color-change point by a single line.

As it turns out, these facts were only confusing me because I had gravely misunderstood the way the VIC-II builds its displays. My understanding of this system advanced in fits and starts over the course of about a year:

  • Cycle-exact Delays for the 6502, where I adapted some techniques I’d seen in an Atari 2600 demo to systematically trigger the anomalies I had encountered in Color Chart Madness.
  • Color Test: an Atari 2600 program, where I produced a complete Atari 2600 program on my own. This gave me direct experience with directing a raster display, and having to do that work “by hand” on the 2600 give me the background I needed to understand how the VIC-II did it automatically.
  • Flickering Scanlines, where I take that background and finally understand the 1996-era document that explains the operation of the VIC-II chip and thus which lets me really understand the anomalies above.
  • A taxonomy of splitscreens, where I begin to apply this knowledge to extracting reasonable effects out of the C64.
  • VIC-II Interrupt Timing, where I finally sit down, do the math, and work out the precise timing tolerances required to get various effects without relying on the advanced, cycle-exact techniques the demoscene standardized on.

That’s a little over a year’s worth of fitful exploration, research, and experimentation. At this point we should be able to just do a brief and accurate explanation of what issues I had run into and why the techniques I used fixed it.

What Was Actually Happening

The reason switching from a spinlock to a raster interrupt fixed the display is actually pretty trivial. There’s nothing at all wrong with reading $D012, and it is indeed that value that triggers the interrupt. Furthermore, it really was testing for the line before the start of any given character. However, it takes an extra 40 cycles or so to actually process the raster interrupt and get into code that we had, ourselves, written—and those 40 cycles were enough for the screen to display enough of that line in the old color to keep the display looking right. It was still setting it “too early”, but the grey background of the screen is actually an opaque foreground color so it masked our flaws. Take that away and the discontinuity is much more flagrant:


Explaining the flickering and discontinuity when we push the raster trigger one line forward is a little bit more involved. The key fact I was missing was that the C64’s CPU doesn’t actually get to run all the time—at the start of every line of text, the graphics chip (the VIC-II) steals 43 cycles worth of time from the CPU, monopolizing the memory bus so that it can load all the graphics information it needs for the next line. (Because these lines mess up your display timings, they are referred to by C64 programmers as badlines.) If we look at the initial spinlock-based implemenation, we see that actually getting around to updating the background color will take between 11 cycles (if we check the raster the very cycle it changes) and 21 cycles (if it changes the cycle just after we check). On a badline, the CPU will go dormant 13 cycles after the raster changes. That means that depending on exactly how the spinlock syncs up with the display, it will change the color either before or after the full row. Furthermore, the full spinlock cycle is 10 cycles long, and that means it won’t stay synced with the display; thus on different frames or even different points in the same frame there is no consistency. Thus, the flicker.

There’s less inconsistency with the interrupt-based implementation. It relies on the KERNAL to save and restore state and ultimately hand over control, so the 43-cycle delay of the badline will always be paid. However, adding that time into the rest of the display means that the color ends up changing halfway across the screen one line down, which means we get visible display tearing and in some sense the worst of all worlds.

But with an understanding of how that timing works, it at least is no longer surprising.


From the Archives: Diffusion Chamber

One of the first programs I wrote when I was learning to develop NES stuff was an animated simulation called “Diffusion Chamber.” I had a DOS version of it, too, in some of my old childhood programs. However, neither of them actually worked properly; the display for the DOS version was completely corrupted and the NES version would occasionally allow the walls to diffuse along with the gas molecules.

I stumbled across the original inspiration for those programs recently; it was from a collection of Scientific American columns by A.K. Dewdney and was mentioned in passing with no real code or discussion:

The diffusion program I call BLEND treats the molecules of a liquid as balls that are able to move only one step horizontally or vertically at a time. A 31-by-31 array called tank holds 930 balls, which are set up initially somewhat like a checkers game….

BLEND proceeds by selecting an array position at random, then selecting a random adjacent position. If a ball occupies the first position and no ball occupies the second, the ball is erased at its current position and redrawn at the new one. If a ball is absent at the first position or present at the second, BLEND could not care less. With microseconds of time on its hands, a movable ball will sooner or later turn up. How long will it take for the two stylized liquids to diffuse completely?

I’m not really sure why I had so much trouble with this in the past. I scale it down to 15×15 to fit on the screen, but it otherwise was extremely straightforward to implement:


How straightforward, you ask? You can fit a complete implementation in a single screen of BASIC code:


That said, I don’t recommend actually running this code. I’m not sure what kind of computer system Mr. Dewdney was running when this column was written but it had a whole lot more power than Commodore 64 BASIC. That listing is impossibly slow, with waits of multiple seconds required before a viable move is found. You could speed it up trivially by selecting a a gap by index instead of a ball by location, but that isn’t the algorithm he described.

It runs more reasonably quickly if you implement it in assembly language instead, but it doesn’t really get us any closer to answering the question it posed.

Modernizing the Code

Porting the implementation to C and making it be 31×31 instead of 15×15 is pretty trivial, but there’s a nasty complication if we want to run it for any length of time. The C standard library uses multiply-and-add PRNGs, and those have very low entropy in their lower bits and you’re not really supposed to pull more numbers out of them than the square root of their period. For drand48 that only permits about 16 million calls. Given how many random numbers we go through to find a valid move, 16 million calls hardly mixes our fluids at all.

So I went looking for a good, fast, high-period random number generator and wound up back in the xorshift space, with the “xorshift*” algorithm that, apparently, passes all the tests in the BigCrush PRNG test suite:

uint32_t xorshift64star(void)
    uint64_t x = rng_state;
    x ^= x >> 12; 
    x ^= x << 25; 
    x ^= x >> 27; 
    rng_state = x;
    return (x * 0x2545F4914F6CDD1DLLU) >> 32;

Once we have that, we need some metric for how diffused the two substances are. I decided to simply sum up the number of excess balls of each color in each column, and then average that over columns. This metric starts at 30, and then should drop over time.

I originally had hopes that this metric would approach 1, but it seems like once it gets around 3 it keeps bouncing about in the 2.5-5 range:


Based on this, I’d have to conclude that the answer is that the substances end up properly mixed after about two million cycles. From there on out it seems like the simulated Brownian motion is as likely to separate them as mix them further.

Conway’s Life on the Game Boy

I’ve finished my Game Boy project.


At the end of my previous post I’d gotten everything done but replacing the static “Game of Life” image with a scrolling message, and I’d outlined a general plan for that too:

Our general plan for displaying the scrolltext will be to set a raster interrupt on line 128, and then set YSCROLL to 0, the tile data to $9C00, and the XSCROLL to a value that updates every other frame.

Implementing that off went off surprisingly smoothly, with no hitches beyond one fatal flaw that actually wasn’t part of the new code. More on that later. In the meantime…

Size and Speed

Like all Game Boy carts that have no memory mappers, the final binary image is exactly 32KB long, but we aren’t using remotely that much of it. Some quick checks with a hex dumper show that the last nonzero byte in the image is at offset 1,504. So that’s one easy metric for how large the final program, but how much of that is filler required by the platform or imposed by the linker?

Looking through the debug dump shows that the linker actually did a very good job; only 3 bytes of space are wasted by the hard location requirements of two of our sections. The interrupt vector table takes up 104 bytes and basically has to be there, but only 6 bytes of it are ever actually executed as code; that’s another 98 bytes of filler. The cartridge metadata consumes another 76. This means our main code and data weighs in at 1,327 bytes. That’s not really a significant drop in size; either would ultimately end up implemented as a 2KB ROM. Entertainingly enough, that also means it would fit with room to spare within the smallest of the Atari 2600 cartridges.

I’d still have liked to keep the entire image below a single kilobyte, but that’s not reasonable when you not only are displaying a nontrivial amount of text but also need to pack your own font.

As for speed, the easiest way to see how quickly the system runs is to do the slowest operation and trap into a debugger just as it’s done. The reason this will work for us is because we have a variable that’s set to 15 just before the main execution loop is unleashed to run an iteration, and which is decremented every time the VBLANK interrupt runs. If we subtract the value in that counter from 16, we’ll have a list of how many frames we were doing work in.

The slowest operation we can trigger is a board scramble. I didn’t optimize the main loop particularly, so the board scramble actually happens after a normal simulation step. Each involves dozens of instructions running over hundreds of cells individually. Even so, a frame is over 17,000 cycles, so the timing is not terribly tight. After running both functions through, the frame variable is holding 12. Our longest update is 4 frames.

That means that, while changing nothing else, we could nearly double the speed of the display without risking data corruption—4 frames to update followed by 4 frames of rendering for 8 frames per simulation tick and a total speed of 7.5 FPS. The Game Boy Color has a turbo mode that lets it double its CPU speed, which would kick us up to 15 FPS, or possibly even 20 if we were more careful about doing the minimum necessary work when processing input.

Even 7.5 FPS is fast enough that it can be hard to follow the evolution of a colony though. I’m happy enough with it at 4.


The source code, with appropriate Makefile, is up on Github. There’s a certain attempt to split application- or device-specific logic away from the other code, but since no other device uses this instruction set, the attempt was perhaps a bit half-hearted.

A prebuilt binary is also available, suitable for emulators or homebrew cartridges.

Implementation notes below the fold.

Continue reading

Sundry Refinements

More progress on the polish for this Game of Life simulation. A lot of different things needed to be done, so I’ll mostly just cover each bit in its own section and put the parts with the pretty pictures on the top. Here we go!

A New Font

If I’m going to write out a message of any length, I will need to have a font to write it with, and squinting at the high score table in my old copy of Tetris only takes you so far. So I made my own font for this project, drawing inspiration from the one Tetris and Super Mario Land used. I liked the consistent 6×6 size, and I liked the way it used variable-width lines on the letterforms for a sort of art-deco effect. I was less fond of how extremely broad some of those lines were—the A’s right edge consumes half the character cell—and it seemed less consistent about where it thickened the line, producing a somewhat unbalanced effect.

To make the font I actually broke out an old tool for the Commodore 64 called Ultrafont +. This had been published by Compute! in various books, but also as a type-in program in their July 1984 issue of the Gazette. This is both pretty convenient to work with and it has a handy test mode:


The C64 and Game Boy don’t have identical tile formats, but since my plan for the font is for it to be monochrome, this forms a convenient common subset. It was also common to “double-dot” your fonts on the Commodore, because, as we discussed when talking about CGA artifact colors, changing colors too quickly in a row of pixels could distort the display on standard-definition televisions. The Game Boy LCD, however, doesn’t have to worry about that, so I’m free to use single-width vertical lines wherever I want.

Once I’d created a font I was happy with, I saved it to disk, extracted the saved character set with VICE’s c1541 utility, and used a simple Python script to both convert my characters into code rgbasm would accept and to encode my scroll text so that the tiles don’t have to be filed to match ASCII. (I had done something similar, but for the ACME assembler, when adapting another font to the NES for Galaxy Patrol.)

That scrolltext won’t get to be used for a while, though.

Continue reading

ZX81: Mind Games

I was idly fiddling around with the ZX81 again, and I found a cute little anomaly.

Start with a simple line of code like 10 PRINT "HELLO". That gets stored in memory at the top of your BASIC program space, and the first two bytes of it are a 2-byte big-endian number. We can see that because if we then type POKE 16509,39 the line number changes to 9994, and POKE 16510,15 takes it up to 9999. A moment’s work will show that 39*256+15 is indeed 9999.

9999 is also the largest permissible line number in Sinclair BASIC. If you try to type in a line with a five-digit line number, a syntax error will be flagged on the line number itself.

You may also recall from back when we were hiding machine language programs away from Sinclair BASIC that we’d do that by putting a $FF byte right after the last real line of program code—corresponding to a notional line number of at least 65280. Perhaps too-high line numbers become invisible! We can see that if we POKE a 255 – or even a 64 – into 16509 – the line vanishes, and it comes back if we put a 39 back in.

But what happens if we leave the high byte alone and just poke a 16 into 16510, for a line number of 10000? It turns out it doesn’t vanish—instead, we get a line number of A000. That looks suspiciously like hex, but it isn’t (not least because $A000 isn’t 10,000, it’s 40,960). What we’re seeing is the top digit overflowing into the characters after the numbers. The ZX81’s wildly nonstandard character set puts the letters immediately after the numbers, so overrun goes directly to letters.

A little experimentation shows that the largest value we can get a line number to take before it disappears is “G383”, which corresponds to a hex value of $3FFF. That’s interesting because it shows how it’s deciding whether or not a line is “real”—it’s not checking to see if the value is above 9999, it’s checking to see if it fits in 14 bits, because that’s how many bits it takes to represent 9999 ($270F).

That’s all well and good, but are they really BASIC lines? Creating a full program that includes lines above 10000 is challenging but not impossible—we just have to work backwards with a set of commands like these:

  • 10 PRINT "END"
  • POKE 16509,42
  • POKE 16510,248
  • 10 PRINT "MIDDLE 2"
  • POKE 16509,41
  • POKE 16510,4
  • 9000 PRINT "MIDDLE 1"

That produces this listing:


Not only that, this program actually runs: it prints out BEGINNING, MIDDLE 1, MIDDLE 2, and END just as you’d expect. Branches even work: while you can’t refer to line B000, if you add a line 2000 GOTO 11000 to the program, it will dutifully only output BEGINNING and END.

So, is this good for anything?

Probably not, really. Lines of code past 9999 are uneditable inside the program, and given the line numbers and line lengths alone for 10,000 lines of BASIC would consume 40KB before you even had any actual code that actually did anything, even on a Spectrum 128 you’d run out of memory long before you ran out of line numbers.

But if you were creating the program with an external code generator and wanted to have a piece of code at the very end that could be seen and run but not edited, this would let you do it.

Game Boy: Double Buffering

Last time, we got our Life simulator to update and display well enough to see it run. This time, we’ll clean it up so it runs acceptably.

The major issue with the current display is tearing—it takes four frames to actually draw a generation to the display, so we are spending about a third of our time showing displays that are part one generation and part the previous one. The usual technique to address this is double buffering—drawing the next frame somewhere else, then flipping to it when it’s ready.

Game Boy Tile Maps

The Game Boy’s graphics display is ultimately not too dissimilar from the Commodore 64’s. There is a set of pattern tables that define what each of the display’s 256 tiles will look like, and then a name table that arranges them into a display. In C64 terms, the pattern table is the character set, and the name tables are the screen memory. Like the C64, both the pattern tables and name tables may be redirected to different parts of memory for instantaneous, large changes.

Where the Game Boy moves beyond the C64 is that the name tables are 32×32 and the screen is a freely repositionable 20×18 window into it, wrapping around as necessary. This is the first case of real, honest-to-Cthulhu panning hardware we’ve encountered in four years of retrocoding. The only other cases that came close—the C64 and the NES—both used tricky approximations that could be exploited or corrupted but which didn’t grant the kind of per-line scrolling effects this unit gives us.

That capability will inform our implementation strategy. While we could use two 32×32 name tables and swap between them, we can instead split one in half and scroll between the top and bottom halves as needed. This fits well with our 20×16 simulation space, and it will simplify the implementation a fair amount.

Implementation details below the jump.

Continue reading

Game Boy progress: display timing

We’re getting somewhere!


Actually implementing the core Life algorithm went relatively smoothly—like I mentioned in the last post, I started out by implementing it in C, and then I refactored that C code piece by piece so that it wouldn’t rely on operations that are inconvenient on the LR35902. That meant getting rid of things like multiplication, but it also—thanks to the way addresses must live in a register and the drastic inconvenience of 16-bit math on this chip—meant I wanted to avoid pointer offsets that weren’t increments as much as possible too. Some were unavoidable, but I’m overall pretty happy with the end result.

(For the curious, or for those who have implemented this a thousand times and are curious exactly which implementation variations I did: I keep just one state array and then a “neighbors” array that’s two wider and two taller so that I never have to bounds-check. I spin through the state array and increment the neighboring cells in that array once. Then I collapse the border by adding each border cell to its appropriate wrapped-around location, then use that information to produce the new state array. It’s uses a bit more RAM than two state arrays and bounds-checking, but the inner loop logic becomes far more compact and straightforward.)

That leaves the question of getting the results to the screen. We saw earlier that the NES gets very unhappy indeed if you try to slam too much into VRAM at once, and we’ll have a similar issue with the VBLANK on the Game Boy. The display won’t corrupt, but we will lose writes. That’s pretty bad too, so we’ll need to restrict ourselves to the 1,140 cycles post-VBLANK.

The screen itself is 20×18, and it lives on a map that’s 32×32. Here’s a straightforward loop to copy from a 20×18 row-major array in RAM (pointed to by DE) into a target location in VRAM (pointed to by HL):

        ld      b, 18
.lp0a:  ld      c, 20            
.lp0b:  ld      a, [de]
        ld      [hl+], a
        inc     de
        dec     c
        jr      nz, .lp0b
        ld      a, 12           ; 12 = 32-20
        add     l
        ld      l, a
        ld      a, 0            ; Can't use "xor a" because that trashes carry
        adc     a, h
        ld      h, a
        dec     b
        jr      nz, .lp0a

The principle here is straightforward; it’s a memory blit of one row, followed by a 16-bit add to the destination pointer to move to the start of the next row.

With that code drafted, how long does it take to run?

Continue reading