Category Archives: retrocoding

Atari 800: Repairing a Mad Science Type-In

Back in February I wrote three posts about writing code for the 8-bit Atari home computers and then didn’t do anything else with it. A big part of that is that I didn’t have any good ideas for things to do with the system. My usual project is the Lights Out puzzle, but when I tried to design that for the Atari 800, I ended up designing it for the Atari 2600 instead. It’s still basically true that the display I got on the 2600 looks better than any display I’ve designed for the 800 that doesn’t more or less clone it.

What I really needed were some examples of people putting the system to good use but without pushing its apparent power far beyond its design. I needed examples of what it was supposed to be doing. I dove back into the old archives and looked through the collections of sample programs, simple games, and basic utilities. I wanted a program that was short, but which made heavy use of the Atari’s unique hardware, and which did something interesting enough to be worth the effort of reverse engineering it.

What I found was Front Attack, a game by Robert J. Neep that was published in 1985 in Compute!’s Atari Collection, Volume 2. It’s a fast-moving arcade game with joystick control, sprites, and a scrolling background. It does all of these things at machine-language speeds, but it is 100% BASIC and never calls out to a machine code routine at any point. It also, however, didn’t work when I typed it in, and kept not working even after I triple-checked the listing. Enough worked that I could tell that this was going to be a program worth investigating. I decided I would debug the game and then write up the bugfixed version. I’ve now done this, and it looks pretty good in-game:


As a bonus, along the way got enough of a feel for the Atari 800’s improvements over the 2600 that I have a pretty good idea of how to make a proper Lights-Out game for it, too. We’ll see where that goes.

Below the fold, I’ll outline how it provides the various things it needs.

Continue reading


NES Splitscreens, Demoeffects, and “Undefined Behavior”

I’ve written extensively in the past about various techniques for getting effects out of old chips. This experience has very little to do with modern hardware, though, even when writing native-code applications, and even if that development is done in assembly language. Application developers nowadays write to published, third-party API standards, and if behavior is anomalous, we blame the device drivers. (This is not entirely true, of course. Back when Valve was porting Left 4 Dead 2 to Linux, they observed that an engine ported to a new platform could be expected to deliver about six frames per second. On the flip side, though, part of the improvement process involved working with hardware vendors to improve their graphics driver performance. Indeed, nVidia’s endless series of “Game Ready” drivers are primarily per-game performance upgrades. But even so, in both of these cases, the game developers are relying on a third-party, mostly-opaque abstraction—OpenGL or Direct3D for the graphics cards, and the C library’s wrapper around the OS system calls for the rest.) The huge diversity of PC hardware—even within a single company’s line of GPUs, as in nVidia’s case—makes targeting any lower level of abstraction unfeasible.

The 8-bit consoles and home computers, however, did not work this way. The supporting hardware on the system was absolutely fixed, and quirks of the hardware were preserved through the life of the console. If revisions do need to be made—the VIC-II used in the Commodore 64 had three design revisions with differences visible to the programmer—the general operating principles of the chips are very unlikely to change.

So it is in this context that I find myself lifting an eyebrow at the headline in this article: Zelda Screen Transitions Are Undefined Behaviour. The text of the article puts enough caveats on that headline that my eyebrow does lower back down, and it is also an excellent exegesis of the NES’s scrolling hardware. If you are at all interested in the seeing how the NES balances its limited memory and its more generous address space to actually construct its display with proper panning hardware—this was, after all, its major advance over the prevoius generation of home computers like the C64—this article is worth reading carefully. It also goes deep enough into the console’s internals to show how it can be confused in useful ways. The technique used is so ubiquitous later, though, that it’s reasonable to question my use of the word “confused” here—after all, if the code behaves consistently (which it does) and it was used widely within the NES canon (which it is) then how do we get away with saying that these techniques are “confusing” the chip?

We can get away with this by looking at the rest of the programming API. In particular, there are two facts that let us conclude with high confidence that the behavior described in the article, while clearly “designed in” to the chip, was not entirely intended:

  • Deploying the technique causes a graphical glitch not “predicted” by other parts of the chip’s interface. This is the two-pixel scroll anomaly the article seeks to explain.
  • The intended effect—a mid-frame alteration in the vertical scrolling value—has a natural implementation (rewrite the vertical scroll value through $2005, the same as you would during VBLANK) but this doesn’t work, despite being a situation that it explicitly handles.

These traits look familiar. In fact, these traits are both shared by the trio of techniques I describe as partial badlines on the C64. Those were definitely considered “tricks” in their own right, but they were also part of a much broader spectrum of advanced VIC-II programming techniques. If we approach the Zelda split-screen effect with the same mindset that I brought to C64 demoeffects, we get a very similar kind of summary:

  • What you do with it: Get around the PPU’s restriction on altering midline scrolling.
  • What is actually happening: The PPU’s VRAM register is overwritten by the PPU while rendering the display, and while the CPU’s writes to the actual vertical scroll value are ignored mid-frame, no other writes are. The CPU can thus alter the base address mid-frame for 8-line-precise scrolling.
  • How to do it: Rewrite the base address through $2006 midframe (ideally during HBLANK) to point to the start of the row you want to display. This will corrupt the Y scroll value to act as if you’d also scrolled down an additional two pixels.

As it happens, a modern developer would probably not do this. With more clever manipulation of the two internal address registers, complete control of mid-screen scroll state is possible. Still, that technique is more complex and may not have been developed until later in the console’s lifespan. In particular, Life Force uses a very similar technique to The Legend of Zelda but has no graphical glitches because it put its status bar at the bottom of the screen where a nonzero but fixed vertical scroll value will be unnoticeable.

Further Reading

The NESdev wiki is the central clearinghouse for NES programming information, and while it got a bit of a late start compared to the C64 in terms of deep investigation, it has long since caught up. Most relevant to the discussion here is the PPU Scrolling page which describes the interaction of the various graphics chip registers with the scrolling mechanisms in great detail, and outlines four different mechanisms for scrolling control.

Getting Off the Ground With the Other Commodore 8-bits

My Commodore work here has been restricted so far to the Commodore 64. However, that system was only one of many related systems they produced over the years:

  • The PET was the first home computer they made with a serious following. Its abilities made it a rival to systems like the TRS-80, but very little else. Still, that was enough to cement their reputation early on.
  • The VIC-20 was their first system to achieve truly widespread success. It was inexpensive and provided significant power for the price. Its existence, and price point, had largely annihilated any potential market space for the ZX81- and Spectrum-based Timex computers in the US. It was also the direct predecessor to the Commodore 64, and its physical appearance and programming model reflect that.
  • The Commodore Plus/4 and Commodore 16 were an attempt to capture the lower-end home computer and small business market, and specifically to compete with Timex computers. It had many enhancements to the BASIC that let the user make use of its graphics and sound capabilities more directly, but the graphics and sound hardware were replaced with a much simpler combined chip called the Text Editing Device, or TED. The Plus/4 also shipped with on-board productivity software. In the end, though, it appears that the market segment the Plus/4 and C16 were intended to compete with Timex in didn’t really exist in the US at all, and all of the systems were failures there. (The on-board productivity software was also poorly-received; BYTE magazine apparently described it as “an insult.”) The Plus/4s and C16s were mostly dumped into the Eastern European marketplace at huge discounts, where they did ultimately do well enough to retain a niche following and a demoscene that lasts through to the present day.
  • The Commodore 128, released a year after the Plus/4, was the true direct successor to the Commodore 64. It added a new, more PC-like video interface, a separate Z80 CPU to let it run CP/M software, twice the RAM, the extra memory mapping and control support needed to make that RAM readily usable, and a vastly more sophisticated BASIC to take command of all of these new abilities as well as its older ones. The BASIC extensions from the Plus/4 were retained, along with new control flow options. Where the extensions were not taken, new ones were added for proper high-level control of the sprites, the sound chip, and the bitmap display, and unlike the Plus/4, the 128 made no compromises compared to the 64. Unfortunately, it largely avoided compromises by including all the C64 hardware directly within it and switching into a “C64 mode” that was, from a software standpoint, nearly indistinguishable from an ordinary C64, including reimposing all the limitations that the 128 was supposed to break.

Commodore made quite a few other systems as well, from the ill-fated CBM-II through the hugely popular and influential Amiga line, but the machines above formed a clear 8-bit lineage based on the 6502 family of microprocessors and a broadly consistent firmware interface that they called the KERNAL.

The KERNAL was a list of memory locations that provided consistent ways to call into system ROM functionality independent of the particular implementation on any given system or ROM patch level. For example, $FFD2 was always the location of the CHROUT “output a character to the current output device” routine, which in turn means that the assembly language components of our Hello World program for the Commodore 64 are portable across all of these machines. In this way the KERNAL provides a functionality similar to what BIOS provided for IBM PC clones—a way for software to ignore some aspects of hardware even in the absence of a shared operating system or modular device driver facility.

But assembly language source compatibility does not translate to binary compatibility. We saw this, at one remove, when we looked at the relocating loader for Compute!‘s automatic proofreader. In the rest of this article we’ll look at the changes each system demands.

Continue reading

C64 Fat Sprite Workarounds

When writing about retro programming topics here on Bumbershoot, I’ve generally wound up oscillating between the absolute basics of getting anything to run at all and techniques to exploit weird corner cases of the hardware to let it do things it was never intended to do. This article is a more unfortunate cousin: a weird corner case that is fully-intended, but vaguely-documented at best and incredibly inconvenient when doing apparently reasonable things.

In particular, today we will be talking about scrolling sprites off the left side of the screen on the Commodore 64.

A Quick Survey of Sprites

A “sprite” is a generic name for any block of pixels that you want to animate over a background of some kind. These days a sprite is usually two triangles rendered with a partially transparent texture. In the 20th century, it was often a highly-optimized pixel-copy routine (a “blitter”, from BLT, itself an abbreviation of “block transfer”). By the 1990s this was sufficient to the needs of the age, but in 1980s hardware blitting was slow and fraught with compromises. It worked—the Commodore Plus/4, the Apple II, and the entire ZX Spectrum line got by with software blitters over a bitmapped display—but it didn’t look great and those parts of 80s home-computer game animation that we remember fondly relied on more than just a blitter.

In particular, systems like the C64, the NES, and the Atari game consoles and home computers provided sprite definition and placement capabilities directly in their graphics hardware. This put a limit on the number of elements that could be efficiently rendered, but allowed the animation to be much more practical and sophisticated for a given CPU power.

Interestingly enough, though, hardware sprites were not necessarily a strict upgrade. The Atari 2600 and NES used the fact of their hardware sprite support to make actual bitmap support completely unnecessary. The C64’s 320×200 screen requires 9KB of RAM to hold a single screen’s worth of bitmap information—the NES’s graphics (without additional support circuitry on the cartridge) only had access to 2KB of RAM and 8KB of ROM. This produced a text-like display mode, and hardware sprites are the only way to get pixel-level control over the display on the NES, and this general approach of separate tile and sprite layers was convenient and efficient enough that they not only were available on the bitmap-and-sprite-capable systems of the time (like the C64 and Atari), but saw usage in lower-power or portable devices at least through the Nintendo DS.

Continue reading

Implementing SHA-256 on the 6502

After five implementations of various algorithms built mostly around bit shuffling, how about a properly modern, common, useful routine? We just need to find something that has a very similar structure and then we can deploy the knowledge we’ve learned in those other domains.

SHA-2 fits this bill neatly, and Wikipedia even has handy pseudocode we can base an implementation on. In looking at the implementations and specifications, it also seems like SHA-256 is a relatively good fit for the 6502, and may well qualify as the most sophisticated modern algorithm that the 6502 can still comfortably handle. We don’t really need to worry about endianness or word size—since we’re an 8-bit chip we can do multibyte mathematics in any direction we want and the size of the algorithm’s words really only alters our loop bounds—but SHA-256’s largest indexed elements are the round constants and the schedule arrays, each of which are 64 32-bit words long. That translates out to tables exactly 256 bytes in size, which is exactly the size of a 6502 memory page and thus the largest amount of memory that can be efficiently indexed.

Initial Implementation

Translating the pseudocode from the wikipedia article into code was not terribly challenging. I pre-defined tables of the necessary constants in the algorithm, and then set aside a fixed block of RAM to store all the working state that changes as a sum is computed. (In a modern system, we would probably instead operate on a pointer to arbitrary blocks of RAM, but the 6502 is much, much better at operating on blocks of memory at fixed addresses than it is at working with pointers, and I wanted to give the poor CPU every edge it could get.) I then defined several 4-byte blocks of memory for use as accumulators—I defined a variety of small functions that would do 32-bit big-endian math based on the contents of the accumulators and which could store results into and out of the working state as needed. There were a few awkward bits here—I ended up replicating a few functions because the 256-byte index limit got inconvenient at a few points—but nothing about the implementation of the algorithm really ended up having to deviate much from the pseudocode. So the implementation step was quite simple and easy, much as I had hoped.

However, the end result did not come out the way I had hoped it would. While it did give the correct answers to my test vectors, it took 580 milliseconds to consume a single block—nearly 10 milliseconds per byte! I wasn’t willing to push this too far—after all, these algorithms are tuned to be feasible only on modern chips—but we can surely do better than this.

Continue reading

Mandala Checkers: Wrap-Up

In the end, the only extra addition I made to Mandala Checkers was a little code to announce the computer’s move:


The final source code is here. I had a number of other ideas in mind, but ultimately I decided it wasn’t worth the effort to put into them. The rest of this post is a set of questions I had about the program and the porting process along the way, and the answers I found.

How Could We Improve the AI?

The original program’s AI was quite weak. It would take a capture if one were available, but it would otherwise move randomly. Even if we decide to not bother with proper move lookahead or search spaces, it’s pretty easy to devise a better set of heuristics:

  • The game is won after making seven captures. We should therefore only make captures if we’re tied or ahead, or if the capture isn’t into a place where our opponent can just capture right back.
  • Likewise, if at all possible, we just shouldn’t move pieces to where they’re being threatened.

It wouldn’t be too hard to alter our move-finder so that it can determine if a board position represents a potential capture for our opponent. The problem, however, is that this strategy completely breaks the game. This ruleset will effectively force a stalemate, as either player, right after the first move. All of the rules variations from American Checkers end up making this strategy more effective and thus making the game less playable:

  • All pieces start out moving like kings, which means there is no forward pressure. After making an initial safe advance, all it has to do is move pieces deeper within the initial structure back and forth until the opponent provides an opportunity for attack.
  • Capturing isn’t mandatory, so there’s no need to make captures that would expose you to unnecessary retaliation, when retreat is an option (or where we have the option of moving a piece over to make any threatened counterjump impossible).
  • Since there are no multiple captures, a player who falls behind cannot meaningfully recover with a single bold stroke.

The end result is that good play produces an aggressively passive game that I’m not actually sure can ever be won except by exploiting an opponent’s blunders.

What Were the Changes in Memory Usage?

The original program took up about 5 kilobytes, with a kilobyte of that being metadata:

           Program header       116 bytes
  10- 100: Main loop            209 bytes
6000-6999: Computer move      1,843 bytes
7000-7999: Human move           624 bytes
8000-8999: Draw board           364 bytes
9000-9999: Init board           996 bytes
           Program footer       794 bytes
           Total              4,946 bytes
           Total non-metadata 4,036 bytes

The reimplementation, on the other hand, is only just over one kilobyte, and only 161 bytes of metadata:

           Program header       116 bytes
           BASIC stub            19 bytes
           Main loop            108 bytes
           Game select/init     190 bytes
           Draw board/score     204 bytes
           Human move            75 bytes
           Computer move        169 bytes
           Generic move engine  162 bytes
           Utility functions     99 bytes
           Program footer        26 bytes
           Total              1,168 bytes
           Total non-metadata 1,007 bytes

The main loop has had its size halved, the init and drawing code dropped in size by about a third, and then the game-playing logic and program metadata each end up being smaller by a factor of six.

This is a frankly implausible amount of memory savings.

How Did We Save That Much Space?

It’s not really common for machine code to save that much space over an HLL. Aggresively hand-optimizing for space usually ends up saving you five to ten percent over a properly-tuned compiler. For smaller programs, in fact, an interpreted language like BASIC is often much smaller than the equivalent machine code—most of its standard library lives in ROM and complicated mathematical formulas can be encoded very compactly: A=A+B can take up as few as five bytes, but even if those are 8-bit values you’d spend eight to ten bytes doing the memory reads and writes you need. For a 16-bit integer value it’ll be quite a few more. For Mandala, though, the code doesn’t require any particularly fancy math. It’s mostly memory iteration and complicated control flow, which machine code is very good at. Furthermore, for the most complicated math and I/O, we simply call out into the BASIC ROM’s own routines to do that work for us instead of replicating it. The only routine where we replicate or replace behavior we could steal from the ROM is our random number generator, and that’s mainly because our xorshift-based generator lets us use the least significant bits in the results and thus restrict our randomness ranges with simple bitmasks instead of also needing to pack in a multiplier.

Furthermore, Sinclair BASIC has some serious disadvantages for its code density. Each program line has a 4-byte penalty, which isn’t too bad on its own… but the ZX81 and the Timex Sinclair 1000 alike do not permit multiple statements on a single line of BASIC code. This means you’ll be paying that 4-byte penalty more often, and often IF statements end up have to get repeated to simulate conditional blocks. Worse, every time a numeric constant appears in the code, there is a byte spent for each digit of the number’s decimal representation, but then an additional six bytes are spent to store its preparsed binary representation as well. There are a lot of numbers in Mandala Checkers and it doesn’t use any tricks to minimize the use of actual numeric constants. Those issues are somewhat compounded by the way that the computer-move logic flails around a bunch to make up for the bugs that the author seems to have noticed but actually isolated well enough to fix.

There are techniques to work around around these numeric-constant penalties—it’s faster but slower to store the constants as strings and parse them at runtime, a handful of numeric constants are hardcoded into the BASIC bytecode, and as a last resort an often-reused constant may be cached in a variable with a short name—but this program doesn’t bother with any of them. Once your program doesn’t run in 2KB the next smallest RAM size is 16KB, which is roomy enough that this is really no longer an issue. This BASIC program was optimized for speed, not size.

That leaves the 768 bytes that we saved out of the program footer. This is another side effect of the 2KB vs 16KB configuration of the ZX81: as we noted back when we first looked at the ZX81’s file formats, the screen memory gets saved out with every program, and the screen size is variable:

The screen memory (or “display file”, as they call it) is actually a list of 24 lines, each of which is individually terminated with a newline. If you have 4KB or more of RAM installed in the system, each line will be the full 32 characters long, which, with terminators (and an additional newline at the start that signals the beginning of the screen), takes up 793 bytes (33 per line, 25 per line, one initial newline). Under more constrained conditions, an empty screen is just 26 newlines.

My reimplementation is written for a 2KB machine, so it gets the fully compressed empty screen in its memory dump and thus the file ends up 768 bytes smaller. The total file size doesn’t shrink quite that much because I’ve got an extra 19 bytes to represent a tiny BASIC program that exists only to transfer control to the machine code.

Of course, the screen won’t stay 26 bytes. That leads us to the final question…

How Much Memory Does This Thing Actually Use?

I don’t bother with any memory allocation routines in this program, and because of the way the system expands and contracts its memory usage on its own, I also wasn’t comfortable with using any memory that existed past the edge of the screen memory. That meant that all my data exists, pre-initialized, within the file’s memory map itself. That’s handy because it means that our memory footprint can be measured directly from the file size. It’s not complete, but the only additional sources of are the expandable screen memory and the processor stack. So the question is thus: how much screen memory is actually used?

I looked into the system memory as the program was running to determine how much screen memory is used. In 16KB mode this stays solid at 794 bytes across the board; in 2KB mode, though, the last 12 bytes or so of each line are blank and thus cut out of the list, which drops the screen memory size by nearly 40%, to 487 bytes. This leaves 556 bytes spare for use by the processor stack in our own computing as well as in the calls the BIOS itself might make. With a fully-utilized screen, that headroom would drop to 199 bytes. That’s a point where I’d start getting nervous about the processor stack possibly trashing my system state.

But that’s also the point where I’m finished with this project. Time to move on to something new.

Mandala: Initial Port Complete

Last time we ported over and bugfixed the initialization and board-drawing code for Mandala Checkers. This time we’ll adapt the code that lets the players make moves.

We’ll end up reorganizing the code a bit from the BASIC original, though. In that, the human and computer code was completely separated into their own blocks of a thousand line numbers, and they would draw their moves on the board themselves. That was necessary for the BASIC version because the BASIC code for drawing the whole board was extremely and unacceptably slow. Our machine code drawing routine, on the other hand, is fast enough that we can just call it after each move is made. This will also clean up a little infelicity in the initial implementation: the players’ scores only appear after they’re scored a point in the original. We draw the scores as part of the board display.

This also means that the human and computer systems now have much more in common in their logic; making a move, or checking to see if a move is legal, now can use identical code. The only thing we need to verify separately is that the right kind of piece is in the start position.

Improvements Over The Original

That also said, a rather distressing number of corners were cut or accidentally filed off in the initial implementation. The human move code, for instance, completely trusted the human player regarding the validity of moves. Executing a move was a matter of erasing the source square and then putting a human piece on the destination square. If the move was not a single square in a diagonal direction, it would also clear the square whose coordinates were the average of your start and end square, and give you credit for a capture. This would let you move opposing pieces and turn them into your own. It would let you jump and capture your own pieces for points. It would let you jump and “capture” completely empty squares for points. Or, you could just move a piece anywhere on the board, get a point for a capture, and erase some totally disconnected square that happened to have an unfortunate coordinate:


This is not really acceptable. Since our computer-move code needs to verify moves independently, we will avail ourselves of that code in the player-move code as well.

The computer-move code is not in the greatest of shape, either. It will, at least, only ever perform legal moves. However, consider this code, from where it is systematically looking for legal capture moves:

6050 LET Y=-11
6055 IF Z+Y>88 OR Z+Y<11 OR Z+2*Y>88 OR Z+2*Y<11 THEN GOTO 6070
6060 IF A(Z+Y)=H AND A(Z+2*Y)=E THEN GOTO 6100
6070 LET Y=-9*(Y=-11)+9*(Y=-9)+10*(Y=9)+(Y=100)
6080 IF Y0 THEN GOTO 6055

The variable Y here is supposed to represent the distance from the start square Z, and it’s checking both that moves are in-bounds (in line 6055) and legal captures (in line 6060) and moving to do the capture if they are. If this is not a legal move, line 6070 is supposed to advance and consider the next direction. It is using the “do math with equality checks” trick that I discussed in the context of speeding up C64 BASIC programs. In Sinclair BASIC, true comparisons have a value of 1 instead of -1, but this test loses the plot very quickly. Y starts at -11 (thanks to line 6050) and it is supposed to cycle through -11 to -9 to 9 to 11 to cover the four directions it can move. However, instead of going to 11, it instead goes to 10, which will never yield a legal move, and then there is an additional, always false test against 100. The end result of this is that the AI will never actually consider any captures to the southeast.

This error is repeated for noncapture moves as well, so it also will not consider moves to the southeast. The implementation seems to have grasped that something was very wrong with this code, because it has multiple levels of flailing about trying to find valid moves, and if after all of it it still can’t find any, it concedes the game and quits.

That’s actually a valid thing to do—it requires incredibly contrived play, but it is at least theoretically possible to stalemate your opponent with this ruleset, and so if that happens the AI needs to give up instead of searching forever for a move that does not exist.

In porting over the computer’s move routines, though, I removed these faults. Taking the logic bit by bit and merging it together into a coherent whole, the rules it were using can be summarized neatly:

  • If a capture is possible, make the capture closest to the opponent’s edge.
  • If no captures are possible, randomly select a piece and attempt to make a non-capture move with it, preferring to advance on the opponent.
  • If, after 200 or so attempts to find a movable piece, you still have found no move, conclude you have stalemated and concede the game.

The BASIC implementation actually selects squares completely at random and treats that square not having a computer piece on it as a failed attempt at a move, which makes it much more likely to concede unnecessarily as it loses pieces. In my port, I only decrement the attempt counter when it finds a piece but finds it is blocked. This does mean that my implementation will hard-lock the computer if there are no pieces on the board at all, but this cannot happen according to the rules of the game; the human player will win when the computer is reduced to two pieces.

The Main Loop

The main loop is pretty simple. It’s a little more complicated than the BASIC version, because we need to manage selecting a game mode, and we let the player choose whether they want another game after a game ends instead of just quitting out to BASIC, but the basic concept is still very similar. It’s just a loop that alternates moves by the human and computer, drawing the board and checking for game-over after each move.

With these components in place this gives us a creditable port of both Mandala Checkers and Chopper Checkers, in a single program, that can load and run on a 2KB Timex Sinclair. That was my minimum goal, but I’m going to mess with it a little more before I conclude that I’m done with the project.