Implementing SHA-256 on the 6502

After five implementations of various algorithms built mostly around bit shuffling, how about a properly modern, common, useful routine? We just need to find something that has a very similar structure and then we can deploy the knowledge we’ve learned in those other domains.

SHA-2 fits this bill neatly, and Wikipedia even has handy pseudocode we can base an implementation on. In looking at the implementations and specifications, it also seems like SHA-256 is a relatively good fit for the 6502, and may well qualify as the most sophisticated modern algorithm that the 6502 can still comfortably handle. We don’t really need to worry about endianness or word size—since we’re an 8-bit chip we can do multibyte mathematics in any direction we want and the size of the algorithm’s words really only alters our loop bounds—but SHA-256’s largest indexed elements are the round constants and the schedule arrays, each of which are 64 32-bit words long. That translates out to tables exactly 256 bytes in size, which is exactly the size of a 6502 memory page and thus the largest amount of memory that can be efficiently indexed.

Initial Implementation

Translating the pseudocode from the wikipedia article into code was not terribly challenging. I pre-defined tables of the necessary constants in the algorithm, and then set aside a fixed block of RAM to store all the working state that changes as a sum is computed. (In a modern system, we would probably instead operate on a pointer to arbitrary blocks of RAM, but the 6502 is much, much better at operating on blocks of memory at fixed addresses than it is at working with pointers, and I wanted to give the poor CPU every edge it could get.) I then defined several 4-byte blocks of memory for use as accumulators—I defined a variety of small functions that would do 32-bit big-endian math based on the contents of the accumulators and which could store results into and out of the working state as needed. There were a few awkward bits here—I ended up replicating a few functions because the 256-byte index limit got inconvenient at a few points—but nothing about the implementation of the algorithm really ended up having to deviate much from the pseudocode. So the implementation step was quite simple and easy, much as I had hoped.

However, the end result did not come out the way I had hoped it would. While it did give the correct answers to my test vectors, it took 580 milliseconds to consume a single block—nearly 10 milliseconds per byte! I wasn’t willing to push this too far—after all, these algorithms are tuned to be feasible only on modern chips—but we can surely do better than this.

Continue reading

Advertisements

Some Notes on Player Complicity

A lot of the earlier articles on Bumbershoot Software were written “from my old notebooks”—old notes or forum posts or conversations that I’d kept around and then worked into full articles. That’s fine for retrocoding stuff, where the state of the art moves fairly slowly, but it’s not useful for fields that are still rapidly innovating. My old notebooks still have a lot of material from my 2004-2014 span being involved with the interactive fiction community, but a lot of the figures there are now reasonably well known names in academia and industry, so much of my notes on what we’d now put under the umbrella of “interactive narrative” are extremely outdated. I do sporadically check in on Nick Montfort’s and Emily Short’s blogs to sample the academic and avant-garde commercial viewpoints in the space, but even there I’ve drifted far enough away from both the state of the art and the goals of the craft that there isn’t much I can talk about or grapple with.

However, Emily Short has a new post up on “Choice Poetics” that struck a chord. Back when I’d write nine or ten thousand words each year about the annual Interactive Fiction Competition, I ended up refining some principles and approaches to the entries in the competition that ended up serving me pretty well in subsequent years when approaching computer games more generally. The papers linked in that article—particularly the one formally analyzing Papers, Please—have forced me to re-evaluate some of those old conclusions.

So, let’s talk about complicity in video games. I’m using this term in both a narrow and a broader sense. In the narrow sense, it’s talking about things a game can do to make a player feel bad about atrocities they commit or have their character commit within a game. More broadly, I want to move this beyond the crimes that the word “complicity” suggests and consider situations where the player feels responsible for their actions within or to a game world.

I first attempted to formalize this in 2010 when reviewing Jason McIntosh’s The Warbler’s Nest, because it was brief and extremely pointed and thus crystallized for me what made me feel like it was laying the responsibility for the events of the game directly at my feet. To wit, if a player is to feel responsible for an atrocity:

  1. The atrocity must be optional. Otherwise the player is no more responsible for these events than a reader is made responsible for the events in a story by the act of turning a page.
  2. The atrocity must be the easy path. Otherwise it’s just Bonus Cruelty you have to work harder for.
  3. The player must know the stakes. Otherwise their actions do not have the necessary import. Lock a player in a room with two unlabeled buttons, and make it clear that the game will do nothing else until one of the two buttons is pressed, and the player will feel no responsibility for the result. Indeed, this is only fair; were such a situation to obtain in real life, the responsibility for the consequences of this action would be firmly on the head of whoever rigged up the buttons, or locked the button-pusher in the room.

That was almost a decade ago. By now, I don’t think any of these hold up as well as they did when I originally devised them. Here are the main problems or complications that, I think, have emerged with it:

  • The player’s interaction mode may be too far removed for the analysis to apply. If game offers a predefined good path and evil path, the default assumption for many players is that they are expected to be completionist. The atrocities on the “evil path” may be optional, but if the player seeks to see the whole of the game’s content—particularly if certain maps, challenges, characters, or other such content is gated behind progress within it—they cease being optional when in pursuit of that goal. More simply, a player who has decided to rain meteors and dinosaur attacks upon their SimCity metropolis will feel no more remorse about this than they would for knocking over a tower of Jenga sticks. A player interacting with a game as an artifact to experiment with or as a library to exhaust has already made a decision that removes them from the analysis.
  • People feel responsible for their actions in real life even if they didn’t know the stakes. Blind or actively deceptive choices still fail, but there’s a lot more leeway here than I had originally formulated. Allowing the player character to feeding a starving child a peanut butter sandwich and then rewarding them by having child die in front of them from anaphylactic shock would be a rather mean thing to do, but a player might feel some responsibility for the action, since they acted without some obvious information (the child’s peanut allergy) that they in principle “should” have known. On the other hand, one could imagine a hospital/surgery simulation where the epilogue reveals that the player character was not a surgeon at all, but a deluded lunatic who was just cutting people up in a barn. In such a game the player would be unlikely to feel even fleeting remorse for the actions presented at the time as saving lives.
  • If an atrocity is difficult enough to perform, the analysis breaks down completely. This is hard to explain without actually working through examples, so I’ll defer that to my worked examples before.

Below the fold, I’ll work through a number of games and show how the original and the modified analyses interact with it. That also means I will, by necessity, be spoiling some of the narrative mechanics in them, so I will list the games here as the spoiler warning. Proceed at your own risk if you have not played Utlima IV, Bioshock, The Warbler’s Nest, Spec Ops: The Line, Fallen London, Papers Please, and Undertale. I’ll try to be vague about exact details that are not commonly discussed, but it’s still going to require going further than I’d like without spoiler warnings.

Continue reading

Working with DLLs on Windows, Mac, and Linux

So.

Let’s talk about dynamic libraries.

How to write them, how to build them, how to connect them to your programs.

How to get a result that actually keeps working once you send your program off to someone else.

Exactly how this stuff works, and what you need to do, varies depending on your operating system and build tools. There are four major cases I plan to cover:

  • Building on Linux with gcc or clang
  • Building on Mac with clang
  • Building on Windows under MSYS2 with gcc or clang
  • Building on Windows with Visual Studio

Irrespective of your operating system, gcc and clang often end up behaving very similarly, and clang also has a mode called clang-cl that apparently mimicks the command-line interface of the Visual Studio tools.

There are two semi-major cases that I’m not going to cover: directly using the Visual Studio command-line tools, and the BSD-like systems, which have a different binary format from the three OSes I am covering.

I’m skipping the Visual Studio command line tools because they are verbose, arcane, and exceptionally easy to find documentation for, assuming you can make it happen in Visual Studio itself.. If it is very important to track down the relevant flags for some build, Visual Studio’s Preferences dialog boxes are extremely transparent about what each entry in each dialog box controls, listing the relevant command-line options either in the description of the setting or alongside each possible option. I’ll only bring it into play if there’s something that’s so much easier to do with the CLI that you’d normally break out of Visual Studio to do it.

BSD, on the other hand, I’m skipping because I know basically nothing about it other than that sometimes it is surprisingly and distressingly different. My advice for gcc/clang on Linux or Mac may perhaps be applicable.

Some Words of Encouragement Before The Abyss

Getting a program to load and run while depending on shared libraries is not generally considered a difficult problem. Indeed, it has been a solved problem for decades. As such, the purpose of this article is twofold.

  1. To be an example of what exactly it means for something to be “a solved problem for decades.” One gets the impression that the details of this stuff are not widely known not because they’re too simple to bother explaining, but because people have been aggressively forgetting it as hard as they can once it works well enough that they can walk away.
  2. To be a usable tutorial for people who have ended up on the rusted pointy end of building or distributing software that has custom shared libraries in it. They may be suffering impostor syndrome, wondering how they could be screwing up such an obviously basic thing. Let my barely-contained ranting reassure you, and with luck the actual instructions in here may get you where you need to be.

Continue reading

Fun With Symbol Collision

One of the basic rules of C is that you don’t get to use the same function name twice.

If I write a file with some function in it, say, lib1a.c:

void my_awesome_function(void)
{
    printf("This is my awesome function!\n");
}

And then I write another file lib1b.c with a different implementation of the same function:

void my_awesome_function(void)
{
    printf("This is my ENTIRELY DIFFERENT awesome function!\n");
}

Then if I write a simple main program to call that function…

#include "lib1.h"

int main(int argc, char **argv)
{
    my_awesome_function();
    return 0;
}

It won’t know which one I mean, and when I try to build the program I will get an error message.

/usr/bin/ld: /tmp/cc30kjUg.o: in function `my_awesome_function':
lib1b.c:(.text+0x0): multiple definition of `my_awesome_function'; /tmp/cce04mSV.o:lib1a.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status

But this reply reveals that I wasn’t quite accurate in my first sentence there. This is not a basic rule of C. It’s a basic rule of C’s linker. That linker is shared with many other languages, so similar rules end up applying to assembly language as well. Languages like C++ work around this by having the function names in the program not quite match up with the function names that the linker gets to see: C++ is said to “mangle” its symbols on the way to the linker.

But even without that, we have a little leeway. The linker still isn’t as strict as that. It only requires that each specific linker product not export the same symbol twice. This means that C can define the same function multiple times as long as it uses the static directive to constrain it to a single file. The C compiler entirely conceals these functions from the linker, so it never gets a chance to notice the duplication.

Linking the Same Symbol Twice, on Purpose

We can be a little crazier, too. After all, a full program is usually made of many individual linker products. We can create two dynamic libraries that define the same symbol, and we will be permitted to link one program against both libraries simultaneously by Mac, Linux, and Windows.

What happens in these cases is, notionally, system-dependent, but when you build with GCC or Clang, the library we list first when we name our libraries is the one that gets priority in the command-line build.

Linking the Same Symbol Twice, by Accident

We’re pretty unlikely to actually run into the situation we described above in a real system, though. Here’s a more insidious example. Let’s keep our lib1a and lib1b libraries as we had them—but now let’s create lib2a and lib2b libraries to go with them. Here is lib2a.c:

#include "lib1.h"

void function1(void)
{
    my_awesome_function();
}

And here is lib2b.c:

#include "lib1.h"

void function2(void)
{
    my_awesome_function();
}

If we then link lib2a against lib1a and lib2b against lib1b, each of these libraries looks nicely independent; neither exports any symbols that conflict with anybody else, and if we have a main program that looks like this:

#include "lib2.h"

int main(int argc, char **argv)
{
    function1();
    function2();
    return 0;
}

We won’t even list anything other than the apparently-non-conflicting 2a and 2b libraries. We would expect, when we ran this, to get output that looks like this:

This is my awesome function!
This is my ENTIRELY DIFFERENT awesome function!

And, indeed, on Windows and Mac, we do. But on Linux, we don’t; we get two copies of the same sentence, and which sentence we get depends on whether we linked 2a in first or 2b. What gives?

One- vs. Two-Level Namespaces

What gives is that Linux uses a one-level namespace for its linking and loading, and Mac and Windows do not. This is a fairly subtle distinction that almost never matters, but it does here so let’s dig into it a bit.

In a two-level linking scheme, the name of the library a symbol is from becomes, effectively, part of the symbol. Both the PE32 format used by Windows and the Mach-O format used by macOS have this mechanism, and are fairly explicit about it—Windows stack traces name their functions things like kernel32.dll!CreateFileW, and the otool -Lv command on macOS will enumerate any library’s dependencies very precisely. (Usually, given the way macOS applications are expected to work, it’s too precise for my tastes, but that is a tale for another time.)

Linux, on the other hand, uses the ELF format, which provides a single-level (or flat) namespace. The ldd command will produce a list of library dependencies much like otool. However, there is a difference: otool does not list anything that is not encoded in some way within the binary itself. The output of ldd, on the other hand, is computed each time it is run, for its results depend on other files in the system, and possibly even the environment variables at the point the process is started.

Another difference between ldd and otool becomes apparent if we run them over our main2 program. On a Mac, otool reports that main2 links against lib2a and lib2b, just like we commanded. On Linux, though, ldd will produce a list that includes not just those two libraries, but also their dependencies. All of these are loaded in at run time just as if the main program had linked directly against them. This means that, as far as Linux is concerned, we are in the exact situation that we were in when we directly linked conflicting libraries into place, and it behaves accordingly.

We can, however, work around this. The POSIX standard provides a function called dlsym that may load any symbol out of any library, even those that you did not link against in the first place. That’s a mechanism that’s sufficiently general that a library could implement whatever linker/loader scheme it wants. It’s got some neat nonstandard-but-widely-available extensions too. GCC and compatible systems (including macOS’s clang compiler) define a special pseudo-library called RTLD_NEXT which will load the next copy of a function starting just past the library calling dlsym.

We can have some fun with that. Let’s!

Continue reading

Mandala Checkers: Wrap-Up

In the end, the only extra addition I made to Mandala Checkers was a little code to announce the computer’s move:

chopper_prompt

The final source code is here. I had a number of other ideas in mind, but ultimately I decided it wasn’t worth the effort to put into them. The rest of this post is a set of questions I had about the program and the porting process along the way, and the answers I found.

How Could We Improve the AI?

The original program’s AI was quite weak. It would take a capture if one were available, but it would otherwise move randomly. Even if we decide to not bother with proper move lookahead or search spaces, it’s pretty easy to devise a better set of heuristics:

  • The game is won after making seven captures. We should therefore only make captures if we’re tied or ahead, or if the capture isn’t into a place where our opponent can just capture right back.
  • Likewise, if at all possible, we just shouldn’t move pieces to where they’re being threatened.

It wouldn’t be too hard to alter our move-finder so that it can determine if a board position represents a potential capture for our opponent. The problem, however, is that this strategy completely breaks the game. This ruleset will effectively force a stalemate, as either player, right after the first move. All of the rules variations from American Checkers end up making this strategy more effective and thus making the game less playable:

  • All pieces start out moving like kings, which means there is no forward pressure. After making an initial safe advance, all it has to do is move pieces deeper within the initial structure back and forth until the opponent provides an opportunity for attack.
  • Capturing isn’t mandatory, so there’s no need to make captures that would expose you to unnecessary retaliation, when retreat is an option (or where we have the option of moving a piece over to make any threatened counterjump impossible).
  • Since there are no multiple captures, a player who falls behind cannot meaningfully recover with a single bold stroke.

The end result is that good play produces an aggressively passive game that I’m not actually sure can ever be won except by exploiting an opponent’s blunders.

What Were the Changes in Memory Usage?

The original program took up about 5 kilobytes, with a kilobyte of that being metadata:

           Program header       116 bytes
  10- 100: Main loop            209 bytes
6000-6999: Computer move      1,843 bytes
7000-7999: Human move           624 bytes
8000-8999: Draw board           364 bytes
9000-9999: Init board           996 bytes
           Program footer       794 bytes
------------------------------------------
           Total              4,946 bytes
           Total non-metadata 4,036 bytes

The reimplementation, on the other hand, is only just over one kilobyte, and only 161 bytes of metadata:

           Program header       116 bytes
           BASIC stub            19 bytes
           Main loop            108 bytes
           Game select/init     190 bytes
           Draw board/score     204 bytes
           Human move            75 bytes
           Computer move        169 bytes
           Generic move engine  162 bytes
           Utility functions     99 bytes
           Program footer        26 bytes
------------------------------------------
           Total              1,168 bytes
           Total non-metadata 1,007 bytes

The main loop has had its size halved, the init and drawing code dropped in size by about a third, and then the game-playing logic and program metadata each end up being smaller by a factor of six.

This is a frankly implausible amount of memory savings.

How Did We Save That Much Space?

It’s not really common for machine code to save that much space over an HLL. Aggresively hand-optimizing for space usually ends up saving you five to ten percent over a properly-tuned compiler. For smaller programs, in fact, an interpreted language like BASIC is often much smaller than the equivalent machine code—most of its standard library lives in ROM and complicated mathematical formulas can be encoded very compactly: A=A+B can take up as few as five bytes, but even if those are 8-bit values you’d spend eight to ten bytes doing the memory reads and writes you need. For a 16-bit integer value it’ll be quite a few more. For Mandala, though, the code doesn’t require any particularly fancy math. It’s mostly memory iteration and complicated control flow, which machine code is very good at. Furthermore, for the most complicated math and I/O, we simply call out into the BASIC ROM’s own routines to do that work for us instead of replicating it. The only routine where we replicate or replace behavior we could steal from the ROM is our random number generator, and that’s mainly because our xorshift-based generator lets us use the least significant bits in the results and thus restrict our randomness ranges with simple bitmasks instead of also needing to pack in a multiplier.

Furthermore, Sinclair BASIC has some serious disadvantages for its code density. Each program line has a 4-byte penalty, which isn’t too bad on its own… but the ZX81 and the Timex Sinclair 1000 alike do not permit multiple statements on a single line of BASIC code. This means you’ll be paying that 4-byte penalty more often, and often IF statements end up have to get repeated to simulate conditional blocks. Worse, every time a numeric constant appears in the code, there is a byte spent for each digit of the number’s decimal representation, but then an additional six bytes are spent to store its preparsed binary representation as well. There are a lot of numbers in Mandala Checkers and it doesn’t use any tricks to minimize the use of actual numeric constants. Those issues are somewhat compounded by the way that the computer-move logic flails around a bunch to make up for the bugs that the author seems to have noticed but actually isolated well enough to fix.

There are techniques to work around around these numeric-constant penalties—it’s faster but slower to store the constants as strings and parse them at runtime, a handful of numeric constants are hardcoded into the BASIC bytecode, and as a last resort an often-reused constant may be cached in a variable with a short name—but this program doesn’t bother with any of them. Once your program doesn’t run in 2KB the next smallest RAM size is 16KB, which is roomy enough that this is really no longer an issue. This BASIC program was optimized for speed, not size.

That leaves the 768 bytes that we saved out of the program footer. This is another side effect of the 2KB vs 16KB configuration of the ZX81: as we noted back when we first looked at the ZX81’s file formats, the screen memory gets saved out with every program, and the screen size is variable:

The screen memory (or “display file”, as they call it) is actually a list of 24 lines, each of which is individually terminated with a newline. If you have 4KB or more of RAM installed in the system, each line will be the full 32 characters long, which, with terminators (and an additional newline at the start that signals the beginning of the screen), takes up 793 bytes (33 per line, 25 per line, one initial newline). Under more constrained conditions, an empty screen is just 26 newlines.

My reimplementation is written for a 2KB machine, so it gets the fully compressed empty screen in its memory dump and thus the file ends up 768 bytes smaller. The total file size doesn’t shrink quite that much because I’ve got an extra 19 bytes to represent a tiny BASIC program that exists only to transfer control to the machine code.

Of course, the screen won’t stay 26 bytes. That leads us to the final question…

How Much Memory Does This Thing Actually Use?

I don’t bother with any memory allocation routines in this program, and because of the way the system expands and contracts its memory usage on its own, I also wasn’t comfortable with using any memory that existed past the edge of the screen memory. That meant that all my data exists, pre-initialized, within the file’s memory map itself. That’s handy because it means that our memory footprint can be measured directly from the file size. It’s not complete, but the only additional sources of are the expandable screen memory and the processor stack. So the question is thus: how much screen memory is actually used?

I looked into the system memory as the program was running to determine how much screen memory is used. In 16KB mode this stays solid at 794 bytes across the board; in 2KB mode, though, the last 12 bytes or so of each line are blank and thus cut out of the list, which drops the screen memory size by nearly 40%, to 487 bytes. This leaves 556 bytes spare for use by the processor stack in our own computing as well as in the calls the BIOS itself might make. With a fully-utilized screen, that headroom would drop to 199 bytes. That’s a point where I’d start getting nervous about the processor stack possibly trashing my system state.

But that’s also the point where I’m finished with this project. Time to move on to something new.

Mandala: Initial Port Complete

Last time we ported over and bugfixed the initialization and board-drawing code for Mandala Checkers. This time we’ll adapt the code that lets the players make moves.

We’ll end up reorganizing the code a bit from the BASIC original, though. In that, the human and computer code was completely separated into their own blocks of a thousand line numbers, and they would draw their moves on the board themselves. That was necessary for the BASIC version because the BASIC code for drawing the whole board was extremely and unacceptably slow. Our machine code drawing routine, on the other hand, is fast enough that we can just call it after each move is made. This will also clean up a little infelicity in the initial implementation: the players’ scores only appear after they’re scored a point in the original. We draw the scores as part of the board display.

This also means that the human and computer systems now have much more in common in their logic; making a move, or checking to see if a move is legal, now can use identical code. The only thing we need to verify separately is that the right kind of piece is in the start position.

Improvements Over The Original

That also said, a rather distressing number of corners were cut or accidentally filed off in the initial implementation. The human move code, for instance, completely trusted the human player regarding the validity of moves. Executing a move was a matter of erasing the source square and then putting a human piece on the destination square. If the move was not a single square in a diagonal direction, it would also clear the square whose coordinates were the average of your start and end square, and give you credit for a capture. This would let you move opposing pieces and turn them into your own. It would let you jump and capture your own pieces for points. It would let you jump and “capture” completely empty squares for points. Or, you could just move a piece anywhere on the board, get a point for a capture, and erase some totally disconnected square that happened to have an unfortunate coordinate:

mandala_bad_capture

This is not really acceptable. Since our computer-move code needs to verify moves independently, we will avail ourselves of that code in the player-move code as well.

The computer-move code is not in the greatest of shape, either. It will, at least, only ever perform legal moves. However, consider this code, from where it is systematically looking for legal capture moves:

6050 LET Y=-11
6055 IF Z+Y>88 OR Z+Y<11 OR Z+2*Y>88 OR Z+2*Y<11 THEN GOTO 6070
6060 IF A(Z+Y)=H AND A(Z+2*Y)=E THEN GOTO 6100
6070 LET Y=-9*(Y=-11)+9*(Y=-9)+10*(Y=9)+(Y=100)
6080 IF Y0 THEN GOTO 6055

The variable Y here is supposed to represent the distance from the start square Z, and it’s checking both that moves are in-bounds (in line 6055) and legal captures (in line 6060) and moving to do the capture if they are. If this is not a legal move, line 6070 is supposed to advance and consider the next direction. It is using the “do math with equality checks” trick that I discussed in the context of speeding up C64 BASIC programs. In Sinclair BASIC, true comparisons have a value of 1 instead of -1, but this test loses the plot very quickly. Y starts at -11 (thanks to line 6050) and it is supposed to cycle through -11 to -9 to 9 to 11 to cover the four directions it can move. However, instead of going to 11, it instead goes to 10, which will never yield a legal move, and then there is an additional, always false test against 100. The end result of this is that the AI will never actually consider any captures to the southeast.

This error is repeated for noncapture moves as well, so it also will not consider moves to the southeast. The implementation seems to have grasped that something was very wrong with this code, because it has multiple levels of flailing about trying to find valid moves, and if after all of it it still can’t find any, it concedes the game and quits.

That’s actually a valid thing to do—it requires incredibly contrived play, but it is at least theoretically possible to stalemate your opponent with this ruleset, and so if that happens the AI needs to give up instead of searching forever for a move that does not exist.

In porting over the computer’s move routines, though, I removed these faults. Taking the logic bit by bit and merging it together into a coherent whole, the rules it were using can be summarized neatly:

  • If a capture is possible, make the capture closest to the opponent’s edge.
  • If no captures are possible, randomly select a piece and attempt to make a non-capture move with it, preferring to advance on the opponent.
  • If, after 200 or so attempts to find a movable piece, you still have found no move, conclude you have stalemated and concede the game.

The BASIC implementation actually selects squares completely at random and treats that square not having a computer piece on it as a failed attempt at a move, which makes it much more likely to concede unnecessarily as it loses pieces. In my port, I only decrement the attempt counter when it finds a piece but finds it is blocked. This does mean that my implementation will hard-lock the computer if there are no pieces on the board at all, but this cannot happen according to the rules of the game; the human player will win when the computer is reduced to two pieces.

The Main Loop

The main loop is pretty simple. It’s a little more complicated than the BASIC version, because we need to manage selecting a game mode, and we let the player choose whether they want another game after a game ends instead of just quitting out to BASIC, but the basic concept is still very similar. It’s just a loop that alternates moves by the human and computer, drawing the board and checking for game-over after each move.

With these components in place this gives us a creditable port of both Mandala Checkers and Chopper Checkers, in a single program, that can load and run on a 2KB Timex Sinclair. That was my minimum goal, but I’m going to mess with it a little more before I conclude that I’m done with the project.

Mandala: Premature Optimization

Before moving on to tackle the implementation of the next part of Mandala Checkers, I did a quick review of my initialization and display code to comment it more thoroughly and to see where I could optimize or generalize it. I found a couple of places where I could make my support routines more powerful and this ended up shrinking my program-so-far from 492 bytes to 465—over 5%.

The optimizations I found are sound—that is, they don’t actually break anything—but they also look incredibly sleazy. With this post I’ll look a little bit at those optimizations and show how they work. It turns out they’re valuable in general and not just as one-off sneaky tricks.

Trick 1: Beyond Call Elimination

Some optimizations don’t depend on any of the context or instructions around them. These are called peephole optimizations—they are essentially a search-and-replace transform for sequences of logical but inefficient instructions that can be replaced with smaller, more efficient, and possibly more obscure instructions.

There’s not always more obscure, though. Here are two rules involving jumps and procedure calls that are extremely logical:

  • A jump instruction to the very next instruction may simply be deleted.
  • A procedure call instruction followed immediately by a procedure return instruction may be replaced with a simple jump instruction.

When we put these rules together, we get a less obviously logical rule:

  • When two procedures are defined back-to-back, and the last thing the first procedure does is call the second, both the call to the second routine and the first routine’s return instruction may be deleted.

This third rule follows immediately from applying the second rule and then the first rule, and it suggests that if we organize our program’s functions appropriately, we can save some space by allowing some function calls to “fall through” into other functions. It turns out that my initialize-board code can reasonably fall through into the draw-board-on-screen code. That looks fine now, but it will start looking weird later when I want to redraw the board after a player makes a move—it will sort of look like I’m calling into the middle of the initialize-and-draw-board subroutine.

It gets weirder when the first procedure calls the second procedure multiple times. I don’t have an example of this yet in Mandala, but a whole lot of my test code needs to print out 16-bit hex values. I do this with three routines; hexout_16 calls hexout_8 twice, and hexout_8 calls hexout_4 twice. In each case, the final call can be eliminated, and the resulting code is one where hexout_16 is doing procedure calls into the middle of itself:

hexout_16:
        push    hl
        ld      a, h
        call    hexout_8
        pop     hl
        ld      a, l
hexout_8:
        push    af
        rra
        rra
        rra
        rra
        call    hexout_4
        pop     af
hexout_4:
        and     15
        add     28
        rst     $10
        ret

If you actually step through it, this code is doing some extremely strange things with its stack, especially if it’s in a context where hexout_16 is the only function that’s ever actually externally called. But if you break it up mentally into three functions that fall into each other, the design clears up right away.

Trick 2: Doing Trick 1 Backwards

The main thing I realized from my review of Mandala so far was that I was doing a lot of duplicative or unnecessary work when printing out status or prompt strings. My only routine for printing anything out was a simple routine named print_at, which took a pointer to a two-byte screen location code followed by a string of characters. It would move the cursor to that location code, then print out characters until it reached a value of 255. (Normally one stops at a character code of 0, but the ZX81’s wacky non-standard character set uses character zero for space.) Making this call cost six bytes, and so I had several places where I would print out a bunch of unnecessary newlines or spaces to spare myself the call, and there were other places where I would call it multiple times in a row to write at various parts of the screen.

This is inconvenient. I’d be far better off generalizing the function so that if it saw, say, a 254, that would make it treat the next two bytes as a new location code. The routine started out looking something like this:

  1. Read two bytes from the string and move the cursor accordingly.
  2. Advance the string pointer two bytes to skip over the location code.
  3. Read a byte from the string.
  4. If the byte is 255, return from the function.
  5. If the byte is not 254, go to step 8.
  6. Advance the string pointer one byte to skip past the 254.
  7. Go to step 1.
  8. Print out the character read.
  9. Advance the string pointer one byte.
  10. Go to step 3.

There are a number of things we can do with this. The most obvious is to notice that if I ever need to print something out without relocating the cursor first, I can set a label print in the middle of the function, letting me effectively say “start at step 3.” This produces a result identical to what would happen if print_at moved the cursor and then called print… as long as we don’t look too closely at the loopback in step 7.

The second thing we can do is notice that there’s no reason we ever have to start at the first step stated. Instead of jumping around with 254 case, we can change step 5 to “If the byte is 254, go to step 0“, remove step 6 entirely, and move step 5 to before step 1.

That’s cute, and sound, and reliable, and it preserves the exact semantics of the previous function, but it’s also ugly. Is there a better way?

There is. The whole reason we need to have these extra jumps is because we need to increment the string pointer and then jump somewhere, and our tests only let us jump or return as the result of a test. But the real problem here is that we’re advancing the string pointer too late, all the way down in step 9! If we move that string pointer advance to just after step 3, we can make the test simply be “If the byte is 254, go to step 1” and our work is entirely done. It’s both cleaner and smaller.

Adding functionality does still cost a bit of space, though: interpreting 254 specially added four bytes to my print_at routine. However, setting up the pointer and making the call costs six bytes, so being able to wipe out even one call as a result results in a space savings. I was able to wipe out much more than one call.

Harm Reduction

But wait, modern programmers might observe. This involves some pretty aggressive use of crisscrossing GOTO statements. GOTO statements are considered harmful, they might say. Aren’t I corrupting the youth of today, they might ask.

To this I answer, no, no I am not, and I am not for a variety of reasons. The first is that all of these jumps are within a single screenful of code, and this was considered acceptable by the more practical developers of that era. The second is that the structure programming wars of the “GOTO considered harmful” era also eschewed things we take for granted as part of well-structured code today, such as mid-function return statements, or break and continue statements within a loop.

And finally, I’m in the clear here because the actual control flow here doesn’t even require break or continue:

  • Repeat:
    • Consume two characters from the string
    • Move the cursor as specified by those two characters
    • Repeat:
      • Consume one character from the string
      • If it is not 254 or 255, print it

      While character consumed was not 254 or 255

    While character consumed was not 255

That’s rigid enough to satisfy even the most rigorously excessive of structural programming requirements. It’s not quite what I actually do, directly—the actual machine code uses the equivalents of break and mid-function return, but the Z80 instruction actually has “return immediately if condition is true” as a specialized one-byte instruction. There’s really no reason to not take advantage of it if it’s there.

Trick 3: Dual-purpose Data

With the new printing procedure in hand, I went through the code and started coalescing large collections of print statements into single calls. However, there was still one nasty knot of calls left. These calls revolved around printing out the current score. It needed to print out the labels for each score, and then it needed to compute the character codes based on the current score and print those out individually.

The trick here is that this is not remotely necessary. Since the scores in this game are only ever one digit long, we can just embed the scores directly in the score string, and look at the character we’d print to determine what the score was. That moves the interpret-score-as-character code out of the board display and into the has-a-player-won logic, but it lets us get rid of a bunch of unnecessary math and lets us coalesce several more print statements.

This is also an instance of a somewhat more general technique where you use your display data directly as model data. This is a bad idea in modern GUI applications, where the view can change unpredictably and you want a stable model to inform any redraws, but for fixed character-cell displays like these systems, there’s no reason to not use the video matrix as a repository in memory of what is currently on the screen.