Mandala: Premature Optimization

Before moving on to tackle the implementation of the next part of Mandala Checkers, I did a quick review of my initialization and display code to comment it more thoroughly and to see where I could optimize or generalize it. I found a couple of places where I could make my support routines more powerful and this ended up shrinking my program-so-far from 492 bytes to 465—over 5%.

The optimizations I found are sound—that is, they don’t actually break anything—but they also look incredibly sleazy. With this post I’ll look a little bit at those optimizations and show how they work. It turns out they’re valuable in general and not just as one-off sneaky tricks.

Trick 1: Beyond Call Elimination

Some optimizations don’t depend on any of the context or instructions around them. These are called peephole optimizations—they are essentially a search-and-replace transform for sequences of logical but inefficient instructions that can be replaced with smaller, more efficient, and possibly more obscure instructions.

There’s not always more obscure, though. Here are two rules involving jumps and procedure calls that are extremely logical:

  • A jump instruction to the very next instruction may simply be deleted.
  • A procedure call instruction followed immediately by a procedure return instruction may be replaced with a simple jump instruction.

When we put these rules together, we get a less obviously logical rule:

  • When two procedures are defined back-to-back, and the last thing the first procedure does is call the second, both the call to the second routine and the first routine’s return instruction may be deleted.

This third rule follows immediately from applying the second rule and then the first rule, and it suggests that if we organize our program’s functions appropriately, we can save some space by allowing some function calls to “fall through” into other functions. It turns out that my initialize-board code can reasonably fall through into the draw-board-on-screen code. That looks fine now, but it will start looking weird later when I want to redraw the board after a player makes a move—it will sort of look like I’m calling into the middle of the initialize-and-draw-board subroutine.

It gets weirder when the first procedure calls the second procedure multiple times. I don’t have an example of this yet in Mandala, but a whole lot of my test code needs to print out 16-bit hex values. I do this with three routines; hexout_16 calls hexout_8 twice, and hexout_8 calls hexout_4 twice. In each case, the final call can be eliminated, and the resulting code is one where hexout_16 is doing procedure calls into the middle of itself:

hexout_16:
        push    hl
        ld      a, h
        call    hexout_8
        pop     hl
        ld      a, l
hexout_8:
        push    af
        rra
        rra
        rra
        rra
        call    hexout_4
        pop     af
hexout_4:
        and     15
        add     28
        rst     $10
        ret

If you actually step through it, this code is doing some extremely strange things with its stack, especially if it’s in a context where hexout_16 is the only function that’s ever actually externally called. But if you break it up mentally into three functions that fall into each other, the design clears up right away.

Trick 2: Doing Trick 1 Backwards

The main thing I realized from my review of Mandala so far was that I was doing a lot of duplicative or unnecessary work when printing out status or prompt strings. My only routine for printing anything out was a simple routine named print_at, which took a pointer to a two-byte screen location code followed by a string of characters. It would move the cursor to that location code, then print out characters until it reached a value of 255. (Normally one stops at a character code of 0, but the ZX81’s wacky non-standard character set uses character zero for space.) Making this call cost six bytes, and so I had several places where I would print out a bunch of unnecessary newlines or spaces to spare myself the call, and there were other places where I would call it multiple times in a row to write at various parts of the screen.

This is inconvenient. I’d be far better off generalizing the function so that if it saw, say, a 254, that would make it treat the next two bytes as a new location code. The routine started out looking something like this:

  1. Read two bytes from the string and move the cursor accordingly.
  2. Advance the string pointer two bytes to skip over the location code.
  3. Read a byte from the string.
  4. If the byte is 255, return from the function.
  5. If the byte is not 254, go to step 8.
  6. Advance the string pointer one byte to skip past the 254.
  7. Go to step 1.
  8. Print out the character read.
  9. Advance the string pointer one byte.
  10. Go to step 3.

There are a number of things we can do with this. The most obvious is to notice that if I ever need to print something out without relocating the cursor first, I can set a label print in the middle of the function, letting me effectively say “start at step 3.” This produces a result identical to what would happen if print_at moved the cursor and then called print… as long as we don’t look too closely at the loopback in step 7.

The second thing we can do is notice that there’s no reason we ever have to start at the first step stated. Instead of jumping around with 254 case, we can change step 5 to “If the byte is 254, go to step 0“, remove step 6 entirely, and move step 5 to before step 1.

That’s cute, and sound, and reliable, and it preserves the exact semantics of the previous function, but it’s also ugly. Is there a better way?

There is. The whole reason we need to have these extra jumps is because we need to increment the string pointer and then jump somewhere, and our tests only let us jump or return as the result of a test. But the real problem here is that we’re advancing the string pointer too late, all the way down in step 9! If we move that string pointer advance to just after step 3, we can make the test simply be “If the byte is 254, go to step 1” and our work is entirely done. It’s both cleaner and smaller.

Adding functionality does still cost a bit of space, though: interpreting 254 specially added four bytes to my print_at routine. However, setting up the pointer and making the call costs six bytes, so being able to wipe out even one call as a result results in a space savings. I was able to wipe out much more than one call.

Harm Reduction

But wait, modern programmers might observe. This involves some pretty aggressive use of crisscrossing GOTO statements. GOTO statements are considered harmful, they might say. Aren’t I corrupting the youth of today, they might ask.

To this I answer, no, no I am not, and I am not for a variety of reasons. The first is that all of these jumps are within a single screenful of code, and this was considered acceptable by the more practical developers of that era. The second is that the structure programming wars of the “GOTO considered harmful” era also eschewed things we take for granted as part of well-structured code today, such as mid-function return statements, or break and continue statements within a loop.

And finally, I’m in the clear here because the actual control flow here doesn’t even require break or continue:

  • Repeat:
    • Consume two characters from the string
    • Move the cursor as specified by those two characters
    • Repeat:
      • Consume one character from the string
      • If it is not 254 or 255, print it

      While character consumed was not 254 or 255

    While character consumed was not 255

That’s rigid enough to satisfy even the most rigorously excessive of structural programming requirements. It’s not quite what I actually do, directly—the actual machine code uses the equivalents of break and mid-function return, but the Z80 instruction actually has “return immediately if condition is true” as a specialized one-byte instruction. There’s really no reason to not take advantage of it if it’s there.

Trick 3: Dual-purpose Data

With the new printing procedure in hand, I went through the code and started coalescing large collections of print statements into single calls. However, there was still one nasty knot of calls left. These calls revolved around printing out the current score. It needed to print out the labels for each score, and then it needed to compute the character codes based on the current score and print those out individually.

The trick here is that this is not remotely necessary. Since the scores in this game are only ever one digit long, we can just embed the scores directly in the score string, and look at the character we’d print to determine what the score was. That moves the interpret-score-as-character code out of the board display and into the has-a-player-won logic, but it lets us get rid of a bunch of unnecessary math and lets us coalesce several more print statements.

This is also an instance of a somewhat more general technique where you use your display data directly as model data. This is a bad idea in modern GUI applications, where the view can change unpredictably and you want a stable model to inform any redraws, but for fixed character-cell displays like these systems, there’s no reason to not use the video matrix as a repository in memory of what is currently on the screen.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s