Atari 2600: A 48px Kernel

Let’s finish this.

The final thing to do is to add a logo to the top of the screen. I’m going to use three copies each of the two player sprites for twelve lines, producing a 48 by 12 pixel display:

......X.........X...X.........XXX.....X..X......
......X...X.....X..XXX........X.X....XXX.X......
......X.........X...X.........X.X.....X..X......
......X...X.XXX.XXX.X.XXX.....X.X.X.X.X..X......
......X...X.X.X.X.X.X.X.......X.X.X.X.X..X......
......X...X.X.X.X.X.X.XXX.....X.X.X.X.X..X......
......X...X.X.X.X.X.X...X.....X.X.X.X.X.........
......XXX.X.XXX.X.X.X.XXX.....XXX.XXX.X..X......
..............X.................................
..............X.................................
............X.X.................................
............XXX.................................

To produce a solid block of centered pixels, we need to set the players into “three copies close” mode. Each of those three copies will be eight pixels wide and have eight pixels of space between them. The first part of the trick is that we will interleave the two players so that there aren’t any gaps. That means that, for a solid centered display Player 1 needs to be drawn 8 pixels to the right of Player 0 and the fourth player graphic (which is Player 1’s second copy) needs to land on pixel 80, the beginning of the second half of the screen. Working backwards from that we find that our target pixels are 56 and 64. We covered quite early on how to locate our sprites in fixed locations, but there’s a slight wrinkle this time. The closest we can get to our target pixels is to write on cycles 37 and 40. However, if we write on cycles 34 and 37, we save a couple bytes in our placement code and still remain within just within the range of a single corrective HMOVE:

        sta     WSYNC
        lda     #$03            ; +2 (2)
        sta     NUSIZ0          ; +3 (5)  Three copies close for P0 and P1
        sta     NUSIZ1          ; +3 (8)
        lda     #$1e            ; +2 (10)
        eor     idlemsk         ; +3 (13)
        sta     COLUP0          ; +3 (16)
        sta     COLUP1          ; +3 (19)
        sta     HMCLR           ; +3 (22)
        lda     #$80            ; +2 (24)
        sta     HMP0            ; +3 (27)
        lda     #$90            ; +2 (29)
        sta     HMP1            ; +3 (32)
        nop                     ; +2 (34)
        sta     RESP0           ; +3 (37)
        sta     RESP1

(We’re making the header bright yellow by default, but we’re deploying the idle mask here too so that it will change along with the other graphics when we’re in idle mode.)

With the players placed, we now need to set aside some space to draw the graphic appropriately. For now, I’m splitting the top empty space in half and removing six scanlines from it:

        ldy     #$0d
*       sta     WSYNC
        dey
        bne     -

        lda     #$ff
        sta     GRP0
        sta     GRP1
        ldy     #$0c
*       sta     WSYNC
        dey
        bne     -
        lda     #$00
        sta     GRP0
        sta     GRP1

        ldy     #$0d
*       sta     WSYNC
        dey
        bne     -

I then remove six extra scanlines from the blank area at the bottom to balance it out. This moves the game board down a bit but (using a placeholder graphic that’s just a solid block of color) shows us a decently centered total display:

lightsout2600_08_blocks

Now to turn that into our logo.

The General Approach

The STA GRPn instructions take three cycles to execute, during which time the display will advance nine pixels. We need to time writes to these registers so that the graphics are changed just before they’re used. At the time that’s happening, the other set of graphics are being consulted, so we do a have a little leeway here. We also have the advantage of being able to preload some values into the graphics first, so if the first two graphics values are fine, we simply need to update the next four graphics blocks. The challenge here is that we only have the A, X and Y registers to work with, so we can only store three values in a row without some kind of load operation—and load operations all cost unacceptably large amounts of time. The crucial insight to surmounting this challenge—which I see credited primarily to Dave Staugas at Atari—is that there are a few extra places to stash data.

Vertical Delay and the Shadow Registers

GRP0 and GRP1 are the two sprite graphics registers that the TIA chip admits to having, and they’re the only ones that can be directly written. However, it also has some internal registers for storing older values of them that were intended to be used to shift certain graphics down a scanline within a display kernel that took more than one scanline per loop. (Hence the name of the control flags for this, which have names like VDELP0 for Vertical DELay Player 0.) What they actually do is switch to displaying the “older” values of that graphic. This technique—the TIA Hardware Guide I’ve been using for reference calls it “The Venerable Six-Digit Score Trick”, and I’ve also seen it called the “Staugas Score Kernel”—uses those shadow registers to preload enough graphics data to let us display the entire 48 pixels.

These shadow registers don’t simply record the previous value of a write to a graphics register, unfortunately. Instead, the way it works is that any time you write to one of the player’s graphics registers, the “current” value of the other player’s graphics register is copied to its “old” value. This means that for any string of writes to the registers, there can be at most three unique values stored. (since writing the third value has wiped out one of the earlier ones). Fortunately, three unique values is all we need, since we’ve got A, X, and Y to cover the pending values. We’ll preload the graphics registers with three values, preload the registers with the remaining three, and then juggle the values so they display properly across the whole scanline.

Getting the Logo Data

We’ll be counting backwards through our loop, as usual, so we need to take our logo graphic up top and not only break it into six eight-pixel columns but also display them upside-down. I wrote a fairly simple script to do that work:

#!/usr/bin/python

logo = ["......X.........X...X.........XXX.....X..X......",
        "......X...X.....X..XXX........X.X....XXX.X......",
        "......X.........X...X.........X.X.....X..X......",
        "......X...X.XXX.XXX.X.XXX.....X.X.X.X.X..X......",
        "......X...X.X.X.X.X.X.X.......X.X.X.X.X..X......",
        "......X...X.X.X.X.X.X.XXX.....X.X.X.X.X..X......",
        "......X...X.X.X.X.X.X...X.....X.X.X.X.X.........",
        "......XXX.X.XXX.X.X.X.XXX.....XXX.XXX.X..X......",
        "..............X.................................",
        "..............X.................................",
        "............X.X.................................",
        "............XXX................................."]

logo.reverse();
result = [[],[],[],[],[],[]]
for i in logo:
    for j in range(6):
        block = i[j*8:j*8+8]
        val = 0
        mask = 128
        for k in block:
            if k == 'X':
                val += mask
            mask /= 2
        result[j].append(val)

for j in range(6):
    print "logo_%d: .byte   $%s" % (j, ",$".join(["%02x" % v for v in result[j]]))

If I wanted to make this a more aggressively Pythonic script, I’d probably do a string replacement of the dots and Xs with zeroes and ones and then rely on Python’s function for converting strings of binary numbers into integers. This works fine though and the underlying operations are expressible directly in more languages. The logo data falls out neatly:

logo_0: .byte   $00,$00,$00,$00,$03,$02,$02,$02,$02,$02,$02,$02
logo_1: .byte   $0e,$0a,$02,$02,$ae,$2a,$2a,$2a,$2e,$00,$20,$00
logo_2: .byte   $00,$00,$00,$00,$ab,$a8,$ab,$aa,$eb,$88,$9c,$88
logo_3: .byte   $00,$00,$00,$00,$83,$82,$82,$02,$82,$02,$02,$03
logo_4: .byte   $00,$00,$00,$00,$ba,$aa,$aa,$aa,$aa,$82,$87,$82
logo_5: .byte   $00,$00,$00,$00,$40,$00,$40,$40,$40,$40,$40,$40

Real-time Graphics Updates

The Atari doesn’t really cache anything; any changes to graphics registers take immediate effect on the cycle they’re named on. We made use of that when placing sprites, but there’s more work happening during sprite placement so it doesn’t quite line up with graphics updates. For those, we rely on the usual 3-pixels-per-cycle ratio, but then we also need to assert that HBLANK is exactly 68 pixels long. This means that since we need to have our writes happen as late as possible, our we want to get our final write to finish as close to pixel 96 as possible without actually reaching it. That means we ideally want to land it at cycle ⌊(96+68)/3⌋=54.

Constructing the Loop

The loop we build here will end up looking a lot different from the loops we’ve made in other parts of this code, or indeed for other 6502-based platforms. Normally, we use a register to hold our loop index. We can’t do that here, though, because we need to use every register to store graphics data. We’ll need to dedicate a byte of scratch RAM to holding our loop counter.

Furthermore, we also need to use the X or Y register as the index in order to look up the graphics in our tables. That’s a bit problematic because we’ll need to store an actual value in that register too, so we’ll need to use another byte of scratch RAM for that as well.

Happily, we already declared two bytes of scratch RAM so that our make_move routine didn’t have to do any bounds checks. Since we don’t call make_move here and we don’t need to persist any values outside of this loop, we can just re-use them.

A First Cut

We need to start by turning on the vertical delay mode for our player sprites.

        lda     #$01
        sta     VDELP0
        sta     VDELP1

Then we begin the loop, counting down from an offset of 11, and starting off with our traditional synchronization for the line. We also need to reload the value of scrtch1 into the Y register so we can use it to index into our logo tables.

        ldy     #$0b
        sty     scrtch1
*       sta     WSYNC
        ldy     scrtch1         ; +3 (3)

After that we can load in the first three values:

        lda     logo_0,y        ; +4 (7)
        sta     GRP0            ; +3 (10)
        lda     logo_1,y        ; +4 (14)
        sta     GRP1            ; +3 (17)
        lda     logo_2,y        ; +4 (21)
        sta     GRP0            ; +3 (24)

At this point, the “old” value of GRP0 is the logo_0 value, the “new” value is the logo_2 value, and then both values of GRP1 are the logo_1 value. Now we’ll load values into the registers to get them ready to go. This is where we juggle some values through scrtch2:

        lda     logo_3,y        ; +4 (28)
        sta     scrtch2         ; +3 (31)
        lda     logo_4,y        ; +4 (35)
        tax                     ; +2 (37)
        lda     logo_5,y        ; +4 (41)
        ldy     scrtch2         ; +3 (44)

Now we have to cycle through the rest of the graphics. This gets a little odd because in terms of timing we’ll be (thanks to the VDEL behavior) updating the graphics for the sprite currently being drawn, and our final write is going to end up being to GRP0, to flush out the last value in the player 1 sprite. It doesn’t matter what we write there, just that we write to it. With that we then clear out the rest of the loop:

        sty     GRP1            ; +3 (47)
        stx     GRP0            ; +3 (50)
        sta     GRP1            ; +3 (53)
        sta     GRP0            ; +3 (56)
        dec     scrtch1         ; +5 (61)
        bpl     -               ; +3 (64)

That code looks great, with one unfortunate exception: our final write is on cycle 56, and our target cycle is cycle 54. We need to somehow delay for negative two cycles.

For this precise case it turns out we could optimize away those two cycles. We can replace the two instructions LDA logo_4,y; TAX with the single instruction LDX logo_4,y and save our two cycles. However, there is no LDX equivalent for reading through a pointer, so if we were doing non-fixed data (like a score, which was the original use case for this kind of display) we’d need to stick to LDA everywhere.

A simpler solution would be to notice that we do not need to actually do our WSYNC at the beginning of the loop. We can move it two instructions further down, which frees up seven cycles which we can then spend five of in no-ops. That costs us a few bytes of space but no real time—after all, at the end of the day each loop through costs exactly one scanline—and that gives us our final routine:

        lda     #$01
        sta     VDELP0
        sta     VDELP1
        ldy     #$0b
        sty     scrtch1
*       ldy     scrtch1         ; +3 (65)
        lda     logo_0,y        ; +4 (69)
        sta     WSYNC           ; +3 (72->0)
        sta     GRP0            ; +3 (3)
        lda     logo_1,y        ; +4 (7)
        sta     GRP1            ; +3 (10)
        lda     logo_2,y        ; +4 (14)
        sta     GRP0            ; +3 (17)
        lda     logo_3,y        ; +4 (21)
        sta     scrtch2         ; +3 (24)
        lda     logo_4,y        ; +4 (28)
        tax                     ; +2 (30)
        lda     logo_5,y        ; +4 (34)
        ldy     scrtch2         ; +3 (37)
        nop                     ; +2 (39)
        cpx     $80             ; +3 (42)
        sty     GRP1            ; +3 (45)
        stx     GRP0            ; +3 (48)
        sta     GRP1            ; +3 (51)
        sta     GRP0            ; +3 (54)
        dec     scrtch1         ; +5 (59)
        bpl     -               ; +3 (62)

        lda     #$00            ; +2 (63)
        sta     GRP0            ; +3 (66)
        sta     GRP1            ; +3 (69)
        sta     VDELP0          ; +3 (72)
        sta     VDELP1          ; +3 (75)

There’s one last wrinkle here. Once we’ve finished cleaning up our work, we’re on cycle 75 from the last sync. There are only 76 cycles per line, so when we go looping through the next set of blank lines, we’ll have already completed a line just cleaning up our last one here. As a result, we need to decrease the number of iterations by one. There is no requirement that every scanline be synched to; we just normally are doing so little work per scanline that we count out line-syncs just for lack of anything better to do with the last few cycles each line. Not this time, though.

The Final Result

We have a logo!

lightsout_11

That’s all I really wanted to do with this project. The only thing left to do is package it for release and do a summary post.