Hello World, 4 Different Ways

We’ve written some pretty complicated code on this blog over the years. Let’s write a really simple program instead.

In fact, let’s write the simplest program: Hello World. And since it’s the simplest program, let’s write it without any language support.

I’m going to do this for four different platforms, as a way of exploring how to get minimal but full programs to run in various environments.

Platform 1: Commodore 64

The Commodore 64 shipped with an 8KB ROM chip called the KERNAL that contained routines for doing basic I/O and system management tasks. Entry points for routines in the KERNAL were standardized across their various models, so a library routine written for the VIC-20 had an excellent chance of continuing to work unmodified when incorporated into a Commodore 128 program. For a Hello World program, the routine that we need to use is CHROUT, at location $FFD2, which prints out a character and handles things like cursor placement and screen scrolling as necessary.

        .word   $0801
        .org    $0801
        .word   basend, 2016
        .byte   $9e, "2061",0
basend: .word   0

        ldy     #$00
loop:   lda     msg, y
        beq     done
        jsr     $ffd2       ; CHROUT
        bne     loop
done:   rts

msg:    .byte   "HELLO, WORLD!", 13, 0

Build command: ophis -o hello.prg hello.asm

This turns into 43 bytes, of which 14 are actually code.

Platform 2: MS-DOS

MS-DOS was designed to run on 16-bit x86 systems. The x86 has a special command INT (for “interrupt”) that traps into system code. The state of the registers determines what subsystem is being called and what function to perform. The BIOS—what we would now consider part of the system’s firmware—reserved interrupt 16 (“10h”). MS-DOS reserved 33 (“21h”). And it’s got a call that will print a string directly. That makes the DOS version the shortest of them all:

        cpu     8086
        org     0x100
        bits    16

        mov     dx, msg
        mov     ah, 0x09
        int     21h

msg:    db      "Hello, world!", 13, 10, "$"

Build command: nasm -f bin -o hello.com hello.asm

This assembles down to 24 bytes, of which a mere 8 are actually code. (Note that an MS-DOS newline is a CRLF instead of the C64’s raw CR, and that for some insane reason MS-DOS’s string terminator is the dollar sign instead of a null byte.

Let’s take a sudden step into the modern world.

Platform 3: Linux

Linux was designed by a PC hobbyist in the 1990s, and the raw internal mechanisms behind it kind of show this heritage. Linux syscalls work almost exactly like MS-DOS ones, except that the whole 32-bit EAX register is used to determine which call to make, and the calls map pretty closely to POSIX routines—in particular, the calls we need correspond to the routines write(2) and _exit(2).

        GLOBAL  _start
        section .text

_start: mov     eax, 4          ; sys_write
        mov     ebx, 1          ; STDOUT
        mov     ecx, msg
        mov     edx, msglen
        int     0x80            ; Do the syscall

        mov     eax, 1          ; sys_exit
        xor     ebx, ebx        ; Successful exit
        int     0x80

        section .data
msg:    db      "Hello, world!",10
        msglen equ ($-msg)

Build commands:

  • nasm -f elf hello.s
  • ld -melf_i386 -o hello hello.o

Now that we’re in something recognizable as the modern world, we have more things to keep track of. We use sections to split up our code and our data. We have to actually export a start symbol. We have to use a syscall to exit the program cleanly. (MS-DOS has a call for that too, but it’s optional for .COM files.) Most importantly of all, when we assemble the code, we no longer get a binary out—we get an object file, instead. We then need to feed that file to a linker to get an actual executable.

That object file is actually 624 bytes long—larger than our source code! The resulting executable shrinks down to 360 bytes, but that’s still a dramatic increase in overhead for a program that is only 31 bytes of code and 14 bytes of data.

(There are a variety of hilariously perverse things you can do to minimize the size of an ELF binary, but I am pointedly trying to play by the rules for these. After all, the point of Hello World is that the lessons learned in writing and building it generalize.)

With that under our belt, let’s do something really different.

Platform 4: Windows 2000

…which is also Windows XP, Vista, Windows 7, 8, 8.1, and 10. Windows is pretty good about keeping its absolutely core APIs stable.

One of the ways it does that is by having the “system calls” actually be functions that live in DLLs, which you proceed to import and call from like any other functions in any other DLL. That means we have to do a great deal more bookkeeping in a raw Win32 application, and in particular, we have to respect the ABI that the Windows calls use. (Interestingly, that ABI is not the C ABI. Signaling the compiler to that effect is why windows.h includes WINAPI declarations on all the files.)

The Windows system call ABI uses what Microsoft C calls the __stdcall convention. Arguments are all passed on the stack, right to left (so that you get the arguments in order when you pop them, or if you add an ever-increasing index to the stack counter); return values come back in EAX; the called function will pop the arguments off the stack for you; EAX, ECX, and EDX will be trashed but all other registers preserved.

Wrangling the Win32 API means we need to not only use the stack for arguments but provide writable locations for return values when they’re written through pointers. Our Hello World here uses actual stack operations, almost like we were writing a real program in a proper programming language:

        GLOBAL  _start
        EXTERN  _ExitProcess@4, _GetStdHandle@4, _WriteConsoleW@20

        ;; Main program
        section .text
        push    -11                     ; STD_OUTPUT_HANDLE
        call    _GetStdHandle@4
        sub     esp, 24
        mov     [esp], eax
        mov     dword [esp+4], msg
        mov     dword [esp+8], msglen
        mov     eax, esp
        add     eax, 20
        mov     [esp+12], eax
        xor     eax, eax
        mov     [esp+16], eax
        call    _WriteConsoleW@20
        xor     eax, eax
        mov     [esp], eax
        call    _ExitProcess@4

        ;; Our message
        section .data
msg:    dw      __utf16__("Hello, world!"),13,10
        msglen equ ($-msg)/2

Build commands:

  • nasm -f win32 hello.asm
  • link /subsystem:console /nodefaultlib /entry:start hello.obj Kernel32.Lib

(Note that these commands assume that all the relevant support files and executables are in your paths and library and link paths are also properly set. You’ll probably have a much longer LINK.EXE line than this.)

This produces a file HELLO.EXE that is exactly 3KB in size. The HELLO.OBJ file is a more modest 534 bytes—smaller than the Linux one—but the Windows executable format enforces 512-byte alignment and padding restrictions and this blows up the file size considerably. (We can actually drop the executable size to 2.5KB by folding the data segment into the text segment.) That said, we’ve got more raw work to do here too. The standard output handle isn’t fixed to a constant the way it is on POSIX, so we need to make a call to get it. The WriteConsole function has five arguments, and it writes through a pointer that must be valid as part of its operation. We reserve 24 bytes on the stack for this; 20 for the arguments, and another 4 to store the number of bytes written. (Interestingly, Windows is kind of halfway between DOS and Linux here, using CRLF instead of LF to end lines, but using an explicit character count instead of delimiters.) Finally, we need to call ExitProcess to quit normally.

All of that gives us 59 bytes of program text from our code.

So What’s Using All That Space?

All our programs are noticeably larger than the size of the code themselves. Of course, in counting the bytes of code, I was completely ignoring the actual data itself—the actual “Hello, world!” being printed. That string is 13 characters long, and the newline following it is an additional 1 or 2 characters depending on platform, and then some platforms also require an additional character for a string delimiter.

Once we account for that, we’ve actually completely accounted for every single byte in the DOS program. The .COM format has zero bytes of overhead—MS-DOS is allowed to load a .COM program into any 16-byte-aligned memory location in its 640KB of program space. It sets up the CS, DS, and ES to be equal to each other and so that the first byte of the .COM file is at CS:0100, which is also the first byte executed. (The first 256 bytes in the code segment store information about the process, such as the command line used to invoke it.)

The Commodore 64 is only slightly more expensive. The 6502 chip has a flat address space, and so has no real equivalent to a code segment register. The PRG format used by the C64 has a two-byte header that indicates where a program is to be loaded if it was loaded with a command like LOAD "HELLO",8,1. If it were instead loaded with a command like LOAD "HELLO",8, this value is ignored and it will load the program into the start of BASIC memory (which on the C64 is $0801.) We want our program to work either way, so we specify $0801 as our intended loading address, and also put a one-line BASIC program that transfers control to our machine language program. Our BASIC program is 12 bytes long and the PRG header is another 2. That accounts for the extra space.

A case can be made that the C64 program is larger than it needs to be, too—the BASIC ROM includes a routine at $AB1E that prints a string with semantics almost exactly like our loop. Setting up and making that call can make our program shrink from 14 bytes to 7. However, for the purposes of this exercise, I am treating that as taking advantage of the BASIC runtime, and thus cheating in the same way calling _printf would be cheating for the modern cases.

The Linux case is more interesting. Even in a stripped binary there is actually a fair amount of data that isn’t our program, and a hex dump doesn’t see any runs of more than about 40 zeroes in a row. However, this seems to all be metadata. If we use a command like objdump -d hello, there’s no extra stuff in what it sees. Our source code is echoed back to us identically, except for numeric values for our locations and a shift to the AT&T assembler syntax.

Windows is more fun. There’s a whole lot of zeroes in it, because of the 512-byte alignment and padding requirements. The “Hello world” string is actually encoded in UTF-16, taking up twice the space, but with no impact on our binary size since the total data size is still far less than 512. But the header information is also gigantic, because the Portable Executable format (“PE32”) which Windows uses is cunningly engineered to look like a valid .EXE file to DOS as well. This is a completely different program, which runs a program very similar to our DOS Hello-World program but which instead prints a message about how this is a Windows program and not a DOS one. I’d have to dig into the internals to be sure, but I’m fairly certain that all important information in the PE32 format allows arbitrary offsets for all the real data. That would mean that you could put a full-scale DOS program into the PE32 and have an .EXE file that incorporated its own DOS and Windows ports. (I’m unaware of anyone doing this. There may be reasons you can’t.)

Windows is also the only case where we needed to link against somebody else’s libraries (in particular, against KERNEL32.LIB). That means we should see some stuff in our text segment that wasn’t written by us. The command DUMPBIN /disasm:BYTES HELLO.EXE lets us check. And indeed we do see the code KERNEL32.LIB added, but there’s an extra surprise as well:

Dump of file hello.exe


  00401000: 6A F5              push        0FFFFFFF5h
  00401002: E8 3B 00 00 00     call        00401042
  00401007: 83 EC 18           sub         esp,18h
  0040100A: 89 04 24           mov         dword ptr [esp],eax
  0040100D: C7 44 24 04 00 30  mov         dword ptr [esp+4],403000h
            40 00
  00401015: C7 44 24 08 0F 00  mov         dword ptr [esp+8],0Fh
            00 00
  0040101D: 89 E0              mov         eax,esp
  0040101F: 83 C0 14           add         eax,14h
  00401022: 89 44 24 0C        mov         dword ptr [esp+0Ch],eax
  00401026: 31 C0              xor         eax,eax
  00401028: 89 44 24 10        mov         dword ptr [esp+10h],eax
  0040102C: E8 17 00 00 00     call        00401048
  00401031: 31 C0              xor         eax,eax
  00401033: 89 04 24           mov         dword ptr [esp],eax
  00401036: E8 01 00 00 00     call        0040103C
  0040103B: CC                 int         3
  0040103C: FF 25 08 20 40 00  jmp         dword ptr ds:[00402008h]
  00401042: FF 25 00 20 40 00  jmp         dword ptr ds:[00402000h]
  00401048: FF 25 04 20 40 00  jmp         dword ptr ds:[00402004h]

Importing three functions cost us 18 bytes of program text and rather more space in the .EXE when it sets up which DLLs must load when it is loaded. the code itself is a simple indirect jump through a jump table in KERNEL32.DLL’s own PE32 header. (In Windows, DLLs and EXEs are basically identical formats, differing only in a single field that indicates primary purpose. When you load a DLL, you dump it into memory somewhere and replace part of its header that lists the file offsets of all the functions you care about with those functions’ actual addresses. Windows does all that for us while loading us into memory.)

But then there’s that final one-byte instruction after our call to ExitProcess. INT 3 is Windows’s “Debug Breakpoint hit” interrupt. The linker has no reason to know or believe that the call to ExitProcess will never return, and it’s worried that we’re going to walk off the end of our .OBJ file. If we were to add a spurious RET instruction after the call to ExitProcess, we would actually not see this instruction generated. So that is also kind of a neat touch.

So, What Have We Learned?

We’ve learned that if you want to write an entire application in raw assembly code, you can totally still do this in the modern era. It’s just a lot more work, and in the Linux and Windows cases particularly, the parts where you hit other libraries are parts where it’s really going to feel like you might as well be letting a C compiler do the work.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s