We’ve written some pretty complicated code on this blog over the years. Let’s write a really simple program instead.
In fact, let’s write the simplest program: Hello World. And since it’s the simplest program, let’s write it without any language support.
I’m going to do this for four different platforms, as a way of exploring how to get minimal but full programs to run in various environments.
Platform 1: Commodore 64
The Commodore 64 shipped with an 8KB ROM chip called the KERNAL that contained routines for doing basic I/O and system management tasks. Entry points for routines in the KERNAL were standardized across their various models, so a library routine written for the VIC-20 had an excellent chance of continuing to work unmodified when incorporated into a Commodore 128 program. For a Hello World program, the routine that we need to use is CHROUT, at location
$FFD2, which prints out a character and handles things like cursor placement and screen scrolling as necessary.
.word $0801 .org $0801 .word basend, 2016 .byte $9e, "2061",0 basend: .word 0 ldy #$00 loop: lda msg, y beq done jsr $ffd2 ; CHROUT iny bne loop done: rts msg: .byte "HELLO, WORLD!", 13, 0
ophis -o hello.prg hello.asm
This turns into 43 bytes, of which 14 are actually code.
Platform 2: MS-DOS
MS-DOS was designed to run on 16-bit x86 systems. The x86 has a special command
INT (for “interrupt”) that traps into system code. The state of the registers determines what subsystem is being called and what function to perform. The BIOS—what we would now consider part of the system’s firmware—reserved interrupt 16 (“10h”). MS-DOS reserved 33 (“21h”). And it’s got a call that will print a string directly. That makes the DOS version the shortest of them all:
cpu 8086 org 0x100 bits 16 mov dx, msg mov ah, 0x09 int 21h ret msg: db "Hello, world!", 13, 10, "$"
nasm -f bin -o hello.com hello.asm
This assembles down to 24 bytes, of which a mere 8 are actually code. (Note that an MS-DOS newline is a CRLF instead of the C64’s raw CR, and that for some insane reason MS-DOS’s string terminator is the dollar sign instead of a null byte.
Let’s take a sudden step into the modern world.
Platform 3: Linux
Linux was designed by a PC hobbyist in the 1990s, and the raw internal mechanisms behind it kind of show this heritage. Linux syscalls work almost exactly like MS-DOS ones, except that the whole 32-bit EAX register is used to determine which call to make, and the calls map pretty closely to POSIX routines—in particular, the calls we need correspond to the routines
GLOBAL _start section .text _start: mov eax, 4 ; sys_write mov ebx, 1 ; STDOUT mov ecx, msg mov edx, msglen int 0x80 ; Do the syscall mov eax, 1 ; sys_exit xor ebx, ebx ; Successful exit int 0x80 section .data msg: db "Hello, world!",10 msglen equ ($-msg)
nasm -f elf hello.s
ld -melf_i386 -o hello hello.o
Now that we’re in something recognizable as the modern world, we have more things to keep track of. We use sections to split up our code and our data. We have to actually export a start symbol. We have to use a syscall to exit the program cleanly. (MS-DOS has a call for that too, but it’s optional for .COM files.) Most importantly of all, when we assemble the code, we no longer get a binary out—we get an object file, instead. We then need to feed that file to a linker to get an actual executable.
That object file is actually 624 bytes long—larger than our source code! The resulting executable shrinks down to 360 bytes, but that’s still a dramatic increase in overhead for a program that is only 31 bytes of code and 14 bytes of data.
(There are a variety of hilariously perverse things you can do to minimize the size of an ELF binary, but I am pointedly trying to play by the rules for these. After all, the point of Hello World is that the lessons learned in writing and building it generalize.)
With that under our belt, let’s do something really different.
Platform 4: Windows 2000
…which is also Windows XP, Vista, Windows 7, 8, 8.1, and 10. Windows is pretty good about keeping its absolutely core APIs stable.
One of the ways it does that is by having the “system calls” actually be functions that live in DLLs, which you proceed to import and call from like any other functions in any other DLL. That means we have to do a great deal more bookkeeping in a raw Win32 application, and in particular, we have to respect the ABI that the Windows calls use. (Interestingly, that ABI is not the C ABI. Signaling the compiler to that effect is why windows.h includes WINAPI declarations on all the files.)
The Windows system call ABI uses what Microsoft C calls the
__stdcall convention. Arguments are all passed on the stack, right to left (so that you get the arguments in order when you pop them, or if you add an ever-increasing index to the stack counter); return values come back in EAX; the called function will pop the arguments off the stack for you; EAX, ECX, and EDX will be trashed but all other registers preserved.
Wrangling the Win32 API means we need to not only use the stack for arguments but provide writable locations for return values when they’re written through pointers. Our Hello World here uses actual stack operations, almost like we were writing a real program in a proper programming language:
GLOBAL _start EXTERN _ExitProcess@4, _GetStdHandle@4, _WriteConsoleW@20 ;; Main program section .text _start: push -11 ; STD_OUTPUT_HANDLE call _GetStdHandle@4 sub esp, 24 mov [esp], eax mov dword [esp+4], msg mov dword [esp+8], msglen mov eax, esp add eax, 20 mov [esp+12], eax xor eax, eax mov [esp+16], eax call _WriteConsoleW@20 xor eax, eax mov [esp], eax call _ExitProcess@4 ;; Our message section .data msg: dw __utf16__("Hello, world!"),13,10 msglen equ ($-msg)/2
nasm -f win32 hello.asm
link /subsystem:console /nodefaultlib /entry:start hello.obj Kernel32.Lib
(Note that these commands assume that all the relevant support files and executables are in your paths and library and link paths are also properly set. You’ll probably have a much longer LINK.EXE line than this.)
This produces a file HELLO.EXE that is exactly 3KB in size. The HELLO.OBJ file is a more modest 534 bytes—smaller than the Linux one—but the Windows executable format enforces 512-byte alignment and padding restrictions and this blows up the file size considerably. (We can actually drop the executable size to 2.5KB by folding the data segment into the text segment.) That said, we’ve got more raw work to do here too. The standard output handle isn’t fixed to a constant the way it is on POSIX, so we need to make a call to get it. The WriteConsole function has five arguments, and it writes through a pointer that must be valid as part of its operation. We reserve 24 bytes on the stack for this; 20 for the arguments, and another 4 to store the number of bytes written. (Interestingly, Windows is kind of halfway between DOS and Linux here, using CRLF instead of LF to end lines, but using an explicit character count instead of delimiters.) Finally, we need to call ExitProcess to quit normally.
All of that gives us 59 bytes of program text from our code.
So What’s Using All That Space?
All our programs are noticeably larger than the size of the code themselves. Of course, in counting the bytes of code, I was completely ignoring the actual data itself—the actual “Hello, world!” being printed. That string is 13 characters long, and the newline following it is an additional 1 or 2 characters depending on platform, and then some platforms also require an additional character for a string delimiter.
Once we account for that, we’ve actually completely accounted for every single byte in the DOS program. The .COM format has zero bytes of overhead—MS-DOS is allowed to load a .COM program into any 16-byte-aligned memory location in its 640KB of program space. It sets up the CS, DS, and ES to be equal to each other and so that the first byte of the .COM file is at CS:0100, which is also the first byte executed. (The first 256 bytes in the code segment store information about the process, such as the command line used to invoke it.)
The Commodore 64 is only slightly more expensive. The 6502 chip has a flat address space, and so has no real equivalent to a code segment register. The PRG format used by the C64 has a two-byte header that indicates where a program is to be loaded if it was loaded with a command like
LOAD "HELLO",8,1. If it were instead loaded with a command like
LOAD "HELLO",8, this value is ignored and it will load the program into the start of BASIC memory (which on the C64 is
$0801.) We want our program to work either way, so we specify
$0801 as our intended loading address, and also put a one-line BASIC program that transfers control to our machine language program. Our BASIC program is 12 bytes long and the PRG header is another 2. That accounts for the extra space.
A case can be made that the C64 program is larger than it needs to be, too—the BASIC ROM includes a routine at
$AB1E that prints a string with semantics almost exactly like our loop. Setting up and making that call can make our program shrink from 14 bytes to 7. However, for the purposes of this exercise, I am treating that as taking advantage of the BASIC runtime, and thus cheating in the same way calling
_printf would be cheating for the modern cases.
The Linux case is more interesting. Even in a stripped binary there is actually a fair amount of data that isn’t our program, and a hex dump doesn’t see any runs of more than about 40 zeroes in a row. However, this seems to all be metadata. If we use a command like
objdump -d hello, there’s no extra stuff in what it sees. Our source code is echoed back to us identically, except for numeric values for our locations and a shift to the AT&T assembler syntax.
Windows is more fun. There’s a whole lot of zeroes in it, because of the 512-byte alignment and padding requirements. The “Hello world” string is actually encoded in UTF-16, taking up twice the space, but with no impact on our binary size since the total data size is still far less than 512. But the header information is also gigantic, because the Portable Executable format (“PE32”) which Windows uses is cunningly engineered to look like a valid .EXE file to DOS as well. This is a completely different program, which runs a program very similar to our DOS Hello-World program but which instead prints a message about how this is a Windows program and not a DOS one. I’d have to dig into the internals to be sure, but I’m fairly certain that all important information in the PE32 format allows arbitrary offsets for all the real data. That would mean that you could put a full-scale DOS program into the PE32 and have an .EXE file that incorporated its own DOS and Windows ports. (I’m unaware of anyone doing this. There may be reasons you can’t.)
Windows is also the only case where we needed to link against somebody else’s libraries (in particular, against KERNEL32.LIB). That means we should see some stuff in our text segment that wasn’t written by us. The command
DUMPBIN /disasm:BYTES HELLO.EXE lets us check. And indeed we do see the code KERNEL32.LIB added, but there’s an extra surprise as well:
Dump of file hello.exe File Type: EXECUTABLE IMAGE 00401000: 6A F5 push 0FFFFFFF5h 00401002: E8 3B 00 00 00 call 00401042 00401007: 83 EC 18 sub esp,18h 0040100A: 89 04 24 mov dword ptr [esp],eax 0040100D: C7 44 24 04 00 30 mov dword ptr [esp+4],403000h 40 00 00401015: C7 44 24 08 0F 00 mov dword ptr [esp+8],0Fh 00 00 0040101D: 89 E0 mov eax,esp 0040101F: 83 C0 14 add eax,14h 00401022: 89 44 24 0C mov dword ptr [esp+0Ch],eax 00401026: 31 C0 xor eax,eax 00401028: 89 44 24 10 mov dword ptr [esp+10h],eax 0040102C: E8 17 00 00 00 call 00401048 00401031: 31 C0 xor eax,eax 00401033: 89 04 24 mov dword ptr [esp],eax 00401036: E8 01 00 00 00 call 0040103C 0040103B: CC int 3 0040103C: FF 25 08 20 40 00 jmp dword ptr ds:[00402008h] 00401042: FF 25 00 20 40 00 jmp dword ptr ds:[00402000h] 00401048: FF 25 04 20 40 00 jmp dword ptr ds:[00402004h]
Importing three functions cost us 18 bytes of program text and rather more space in the .EXE when it sets up which DLLs must load when it is loaded. the code itself is a simple indirect jump through a jump table in KERNEL32.DLL’s own PE32 header. (In Windows, DLLs and EXEs are basically identical formats, differing only in a single field that indicates primary purpose. When you load a DLL, you dump it into memory somewhere and replace part of its header that lists the file offsets of all the functions you care about with those functions’ actual addresses. Windows does all that for us while loading us into memory.)
But then there’s that final one-byte instruction after our call to ExitProcess.
INT 3 is Windows’s “Debug Breakpoint hit” interrupt. The linker has no reason to know or believe that the call to ExitProcess will never return, and it’s worried that we’re going to walk off the end of our .OBJ file. If we were to add a spurious
RET instruction after the call to ExitProcess, we would actually not see this instruction generated. So that is also kind of a neat touch.
So, What Have We Learned?
We’ve learned that if you want to write an entire application in raw assembly code, you can totally still do this in the modern era. It’s just a lot more work, and in the Linux and Windows cases particularly, the parts where you hit other libraries are parts where it’s really going to feel like you might as well be letting a C compiler do the work.