Hand-coding a Linux ELF in raw machine code
Announce at start: "I'm using the hand-coding-elf-linux skill to write this as raw machine code."
Core rule: Author the bytes. You write instruction encodings and ELF header bytes directly. A mnemonic (mov eax,1, svc #0) may appear ONLY as a trailing annotation for the reader — never as the thing you author and then assemble. Producing a .c/.s or running a compiler/assembler/linker on your logic is the failure this skill prevents. xxd, printf, head -c are byte-layers, not source. Think in machine code, not assembly.
Minimal static ELF64
A static ET_EXEC ELF that makes raw syscalls just runs on Linux — no code signing, no dynamic linker to satisfy. The whole job is two structures then code. Non-PIE ET_EXEC loaded at a fixed 0x400000, so absolute addresses are known at build time.
ELF64 header (64 bytes):
| field | bytes | value |
|---|---|---|
| e_ident | 16 | magic 7F 45 4C 46 (4 bytes) + 02 class=64-bit + 01 data=LE + 01 version + 00 OSABI=SysV + 00 ABIversion + 7 zero pad = 16. (Count: 4+1+1+1+1+1+7.) |
| e_type | 2 | 0200 = ET_EXEC |
| e_machine | 2 | x86-64 = 3E00 (0x3E); aarch64 = B700 (0xB7) |
| e_version | 4 | 01000000 |
| e_entry | 8 | 0x400078 = load addr + sizeof(headers) = 0x400000 + 0x40 + 0x38 |
| e_phoff | 8 | 0x40 (64) |
| e_shoff | 8 | 0 (no sections) |
| e_flags | 4 | 0 |
| e_ehsize | 2 | 4000 (64) |
| e_phentsize | 2 | 3800 (56) |
| e_phnum | 2 | 0100 (1) |
| e_shentsize/e_shnum/e_shstrndx | 2+2+2 | 0,0,0 |
Program header `PT_LOAD` (56 bytes, ELF64 field order): p_type=1, p_flags=5 (R+X — note flags come second in ELF64), p_offset=0, p_vaddr=0x400000, p_paddr=0x400000, p_filesz=<file size>, p_memsz=<file size>, p_align=0x1000. Map the whole file from offset 0. Code starts at file offset 0x78 (= entry).
p_offset and p_vaddr must be congruent mod p_align — both are 0 mod 0x1000 here, so fine.
The two syscall ABIs (get these exactly right)
Syscall numbers differ wildly between arches — this is the #1 source of wrong behavior.
| x86-64 | arm64 (aarch64) | |
|---|---|---|
| syscall nr in | rax | x8 |
| args in | rdi, rsi, rdx, r10, r8, r9 | x0, x1, x2, x3, x4, x5 |
| trap instruction | syscall = 0F 05 | svc #0 = D4000001 |
| write | 1 | 64 |
| exit | 60 | 93 |
| return value | rax | x0 |
x86-64 instructions you need
mov r32, imm32 is B8+r then 4-byte LE immediate (eax=0,ecx=1,edx=2,ebx=3,esp=4,ebp=5,esi=6,edi=7):
B8 01000000 mov eax,1 ; write
BF 01000000 mov edi,1 ; fd=stdout
BE <addr32> mov esi,&msg ; abs addr works because ET_EXEC is non-PIE
BA 0E000000 mov edx,14 ; len
0F 05 syscall
B8 3C000000 mov eax,60 ; exit
BF 00000000 mov edi,0 ; status
0F 05 syscall
The string address is absolute: 0x400000 + file_offset_of_msg, written little-endian into the BE immediate. Note the string's offset is constant as long as the code in front of it is fixed-length — changing the message's text or length does NOT move the string (it always starts right after the same block of code), so the BE immediate stays the same; only edx (the length) changes. (If you ever make it PIE, swap to a RIP-relative lea; for hand-coding, stay non-PIE.)
arm64 instructions you need
ARM64 instructions are fixed 32-bit, written little-endian (encoding 0xAABBCCDD → bytes DD CC BB AA). Encoding formulas (Rd = register number 0–31, hw = which 16-bit slot):
| op | formula (sum the disjoint pieces) |
|---|---|
movz Xd,#imm16,lsl#(16*hw) | 0xD2800000 + (hw<<21) + (imm16<<5) + Rd |
movk Xd,#imm16,lsl#(16*hw) | 0xF2800000 + (hw<<21) + (imm16<<5) + Rd |
adr Xd,label | 0x10000000 + (immlo<<29) + (immhi<<5) + Rd, with off = label - addr(adr), immlo = off & 3, immhi = (off>>2) & 0x7FFFF |
svc #imm16 | 0xD4000001 + (imm16<<5) |
adr x1,msg is PC-relative (offset from the adr instruction to the string), so no absolute address is needed and it survives any load address. Syscall numbers ≤ 0xFFFF (like 64/93) load with a single movz; larger values need a movz for the low half then movk for the next.
D2800020 movz x0,#1 ; fd=stdout
100000E1 adr x1,msg ; &msg (PC-relative, off 0x1C here)
D28001C2 movz x2,#14 ; len
D2800808 movz x8,#64 ; write
D4000001 svc #0
D2800000 movz x0,#0 ; status
D2800BA8 movz x8,#93 ; exit
D4000001 svc #0
(movz x8,#64 = 0xD2800000 | (64<<5) | 8; movz x8,#93 likewise.)
Build
Lay bytes with xxd -r -p (ignores whitespace — but it also eats stray hex letters, so put NO comment text in the hex stream). No padding is needed: header + program header + code + string are contiguous. Working, tested builds for both arches:
chmod +x the output. Inspect with readelf -h / readelf -l (or file).
Verify with qemu
qemu-user runs a foreign-arch Linux binary on any host by emulating the CPU and translating Linux syscalls.
On a Linux host: install qemu-user-static, then just:
qemu-x86_64 ./hello_x64 # or qemu-x86_64-static
qemu-aarch64 ./hello_arm64
A native-arch binary runs directly; only the foreign one needs the explicit qemu prefix (binfmt_misc usually makes even that automatic).
On a macOS host (no qemu-user, only qemu-system-*): run inside a Linux container — see verify-with-qemu.sh, which uses Docker + qemu-user-static. Do NOT reach for qemu-system-* (full-machine emulation needs a kernel+rootfs); qemu-user is the right tool for a single static binary.
Always ship a portable runner (host arch ≠ target arch)
The host you build on is frequently a different arch than the binary (e.g. authoring an x86-64 ELF on an arm64 Mac). A bare ./bin then fails confusingly, and ad-hoc docker run … apt-get install qemu-user-static … one-liners reinstall qemu on every run and spew debconf/platform noise. So always drop a `run-elf.sh` next to the binary — a single command that runs it regardless of host.
Copy run-elf.sh into the project (or generate an equivalent). It:
- reads the target arch from the ELF's own
e_machinebyte (offset0x12:3e00=x86-64,b700=arm64) — no hardcoding, - runs natively when host arch == target,
- uses `qemu-<arch>-static` when the host is Linux of a different arch,
- on macOS builds a small cached Docker image with
qemu-user-staticonce, then reuses it (sub-second per run), and passes a TTY so colors render.
./run-elf.sh ls_x64 # works on Linux x86-64, Linux arm64, or macOS — same command
./run-elf.sh ls_x64 | cat -v # make ANSI escapes visible as text
Before you finish: hand the user the commands
After the binary builds and you've verified it, END the run by giving the user copy-pasteable commands to run/test/verify it themselves. Use the REAL filename and arch you produced, not placeholders.
If you shipped `run-elf.sh` (recommended — see above), the one command that works on any host is `./run-elf.sh <bin>`. Lead with that, then offer the per-host raw commands below for users who want to see what's underneath or don't have it:
# inspect what you built
file <bin>
readelf -h <bin> # header: arch, entry, type
readelf -l <bin> # the PT_LOAD segment
# run it
./<bin>; echo "exit=$?" # Linux, matching arch (native)
qemu-x86_64 ./<bin>; echo "exit=$?" # Linux, x86-64 binary via qemu-user (apt install qemu-user-static)
qemu-aarch64 ./<bin>; echo "exit=$?" # Linux, arm64 binary via qemu-user
# macOS host (no qemu-user): run it in a Linux container via qemu-user-static
docker run --rm -v "$PWD":/w -w /w debian:stable-slim sh -c \
'apt-get update -qq >/dev/null && apt-get install -y -qq qemu-user-static >/dev/null && \
chmod +x <bin> && qemu-x86_64-static ./<bin>; echo "exit=$?"' # swap qemu-aarch64-static for an arm64 binary
State the expected result too (e.g. prints ... and exits with code N), so the user can tell at a glance whether their run matches.
Beyond hello-world: servers, files, and big programs
hello-world is write+exit. Real programs — a TCP server, a file reader, a renderer — need more syscalls, writable memory, control flow, and subroutines, but the method is unchanged: you still author every instruction word by hand. This section is the delta. Worked example throughout: a web server that reads a markdown file from disk and renders it to HTML per request, entirely in hand-coded arm64 — a ~5.7 KB static ELF that speaks HTTP and does its own markdown parsing.
More syscalls (arm64 / x86-64 numbers)
A blocking TCP server plus a file read use these (arm64 / x86-64):
| call | arm64 | x86-64 | call | arm64 | x86-64 | |
|---|---|---|---|---|---|---|
| socket | 198 | 41 | accept | 202 | 43 | |
| setsockopt | 208 | 54 | read | 63 | 0 | |
| bind | 200 | 49 | close | 57 | 3 | |
| listen | 201 | 50 | openat | 56 | 257 |
Server shape: socket → setsockopt(SO_REUSEADDR) → bind → listen → loop{ accept → read(drain the request) → …work… → write → close }. sockaddr_in is 16 bytes: sin_family=2 (u16 LE), sin_port big-endian (port 8006 = 0x1F46, stored as bytes 46 1F), sin_addr=0 (INADDR_ANY), 8 pad. Read a file with openat(AT_FDCWD, path, O_RDONLY) — AT_FDCWD is -100; load it with movn (arm64 movn x0,#99 = ~99 = -100 = 0x92800C60), path pointer in x1, flags 0 in x2. The path resolves against the process's cwd, so use an absolute path (/data.md) when the file's location is fixed — a bare data.md depends on where the container/process was started.
Writable memory: read() needs it, and BSS is free
read/accept write into a buffer, so that buffer must be in a writable page. An R+X-only PT_LOAD (p_flags=5) faults with EFAULT on the first read. Fix: make the load segment RWX (p_flags=7), or use the stack. And big buffers cost zero file bytes: set `p_memsz` larger than `p_filesz` and the kernel zero-fills the gap on a fresh mmap (classic BSS). The example keeps a 64 KB input + 256 KB output buffer that way while the on-disk ELF stays ~5.7 KB. Buffers live past p_filesz but within p_memsz, at vaddrs you compute.
Guard syscall returns
A failed syscall returns a small negative number. Store read's return as an unsigned length without checking and -1 becomes a giant count — your copy/scan loop runs off the buffer and faults. Compare the return to 0 and branch negatives to an error path (e.g. emit a fixed HTTP/1.1 500), exactly like the fd check after openat.
Big programs: write a byte-emitter, not hex by hand
Hand-laying xxd hex is fine for a dozen instructions; a parser is hundreds. Author them from a small byte-emitter program instead — one function per instruction returning the exact 32-bit word you hand-encoded (movz, bl, cmp, cbz, …), plus a label/fixup layer. This is still authoring bytes: it is NOT an assembler (no mnemonic→encoding translation — you supply every encoding). It buys named labels, a two-pass layout (size everything, then fill in every PC-relative displacement, asserting the size is stable across passes), and real subroutines: bl(0x94000000 | imm26) / ret(0xD65F03C0), save x29/x30 on entry, keep long-lived state in callee-saved x19–x28. Encodings for the ops beyond the hello set (arm64; Rd/Rn/Rm/Rt/Ra = reg 0–31). The fields are disjoint, so the pieces are shown added with + (identical to OR-ing them). All verified in a shipping build:
| op | encoding (sum the pieces) |
|---|---|
mov Xd,Xm (alias of orr) | 0xAA0003E0 + (Rm<<16) + Rd |
add Xd,Xn,#imm12 / sub | 0x91000000 / 0xD1000000, + (imm12<<10) + (Rn<<5) + Rd |
cmp Xn,#imm12 (SUBS→xzr) | 0xF100001F + (imm12<<10) + (Rn<<5) (32-bit cmp Wn: 0x7100001F) |
cmp Xn,Xm (SUBS→xzr) | 0xEB00001F + (Rm<<16) + (Rn<<5) |
b.cond label | 0x54000000 + (imm19<<5) + cond; cond: eq0 ne1 cs2 cc3 mi4 pl5 hi8 ls9 ge10 lt11 gt12 le13 |
cbz Xt,label / cbnz | 0xB4000000 / 0xB5000000, + (imm19<<5) + Rt (32-bit: 0x34… / 0x35…) |
b label / bl label | 0x14000000 / 0x94000000, + (imm26 & 0x3FFFFFF), imm26 = (label−here)>>2 |
ret | 0xD65F03C0 |
movn Xd,#imm16 | 0x92800000 + (imm16<<5) + Rd (#99 → −100 = AT_FDCWD) |
ldrb Wt,[Xn,#imm] / strb | 0x39400000 / 0x39000000, + (imm<<10) + (Rn<<5) + Rt |
strh Wt,[Xn,#imm] | 0x79000000 + ((imm>>1)<<10) + (Rn<<5) + Rt |
ldr Wt,[Xn,#imm] / str | 0xB9400000 / 0xB9000000, + ((imm>>2)<<10) + (Rn<<5) + Rt |
ldr Xt,[Xn,#imm] / str | 0xF9400000 / 0xF9000000, + ((imm>>3)<<10) + (Rn<<5) + Rt |
udiv Xd,Xn,Xm | 0x9AC00800 + (Rm<<16) + (Rn<<5) + Rd |
msub Xd,Xn,Xm,Xa | 0x9B008000 + (Rm<<16) + (Ra<<10) + (Rn<<5) + Rd (x−udiv*x → remainder) |
Gotcha: ldr/str unsigned-offset immediates are scaled by the access size — byte×1, half×2, word×4, dword×8 — so divide the byte offset by the size when encoding (that's the imm>>1/2/3 above). cmp is just SUBS into the zero register (Rd=31).
Verify complex output byte-for-byte against a reference (oracle)
For anything past trivial output, "looks right" is not verification. Write a reference implementation of the exact same algorithm in a normal language (Python is fine), keep it as an uncommitted dev-time tool, and assert the machine-code program's output equals the reference byte-for-byte on real input. Design the reference to emit minimal, deterministic bytes so the machine code can match it exactly. This turns "did my parser work?" into a hard gate and points at the first diverging byte when it fails. In the example, the served HTML had to diff-clean against the Python renderer on the real document before shipping — and stayed the deterministic gate for every later change.
Running on macOS + Apple Silicon
An arm64 Linux ELF runs natively in a linux/arm64 Docker container on Apple Silicon — no qemu emulation, so local behavior matches an arm64 Linux server exactly (a static ELF runs even in a busybox/scratch-style image; it needs no libc or loader). Two gotchas: (1) if docker pull/build hangs at error getting credentials, the credential helper is stalling — bypass it for public images with an empty config: mkdir -p /tmp/nc && printf '{}' >/tmp/nc/config.json && DOCKER_CONFIG=/tmp/nc docker …. (2) A server never exits, so run it detached (docker run -d …) — a foreground docker run blocks your shell until killed.
Common mistakes
- Using x86-64 syscall numbers on arm64 or vice-versa (write is 1 vs 64, exit 60 vs 93) — the program does the wrong thing or hangs.
- Reading into an R+X-only segment —
read/acceptneed writable memory; use RWX (p_flags=7) or the stack, else EFAULT on the first read. - Using a syscall return as an unsigned length without checking for negative — one errno becomes a giant length and overruns the buffer.
- Shipping big zero buffers as file bytes — use
p_memsz > p_filesz(BSS zero-fill) instead of padding megabytes of zeros into the ELF. - Hand-laying hex for a large program — write a byte-emitter with labels + two-pass layout (still authoring bytes, not assembling).
- Shipping a non-trivial renderer/parser on "looks right" — gate it on byte-for-byte equality with a reference implementation.
syscall(0F 05) on x86-64 vssvc #0(D4000001) on arm64 — don't cross them. Linux arm64 issvc #0with the number inx8.- Wrong
e_machine(0x3E vs 0xB7) — the kernel/qemu rejects "Exec format error". - Hardcoding a
mov esi,imm32absolute string address but building PIE — keep itET_EXECso0x400000is fixed. - Forgetting ELF64 puts
p_flagsimmediately afterp_type(different from ELF32). - Writing a
.s/.cand assembling — that is exactly what "no source code" forbids. - Trying
qemu-system-*for a static binary on macOS — useqemu-user-staticin a container instead.