Hand-coding a Linux ELF in raw machine code

Announce at start: "I'm using the hand-coding-elf-linux skill to write this as raw machine code."

Core rule: Author the bytes. You write instruction encodings and ELF header bytes directly. A mnemonic (mov eax,1, svc #0) may appear ONLY as a trailing annotation for the reader — never as the thing you author and then assemble. Producing a .c/.s or running a compiler/assembler/linker on your logic is the failure this skill prevents. xxd, printf, head -c are byte-layers, not source. Think in machine code, not assembly.

Linux is the easy case

Linux has no code signing and no mandatory dynamic linker. A static ET_EXEC ELF that makes raw syscalls just runs — no signing step, no dynamic linker, no per-OS executable metadata. That makes the whole job: ELF header + one PT_LOAD segment + your machine code.

Minimal static ELF64

Two structures, then code. Non-PIE ET_EXEC loaded at a fixed 0x400000, so absolute addresses are known at build time.

ELF64 header (64 bytes):

field	bytes	value
e_ident	16	`7F 45 4C 46` then `02`(64-bit) `01`(LE) `01` `00`(SysV) `00`, +7 zero
e_type	2	`0200` = ET_EXEC
e_machine	2	x86-64 = `3E00` (0x3E); aarch64 = `B700` (0xB7)
e_version	4	`01000000`
e_entry	8	`0x400078` = load addr + sizeof(headers) = 0x400000 + 0x40 + 0x38
e_phoff	8	`0x40` (64)
e_shoff	8	0 (no sections)
e_flags	4	0
e_ehsize	2	`4000` (64)
e_phentsize	2	`3800` (56)
e_phnum	2	`0100` (1)
e_shentsize/e_shnum/e_shstrndx	2+2+2	0,0,0

Program header `PT_LOAD` (56 bytes, ELF64 field order): p_type=1, p_flags=5 (R+X — note flags come second in ELF64), p_offset=0, p_vaddr=0x400000, p_paddr=0x400000, p_filesz=<file size>, p_memsz=<file size>, p_align=0x1000. Map the whole file from offset 0. Code starts at file offset 0x78 (= entry).

p_offset and p_vaddr must be congruent mod p_align — both are 0 mod 0x1000 here, so fine.

The two syscall ABIs (get these exactly right)

Syscall numbers differ wildly between arches — this is the #1 source of wrong behavior.

	x86-64	arm64 (aarch64)
syscall nr in	`rax`	`x8`
args in	rdi, rsi, rdx, r10, r8, r9	x0, x1, x2, x3, x4, x5
trap instruction	`syscall` = `0F 05`	`svc #0` = `D4000001`
write	1	64
exit	60	93
return value	rax	x0

x86-64 instructions you need

mov r32, imm32 is B8+r then 4-byte LE immediate (eax=0,ecx=1,edx=2,ebx=3,esp=4,ebp=5,esi=6,edi=7):

B8 01000000   mov eax,1            ; write
BF 01000000   mov edi,1            ; fd=stdout
BE <addr32>   mov esi,&msg         ; abs addr works because ET_EXEC is non-PIE
BA 0E000000   mov edx,14           ; len
0F 05         syscall
B8 3C000000   mov eax,60           ; exit
BF 00000000   mov edi,0            ; status
0F 05         syscall

The string address is absolute: 0x400000 + file_offset_of_msg, written little-endian into the BE immediate. Note the string's offset is constant as long as the code in front of it is fixed-length — changing the message's text or length does NOT move the string (it always starts right after the same block of code), so the BE immediate stays the same; only edx (the length) changes. (If you ever make it PIE, swap to a RIP-relative lea; for hand-coding, stay non-PIE.)

arm64 instructions you need

ARM64 instructions are fixed 32-bit, written little-endian (encoding 0xAABBCCDD → bytes DD CC BB AA). Encoding formulas (Rd = register number 0–31, hw = which 16-bit slot):

op	formula (OR the pieces)
`movz Xd,#imm16,lsl#(16*hw)`	`0xD2800000 \	(hw<<21) \	(imm16<<5) \	Rd`
`movk Xd,#imm16,lsl#(16*hw)`	`0xF2800000 \	(hw<<21) \	(imm16<<5) \	Rd`
`adr Xd,label`	`0x10000000 \	(immlo<<29) \	(immhi<<5) \	Rd`, with` off = label - addr(adr)`,` immlo = off & 3`,` immhi = (off>>2) & 0x7FFFF`
`svc #imm16`	`0xD4000001 \	(imm16<<5)`

adr x1,msg is PC-relative (offset from the adr instruction to the string), so no absolute address is needed and it survives any load address. Syscall numbers ≤ 0xFFFF (like 64/93) load with a single movz; larger values need a movz for the low half then movk for the next.

D2800020   movz x0,#1           ; fd=stdout
100000E1   adr  x1,msg          ; &msg (PC-relative, off 0x1C here)
D28001C2   movz x2,#14          ; len
D2800808   movz x8,#64          ; write
D4000001   svc  #0
D2800000   movz x0,#0           ; status
D2800BA8   movz x8,#93          ; exit
D4000001   svc  #0

(movz x8,#64 = 0xD2800000 | (64<<5) | 8; movz x8,#93 likewise.)

Build

Lay bytes with xxd -r -p (ignores whitespace — but it also eats stray hex letters, so put NO comment text in the hex stream). No padding is needed: header + program header + code + string are contiguous. Working, tested builds for both arches:

chmod +x the output. Inspect with readelf -h / readelf -l (or file).

Verify with qemu

qemu-user runs a foreign-arch Linux binary on any host by emulating the CPU and translating Linux syscalls.

On a Linux host: install qemu-user-static, then just:

qemu-x86_64 ./hello_x64      # or qemu-x86_64-static
qemu-aarch64 ./hello_arm64

A native-arch binary runs directly; only the foreign one needs the explicit qemu prefix (binfmt_misc usually makes even that automatic).

On a macOS host (no qemu-user, only qemu-system-*): run inside a Linux container — see verify-with-qemu.sh, which uses Docker + qemu-user-static. Do NOT reach for qemu-system-* (full-machine emulation needs a kernel+rootfs); qemu-user is the right tool for a single static binary.

Always ship a portable runner (host arch ≠ target arch)

The host you build on is frequently a different arch than the binary (e.g. authoring an x86-64 ELF on an arm64 Mac). A bare ./bin then fails confusingly, and ad-hoc docker run … apt-get install qemu-user-static … one-liners reinstall qemu on every run and spew debconf/platform noise. So always drop a `run-elf.sh` next to the binary — a single command that runs it regardless of host.

Copy run-elf.sh into the project (or generate an equivalent). It:

reads the target arch from the ELF's own e_machine byte (offset 0x12: 3e00=x86-64, b700=arm64) — no hardcoding,
runs natively when host arch == target,
uses `qemu-<arch>-static` when the host is Linux of a different arch,
on macOS builds a small cached Docker image with qemu-user-static once, then reuses it (sub-second per run), and passes a TTY so colors render.

./run-elf.sh ls_x64            # works on Linux x86-64, Linux arm64, or macOS — same command
./run-elf.sh ls_x64 | cat -v   # make ANSI escapes visible as text

Before you finish: hand the user the commands

After the binary builds and you've verified it, END the run by giving the user copy-pasteable commands to run/test/verify it themselves. Use the REAL filename and arch you produced, not placeholders.

If you shipped `run-elf.sh` (recommended — see above), the one command that works on any host is `./run-elf.sh <bin>`. Lead with that, then offer the per-host raw commands below for users who want to see what's underneath or don't have it:

# inspect what you built
file <bin>
readelf -h <bin>      # header: arch, entry, type
readelf -l <bin>      # the PT_LOAD segment

# run it
./<bin>; echo "exit=$?"                 # Linux, matching arch (native)
qemu-x86_64  ./<bin>; echo "exit=$?"    # Linux, x86-64 binary via qemu-user (apt install qemu-user-static)
qemu-aarch64 ./<bin>; echo "exit=$?"    # Linux, arm64 binary via qemu-user

# macOS host (no qemu-user): run it in a Linux container via qemu-user-static
docker run --rm -v "$PWD":/w -w /w debian:stable-slim sh -c \
  'apt-get update -qq >/dev/null && apt-get install -y -qq qemu-user-static >/dev/null && \
   chmod +x <bin> && qemu-x86_64-static ./<bin>; echo "exit=$?"'   # swap qemu-aarch64-static for an arm64 binary

State the expected result too (e.g. prints ... and exits with code N), so the user can tell at a glance whether their run matches.

Common mistakes

Using x86-64 syscall numbers on arm64 or vice-versa (write is 1 vs 64, exit 60 vs 93) — the program does the wrong thing or hangs.
syscall (0F 05) on x86-64 vs svc #0 (D4000001) on arm64 — don't cross them. Linux arm64 is svc #0 with the number in x8.
Wrong e_machine (0x3E vs 0xB7) — the kernel/qemu rejects "Exec format error".
Hardcoding a mov esi,imm32 absolute string address but building PIE — keep it ET_EXEC so 0x400000 is fixed.
Forgetting ELF64 puts p_flags immediately after p_type (different from ELF32).
Writing a .s/.c and assembling — that is exactly what "no source code" forbids.
Trying qemu-system-* for a static binary on macOS — use qemu-user-static in a container instead.