Part 0 - Introduction

Registers

This is worth knowing. Registers - this is a term which you would have to encounter in many areas of assembly and debug when exactly this happens.

#c #memory

Registers

The Paradox

Everything in RAM has an address.

Registers don’t.

You cannot take the address of a register in C. You cannot point a pointer at one. The & operator doesn’t work on them. They exist entirely outside the addressable memory space.

And yet — registers are where all computation actually happens.

Not RAM. Registers.

The CPU cannot add two numbers that are sitting in RAM. It cannot compare them, shift them, OR them. It must first load them into registers. The operation happens in the register. The result lives in a register. Only then — maybe — does it get written back to RAM.

Every line of C you will ever write, at the bottom, is the compiler figuring out how to choreograph data moving between RAM and registers so that the actual work can happen.

What a Register Physically Is

A register is built from flip-flops.

A flip-flop is a circuit with two stable states — 0 or 1 — that holds its state as long as power is applied, without refreshing. Unlike a DRAM capacitor, it doesn’t leak. It doesn’t need to be periodically rewritten.

It’s made of transistors arranged in a feedback loop — the output feeds back to hold the input stable. Typically 6 transistors per bit for SRAM-style storage. Fast. Stable. Expensive in silicon area.

A 64-bit register is 64 of these flip-flops, sitting physically inside the CPU core, a few nanometers from the arithmetic units that operate on them.

Access time: less than one clock cycle. The data is essentially already there.

The Register File

A CPU core doesn’t have one register. It has a register file — a small, structured set of registers.

On x86-64 — your CPU’s architecture — the programmer-visible general purpose registers are:

RAX    RBX    RCX    RDX
RSI    RDI    RSP    RBP
R8     R9     R10    R11
R12    R13    R14    R15

16 general purpose registers. Each is 64 bits — 8 bytes wide.

Plus:

  • RIP — the instruction pointer. Points to the next instruction to execute. You can’t directly write it (well — you can, via jumps and calls).
  • RFLAGS — status flags. Zero flag, carry flag, sign flag, overflow flag. Set by arithmetic operations, read by conditional jumps.
  • XMM0–XMM15 — 128-bit SSE registers for floating point and SIMD.
  • YMM0–YMM15 — 256-bit AVX registers (the lower 128 bits are the XMM registers).
  • Various segment registers, control registers, debug registers — mostly kernel territory.

The Naming Hierarchy — This Is Important

x86-64 has backwards compatibility going back to 1978. The registers have sub-register names that reflect this history.

Take RAX:

RAX  = 64-bit  [bits 63 ────────────────────── 0]
EAX  = 32-bit  [                   bits 31 ── 0]
AX   = 16-bit  [                         15 ── 0]
AH   =  8-bit  [                         15 ── 8]  (high byte of AX)
AL   =  8-bit  [                               7 ── 0]  (low byte of AX)

Same physical register. Different width views into it.

Writing to EAX zero-extends into RAX — the upper 32 bits are cleared. This was a deliberate x86-64 design decision.

Writing to AX — only the lower 16 bits change. Upper 48 bits untouched.

Writing to AH or AL — only that byte changes.

This matters for exploit development. A function returns a value in RAX. If the function was compiled to write EAX, the upper 32 bits of RAX are zeroed. If it wrote AX, they’re not. Getting this wrong when reading return values causes subtle, catastrophic bugs.

The same sub-register naming applies:

RBX / EBX / BX / BH / BL
RCX / ECX / CX / CH / CL
RDX / EDX / DX / DH / DL
RSI / ESI / SI / SIL
RDI / EDI / DI / DIL
RSP / ESP / SP / SPL
RBP / EBP / BP / BPL
R8  / R8D  / R8W  / R8B
R9  / R9D  / R9W  / R9B
... (R10 through R15 same pattern)

What Each Register Is Used For — Calling Convention

Registers aren’t just generic storage. The x86-64 System V ABI (the calling convention your Linux system uses) assigns roles to them:

REGISTER   ROLE                          CALLER/CALLEE SAVED?
────────────────────────────────────────────────────────────
RAX        Return value / scratch        Caller-saved
RBX        Callee-saved general purpose  Callee-saved
RCX        4th argument                  Caller-saved
RDX        3rd argument                  Caller-saved
RSI        2nd argument                  Caller-saved
RDI        1st argument                  Caller-saved
RSP        Stack pointer                 Callee-saved (special)
RBP        Frame pointer                 Callee-saved
R8         5th argument                  Caller-saved
R9         6th argument                  Caller-saved
R10        Scratch / static chain        Caller-saved
R11        Scratch                       Caller-saved
R12–R15    Callee-saved general purpose  Callee-saved
RIP        Instruction pointer           Not directly writable
RFLAGS     Condition codes               Caller-saved

Caller-saved means: if you (the caller) care about the value in that register after making a function call, you must save it before the call. The callee is free to trash it.

Callee-saved means: if the function (callee) uses that register, it must save and restore it. When the function returns, those registers must have the same values they had when the function was entered.

When you call printf("hello") — the string address goes into RDI. That’s not a C concept. That’s metal. The actual instruction mov rdi, <address> happens before the call printf instruction.

When malloc returns a pointer — it’s in RAX. *p = malloc(...) is the compiler emitting mov [address_of_p], rax.

Why No Address

Registers are not part of the addressable memory space because they are not memory.

They’re not on the memory bus. They’re not in the RAM chips. They’re not behind the memory controller. They’re inside the CPU core itself, wired directly to the ALU (Arithmetic Logic Unit), the load/store units, the instruction decoders.

The CPU reaches a register by hardwired logic — specific bits in the instruction encoding specify which register. The instruction add rax, rbx is encoded as specific bytes where certain bit fields say “source: register 0 (rax), operand: register 3 (rbx).” The silicon reads those bits and routes the register file outputs directly to the adder inputs.

No address. No bus transaction. No latency. It just happens in the combinational logic.

This is why register access is measured in fractions of a nanosecond while RAM is 80ns. They aren’t even the same class of thing.

The Compiler’s Job — Register Allocation

You write:

int a = 1;
int b = 2;
int c = a + b;
int d = c * 3;

The compiler’s job — specifically the register allocator pass — is to figure out which variables live in which registers, when, and when to spill them to the stack because you ran out of registers.

A smart register allocation for the above:

mov  eax, 1        ; a → EAX
mov  ecx, 2        ; b → ECX
add  eax, ecx      ; EAX = a + b = 3, this is c
lea  edx, [eax + eax*2]  ; EDX = c * 3 = 9, this is d

Notice: a, b, c, d never touched RAM. They lived entirely in registers. No memory access at all.

This is what -O2 optimization is largely doing — keeping values in registers as long as possible instead of constantly writing and reading back from the stack.

The unoptimized version (like -O0, used for debugging) religiously writes every variable to the stack and reads it back, so the debugger can always find them at known stack addresses. This is 3–5x slower. That’s the cost of debuggability.

Register Spilling

16 general-purpose registers. Your function might have 40 local variables.

When the register allocator runs out of registers, it spills — writes a register’s current value to a stack slot, frees the register for something else, reloads later when needed.

Spills are expensive. They’re the register allocator admitting defeat and going to RAM (well, stack — which is in RAM — which means it’ll likely hit L1 cache, but still).

Good C code, good compiler flags, and good data structure design minimize spills. Functions with fewer live variables at any one time, fewer arguments, simpler control flow — these allow the compiler to keep more in registers.

The Programmer-Invisible Registers

Modern CPUs have far more registers than you can see.

Out-of-order execution requires this. The CPU renames the architectural registers (RAX, RBX…) to a much larger pool of physical registers — x86-64 chips typically have 168–512+ physical registers internally.

This is register renaming. When two instructions both want to write RAX, the CPU assigns them to different physical registers and tracks which version is “the real RAX” at any point. This lets the CPU execute those instructions in parallel even though they appear to write the same register.

You never see these. The compiler doesn’t see these. The OS doesn’t see these. They exist entirely within the CPU microarchitecture — the hardware-level implementation that sits below even the ISA (Instruction Set Architecture).

Spectre and Meltdown exploited timing effects caused by this speculative, out-of-order execution engine. Side channels in the physical register file’s timing exposed secrets from memory that the program technically never had permission to read.

That is what operating at this level looks like.

Context Switching — The Register Save Problem

Your OS runs many processes. The CPU has one set of registers.

When the kernel switches from process A to process B — it must save all of process A’s registers somewhere, and load all of process B’s registers that were saved last time B was running.

This saved register state is the context — stored in a kernel data structure per thread. On x86-64 that’s all 16 general purpose registers, RIP, RFLAGS, the segment registers, and the FPU/SSE/AVX state.

The AVX-512 register file alone is 32 registers × 64 bytes = 2048 bytes that must be saved and restored on every context switch. This is why enabling AVX-512 can increase context switch overhead.

When an exploit does a privilege escalation — when it gets ring 3 code running as ring 0 — part of what it’s manipulating is this register save/restore mechanism. The kernel trusts the saved register state. An attacker who controls the saved state controls what registers contain when execution resumes.

The RIP Register — Control Flow Is Just a Register

The instruction pointer — RIP — contains the address of the next instruction to execute.

The CPU fetch-decode-execute cycle:

1. Read memory at address in RIP → fetch instruction bytes
2. Decode those bytes → determine what operation and operands
3. Execute the operation
4. Advance RIP to next instruction (or jump sets RIP directly)
5. Repeat

Execution is just: RIP advances. A jump is just: RIP gets set to a new value. A function call is: push current RIP onto stack, set RIP to function address. A return is: pop saved RIP from stack, set RIP to it.

All control flow is register manipulation.

A buffer overflow that overwrites the return address on the stack — it’s overwriting the value that will be popped into RIP. When ret executes, it pops your value into RIP, and the CPU starts fetching instructions from wherever you pointed.

That’s not a metaphor. That’s literally what happens. The CPU has no opinion about it. It executes whatever RIP points to.

Summary — What You Own After 0.3

Register         = flip-flop array, inside CPU core, no address, sub-nanosecond
16 GPRs          = RAX RBX RCX RDX RSI RDI RSP RBP R8–R15
Sub-registers    = EAX/AX/AH/AL — different width views of same register
Calling conv     = RDI RSI RDX RCX R8 R9 for args, RAX for return
Callee-saved     = RBX RBP R12–R15 — function must preserve
Caller-saved     = rest — caller must save if it cares
RIP              = instruction pointer — control flow is just this register
Register rename  = CPU has 168+ physical regs behind the 16 you see
Context switch   = OS saves/restores all registers between processes
Spill            = register allocator evicting to stack when regs exhausted

One sentence:

Registers are the only place computation actually happens — tiny, impossibly fast, nameless storage wired directly to the CPU’s arithmetic logic, invisible to the address space, and the ultimate target of everything an attacker wants to control.

0.3 Extended — Register Conventions

The Full Contract Between Every Function That Ever Calls Another

Why Conventions Exist At All

The CPU doesn’t enforce any of this.

You could write a program where function arguments go in R15, R14, R13. The CPU doesn’t care. It executes whatever instructions you give it.

The calling convention is a social contract — agreed upon by compiler writers, OS designers, and library authors so that code compiled by different compilers, from different languages, can call each other without negotiating every time.

When your C code calls printf from glibc — your compiler and the glibc compiler never met. They agreed on the convention. That’s the only reason it works.

On Linux x86-64 the convention is called System V AMD64 ABI. On Windows x86-64 it’s different — Microsoft defined their own. This is why Windows and Linux binaries are incompatible at the ABI level even on identical hardware.

The Full Register Map — Burned Into Memory

REG     ALT NAME    ROLE                          SAVED BY
──────────────────────────────────────────────────────────────
RAX     —           Return value (int/ptr)         Caller
RBX     —           General purpose               Callee
RCX     4th arg     4th integer argument           Caller
RDX     3rd arg     3rd integer argument           Caller
RSI     2nd arg     2nd integer argument           Caller
RDI     1st arg     1st integer argument           Caller
RSP     sp          Stack pointer                  Callee (special)
RBP     fp/bp       Frame pointer (base pointer)   Callee
R8      5th arg     5th integer argument           Caller
R9      6th arg     6th integer argument           Caller
R10     —           Scratch / static chain ptr     Caller
R11     —           Scratch                        Caller
R12     —           General purpose               Callee
R13     —           General purpose               Callee
R14     —           General purpose               Callee
R15     —           General purpose               Callee
RIP     pc          Instruction pointer            N/A
RFLAGS  —           Condition codes                Caller
XMM0–XMM7     Float/SSE args 1–8, XMM0 = float return value    Caller
XMM8–XMM15    Scratch float registers                           Caller

Caller-Saved vs Callee-Saved — The Exact Mental Model

Imagine you are writing a function. You’re mid-computation. You have a value in RAX you spent 20 instructions computing. You need to call malloc.

Does malloc promise to leave your RAX alone?

No. RAX is caller-saved. malloc will absolutely write its return value into RAX. Your value is gone.

If you need that value after the call — you (the caller) must push it to the stack before the call and pop it back after.

; You computed something precious, it's in RAX
push rax              ; save it — your responsibility
mov  rdi, 64          ; argument to malloc
call malloc           ; RAX now = pointer malloc returned
pop  rbx              ; restore your precious value
                      ; (into RBX, not RAX — RAX is the malloc result)

Now imagine you are writing a library function. You need RBX for your own work. RBX is callee-saved. The caller may have something precious in RBX and is trusting you not to destroy it.

You (the callee) must save RBX at your function entry and restore it before returning.

my_function:
    push rbx          ; save caller's RBX — your responsibility
    
    ; ... use RBX freely for your own work ...
    
    pop  rbx          ; restore it
    ret

The CPU doesn’t enforce this. If you trash RBX without restoring it, the CPU executes happily. The caller will just find garbage where their value used to be, and the resulting bug will be spectacular and nearly impossible to diagnose.

Argument Passing — The Exact Order

Integer and pointer arguments, left to right:

1st arg → RDI
2nd arg → RSI
3rd arg → RDX
4th arg → RCX
5th arg → R8
6th arg → R9
7th+ args → pushed on stack, right to left

Concrete example:

ssize_t write(int fd, const void *buf, size_t count);

When you call write(1, "hello", 5):

RDI = 1              (fd)
RSI = address of "hello"   (buf)
RDX = 5              (count)

Then call write.

Another:

int something(int a, int b, int c, int d, int e, int f, int g);
RDI = a
RSI = b
RDX = c
RCX = d
R8  = e
R9  = f
[RSP+8] = g          (on stack, above return address)

The Stack Arguments — Exactly How

When arguments spill to the stack, the caller pushes them before the call. The stack looks like this at the moment of call:

HIGH ADDRESS
┌─────────────────┐
│    arg 9        │  ← RSP + 24  (if 9+ args)
├─────────────────┤
│    arg 8        │  ← RSP + 16
├─────────────────┤
│    arg 7        │  ← RSP + 8
├─────────────────┤
│  return address │  ← RSP       (pushed by CALL instruction)
└─────────────────┘
LOW ADDRESS

Inside the called function, if it uses a frame pointer:

HIGH ADDRESS
┌─────────────────┐
│    arg 9        │  ← RBP + 24
├─────────────────┤
│    arg 8        │  ← RBP + 16
├─────────────────┤
│    arg 7        │  ← RBP + 8
├─────────────────┤
│  return address │  ← RBP - 0  (wait — actually:)
├─────────────────┤
│   saved RBP     │  ← RBP      (the prologue pushed caller's RBP)
├─────────────────┤
│   local vars    │  ← RBP - 8, RBP - 16 ...
└─────────────────┘
LOW ADDRESS

The ABI requires the stack to be 16-byte aligned at the point of CALL. This means RSP must be divisible by 16 before pushing the return address. The CALL instruction pushes 8 bytes (the return address), making RSP 8-byte aligned. The callee’s prologue typically pushes RBP (another 8 bytes), restoring 16-byte alignment. This matters for SSE/AVX instructions that require aligned memory — they’ll fault if the stack isn’t aligned when they execute.

Return Values — Every Case

TYPE                          WHERE RETURNED
──────────────────────────────────────────────────────
int, long, pointer, size_t    RAX
Second 64-bit word (128-bit)  RDX (RAX:RDX pair)
float, double                 XMM0
Long double (80-bit)          ST(0) — x87 FPU stack
Struct ≤ 16 bytes             RAX + RDX (packed in)
Struct > 16 bytes             Caller allocates space,
                              passes address as hidden
                              first argument in RDI,
                              function fills it, returns
                              that address in RAX

The struct return case is subtle and important. If you write:

typedef struct { long a; long b; long c; } Big;
Big get_big(void);

The compiled call looks like this under the hood:

// What the compiler actually generates, conceptually:
Big result;                    // caller allocates on stack
get_big_hidden(&result);       // passes hidden pointer in RDI
// get_big writes into *hidden_ptr
// returns the pointer in RAX (which caller ignores or uses)

This is why returning large structs by value in C isn’t “free” — the caller is silently allocating stack space and passing a hidden pointer. You won’t see this in the C source. You will see it in the disassembly.

The Function Prologue — Exactly What Happens

Every standard function begins with a prologue that sets up its stack frame:

push rbp              ; save caller's frame pointer (callee-saved)
mov  rbp, rsp         ; RBP now points to our frame base
sub  rsp, N           ; allocate N bytes for local variables

And ends with an epilogue:

mov  rsp, rbp         ; restore stack pointer (undo local allocation)
pop  rbp              ; restore caller's frame pointer
ret                   ; pop return address into RIP, jump there

Or equivalently using the leave instruction:

leave                 ; does: mov rsp, rbp; pop rbp
ret

With -O2 and no frame pointer (-fomit-frame-pointer), RBP is freed as a general register and the prologue disappears or becomes minimal. Stack unwinding then uses DWARF .eh_frame metadata instead. Backtraces still work but RBP doesn’t serve as anchor anymore.

Seeing It For Real

Take this C:

int add(int a, int b) {
    return a + b;
}

int main(void) {
    int x = add(3, 7);
    return x;
}

Compile: gcc -O0 -S -o out.s file.c

You’ll see something like:

add:
    push   rbp
    mov    rbp, rsp
    mov    DWORD PTR [rbp-4], edi    ; spill arg a to stack
    mov    DWORD PTR [rbp-8], esi    ; spill arg b to stack
    mov    edx, DWORD PTR [rbp-4]   ; reload a
    mov    eax, DWORD PTR [rbp-8]   ; reload b
    add    eax, edx                  ; eax = a + b
    pop    rbp
    ret                              ; return value in EAX

main:
    push   rbp
    mov    rbp, rsp
    sub    rsp, 16                   ; allocate 16 bytes for locals
    mov    esi, 7                    ; 2nd arg → ESI
    mov    edi, 3                    ; 1st arg → EDI
    call   add
    mov    DWORD PTR [rbp-4], eax    ; store return value (x)
    mov    eax, DWORD PTR [rbp-4]   ; load x into return reg
    leave
    ret

At -O2:

add:
    lea    eax, [rdi + rsi]    ; eax = a + b directly from arg regs
    ret

main:
    mov    eax, 10             ; compiler computed 3+7=10 at compile time
    ret

The -O0 version is the calling convention made visible. Every arg explicitly spilled and reloaded. Every local at a known stack address. Painfully inefficient — but debuggable.

The -O2 version is the calling convention optimized away. Arguments never touch the stack. The result is a compile-time constant. The function call itself disappears.

Variadic Functions — How printf Works

printf(const char *fmt, ...) takes variable arguments. How does it know where they are?

The ABI has rules for this too.

Integer/pointer variadic args go into the same registers as normal args, in order. If fmt is the first arg (RDI), then the first variadic arg is RSI, then RDX, etc.

But here’s the catch: printf doesn’t know at compile time how many args there are. It reads fmt to find out how many %d, %s etc. are in the format string.

It then reads those args from registers — but it can only do this because the caller saved them to a known location on the stack at function entry. This is the register save area — in a variadic function prologue, all potential argument registers (RDI, RSI, RDX, RCX, R8, R9) are dumped to a contiguous block on the stack so that va_arg can walk through them with a simple pointer.

A format string vulnerability is: you call printf(user_input). No format arguments. But printf thinks it has arguments — whatever garbage is in RSI, RDX, RCX, R8, R9, and then whatever is on the stack above the frame. %x reads and prints those. %n writes a count to the address in the corresponding argument register.

The attacker controls the format string. They control what gets read. They control what gets written. The register convention is what makes the corruption predictable — they know exactly which register or stack slot corresponds to which %x in the format string.

Windows x64 ABI — The Differences

If you ever read Windows exploit code or reverse Windows binaries:

1st arg → RCX   (not RDI)
2nd arg → RDX   (not RSI)
3rd arg → R8    (not RDX)
4th arg → R9    (not RCX)
5th+ → stack

And critically: Windows requires the caller to allocate 32 bytes of shadow space on the stack before every call — even if the callee uses no stack args. It’s reserved space the callee can use to spill its register args if it wants. The callee doesn’t have to use it, but the space must be there.

This is why Windows shellcode and Linux shellcode are different even on the same CPU. Same ISA. Different ABI. Different stack layout. Different argument registers.

Summary — The Full Contract

ARG ORDER      RDI RSI RDX RCX R8 R9 → stack (right to left)
RETURN         RAX (int/ptr), XMM0 (float), RAX+RDX (large int)
CALLEE-SAVED   RBX RBP R12 R13 R14 R15 RSP
CALLER-SAVED   RAX RCX RDX RSI RDI R8 R9 R10 R11 RFLAGS XMM0-15
PROLOGUE       push rbp / mov rbp,rsp / sub rsp,N
EPILOGUE       leave / ret   or   mov rsp,rbp / pop rbp / ret
ALIGNMENT      RSP must be 16-byte aligned before CALL
WINDOWS DIFF   RCX RDX R8 R9 for args + 32-byte shadow space

One sentence:

The calling convention is the invisible contract that lets separately compiled code interoperate — it specifies exactly which register holds each argument, who saves what, where the return value appears, and how the stack is shaped at every call — and every exploit that touches function calls is exploiting a deviation from or manipulation of this contract.