The Memory Hierarchy
The Fundamental Contradiction
You have a CPU that can execute an instruction in 0.3 nanoseconds.
You have RAM that takes 80 nanoseconds to respond.
That’s a 266x speed mismatch. If the CPU had to wait for RAM on every instruction, a 4GHz processor would effectively run at 15MHz. You’d be back to 1981.
The memory hierarchy exists entirely to resolve this contradiction.
It doesn’t solve it. It hides it — using a layered system of progressively faster, progressively smaller, progressively more expensive storage, arranged so that the data you need next is almost always in the fast layer when you reach for it.
Almost always. When it isn’t — you feel it.
The Full Hierarchy
SIZE LATENCY BANDWIDTH COST/GB
┌─────────────┐
│ REGISTERS │ ~1 KB 0.3 ns ~10 TB/s N/A
└──────┬──────┘
│
┌──────┴──────┐
│ L1 CACHE │ 32–64 KB ~1 ns ~3 TB/s N/A
└──────┬──────┘ per core
│
┌──────┴──────┐
│ L2 CACHE │ 256KB–1MB ~4 ns ~1 TB/s N/A
└──────┬──────┘ per core
│
┌──────┴──────┐
│ L3 CACHE │ 8–64 MB ~40 ns ~400 GB/s N/A
└──────┬──────┘ shared
│
┌──────┴──────┐
│ DRAM │ 8–256 GB ~80 ns ~50 GB/s ~$3/GB
└──────┬──────┘
│
┌──────┴──────┐
│ NVMe SSD │ 500GB–8TB ~100 μs ~7 GB/s ~$0.10/GB
└──────┬──────┘
│
┌──────┴──────┐
│ SATA SSD │ 500GB–8TB ~500 μs ~0.5 GB/s ~$0.07/GB
└──────┬──────┘
│
┌──────┴──────┐
│ HDD │ 1TB–20TB ~10 ms ~0.1 GB/s ~$0.02/GB
└─────────────┘
Each level going down: bigger, cheaper, slower.
Each level going up: smaller, more expensive, faster.
The hierarchy works because of one empirical observation about real programs:
The Principle of Locality
Programs don’t access memory randomly. They cluster.
Temporal locality — if you accessed a memory location recently, you’ll probably access it again soon. Loop variables. Counters. Frequently called functions.
Spatial locality — if you accessed a memory location, you’ll probably access nearby locations soon. Arrays. Structs. Sequential instruction execution.
These aren’t rules the programmer enforces. They’re patterns that naturally emerge from how programs are structured. Loops iterate over arrays. Functions call other functions nearby in the binary. Data structures group related fields.
The cache hierarchy is designed to exploit these patterns. The hardware bets that what you touched recently and what lives nearby will be needed again. It keeps those things close.
When the bet is right — performance is extraordinary. When the bet is wrong — you feel the full cost of going to the next level down.
L1 Cache — The Inner Sanctum
32–64KB per core. ~1ns latency.
This is the fastest memory your program can touch without using registers. It sits physically adjacent to the execution units on the CPU die — a few hundred micrometers away.
L1 is split into two halves on most modern CPUs:
L1i — instruction cache. Holds recently fetched machine code. When the CPU is executing a tight loop, the loop’s instructions are in L1i and the fetch stage never has to go anywhere else.
L1d — data cache. Holds recently accessed data. Your loop variable, your array elements, your struct fields — if they’re hot, they live here.
32KB sounds tiny. It is tiny. But a tight inner loop might only touch a few hundred bytes of data repeatedly. 32KB is enormous relative to that.
The L1 hit rate for well-written code is 95–99%. The other 1–5% propagates down to L2.
L2 Cache — The Buffer
256KB–1MB per core. ~4ns latency.
Larger than L1. Still per-core — not shared. Acts as a victim cache and staging area.
When L1 evicts a cache line (needs room for new data), it goes to L2. When L1 misses, L2 is checked before going to L3. If L2 has it — 4ns. Still fast.
The L1→L2 boundary is where most “hot path” code lives. If your inner loop’s working set fits in L2, you’re running efficiently. If it spills into L3, you’ve probably made a data structure decision you should reconsider.
L3 Cache — The Shared Pool
8–64MB, shared across all cores. ~40ns latency.
This is the last line of defense before RAM.
Shared means: all cores on the die compete for L3 space. Core 0’s hot data and Core 7’s hot data are both in L3. When one core evicts a line, it’s gone for everyone. Under heavy multi-core workloads, L3 contention is a real performance problem.
L3 also plays a critical role in cache coherency (0.6) — when one core modifies data, L3 is where the synchronization happens before other cores see the update.
40ns sounds fast. Compare to registers at 0.3ns. An L3 miss means you’re going to RAM at 80ns. L3 is your last chance.
DRAM — You Already Know This
80ns. 50GB/s bandwidth. Everything from 0.1 and 0.2.
The key point in the hierarchy context: an L3 miss is catastrophic. You’ve burned through all three cache levels and now you’re waiting 80ns — 320 CPU cycles — for RAM to respond.
This is why cache-oblivious algorithms exist. Why data-oriented design matters. Why the layout of your structs in memory is a performance question, not just an aesthetic one.
The Cliff, Not the Slope
Look at the latency numbers again:
L1 → L2 : 1ns → 4ns = 4x slower
L2 → L3 : 4ns → 40ns = 10x slower
L3 → RAM : 40ns → 80ns = 2x slower
RAM → SSD : 80ns → 100μs = 1250x slower
SSD → HDD : 100μs → 10ms = 100x slower
The L2→L3 boundary is a 10x cliff. The RAM→SSD boundary is a 1250x cliff.
These aren’t smooth degradations. They’re discontinuities. Code that fits in L2 doesn’t run 10% slower when it spills to L3 — it runs 10x slower at the affected accesses. Code that fits in RAM doesn’t run gradually slower when it starts paging to SSD — it falls off a cliff.
Understanding where your working set sits in this hierarchy is the difference between code that performs and code that doesn’t.
Cache Geometry — How It’s Organized Internally
A cache isn’t just a bag of recently used bytes. It has precise structure.
A cache is divided into sets. Each set has ways — slots for cache lines.
N-way set-associative cache:
Set 0: [ way 0 | way 1 | way 2 | way 3 ]
Set 1: [ way 0 | way 1 | way 2 | way 3 ]
Set 2: [ way 0 | way 1 | way 2 | way 3 ]
...
A typical L1d: 32KB, 8-way, 64-byte lines → 64 sets.
When a memory address is accessed, the hardware maps it to a specific set (using bits from the address). That line can live in any of the 8 ways in that set. If all 8 ways are occupied, one must be evicted (usually LRU — least recently used).
Why does this matter to you?
Cache aliasing. If two hot data structures happen to map to the same cache set, they evict each other constantly — even if the total working set is smaller than the cache. This is a cache conflict miss — not because the cache is full, but because the geometry creates a collision.
This can happen in code where two arrays, both accessed in a loop, are allocated exactly a power-of-two distance apart in memory. Same set, different line, constant eviction.
The fix: add padding between the arrays. Or use an allocator that’s aware of cache geometry.
Most programmers never think about this. The ones who write code like Stuxnet — where every byte of size and every cycle of execution is deliberate — do.
Inclusion and Exclusion Policies
Inclusive cache — everything in L1 is also in L2, everything in L2 is also in L3. Simple coherency — to evict a line from L3, you also evict from L1/L2. Intel historically used this for L3.
Exclusive cache — each level holds different data. L1 and L2 don’t overlap. More total cache capacity for the same silicon. AMD uses this in some designs.
Non-inclusive non-exclusive (NINE) — no strict rule either way. Modern Intel (since Skylake) uses this for L3. More complex but optimal use of silicon.
Why you care: in exploit development, when you spray memory or need to predict cache state, the inclusion policy affects which levels you need to flush or prime. Cache side-channel attacks like Flush+Reload and Prime+Probe depend intimately on whether caches are inclusive.
The Hardware Prefetcher
The CPU doesn’t wait for you to miss.
It watches your memory access patterns. If it sees you accessing addresses 0x1000, 0x1040, 0x1080, 0x10C0 — sequential, stride of 64 bytes — the hardware prefetcher will start fetching 0x1100, 0x1140 before you ask for them.
When you get there — L1 hit. The prefetcher covered the latency.
The prefetcher is sophisticated. It handles regular strides. Some implementations handle irregular but repeating patterns. But it can be confused, saturated, or defeated — and when it is, every access is a cold miss.
Writing a large memcpy: prefetcher is maximally helpful — perfect sequential pattern.
Chasing a linked list through heap: prefetcher is useless — each next pointer points somewhere unpredictable. Every node is a potential L3 miss.
This is the hardware argument for arrays over linked lists in performance-critical code.
Software Prefetching
When you know your access pattern but the hardware prefetcher can’t figure it out, you can do it manually:
__builtin_prefetch(addr, rw, locality);
// rw: 0=read, 1=write
// locality: 0=no cache, 1=L3, 2=L2, 3=L1
This emits a prefetcht0/1/2 or prefetchnta instruction. It’s a hint — the CPU can ignore it. But when it works, it hides latency by starting the fetch before you need the data.
Used in database engines, compression algorithms, network packet processing — anywhere you can look ahead in your data and issue prefetches for what’s coming.
The Write Story — Write-Through vs Write-Back
When your program writes to a memory location, what happens to the cache?
Write-through: the write goes to the cache AND immediately propagates to the next level down (all the way to RAM). Simple. Consistent. Slow — every write pays full RAM latency.
Write-back: the write only updates the cache line. The line is marked dirty. RAM is not updated yet. The dirty line is written back to RAM only when it gets evicted from cache.
Modern CPUs use write-back for performance. RAM often contains stale data. The cache is the truth.
This creates complexity: when core 0 has a dirty cache line and core 1 tries to read that address, core 1 must get core 0’s version — not the stale version in RAM. This is the cache coherency problem (0.6).
Non-Temporal Stores — Bypassing the Cache
Sometimes you know you’re writing data you’ll never read again — like writing a large output buffer. Polluting the cache with it evicts useful data.
x86 has non-temporal store instructions (movnti, movntps) that write directly to RAM, bypassing the cache entirely.
#include <immintrin.h>
_mm_stream_si64((__int64*)dst, value); // non-temporal 64-bit store
Used in memset/memcpy implementations for large buffers. Used in video processing. Used whenever you have a large write-once stream.
The tradeoff: faster for large streaming writes. Catastrophic if you then immediately read that data (cache miss, had to come from RAM).
Cache Flushing — CLFLUSH
You can explicitly evict a specific cache line:
__builtin_ia32_clflush(addr);
Or the fence variant: clflushopt, clwb.
This is normally useless to application code. It’s critical in:
- Persistent memory (Intel Optane) — flushing ensures data reaches non-volatile storage
- Cache side-channel attacks — Flush+Reload attacks use CLFLUSH to evict a line, then time how long a victim process takes to access it — if fast, they accessed it (cache hit); if slow, they didn’t
Stuxnet didn’t use cache attacks. But second-generation nation-state implants absolutely have. Knowing that a single assembly instruction can evict a cache line and turn the cache into a timing oracle — that’s the level of hardware knowledge that enables those techniques.
The Abstract Machine vs The Real Machine
The C standard describes an abstract machine. In the abstract machine, memory is flat, reads and writes happen in program order, there’s no cache.
The real machine has caches, write buffers, store-to-load forwarding, speculative execution, out-of-order completion.
The C standard says your program is equivalent to the abstract machine. The CPU is allowed to do anything it wants as long as the observable behavior matches. The cache is invisible to your program — you can’t tell the difference between hitting L1 or going to RAM, functionally.
Except:
- Timing — you absolutely can tell the difference via timing. This is the foundation of every cache side-channel attack in existence.
- Multithreaded code — the C memory model (C11+) defines visibility rules for concurrent access. The cache and store buffers are part of why the rules are what they are.
- volatile — forces the compiler to actually perform the access, but does not force a cache flush. A volatile read hits the cache. Only the CPU’s cache coherency protocol guarantees RAM consistency.
The gap between the abstract machine and the real machine is where all the interesting things live.
Summary — What You Own After 0.4
Hierarchy = registers → L1 → L2 → L3 → RAM → SSD → HDD
Driving force = speed/size/cost tradeoffs at each level
Locality = temporal (reuse soon) + spatial (use neighbors)
= the assumption the entire hierarchy is built on
L1 = 32–64KB, ~1ns, per core, split I/D
L2 = 256KB–1MB, ~4ns, per core
L3 = 8–64MB, ~40ns, shared across cores
DRAM = GBs, ~80ns, the first level not on the CPU die
The cliff = not a slope — missing a level is a discrete cost
Write-back = cache holds dirty data, RAM may be stale
Prefetcher = hardware watches your pattern, fetches ahead
CLFLUSH = software can evict specific cache lines
Abstract vs real = timing and concurrency expose the real machine
One sentence:
The memory hierarchy is a pyramid of speed-vs-size tradeoffs built entirely on the bet that your next memory access will be near your last one — and every performance pathology, cache attack, and concurrency bug exists in the gap between that bet and reality.