The Science · Storage Workload

KV-Cache Offload: when your SSD becomes an extension of VRAM

The real AI behavior this models

Transformer inference keeps a key-value cache: for every token in the context, every layer stores the attention keys and values it computed, so they never need recomputing. The cache grows linearly with context length — for a 7B-class model it can reach multiple gigabytes over a long conversation, and for larger models it can exceed the VRAM left over after the weights.

Inference engines respond by offloading KV-cache pages to system RAM, and beyond that, to NVMe. From that moment on, every new token's attention pass touches storage: the engine reads back cache pages for context it attends to (decode reads), and writes new pages as the conversation grows (prefill writes). Your SSD is now on the latency path of every word the model says.

Why generic benchmarks can't see it

The pattern is bursty, not sustained. Reads cluster between GPU compute phases; the drive idles during "think time," then must deliver instantly. Sustained QD32 sweeps measure a state this workload never enters.
It is latency-critical at low queue depth. One inference stream issues a handful of outstanding requests. A drive tuned to look good at high queue depth can be mediocre exactly where KV offload lives.
It mixes reads and writes in conversation-shaped proportions — determined by how humans actually chat, not by a benchmark knob.

Exactly what CAI does

Replays an I/O trace derived from a real ShareGPT conversation log: turn lengths and context growth follow an actual multi-turn chat rather than a synthetic distribution. (If the trace dataset isn't present, CAI substitutes a statistical model of the same conversation shape and says so in the log.)
Issues decode-phase and prefill-phase reads with Direct I/O (OS cache bypassed, page-aligned buffers) at QD1 — the queue depth an inference engine actually produces. Write costs are modeled from the measured reads with a documented 0.70 write-penalty factor, because the read-only model file is never written to.
Counts only bytes actually transferred on reads — short reads are never credited as full ones.
Clears the OS page cache before the workload and records the suite-level cache-clear status in the exported results.

What is measured and how it is scored

Metric	Why it matters
Decode read throughput (GB/s)	Directly gates tokens/second once the cache lives on disk
Decode read latency P50 / P99 (ms)	P99 spikes appear to the user as mid-sentence stalls
Prefill write throughput (GB/s)	Gates prompt ingestion; modeled from measured reads (0.70 write-penalty factor)

Each metric is normalized against the CAI v1.0 reference values and folded into the Storage Score by geometric mean — see the methodology for the exact math.

Run it yourself

CAI_Benchmark_CLI --storage --model path\to\model.gguf --json results.json

The KV-cache offload result appears in the storage section of the report, with full metric detail and the suite-level cache-clear status preserved in the JSON.

The pairing trick. CAI also runs the same KV-cache workload entirely in DRAM (the memory suite's "DRAM-tier KV cache" test). The gap between the two numbers is the literal, measured cost of spilling context to your SSD — and the clearest argument for why both subsystems decide local-AI experience.