The Science · Storage Workload
Model Load: the part of time-to-first-token your GPU can't fix
The real AI behavior this models
Before a local model can answer anything, its weights must travel from your SSD into memory. A 7B model in Q4 quantization is ~4.4 GB; a 32B model is ~20 GB; a 70B model is ~42 GB. On a SATA drive, that 20 GB read alone takes over half a minute at the bus ceiling; on a fast Gen4 NVMe it's a few seconds. No GPU upgrade changes this number — it is set by your storage, and the user feels every second of it as "the AI is loading."
This also isn't a once-per-day cost. Multi-model workflows — a coding model here, a chat model there, an embedding model for search — reload weights constantly, and every model swap pays the storage toll again.
Why generic benchmarks can't see it
- Sequential MB/s overstates reality. Loading a GGUF file is not one perfect sequential stream: the engine parses headers, maps tensors, and allocates as it reads. CAI measures the end-to-end load a real engine performs, not an idealized copy.
- The OS page cache lies. Load the same model twice and the second load comes from RAM at tens of GB/s. Naive benchmarks (and casual reviews) routinely publish that second number. CAI's defenses make this the headline feature of the test.
Exactly what CAI does
- Invokes a real llama.cpp binary against a real GGUF model file and parses the engine's own reported model-file load time.
- Purges the OS page cache and standby list before each run (administrator required), and records whether the purge succeeded.
- Validates every parsed time against physics: if the implied throughput exceeds what a single NVMe drive can deliver over PCIe 5.0 ×4 (~15.75 GB/s), the parsed value is rejected — it was measuring engine initialization, not file I/O — and the wall-clock load time is used instead.
- If even the wall-clock time implies impossible throughput, the run is flagged
likely_cachedand excluded from this test's aggregation. If every run is contaminated, CAI reports the result with an explicit inflation warning rather than pretending. A flattering number that violates the bus speed is not a result; it's a bug, and CAI treats it as one.
What is measured and how it is scored
| Metric | Why it matters |
|---|---|
| Model load time (ms) | The storage component of time-to-first-token |
| Effective load throughput (MB/s) | Comparable across model sizes |
| Cache-validity status | Whether each run measured the drive or the page cache |
Normalized against CAI v1.0 reference values and folded into the Storage Score by geometric mean — see the methodology.
Run it yourself
CAI_Benchmark_CLI --ttft --llama path\to\llama-cli.exe --model path\to\model.gguf