The Science · Storage Workload

Model Load: the part of time-to-first-token your GPU can't fix

The real AI behavior this models

Before a local model can answer anything, its weights must travel from your SSD into memory. A 7B model in Q4 quantization is ~4.4 GB; a 32B model is ~20 GB; a 70B model is ~42 GB. On a SATA drive, that 20 GB read alone takes over half a minute at the bus ceiling; on a fast Gen4 NVMe it's a few seconds. No GPU upgrade changes this number — it is set by your storage, and the user feels every second of it as "the AI is loading."

This also isn't a once-per-day cost. Multi-model workflows — a coding model here, a chat model there, an embedding model for search — reload weights constantly, and every model swap pays the storage toll again.

Why generic benchmarks can't see it

Sequential MB/s overstates reality. Loading a GGUF file is not one perfect sequential stream: the engine parses headers, maps tensors, and allocates as it reads. CAI measures the end-to-end load a real engine performs, not an idealized copy.
The OS page cache lies. Load the same model twice and the second load comes from RAM at tens of GB/s. Naive benchmarks (and casual reviews) routinely publish that second number. CAI's defenses make this the headline feature of the test.

Exactly what CAI does

Invokes a real llama.cpp binary against a real GGUF model file and parses the engine's own reported model-file load time.
Purges the OS page cache and standby list before each run (administrator required), and records whether the purge succeeded.
Validates every parsed time against physics: if the implied throughput exceeds what a single NVMe drive can deliver over PCIe 5.0 ×4 (~15.75 GB/s), the parsed value is rejected — it was measuring engine initialization, not file I/O — and the wall-clock load time is used instead.
If even the wall-clock time implies impossible throughput, the run is flagged likely_cached and excluded from this test's aggregation. If every run is contaminated, CAI reports the result with an explicit inflation warning rather than pretending. A flattering number that violates the bus speed is not a result; it's a bug, and CAI treats it as one.

What is measured and how it is scored

Metric	Why it matters
Model load time (ms)	The storage component of time-to-first-token
Effective load throughput (MB/s)	Comparable across model sizes
Cache-validity status	Whether each run measured the drive or the page cache

Normalized against CAI v1.0 reference values and folded into the Storage Score by geometric mean — see the methodology.

Run it yourself

CAI_Benchmark_CLI --ttft --llama path\to\llama-cli.exe --model path\to\model.gguf

Why you'll trust the smaller number. CAI will sometimes report a load throughput far below what your drive's spec sheet promises — because it is reporting what an inference engine actually achieves, parsing and allocating as it streams. That gap between spec-sheet MB/s and achieved load speed is precisely the information a buyer needs and no other benchmark publishes.