CAI Benchmark is the open standard for measuring how client SSDs and DRAM perform under real local-AI workloads — model loading, KV-cache offload, RAG queries — not synthetic transfer tests.
Why CAI exists
Local AI inference is bound by storage and memory in ways no QD32 sequential sweep can see. An 18 GB model loads once at queue depth 1. A long chat spills its KV cache to disk in bursty, latency-critical reads between GPU "think" cycles. A RAG pipeline mixes vector-database reads with embedding-update writes. CAI runs these patterns — derived from real inference traces and engine profiling — against your hardware.
The output
Every metric is normalized against a published reference system (score 50 = a good Gen4 NVMe SSD with DDR5-4800) and combined by geometric mean — the same mathematics SPEC CPU has used for decades. No marketing units. No cherry-picking.
Twelve storage workloads, from sequential model reads to concurrent multi-model load.
Memory workloads from STREAM Copy/Scale/Add/Triad to DRAM-tier KV cache.
45% storage + 55% memory — weighted for memory-bound LLM inference.
Illustrative example output
The science
Each workload exists because a real inference engine does it. Each page below documents what the engine does, why generic tools can't see it, exactly what CAI measures, and how it is scored — fully reproducible from the methodology.
Time-to-first-token starts with reading 4–40 GB of weights from disk. CAI measures real llama.cpp load time with cache-contamination detection.
Read the science →Long-context chats overflow VRAM and spill the KV cache to NVMe. CAI replays a ShareGPT conversation trace against your drive.
Read the science →QD1 burst reads with think-time cycles — the access pattern of layer-by-layer tensor streaming, invisible to QD32 benchmarks.
Vector-database lookups interleaved with writes — the I/O signature of retrieval-augmented generation.
Training and fine-tuning checkpoint saves: large sequential writes under time pressure.
Unload model A, load model B; hot-swap LoRA adapters. The workload of every multi-model workflow.
The canonical Copy/Scale/Add/Triad kernels — the ceiling on token generation speed for CPU inference.
L1/L2/L3/DRAM latency profiling and random-access latency under AI-realistic working sets.
The same KV-cache workload run in memory — quantifying exactly what NVMe offload costs you.
See it in action
A full GUI with per-workload selection, live disk-I/O monitoring, and Excel/JSON export — or a scriptable CLI for labs and CI racks. Same engine, same scores.

CAI_Benchmark_GUI.exe — workload selection, model queue, score cards
> CAI_Benchmark_CLI.exe --memory --quick ============================================================ MEMORY BENCHMARKS ============================================================ STREAM Bandwidth Test (array size: 512 MB) Copy: 27.64 GB/s Scale: 17.42 GB/s Add: 19.43 GB/s Triad: 13.29 GB/s Memory Latency Test (buffer size: 256 MB) Random Latency: 147.07 ns Tensor Allocation Test Allocation Bandwidth: 5.76 GB/s ============================================================ RESULTS ============================================================ Memory Score: 26.9 (FAIR) Total time: 9.5s
Unedited quick-mode memory pass from the development machine. The full suite adds twelve storage workloads and the combined AI Readiness Score.
Trust, engineered
A benchmark is only as good as its worst loophole. CAI validates its own measurements: Direct I/O bypasses the OS cache, the page cache is purged between runs, an independent monitor watches OS disk counters during every storage test, and headline results (model load, sequential read) implying throughput beyond the physical ceiling of a PCIe 5.0 ×4 drive (~15.75 GB/s) are rejected as cache contamination rather than reported as records.
Download, run as administrator, get your scores —
--quick mode for a fast pass, the full suite for publication-grade numbers.
No account, no telemetry, no install required.