CAI Benchmark

Why CAI exists

Generic disk benchmarks answer the wrong question

Local AI inference is bound by storage and memory in ways no QD32 sequential sweep can see. An 18 GB model loads once at queue depth 1. A long chat spills its KV cache to disk in bursty, latency-critical reads between GPU "think" cycles. A RAG pipeline mixes vector-database reads with embedding-update writes. CAI runs these patterns — derived from real inference traces and engine profiling — against your hardware.

Traditional disk benchmarks

Saturated queues (QD32) that inference never produces
Uniform block sizes; no think-time, no bursts
Happily report RAM-cache speed as "disk speed"
One number with no link to AI user experience

QD1 burst reads with inference think-time cycles
Real GGUF model files as test payloads
Direct I/O + cache clearing + physical-limit validation
Scores anchored in measured time-to-first-token

The output

Three numbers. SPEC-style rigor.

Every metric is normalized against a published reference system (score 50 = a good Gen4 NVMe SSD with DDR5-4800) and combined by geometric mean — the same mathematics SPEC CPU has used for decades. No marketing units. No cherry-picking.

Storage Score

Twelve storage workloads, from sequential model reads to concurrent multi-model load.

Memory Score

Memory workloads from STREAM Copy/Scale/Add/Triad to DRAM-tier KV cache.

AI Readiness Score

45% storage + 55% memory — weighted for memory-bound LLM inference.

Illustrative example output

The science

Every test models a measured AI behavior

Each workload exists because a real inference engine does it. Each page below documents what the engine does, why generic tools can't see it, exactly what CAI measures, and how it is scored — fully reproducible from the methodology.

Storage

Model Load (TTFT)

Time-to-first-token starts with reading 4–40 GB of weights from disk. CAI measures real llama.cpp load time with cache-contamination detection.

Read the science →

Storage

KV-Cache Offload

Long-context chats overflow VRAM and spill the KV cache to NVMe. CAI replays a ShareGPT conversation trace against your drive.

Read the science →

Storage

Tensor Burst I/O

QD1 burst reads with think-time cycles — the access pattern of layer-by-layer tensor streaming, invisible to QD32 benchmarks.

Storage

RAG Mixed Read/Write

Vector-database lookups interleaved with writes — the I/O signature of retrieval-augmented generation.

Storage

Checkpoint Write

Training and fine-tuning checkpoint saves: large sequential writes under time pressure.

Storage

Model Swap & LoRA Load

Unload model A, load model B; hot-swap LoRA adapters. The workload of every multi-model workflow.

Memory

STREAM Bandwidth

The canonical Copy/Scale/Add/Triad kernels — the ceiling on token generation speed for CPU inference.

Memory

Cache Hierarchy & Latency

L1/L2/L3/DRAM latency profiling and random-access latency under AI-realistic working sets.

Memory

DRAM-Tier KV Cache

The same KV-cache workload run in memory — quantifying exactly what NVMe offload costs you.

See it in action

One window. Or one command.

A full GUI with per-workload selection, live disk-I/O monitoring, and Excel/JSON export — or a scriptable CLI for labs and CI racks. Same engine, same scores.

CAI Benchmark GUI: workload selection, run controls, and live score cards

CAI_Benchmark_GUI.exe — workload selection, model queue, score cards

PowerShell — real output

> CAI_Benchmark_CLI.exe --memory --quick

============================================================
MEMORY BENCHMARKS
============================================================

STREAM Bandwidth Test (array size: 512 MB)
  Copy:  27.64 GB/s
  Scale: 17.42 GB/s
  Add:   19.43 GB/s
  Triad: 13.29 GB/s

Memory Latency Test (buffer size: 256 MB)
  Random Latency: 147.07 ns

Tensor Allocation Test
  Allocation Bandwidth: 5.76 GB/s

============================================================
  RESULTS
============================================================

Memory Score:        26.9  (FAIR)

Total time: 9.5s

Unedited quick-mode memory pass from the development machine. The full suite adds twelve storage workloads and the combined AI Readiness Score.

Trust, engineered

Built to be argued with

A benchmark is only as good as its worst loophole. CAI validates its own measurements: Direct I/O bypasses the OS cache, the page cache is purged between runs, an independent monitor watches OS disk counters during every storage test, and headline results (model load, sequential read) implying throughput beyond the physical ceiling of a PCIe 5.0 ×4 drive (~15.75 GB/s) are rejected as cache contamination rather than reported as records.

Versioned methodology (CAI v1.0) Published reference values Physical-limit validation Run-to-run variance reporting Free for personal & commercial use Vendor-neutral

Your GPU is ready for AI.
Are your storage and memory?

Generic disk benchmarks answer the wrong question

Traditional disk benchmarks

CAI Benchmark

Three numbers. SPEC-style rigor.

Storage Score

Memory Score

AI Readiness Score

Every test models a measured AI behavior

Model Load (TTFT)

KV-Cache Offload

Tensor Burst I/O

RAG Mixed Read/Write

Checkpoint Write

Model Swap & LoRA Load

STREAM Bandwidth

Cache Hierarchy & Latency

DRAM-Tier KV Cache

One window. Or one command.

Built to be argued with

Benchmark your machine in minutes