Methodology · CAI v1.0
How CAI scores are computed — and why you can trust them
A benchmark earns trust by showing its work. This page documents the complete scoring pipeline: the workloads, the normalization, the aggregation, and — most importantly — the defenses against the failure mode that quietly invalidates most storage benchmarks: accidentally measuring RAM instead of the SSD.
1. SPEC-style scoring
Every workload produces raw metrics (throughput, latency, completion time). Each metric is converted to a ratio against a published reference value:
ratio = reference / measured (lower-is-better, e.g. latency)
Ratios are combined by geometric mean — the same aggregation SPEC CPU has used for decades, because it is the only mean where no single metric can dominate and where "20% better at everything" always means a 20% better score:
Two guards keep the aggregate honest: each ratio is capped to the range [0.4, 2.5] before the geometric mean — so no single outlier metric, good or bad, can dominate the score — and the final score is clamped to 0–100. The factor 50 anchors the scale: a score of 50 means "exactly matches the reference system."
2. The reference system
CAI v1.0's reference values define a deliberately attainable baseline — a good Gen4 NVMe SSD paired with DDR5-4800 — so scores read naturally:
| Score | Interpretation |
|---|---|
| 70+ | Excellent — Gen4/Gen5 NVMe class, ideal for local AI |
| 50–70 | Good — solid Gen4 performance, handles most AI workloads |
| 30–50 | Moderate — Gen3 level; may bottleneck large models |
| 20–30 | Fair — below average; consider upgrading for AI work |
| <20 | Poor — SATA-class or system bottleneck |
All reference values are published in the repository and frozen per methodology version.
Scores are always labeled with their version (CAI v1.0); reference changes bump
the version so published numbers never silently change meaning.
3. The AI Readiness Score
The weighting reflects how local LLM inference actually behaves: once a model is resident, token generation is overwhelmingly memory-bandwidth-bound, while storage dominates load, swap, and cache-offload events. The split is documented, fixed per version, and deliberately not configurable — comparability beats flexibility.
4. Defenses against cache contamination
The single most common error in storage benchmarking is reading a file that the operating system already holds in RAM — and reporting memory bandwidth as disk speed. CAI applies four independent defenses:
Direct I/O
Storage benchmarks open files with OS cache bypass
(FILE_FLAG_NO_BUFFERING on Windows, O_DIRECT on Linux) using
page-aligned buffers, so reads must come from the device.
Cache clearing between runs
The page cache and standby list are purged before each run (requires administrator/root). The mechanism and its success/failure are recorded in the results file — a run never silently pretends the cache was cold.
Physical-limit validation
A single NVMe drive on PCIe 5.0 ×4 cannot exceed ~15.75 GB/s — that is the bus, not an opinion. Model-load runs whose implied throughput exceeds the 16 GB/s ceiling are flagged cached and excluded from that test's aggregation; sequential-read results above the ceiling carry an explicit cache-contamination warning in the report. Results between what real Gen5 drives achieve (~15 GB/s) and the bus ceiling are flagged as suspect. CAI would rather qualify a number than publish a flattering lie.
Independent I/O monitoring
While storage workloads run, a separate monitor samples OS-level disk counters. If the device reports almost no physical reads while the benchmark "reads" gigabytes — or per-operation latency is below what silicon physics allows — the run is classified as cache-served, a warning is emitted, and the classification is recorded alongside the metrics in the exported results.
5. Run protocol and variance
- Tests run 3 times by default (configurable); reported metrics are aggregated across runs and the per-run spread is preserved in the results file.
- In the model-load test, runs flagged as cached are excluded from aggregation; if all runs are flagged, the result is reported with an explicit warning, never silently. Other workloads record their cache classification alongside the metrics.
- Real GGUF model files (up to 18.5 GB) are used as test payloads, so file sizes, access granularity, and filesystem behavior match real deployments.
- Results export to JSON (machine-readable, for comparison databases) and Excel (human-readable, with per-metric breakdowns).
6. Reproducing any number
CAI_Benchmark_CLI --all --runs 3 --json results.json # full suite CAI_Benchmark_CLI --storage # storage only CAI_Benchmark_CLI --memory # memory only
Every score in any CAI report can be regenerated with one command on comparable hardware. That property — not any institution's name — is the methodology's authority.
CAI v1.0) is stamped into every exported results file and is independent of
the application version (v2.x). CAI v1.0 numbers will mean the same thing in five years as
they do today.