Methodology · CAI v1.0

How CAI scores are computed — and why you can trust them

A benchmark earns trust by showing its work. This page documents the complete scoring pipeline: the workloads, the normalization, the aggregation, and — most importantly — the defenses against the failure mode that quietly invalidates most storage benchmarks: accidentally measuring RAM instead of the SSD.

1. SPEC-style scoring

Every workload produces raw metrics (throughput, latency, completion time). Each metric is converted to a ratio against a published reference value:

ratio = measured / reference (higher-is-better)
ratio = reference / measured (lower-is-better, e.g. latency)

Ratios are combined by geometric mean — the same aggregation SPEC CPU has used for decades, because it is the only mean where no single metric can dominate and where "20% better at everything" always means a 20% better score:

score = 50 × ( ∏ ratioᵢ )^(1/n)

Two guards keep the aggregate honest: each ratio is capped to the range [0.4, 2.5] before the geometric mean — so no single outlier metric, good or bad, can dominate the score — and the final score is clamped to 0–100. The factor 50 anchors the scale: a score of 50 means "exactly matches the reference system."

2. The reference system

CAI v1.0's reference values define a deliberately attainable baseline — a good Gen4 NVMe SSD paired with DDR5-4800 — so scores read naturally:

Score	Interpretation
70+	Excellent — Gen4/Gen5 NVMe class, ideal for local AI
50–70	Good — solid Gen4 performance, handles most AI workloads
30–50	Moderate — Gen3 level; may bottleneck large models
20–30	Fair — below average; consider upgrading for AI work
<20	Poor — SATA-class or system bottleneck

All reference values are published in the repository and frozen per methodology version. Scores are always labeled with their version (CAI v1.0); reference changes bump the version so published numbers never silently change meaning.

3. The AI Readiness Score

AI Readiness = 0.45 × Storage Score + 0.55 × Memory Score

The weighting reflects how local LLM inference actually behaves: once a model is resident, token generation is overwhelmingly memory-bandwidth-bound, while storage dominates load, swap, and cache-offload events. The split is documented, fixed per version, and deliberately not configurable — comparability beats flexibility.

4. Defenses against cache contamination

The single most common error in storage benchmarking is reading a file that the operating system already holds in RAM — and reporting memory bandwidth as disk speed. CAI applies four independent defenses:

Direct I/O

Storage benchmarks open files with OS cache bypass (FILE_FLAG_NO_BUFFERING on Windows, O_DIRECT on Linux) using page-aligned buffers, so reads must come from the device.

Cache clearing between runs

The page cache and standby list are purged before each run (requires administrator/root). The mechanism and its success/failure are recorded in the results file — a run never silently pretends the cache was cold.

Physical-limit validation

A single NVMe drive on PCIe 5.0 ×4 cannot exceed ~15.75 GB/s — that is the bus, not an opinion. Model-load runs whose implied throughput exceeds the 16 GB/s ceiling are flagged cached and excluded from that test's aggregation; sequential-read results above the ceiling carry an explicit cache-contamination warning in the report. Results between what real Gen5 drives achieve (~15 GB/s) and the bus ceiling are flagged as suspect. CAI would rather qualify a number than publish a flattering lie.

Independent I/O monitoring

While storage workloads run, a separate monitor samples OS-level disk counters. If the device reports almost no physical reads while the benchmark "reads" gigabytes — or per-operation latency is below what silicon physics allows — the run is classified as cache-served, a warning is emitted, and the classification is recorded alongside the metrics in the exported results.

5. Run protocol and variance

Tests run 3 times by default (configurable); reported metrics are aggregated across runs and the per-run spread is preserved in the results file.
In the model-load test, runs flagged as cached are excluded from aggregation; if all runs are flagged, the result is reported with an explicit warning, never silently. Other workloads record their cache classification alongside the metrics.
Real GGUF model files (up to 18.5 GB) are used as test payloads, so file sizes, access granularity, and filesystem behavior match real deployments.
Results export to JSON (machine-readable, for comparison databases) and Excel (human-readable, with per-metric breakdowns).

6. Reproducing any number

CAI_Benchmark_CLI --all --runs 3 --json results.json   # full suite
CAI_Benchmark_CLI --storage                            # storage only
CAI_Benchmark_CLI --memory                             # memory only

Every score in any CAI report can be regenerated with one command on comparable hardware. That property — not any institution's name — is the methodology's authority.

Versioning promise. CAI scores are comparable within a methodology version. Workload traces, reference values, and weights are frozen per version; changes are announced, documented, and version-bumped. The methodology version (CAI v1.0) is stamped into every exported results file and is independent of the application version (v2.x). CAI v1.0 numbers will mean the same thing in five years as they do today.