Benchmark Methodology

asiai follows established benchmarking standards (MLPerf, SPEC CPU 2017, NVIDIA GenAI-Perf) to produce reliable, reproducible, and comparable results.

Protocol

Warmup: 1 non-timed generation per engine to prime caches
Measured runs: Default 3 runs per prompt per engine (configurable via --runs)
Sampling: temperature=0 (greedy) for deterministic output
Reporting: Median tok/s as primary metric (SPEC standard), mean +/- stddev as secondary
Throttling: Warning emitted if thermal_speed_limit < 100% during any run
Metadata: Engine version, model format, quantization, hardware chip, macOS version stored per result

Metrics

tok/s — Generation Speed

Tokens per second of generation time only, excluding prompt processing (TTFT).

generation_s = total_duration_s - ttft_s
tok_per_sec  = tokens_generated / generation_s

At large context sizes (e.g., 64k tokens), TTFT can dominate total duration. Excluding it from tok/s prevents fast generators from appearing slow.

TTFT — Time to First Token

Time between sending the request and receiving the first output token, in milliseconds. Measured server-side (Ollama) or client-side at the first SSE content chunk (OpenAI-compatible engines).

Power — GPU Watts

Average GPU power during execution of each specific engine, measured via sudo powermetrics. One PowerMonitor per engine — no session-wide averaging.

tok/s/W — Energy Efficiency

tok_per_sec_per_watt = tok_per_sec / power_watts

Variance — Pooled Stddev

Pooled intra-prompt standard deviation captures run-to-run noise without mixing in inter-prompt variance.

Stability classification:

CV < 5% → stable
CV < 10% → variable
CV >= 10% → unstable

Where CV = (std_dev / mean) * 100.

Conformance

Practice	Status
TTFT separated from tok/s	Implemented
Deterministic sampling (temperature=0)	Implemented
Token count from server API (not SSE chunks)	Implemented
Per-engine power monitoring	Implemented
1 warmup generation per engine	Implemented
Default 3 runs (SPEC minimum)	Implemented
Median as primary metric (SPEC standard)	Implemented
Pooled intra-prompt stddev	Implemented
Thermal throttling detection + warning	Implemented
Engine version + model metadata stored	Implemented
Historical regression detection	Implemented

Apple Silicon Considerations

Unified Memory

Apple Silicon shares memory between CPU and GPU. asiai runs engines sequentially to avoid memory contention. Only Ollama reports VRAM per model — other engines show "—".

Thermal Throttling

MacBook Air (no fan): severe throttling under sustained load
MacBook Pro (fan): mild throttling
Mac Mini/Studio/Pro: active cooling, minimal throttling

asiai records thermal_speed_limit per result and warns if throttling is detected.

KV Cache

Large context sizes (32k+) can cause instability on engines that pre-allocate KV cache. Set engine context length to match the actual test size for fair results.

Metadata

Every benchmark result stores: engine, engine_version, model, model_format, model_quantization, hw_chip, os_version, thermal_level, thermal_speed_limit, metrics_version. This enables fair regression comparison and cross-machine benchmarks.