Benchmark Best Practices

Version: 0.3.2 Status: Living document — updated as methodology evolves References: MLPerf Inference, SPEC CPU 2017, NVIDIA GenAI-Perf

Overview

asiai bench follows established benchmarking standards to produce reliable, reproducible, and comparable results across inference engines on Apple Silicon. This document tracks which best practices are implemented, planned, or intentionally excluded.

Conformance Summary

Category	Practice	Status	Since
Metrics	TTFT separated from tok/s	Implemented	v0.3.1
	Deterministic sampling (temperature=0)	Implemented	v0.3.2
	Token count from server API (not SSE chunks)	Implemented	v0.3.1
	Per-engine power monitoring	Implemented	v0.3.1
	generation_duration_ms explicit field	Implemented	v0.3.1
Warmup	1 warmup generation per engine (non-timed)	Implemented	v0.3.2
Runs	Default 3 runs (SPEC minimum)	Implemented	v0.3.2
	Median as primary metric (SPEC standard)	Implemented	v0.3.2
	Mean + stddev as secondary	Implemented	v0.3.0
Variance	Pooled intra-prompt stddev	Implemented	v0.3.1
	CV-based stability classification	Implemented	v0.3.0
Environment	Sequential engine execution (memory isolation)	Implemented	v0.1
	Thermal throttling detection + warning	Implemented	v0.3.2
	Thermal level + speed_limit recorded	Implemented	v0.1
Reproducibility	Engine version stored per benchmark	Implemented	v0.3.2
	Model format + quantization stored	Implemented	v0.3.2
	Hardware chip + macOS version stored	Implemented	v0.3.2
	Open-source benchmark code	Implemented	v0.1
Regression	Historical baseline comparison (SQLite)	Implemented	v0.3.0
	Comparison by (engine, model, prompt_type)	Implemented	v0.3.1
	metrics_version filtering	Implemented	v0.3.1
Prompts	4 diverse prompt types + context fill	Implemented	v0.1
	Fixed max_tokens per prompt	Implemented	v0.1

Planned Improvements

P1 — Statistical Rigor

Practice	Description	Standard
95% confidence intervals	CI = mean +/- 2*SE. More informative than +/- stddev.	Academic
Percentiles (P50/P90/P99)	For TTFT especially — tail latency matters.	NVIDIA GenAI-Perf
Outlier detection (IQR)	Flag runs outside [Q1 - 1.5IQR, Q3 + 1.5IQR].	Statistical standard
Trend detection	Detect monotone performance degradation across runs (thermal drift).	Academic

P2 — Reproducibility

Practice	Description	Standard
Cooldown between engines	Pause 3-5s between engines to let thermals stabilize.	GPU benchmark
Token ratio verification	Warn if tokens_generated < 90% of max_tokens.	MLPerf
Export format	`asiai bench --export` JSON for community submissions.	MLPerf submissions

P3 — Advanced

Practice	Description	Standard
`ignore_eos` option	Force generation to max_tokens for throughput benchmarks.	NVIDIA
Concurrent request testing	Test batching throughput (relevant for vllm-mlx).	NVIDIA
Background process audit	Warn if heavy processes are running during benchmark.	SPEC

Intentional Deviations

Practice	Reason for deviation
MLPerf minimum 600s duration	Designed for datacenter GPUs. Local inference on Apple Silicon with 3 runs + 4 prompts already takes ~2-5 minutes. Sufficient for stable results.
SPEC 2 non-timed warmup workloads	We use 1 warmup generation (not 2 full workloads). Single warmup is sufficient for local inference engines where JIT warmup is minimal.
Population vs sample stddev	We use population stddev (N divisor) instead of sample stddev (N-1 divisor). With small N (3-5 runs), the difference is minimal and population is more conservative.
Frequency scaling control	Apple Silicon does not expose CPU governor controls. We record thermal_speed_limit instead to detect throttling.

Apple Silicon Specific Considerations

Unified Memory Architecture

Apple Silicon shares memory between CPU and GPU. Two key implications:

Never benchmark two engines simultaneously — they compete for the same memory pool. asiai bench runs engines sequentially by design.
VRAM reporting — Only Ollama reports size_vram (GPU-mapped portion). OpenAI-compatible engines don't expose this. We show "—" rather than misleading values.

Thermal Throttling

MacBook Air (no fan): severe throttling under sustained load. Results degrade after 5-10 min.
MacBook Pro (fan): throttling is mild and usually handled by the fan ramping up.
Mac Mini/Studio/Pro: active cooling, minimal throttling.

asiai bench records thermal_speed_limit per result and warns if throttling is detected (speed_limit < 100%) during any run.

KV Cache and Context Length

Large context sizes (32k+) can cause performance instability on engines that pre-allocate KV cache at model load time. Example: LM Studio defaults to loaded_context_length: 262144 (256k), which allocates ~15-25 GB of KV cache for a 35B model, potentially saturating 64 GB of unified memory.

Recommendations: - When benchmarking large contexts, set engine context length to match the actual test size (e.g. lms load model --context-length 65536 for 64k tests). - Compare engines with equivalent context length settings for fair results.

Metadata Stored Per Benchmark

Every benchmark result in SQLite includes:

Field	Example	Purpose
`engine`	"ollama"	Engine identification
`engine_version`	"0.17.4"	Detect performance changes across updates
`model`	"qwen3.5:35b-a3b"	Model identification
`model_format`	"gguf"	Differentiate format variants
`model_quantization`	"Q4_K_M"	Differentiate quantization levels
`hw_chip`	"Apple M4 Pro"	Hardware identification
`os_version`	"15.3"	macOS version tracking
`thermal_level`	"nominal"	Environment condition
`thermal_speed_limit`	100	Throttling detection
`metrics_version`	2	Formula version (prevents cross-version regression)

This metadata enables: - Fair regression comparison: only compare results with matching metadata - Cross-machine benchmarks: identify hardware differences - Community data sharing: self-describing results (planned for v1.x)