Benchmark Metrics Specification

Version: 0.4.0 Status: Implemented Scope: asiai bench — all engines

Motivation

Benchmark results must be comparable across engines. Each metric has a single definition that all engine implementations must respect. The implementation may vary (server-side API vs client-side measurement), but the semantic must be identical.

Metrics

M1. `tok_per_sec` — Generation Speed

Definition: Tokens produced per second of generation time only, excluding prompt processing (TTFT).

generation_s = total_duration_s - ttft_s
tok_per_sec  = tokens_generated / generation_s    (if generation_s >= 0.01)
             = 0.0                                 (otherwise)

Engine	`generation_s` source
Ollama	`eval_duration / 1e9` (server API — direct)
OpenAI-compat	`elapsed_s - (ttft_ms / 1000)` (client-side)

Rationale: At large context sizes (e.g. 64k tokens), TTFT can dominate total duration. Including it in tok/s makes fast generators appear slow (e.g. 3.2 tok/s instead of 42 tok/s).

M2. `ttft_ms` — Time to First Token

Definition: Time between sending the request and receiving the first output token, in ms.

Engine	Source
Ollama	`prompt_eval_duration / 1e6` (server API)
OpenAI-compat	`(time.monotonic() at 1st content chunk - t0) * 1000` (client)

Note: Semantics differ slightly (server vs client measurement), but on localhost the gap is ~1ms — acceptable.

M3. `total_duration_ms` — Total Duration

Definition: Wall-clock total request time (prompt processing + generation), in ms.

Invariant: total_duration_ms >= ttft_ms — always.

Engine	Source
Ollama	`total_duration / 1e6` (server API)
OpenAI-compat	`elapsed_s * 1000` (client wall-clock)

M4. `tokens_generated` — Token Count

Definition: Number of output tokens produced by the model.

Source (by priority): 1. Server counter: Ollama eval_count, OpenAI-compat usage.completion_tokens 2. Text length estimate: max(1, len(text) // 4) (heuristic: ~4 chars/token) 3. Never len(text_parts) (SSE chunks != tokens)

M5. `generation_duration_ms` — Generation Duration

Definition: Generation time only (excluding TTFT), in ms. Makes the decomposition total = ttft + generation explicit and auditable.

Engine	Source
Ollama	`eval_duration / 1e6` (server API — direct)
OpenAI-compat	`max(0, elapsed_s - ttft_s) * 1000` (computed)

M6. `power_watts` — GPU Power

Definition: Average GPU power during execution of this specific engine, in watts.

Scope: One PowerMonitor per engine. Started before the first prompt, stopped after the last run. Each engine gets its own measurement — no session-wide averaging.

Source: sudo powermetrics (macOS).

M7. `tok_per_sec_per_watt` — Energy Efficiency

tok_per_sec_per_watt = tok_per_sec / power_watts

Uses the corrected tok/s (M1) and per-engine power (M6).

M8. `std_dev_tok_s` — Variance (Pooled)

Definition: Pooled intra-prompt standard deviation — captures run-to-run noise without mixing in inter-prompt variance.

For each prompt_type p with runs [v1, v2, ..., vn]:
    var_p = sum((vi - mean_p)^2) / n    (population variance)

pooled_variance = mean(var_p for all p with n >= 2)
std_dev_tok_s   = sqrt(pooled_variance)

Stability classification (unchanged): - CV < 5% → stable - CV < 10% → variable - CV >= 10% → unstable

Where CV = (std_dev_tok_s / avg_tok_s) * 100.

Implementation Map

Metric	`base.py`	`ollama.py`	`openai_compat.py`	`runner.py`	`reporter.py`
M1 tok/s	field	server API	client (excl. TTFT)	passthrough	avg
M2 ttft_ms	field	server API	client streaming	passthrough	avg
M3 total_duration_ms	field	server API	client wall-clock	passthrough	avg
M4 tokens_generated	field	server API	server or `len//4`	passthrough	avg
M5 generation_duration_ms	field	server API	computed	stored in dict	—
M6 power_watts	—	—	—	per-engine monitor	passthrough
M7 tok/s/W	—	—	—	computed	passthrough
M8 std_dev	—	—	—	—	pooled intra-prompt

Benchmark Protocol

Warmup: 1 non-timed generation per engine ("Hello", max_tokens=1) to prime caches.
Measured runs: Default 3 runs per prompt per engine (configurable via --runs).
Sampling: temperature=0 (greedy) on all engines for deterministic output.
Reporting: Median tok/s as primary metric (SPEC standard), mean +/- stddev as secondary.
Throttling: Warning emitted if thermal_speed_limit < 100% during any run.
Metadata: engine_version, model_format, model_quantization, hw_chip, os_version stored per result for reproducibility.

See benchmark-best-practices.md for full methodology audit.

Benchmark Metrics Specification

Motivation

Metrics

M1. tok_per_sec — Generation Speed

M2. ttft_ms — Time to First Token

M3. total_duration_ms — Total Duration

M4. tokens_generated — Token Count

M5. generation_duration_ms — Generation Duration

M6. power_watts — GPU Power

M7. tok_per_sec_per_watt — Energy Efficiency

M8. std_dev_tok_s — Variance (Pooled)