Skip to content

Agentic Benchmark Results

This page reports real asiai bench --agentic-mode results on Apple Silicon. The agentic protocol runs an 8-phase, prefix-cache-aware conversation (--runs 5 for variance), which exercises the way an agent actually uses a model — multi-turn, long system prefix, 50K-token long-context phase — rather than a single one-shot generation.

Why agentic mode — who is this for? Agent frameworks don't drive a model like a chatbot: they reuse a large system prefix across many turns, emit tool calls, and carry long context. A one-shot throughput number misses all of that — and the ranking can even flip (an engine with great raw decode but a multi-second TTFT or a broken prefix cache is unusable for an agent). Agentic mode measures the model the way it is actually driven by agent orchestrators and coding assistants — e.g. Hermes Agent, OpenClaw, opencode, Aider, Cline, or Continue — so the result reflects real agent workloads, not a benchmark artefact.

Living document. These numbers are refreshed as engine versions, model revisions and instrumentation improve (e.g. peak-RAM capture). Each row carries the exact engine version and model file so a result is always reproducible.

Campaign 2026-06-03. Models: Qwen3.6 and the Qwopus3.6 finetune, in two architectures — 27B dense and 35B-A3B MoE (Mixture-of-Experts, ~3B active parameters per token). Engines: llama.cpp (b9430) and the MLX family (mlx-lm, mlx_vlm, omlx, rapid-mlx, vllm-mlx). MTP = the model's built-in Multi-Token Prediction head used for speculative decoding (--spec-type draft-mtp). Hardware: MacBook Pro M5 Max (128 GB) and Mac mini M4 Pro (64 GB), both in High Power Mode.

How to read the table

Verdict-first. Rows are grouped by a deterministic gate result, not just sorted:

  • best validated throughput in the block · viable · reserve (passes hard gates but mediocre latency) · eliminated (failed a gate).
  • Gates: valid ≥ 80% · TTFT ≤ 1500 ms (hard fail > 3000) · prefix-cache reuse > 0.
  • dec = sustained warm decode (tok/s) · 50K = decode at 50K context · TTFT = time-to-first-token (ms) · t/s/W = tokens per second per SoC watt (efficiency, higher is better) · RAMpk = peak engine RSS (GB, the figure that governs memory fit) · = not measured (never 0).
  • ★ ranks by throughput only. Picking a model for real work also weighs output quality (see the dev/code evaluation), which throughput does not capture.

M4 Pro and M5 Max are not comparable in absolute terms here — different quant (Q5_K_XL vs Q4_K_S). Compare within a machine block.

MacBook Pro M5 Max 128 GB · Q4

model · engine · MTP dec t/s peak 50K TTFT ms reuse t/s/W RAMpk GB valid%
★ Tier 1 — winner + fast
Qwopus-35B · llamacpp b9430 ▲MTP 123.3 127.5 83.8 67 0.8 1.590 100
Qwen-35B · llamacpp b9430 ▲MTP 118.3 123.5 82.9 62 0.8 1.513 100
Qwopus-35B · llamacpp b9430 105.7 108.3 76.1 63 0.8 1.507 100
Qwen-35B · llamacpp b9430 85.5 90.8 66.7 59 0.8 1.403 100
✓ Tier 2 — viable (slower)
Qwen-27B · llamacpp b9430 ▲MTP 28.0 29.5 22.9 118 0.8 0.378 32.2 100
Qwopus-27B · llamacpp b9430 ▲MTP 26.7 29.8 22.0 118 0.8 0.367 31.5 100
Qwopus-27B · llamacpp b9430 25.9 27.1 20.8 110 0.8 0.342 28.4 100
Qwen-27B · llamacpp b9430 23.8 24.0 19.2 111 0.8 0.340 28.9 100
⚠ Tier 3 — reserve (poor latency)
Qwopus-27B · mlx-lm 0.31.3 29.2 29.3 24.3 600 1.0 0.461 26.4 100
Qwen-27B · rapid-mlx 0.6.71 20.6 20.7 17.9 798 0.357 85
Qwen-27B · omlx 0.4.0 20.0 20.2 17.5 2150 0.82 0.346 26.7 100
✗ Tier 4 — eliminated
~~Qwen-27B · mlx_vlm 0.6.0 ▲MTP~~ ~~41.0~~ ~~10879~~ 0.0 75
~~Qwen-27B · mlx_vlm 0.6.0~~ ~~31.9~~ 26.0 ~~9578~~ 0.0 100
~~Qwen-27B · vllm-mlx 0.3.0~~ ~~20.5~~ 18.1 ~~9578~~ 24.3 100

Eliminations: mlx_vlm+MTP fails validity (75%) and breaks long-context; both mlx_vlm runs and vllm-mlx have ~9.6 s TTFT (unusable per agent turn).

Mac mini M4 Pro 64 GB · Q5

model · engine · MTP dec t/s peak 50K TTFT ms reuse t/s/W RAMpk GB valid%
★ Tier 1
Qwen-35B · llamacpp b9430 ▲MTP 44.6 50.7 32.6 143 0.8 1.557 33.0 100
Qwen-35B · llamacpp b9430 36.3 45.6 29.6 133 0.8 1.553 30.8 100
✓ Tier 2
Qwen-27B · llamacpp b9430 10.4 10.4 7.2 397 0.8 0.279 31.9 100
Qwen-27B · llamacpp b9430 ▲MTP 9.7 9.8 7.5 409 0.8 0.272 35.4 100

Key findings

  • The 35B-A3B MoE beats the 27B dense on every throughput axis on both machines — it activates only ~3B parameters per token, so it decodes ~4× faster than the dense 27B and is ~3.5× more energy-efficient (1.5 vs ~0.4 tok/s/W). Throughput is not quality, however — see the caveat below.
  • MTP gain depends on architecture × hardware. Measured decode uplift: MoE +38% (M5) / +23% (M4); dense +16% (M5) but −7% (M4) — on the slower M4 GPU the dense draft overhead is not amortised. So MTP is a per-model, per-machine measurement, not a universal win.
  • The MLX server family is throughput-only here: mlx-lm has the best MLX decode but a 600 ms TTFT floor; mlx_vlm, vllm-mlx and omlx are knocked out by TTFT (2–11 s) and/or broken prefix-cache. llama.cpp dominates first-token latency (~60–120 ms).
  • Peak vs steady RAM. mlx-lm's RSS sits at ~14.5 GB steady but peaks at 26.4 GB (lazy KV allocation + compact MLX-4bit weights); llama.cpp pre-allocates the full context KV up front (~29 GB flat). At peak they are comparable — use RAMpk for memory-fit decisions, not the steady value.

Methodology & caveats

  • asiai bench --agentic-mode --runs 5, thinking disabled (chat_template_kwargs.enable_thinking=false), server context ≥ 65536.
  • One engine resident at a time (SOLO); page cache purged between GGUF runs that share a file.
  • Quant differs by machine (M5 Q4_K_S/Q4_K_XL, M4 Q5_K_XL) → absolute numbers are not comparable across machines, only within a block.
  • High Power Mode is required on the M5 laptop (otherwise sustained GPU is throttled ~40%); the M4 mini desktop is roughly neutral to it.
  • Known instrumentation gaps (being fixed): peak RAM is missing () on some manually-launched llama.cpp servers; engine version is not yet stamped per run (shown here from a version map); prefix-cache reuse is a coarse fraction pending a true hit-rate.

See also: Benchmark methodology · Metrics spec · Community leaderboard.