Apple Silicon Agentic Inference Panel

Comparative benchmark panel across inference engines (llama.cpp, mlx-lm, LM Studio, Rapid-MLX, vLLM-MLX, oMLX, vMLX, Ollama) running Qwen 3.6 family models on Apple Silicon M-series, measured with asiai bench --agentic-mode and asiai bench --burst-mode.

Workload target: agent-orchestrator class — ~60-80 tool calls per turn, identical system prompt of ~7 KB, user message changing per call. This is the worst case for naïve prefix caching: a true cache-reuse cross-USER is required, not just cache-on-the-same-prompt.

Reading the throughput figures: Section 1 decode rates use the Qwen3 default chat template (thinking ON), so they include reasoning tokens — effective agent-throughput on a thinking model is lower. Thinking is a per-task trade-off (caveat 1), not a global on/off.

Published 2026-06 · contributions and corrections welcome via github.com/druide67/asiai.

⚠️ Known caveats before reading further

Thinking mode is a per-task trade-off. With the Qwen3 default template (thinking ON), Qwen 3.6 / Qwopus emit ~6-7× more tokens, so the Section 1 decode figures include reasoning tokens and effective agent-throughput is lower. Thinking ON is required for written multi-section deliverables (a thinking-OFF model skips the deliverable) but costs atomic tool-call cleanliness (asiai measures ~100% clean tool calls with thinking OFF vs ~77.8% with thinking ON + preserve_thinking ON, deterministic across runs; enable_thinking=on + preserve_thinking=off is unusable — a deterministic HTTP 500 once reasoning accumulates in the context). Set thinking per task-dimension, not as one global flag.
Rapid-MLX and vLLM-MLX share an engine. Rapid-MLX is a community fork of waybarrios/vllm-mlx; they appear as separate rows below because they have diverged in version and features, but the prefix-cache snapshot mechanism is the same lineage.
MTP: Qwen 3.6 has a real head; the backend matters. Qwen 3.6's official config.json carries mtp_num_hidden_layers=1 (Qwen naming — not the DeepSeek num_nextn_predict_layers key, so a nextn-only check wrongly concludes "no head"). Some re-quantized GGUF/MLX artifacts drop the MTP tensors while keeping the config flag — verify the tensors in the weight index, not just the flag. llama.cpp native MTP (--spec-type draft-mtp) requires a -MTP-GGUF that embeds the head; a plain GGUF cannot draft. Released mlx-lm does not run the head as native speculative decoding (PR ml-explore/mlx-lm#990 adds it). LM Studio routes GGUF through its llama.cpp-derived backend and MLX through mlx-engine.
Single-pass measurements, no variance reporting — Section 1 / 2 chiffres are single observations. Variance reporting (median + min + max across N passes) is supported as of --burst-runs N but the rebench is pending.

Section	Topic	Status
1	Single-call performance	🟡 8 cells, thinking-mode ON (decode includes reasoning tokens)
2	Concurrent burst (30/60/80 parallel calls)	🟡 smoke cell + 2 partial concurrent points; no normalized 30/60/80 panel
3	Caches & optimizations	✅ 8 engines covered
4	Memory & resources	✅ idle + under-load swap (+0) + footprint measured
5	Model quality (public leaderboards)	🟡 vendor/self-reported figures (llm-stats)
—	asiai direct measurements	✅ dev-quality, thinking ablation, MTP, instruction-following
6	Operational (license, endpoints, maintenance)	✅ 8 engines covered
7	Quality benchmark weighting	🟡 default weighting, override via `--weights` planned
8	Custom long-horizon eval (proposal)	🟡 scoped, not yet built

Section 1 — Single-call performance

🟠 May 2026 snapshot — indicative, not the reference numbers. This table was captured in May (thinking-mode ON, single-pass) and its source fixtures have not been re-verified. For current, reproducible decode throughput, use the asiai direct measurements section below (June, llama.cpp b9430, deterministic). What this table is reliable for is the relative TTFT / prefix-cache story (cross-USER reuse), not absolute t/s. Note in particular that the 123.9 t/s in row 5 (LM Studio GGUF+MTP) sits right next to the June llama.cpp Qwopus+MTP 123.3 t/s — LM Studio's GGUF path is a llama.cpp-derived backend, so the two measure essentially the same engine.

⚠️ Read with caveat 1 above: every figure in this table includes the Qwen3 default thinking-mode tokens (reasoning_content). Effective agent-throughput requires re-running with chat_template_kwargs={"enable_thinking": false}. The column is labeled "decode (t/s)" not "effective throughput".

The "lower-bound estimate" column is 60 × (TTFT + max_tokens/decode), assuming sequential dispatch (which Rapid-MLX single-slot enforces). It is not a production tick prediction — see Section 7 for the methodological caveat.

📌 Versions tested (May 2026): Rapid-MLX 0.6.66, LM Studio 0.4.14, llama.cpp b9270. Engine versions churn weekly on Apple Silicon — treat each figure as dated, not current. (The asiai-measurements section uses llama.cpp b9430.)

#	Engine	Model	Format	Warm decode (t/s) ¹	TTFT warm (ms)	TTFT prefix-test median (ms)	TTFT cold (ms)	Lower-bound estimate (60 calls × single-call, optimistic)	Source fixture
1	Rapid-MLX 0.6.66 (fork of vllm-mlx)	Qwopus 3.6-35B-A3B-v1 (zaydiscold MLX-4bit)	MLX-4bit	109.1 ¹	139	131	2074	~3.6 min	`cell-rapidmlx-qwopus35b.json`
2	Rapid-MLX 0.6.66	Qwen 3.6-35B-A3B-UD (MLX-4bit)	MLX-4bit	106.9 ¹	321	319	2095	~4 min	`cell-rapidmlx-35b-a3b.json`
3	Rapid-MLX 0.6.66	Qwopus 3.6-27B-v2 (Jackrong MLX-4bit)	MLX-4bit	31.8 ¹	323	323	8647	~13 min	`cell-rapidmlx-qwopus.json`
4	Rapid-MLX 0.6.66	Qwen 3.6-27B-UD (MLX-4bit)	MLX-4bit	20.5 ¹	527	527	8954	~23 min	`cell-rapidmlx-full-27bud.json`
5	LM Studio 0.4.14 (GGUF backend) ²	Qwen 3.6-35B-A3B-MTP (Unsloth GGUF)	GGUF Q4 + MTP	123.9 ¹ ²	309	5965	6063	~3.5 min warm / ~9.2 min prefix-changing	`cell-lmstudio-mtp-qwen35b.json`
6	LM Studio 0.4.14 (GGUF backend) ²	Qwopus 3.6-35B-A3B-v1 (Jackrong GGUF)	GGUF Q4_K_S	105.6 ¹	292	5785	5624	~3.5 min warm / ~9.6 min prefix-changing	`cell-lmstudio-qwopus35b.json`
7	llama.cpp b9270	Qwen 3.6-35B-A3B (UD Q5_K_XL)	GGUF Q5_K_XL	80.9 ¹	3000	3000	n/a	~8 min	(baseline reference)
8	llama.cpp b9270	Qwopus 3.6-27B-v2 (Jackrong GGUF Q4)	GGUF Q4	25.3 ¹	13000	13000	n/a	~30 min	(baseline reference)

¹ Thinking-mode caveat: figures captured with default chat template (thinking ON). Real-world effective throughput on tool-call workloads is typically 4-12 t/s on Qwopus/Qwen3.6 finetunes when reasoning tokens inflate output 6-7×. To reproduce these decode figures, pass chat_template_kwargs={"enable_thinking": false} in the request payload.

² LM Studio backend: rows 5-6 used a GGUF file, which routes through LM Studio's llama.cpp-derived backend (NOT the MLX runtime mlx-engine). The MTP claim in row 5 reflects this backend's implementation, not mlx-engine speculative decoding. Released mlx-lm does not run the MTP head as native speculative decoding (its sanitize() historically dropped MTP weights during conversion; native support is in PR ml-explore/mlx-lm#990), so a hypothetical MLX-format MTP model would not benefit on the released mlx-engine either.

Key observations

On the realistic agent pattern (identical system + changing user prompts), Rapid-MLX + Qwopus 35B-A3B-v1 delivers 131 ms median TTFT prefix-test vs 5965 ms for LM Studio GGUF backend (~44× faster). The advantage comes from the vllm-mlx prefix-cache snapshot mechanism (see Section 3 for the source-code disambiguation).
On pure decode throughput (warm path), the LM Studio GGUF backend with Unsloth MTP records 123.9 t/s vs Rapid-MLX 109.1 t/s (+13.5%). This delta reflects the LM Studio llama.cpp-derived backend's speculative decoding on a GGUF carrying the MTP head, not an Apple-MLX gain (released mlx-engine does not run the head — see footnote 2). On the native llama.cpp path, MTP is net-positive on the MoE 35B-A3B — see Section 3.
All Qwen 3.6 family configurations (hybrid DeltaNet + full-attention) fail cross-USER prefix cache except Rapid-MLX, which keeps an RNN-state snapshot. On llama.cpp / LM Studio GGUF llama_memory_can_shift=false; on mlx-lm / oMLX the recurrent/SSM state can't be split at an arbitrary token boundary. The upstream llama.cpp fix for this architecture is not merged (#23121 closed; preserve_thinking does not address it, #22615).
Single-slot serialization confirmed: smoke burst test (Section 2) shows Rapid-MLX 0.6.66 serializes concurrent calls FIFO (p50 ≈ p95 ≈ max on burst=5). For 60-80 calls/turn, total wall-time scales linearly with burst size on this engine. A multi-slot engine (e.g. llama.cpp --parallel N) would behave differently, but --parallel N on Qwen3.6 hybrid disables prefix cache per slot (architectural limitation).

Section 2 — Concurrent burst (30/60/80 parallel calls)

Pattern: 30 to 80 concurrent POST /v1/chat/completions calls fired within a ~200 ms window. Simulates an agent loop dispatching multiple MCP/tool calls in parallel. Measured natively via asiai bench --burst-mode.

🟡 Status: 1 smoke cell measured (Rapid-MLX burst-5). Full panel pending.

Smoke cell (Rapid-MLX 0.6.66 + Qwopus 35B-A3B-v1, burst=5)

burst N	wall-time (s)	p50 latency (ms)	p95 latency (ms)	max latency (ms)	agg throughput (t/s)
5	2.8	2615	2792	2812	88.8

Smoke finding: p50 ≈ p95 ≈ max indicates the 5 calls were serialized server-side (single-slot engine). Rapid-MLX 0.6.66 does not appear to support concurrent request scheduling — calls queue FIFO internally. To validate at 60/80 calls scale.

Full concurrent panel — not yet measured

A normalized 30/60/80-concurrent panel has not been run (the measurements here are sequential agentic-mode, not concurrent burst). The two partial concurrent data points that exist elsewhere:

TurboQuant (K=q8_0 V=turbo2, Qwen3-4B, M4 Pro): +9% aggregate at 4-parallel (68.5 → 74.7 t/s) even though single-stream is −8% — the KV compression buys back the parallel headroom.
oMLX continuous batching (mlx-lm BatchGenerator): ×1.8 aggregate at burst-8 (12.8 → 22.9 t/s), but it collapses at burst-30 (17.3 t/s) once a 27B-dense saturates RAM into swap — 0 crashes.

A dedicated burst-mode panel across all engines is deferred.

Section 3 — Caches & optimizations

#	Couple	Cache reuse cross-USER	Snapshot persists cross-restart	MTP support	MTP accept rate	TurboQuant compat	KV cache native types	Native parallel slots
1	Rapid-MLX + Qwopus 35B-A3B-v1	✅ YES (RNN-state snapshot, see ³ below)	✅ persistent in `~/.cache/vllm-mlx/`	❌ released MLX runtime doesn't run the MTP head as speculative decode (mlx-lm PR #990 pending)	n/a	❌ MLX only	MLX native (no quant flag exposed)	⚠️ single slot (smoke burst confirms FIFO serialization)
2	Rapid-MLX + Qwen 35B-A3B-UD	✅ YES ³	✅ persistent	❌	n/a	❌	MLX native	⚠️ single slot
3	LM Studio + Qwen 35B-A3B-MTP	❌ NO (architectural hybrid limitation)	n/a	✅ via mlx-engine v1.8.1	82.1 % (on coding task)	❌	mlx-engine v1.8.1 (4bit MLX)	configurable via GUI
4	LM Studio + Qwopus 35B-A3B-v1	❌ NO	n/a	❌ no heads	n/a	❌	mlx-engine v1.8.1 (Q4_K_S GGUF)	configurable via GUI
5	llama.cpp + Qwen 3.6-35B-A3B	❌ NO (architectural hybrid limitation)	n/a	✅ `--spec-type draft-mtp` on a `-MTP-GGUF` (a plain GGUF cannot draft). Net-positive on the MoE 35B-A3B — asiai measures +38% decode (base) / +17% (Qwopus) on M5 Max (see § asiai measurements)	benefit = intra-session decode delta (no acceptance rate logged)	✅ turbo2/3/4 V cache	`fp16`, `q8_0`, `q5_0`, `turbo2/3/4`	⚠️ `--parallel N` works mechanically but disables prefix cache per slot on hybrid arch (each slot owns its KV, the `--cache-reuse N` flag is already silently disabled here). Use with caution.
6	mlx-lm	❌ NO (PRs #923, #188, #192 pending upstream)	n/a	❌ broken on hybrid arch	n/a	❌	MLX native	❌ (single slot)
7	oMLX	❌ NO (tool calling lost post-cache-hit, issue #825)	partial	❌	n/a	❌	MLX native + tiered SSD cache	❌
8	vLLM-MLX (`waybarrios`, upstream of Rapid-MLX)	⚠️ trie prefix-cache, no documented hybrid/DeltaNet support (Rapid-MLX rows 1-2 add the RNN-state snapshot on top)	n/a	⚠️ MTP added in prerelease 0.4.0rc1	n/a	❌	MLX + paged-attention	✅

³ Rapid-MLX prefix cache: the cache stores hybrid-attention KV slabs + RNN-state snapshots, keyed per <repo>--<sys_prompt_hash> and persisted under ~/.cache/vllm-mlx/. The observed ~131 ms TTFT prefix-test is an in-RAM KV slab reattach plus the changed-user forward pass, not a from-disk reload.

oMLX large-context cache. oMLX's 2-tier paged SSD KV cache turns a 55K-token prefill from ~115 s to ~3.5 s TTFT on a same-prompt cache-hit (×33; 55,296 / 55,837 tokens cached). On small prompts (~7.5K) there's no advantage (~2-5 s, = mlx-lm) and decode is ~19 t/s (no raw-speed gain). This is same-prompt reuse, not cross-USER (which oMLX doesn't do); cross-restart persistence is documented but not yet A/B-tested.

TurboQuant KV compression (llama.cpp). K=q8_0 V=turbo2 cuts KV RAM ~28% (22.9 → 16.4 GB on a 4B model, M4 Pro) with tool-call validity unchanged (10/10), and gains +9% aggregate at 4-parallel despite −8% single-stream. The symmetric K=turbo3 V=turbo3 reaches ~−56% RAM but degrades quality (early-stop, repetition) — the asymmetric q8_0/turbo2 is the usable config.

Section 4 — Memory & resources (Apple Silicon M5 Max 128 GB)

#	Couple	Working-set RAM (GB)	Disk footprint (GB)	Swap Δ idle	Swap Δ under load	SOLO required?	Cohabitation safe?
1	Rapid-MLX + Qwopus 35B-A3B-v1	~22	19.9 (MLX-4bit)	+0	+0 MB	⚠️ SOLO (cohabit thrash to 0.4 t/s)	❌
2	Rapid-MLX + Qwen 35B-A3B-UD	~24	20.0 (MLX-4bit)	+0	+0 MB	⚠️ SOLO	❌
3	LM Studio + Qwen 35B-A3B-MTP	21.6	23.2 (Q4 + MTP heads)	+0	+0 MB	not tested	not tested
4	LM Studio + Qwopus 35B-A3B-v1	18.5	19.9 (Q4_K_S)	+0	+0 MB	not tested	not tested
5	llama.cpp + Qwen 3.6-35B-A3B (reference)	~16	~16 (Q5_K_XL)	+0	+0 MB	❌	✅ with `--parallel 2/3`

"Under load" = the 8-phase agentic bench including a 50K-token prefill (the heaviest sequential memory stress measured), M5 Max 128 GB, SOLO: swap delta 0 MB / 0 swapouts for every engine — model + KV fit in free/inactive memory with >100 GB headroom. This is sequential-load memory, not 60-concurrent memory (see Section 2). Working-set RAM is an estimate; measured RSS includes mmap'd GGUF / wired MLX pages, so the true incremental footprint is lower (the MTP head adds ~+3 GB).

Observations

Rapid-MLX requires SOLO operation on the GPU: cohabitation with another actively-decoding engine triggers a swap delta of 5.4 → 14.2 GB and a decode collapse to 0.4 t/s. Do not start a second engine on the same Apple Silicon GPU.
LM Studio MTP disk footprint is +13 % vs Q4_K_S without MTP heads, due to the MTP weight blocks. Negligible cost relative to the +17 % decode gain.
On M5 Max 128 GB unified memory: every 35B-A3B configuration tested leaves more than 100 GB headroom after load — RAM is not the limiting factor.
On M4 Pro 64 GB: Q5_K_XL does not fit alongside auxiliary models (swap thrash observed in production). Q4_K_S does fit.

Section 5 — Model quality

Public-benchmark figures here are vendor / self-reported and aggregated by leaderboards (llm-stats), not independently verified. Cross-validate at llm-stats · LiveBench · SWE-bench before relying on them. asiai's own direct measurements on Apple Silicon are in the next section.

Author-only claims (Jackrong/Qwopus, Unsloth self-eval) are flagged separately and kept out of the public-leaderboard columns.

🔴 Critical finding: the "Hessling agentic" benchmark cited on several community model cards is not independently reproducible — 16 prompts, single curator, no neutral leaderboard integration. All three advisors recommend treating it as a smoke test only.

Open-weight Qwen 3.6 base models

Public-leaderboard figures (llm-stats), self-reported. The 27B-dense outscores the 35B-A3B MoE on SWE-bench — consistent with asiai's own dev-quality finding below (the MoE base is the one that hits the tool-call empty-object bug). MTP heads are a decode-speed feature and do not change a model's quality scores.

Model	Architecture	SWE-bench Verified	GPQA Diamond	MMLU-Pro	Terminal-Bench 2.0	BFCL
Qwen 3.6-35B-A3B-Instruct	MoE 35B / 3B active	73.4%	86.0%	85.2%	24.6%	absent from board
Qwen 3.6-27B-Dense Instruct	Dense 27B hybrid	77.2%	87.8%	86.2%	59.3% (vendor)	absent from board

Terminal-Bench 2.0 is far harder than the older Terminal-Bench v1 (community cards quote ~51.5% for the 35B-A3B on v1); the 24.6% here is the 2.0 generation.

Qwopus 3.6 family — author-reported only, not independently verified

The Qwopus 3.6 finetunes published by Jackrong on HuggingFace claim substantial gains over the Qwen base. As of May 2026 these claims have not been independently reproduced on neutral leaderboards. Treat as experimental until BFCL / SWE-bench reruns by a third party are available.

Model (author claims)	MMLU-Pro	SWE-bench Verified	Hessling agentic (16 prompts)
Qwopus 3.6-35B-A3B-v1 (Jackrong)	claimed 88+	claimed 75+	claimed 88.6 ⚠ non-reproducible
Qwopus 3.6-27B-v2 (Jackrong)	claimed 87.43	claimed 75.25	n/a

⚠ The "Hessling agentic" benchmark cited on the Jackrong model cards appears to be a 16-prompt curator-specific evaluation with no neutral leaderboard integration. All three advisories queried (Grok-4, GPT-5, Gemini Advanced) recommend treating it as smoke test only.

Frontier anchors (mid-2026)

All figures are vendor / self-reported, aggregated by llm-stats — none are independently verified there. Terminal-Bench 2.0 is the exception (the tbench team re-runs submissions; rows are peak agent×model scores). GPQA are vendor "Diamond" figures and the set is near-saturated — treat as approximate.

Model	SWE-bench Verified	GPQA Diamond	MMLU-Pro	Terminal-Bench 2.0	Source
Claude Opus 4.8	88.6%	93.6%	n/a	— (no TB submission)	llm-stats / Anthropic
Claude Opus 4.7	87.6%	94.2%	n/a	90.2%	llm-stats / tbench
Claude Sonnet 4.6	79.6%	89.9%	n/a	53.4%	llm-stats / tbench
GPT-5.5	n/a* (SWE-Pro 58.6%)	93.6%	n/a	84.7%	OpenAI / tbench
GPT-5 (base)	74.9%	85.7%	n/a	49.6%	llm-stats / tbench
Gemini 3.1 Pro	80.6%	~94.4%	n/a	80.2%	llm-stats / tbench
DeepSeek-V4-Pro-Max	80.6%	90.1%	87.5%	n/a	vendor (DeepSeek)
Llama-3.3-70B-Instruct	n/a	n/a	68.9%	n/a	Meta (baseline)

* GPT-5.5 has no public SWE-bench Verified score (OpenAI reports SWE-bench Pro Public 58.6%); the "88.7% SWE-bench" figure circulating is not on any primary source. Note: Qwen 3.6 has no 235B-A22B — the open family is the 27B-dense and 35B-A3B (below); the 235B-A22B is the prior Qwen3 generation.

Same-class open-weights baselines

Model	MMLU-Pro	SWE-bench Verified	Notes
Llama-3.3-70B-Instruct	~75-80	~40-50	Older but well-characterized baseline
Mistral Codestral 25.05 / Devstral	high (coding-specialized)	medium-high	Strong editor-style completion fidelity, weaker on reasoning
GLM-4.6-Coder (Zhipu)	vendor claims very high	disputed	Significant skepticism around evaluation methodology (consensus)

Quality benchmarks deprecated for this decision

HumanEval / HumanEval+ — saturated in 2026, all frontier models above 90 %, no signal left.
GSM8K — saturated, no signal for coding agents.
MMLU (original) — superseded by MMLU-Pro.
Author-reported "Hessling agentic" 16-prompt — non-reproducible, treat as smoke test only.

Open quality questions (research gaps)

Quality-per-GB-RAM benchmark: no standard exists. Proposed proxy formula: AgentScorePerGB = (0.5·SWE + 0.3·BFCL + 0.2·TerminalBench) / RAM_resident.
Long-horizon stability (60+ tool calls): closest existing benchmarks are τ-bench, PencilPuzzleBench (>1000 turns), MultiAgentBench, TRAIL. None of them specifically measure "schema correctness and strategic coherence across 60-80 sequential tool calls" — that benchmark gap is acknowledged by all three advisors.
Conversion-aware evaluation (MLX-4bit vs GGUF Q4_K_M vs Q5_K_XL): no standardized leaderboard. Community reports diverge — some claim MLX-4bit preserves tool-calling stability worse than GGUF Q5_K_M, others say the opposite. Practical advice: run your own production workload against each quant before committing.
Qwopus 3.6 family quality validation: needs third-party BFCL + SWE-bench reruns. Author claims should not drive production decisions.

asiai direct measurements — Apple Silicon, mid-2026

What the public leaderboards above don't show: measurements asiai ran directly on Apple Silicon (M5 Max 128 GB in High Power Mode, M4 Pro 64 GB), llama.cpp b9430, deterministic (temp 0), on the public Qwen 3.6 family and the Opus-distilled Qwopus finetune. Caveat: cross-session absolute throughput on the M5 laptop is ±15% (thermal/load); only the intra-session ±MTP back-to-back deltas are tight, and M5↔M4 absolutes aren't comparable (different quants).

Dev-quality / tool-call (`asiai bench --code`)

The base Qwen 3.6-35B-A3B (MoE) collapses edit_file.edits to an empty object on the deep-context turn — 3/3 runs, at both Q4_K_S and Q5_K_XL, same chat template. Tool-call clean 87.5%, edit-turns clean 66.7%. It is the MoE base's tool-call generation behaviour, not the quant and not the template.
The dense 27B (Q5_K_XL) and Qwopus-35B-A3B (Q4_K_S) both score 100% clean / 0 bugs — Qwopus reaches dense-27B tool-call reliability at the MoE's ~4× decode rate.
Under a harder tool-call stress suite, Qwopus stays 100% / 0 while the dense 27B drops to 88.9% / 3 bugs (the same empty-object failure). But on an expression-evaluator trap (precedence of ** vs unary minus) the dense 27B is correct and Qwopus is wrong — they split. (Recovery rate is weight-sensitive and noisy — not a headline.)

Thinking ablation (`asiai bench --thinking-ablation`, Qwopus-35B-A3B, 3 deterministic runs)

Config	Tool-call clean	Note
`enable_thinking=off`	100%	the only fully-clean config
`enable_thinking=on` + `preserve_thinking=on`	77.8%	2/9 turns dirty
`enable_thinking=on` + `preserve_thinking=off`	11.1%	turns 2-8 → HTTP 500 (context corruption); avoid

MTP throughput (`--spec-type draft-mtp`, warm decode, intra-session ±MTP)

Model / hardware	MTP off	MTP on	Δ
35B-A3B base · M5 Max	85.5 t/s	118.4 t/s	+38%
Qwopus 35B-A3B · M5 Max	105.7 t/s	123.3 t/s	+17%
27B-dense · M5 Max	23.8 t/s	28.0 t/s	+18%
Qwopus 27B · M5 Max	25.9 t/s	26.7 t/s	+3%
35B-A3B MoE · M4 Pro	36.3 t/s	44.6 t/s	+23%
27B-dense · M4 Pro	10.4 t/s	9.7 t/s	−6%

MTP gain scales as (MoE > dense) × (M5 > M4) — strongly positive on the MoE, marginal-to-negative on the slow dense path (the draft overhead isn't amortised). The Qwopus finetune's MTP head is also weaker than the base (Qwopus 27B +3% / 35B +17%, vs base 27B-dense +18% / 35B-A3B +38%) — finetuning erodes the draft head. The MLX-side MTP (mlx_vlm) is disqualified: it breaks long context (empty output, 75% valid). Headline: the 35B-A3B MoE + MTP on llama.cpp sustains ~118 t/s decode on M5 Max (~44 t/s on M4 Pro), ~4× the 27B-dense, at ~1.5 tok/s/W, TTFT ~62 ms, 100% output validity.

Instruction-following (`asiai bench --instruct`, research-brief)

The thinking trade-off has teeth on multi-step deliverables: with enable_thinking=false, Qwopus-35B does the tool work but delivers the requested multi-section brief 0% of the time (it stops at the secondary step); with thinking on, the base model delivers it 100% (5/5 sections). This pulls the opposite way from the tool-call result above — thinking-off is cleanest for atomic tool calls but suppresses written deliverables — which is why asiai sets thinking per task-dimension, not as one global switch.

Perfectionist research loop (`asiai bench --instruct loop-search`)

Single-turn IFEval and research-brief saturate at 100% across these models, so neither surfaces the perfectionist research loop: a model that won't accept an ambiguous, unconfirmable search result and re-issues semantically-equivalent queries until a no-progress guardrail halts it, never delivering. A loop-search sweep (9 configs, M5, b9430, thinking on/off, two ambiguity modes) isolates it:

The 35B-A3B MoE loops to the cap — for both the base and the Qwopus finetune, in Q4 and Q8 alike. The higher quant does not fix it, so the loop is architectural to the A3B MoE, not a quant artefact.
The dense 27B never loops (Q4 / Q5 / Q8): it accepts the ambiguous result and writes the briefing.

So the throughput leader (the MoE, ~118-123 t/s) and the agentic-fitness leader (the dense 27B, ~25 t/s) are different models. For a harness such as NousResearch's Hermes Agent, loop-resistance can outweigh raw decode — the fastest model is not always the right agent. (This is the inverse of the tool-call result, where the MoE finetune was the more robust agent: fitness is per-failure-mode, so measure several.)

Section 6 — Operational

📌 Capability snapshot (mid-2026). Engine versions churn weekly on Apple Silicon — these cells are point-in-time, not a version-pinned guarantee.

#	Engine	License	Stream OAI-compat	`/v1/models`	`/health`	`/metrics` (Prometheus)	Tool calling	Auto-DL HF	Persisted prefix cache	Maintainer activity
1	Rapid-MLX 0.6.66	Apache-2.0	✅	✅	✅ (HTML page)	❌ (logs only)	✅	✅ HF Hub auto-DL on serve	✅ `~/.cache/vllm-mlx/prefix_cache/`	community (raullenchai)
2	LM Studio 0.4.14	proprietary	✅	✅	partial (websocket)	❌	✅	✅ via `lms get` CLI	❌	Element Labs
3	llama.cpp b9270	MIT	✅	✅	✅	✅ `--metrics`	✅	manual (GGUF on disk)	❌ (`--cache-reuse N` arch-disabled on hybrid)	ggerganov very active
4	mlx-lm	MIT	✅	✅	✅	❌	partial	✅ HF auto	❌	Apple ml-explore active
5	oMLX	MIT	✅	✅	✅	❌	✅ (caveat: post-cache-hit bug)	✅	partial (tiered SSD)	jundot active
6	vLLM-MLX	Apache-2.0	✅	✅	✅	✅	✅	✅	✅ paged-attention	vllm-project active
7	vMLX (Mamba/SSM)	Apache-2.0	✅	✅	✅	partial	untested	partial	untested	community
8	Ollama	MIT	✅	partial	✅ `/api/version`	❌	partial	✅ `ollama pull`	❌	Ollama Inc. very active

Section 7 — Quality benchmark weighting for agentic-coding workloads

This is the asiai default weighting for an orchestrator-class workload (60-80 sequential tool calls per turn, schema-validated output, long-context system prompts). It is informed by three frontier-LLM advisories (Grok-4, GPT-5, Gemini Advanced) queried May 2026, but is not a community consensus — treat as a starting point, not authoritative. Override via a future --weights flag (planned).

Benchmark	What it measures	Why it matters here	Consensus weight
SWE-bench Verified	Real GitHub repo navigation + patch + test repair	Best proxy for code-editing fidelity inside an agent loop	35 %
BFCL v3 (Berkeley Function Calling Leaderboard)	Multi-turn function-call accuracy, argument fidelity, schema adherence	Direct predictor of orchestrator stability across many tool calls	25 %
TerminalBench 2.0 / MCP-Atlas	CLI and MCP task execution autonomy	"Does the agent survive 40+ actions without derailing"	20 %
LiveBench Coding	Contamination-resistant coding tasks (refreshed monthly)	Catches train-test leakage that inflates HumanEval-class scores	10 %
Custom long-horizon stability eval	60-80 sequential tool calls with cumulative context growth, malformed JSON recovery	The benchmark that does not exist yet in public form — see Section 8	10 %

Benchmarks consciously dropped from the weighting

MMLU-Pro, GPQA Diamond, HumanEval+ — useful as a general capability signal, but weakly correlated with agent-loop reliability per 2026 evidence. Frontier-lab confirmations indicate single-shot reasoning scores no longer predict autonomous agent success at sufficient granularity.
Author-reported aggregates without third-party reruns (Jackrong Hessling, Unsloth self-eval, GLM-4.6-Coder vendor claims).

Section 8 — Custom "endurance" benchmark proposal (research opportunity)

All three advisors converge on the same gap: the benchmark that would best characterize an orchestrator workload does not exist publicly yet. Building one is the only way to get the missing signal.

Proposed scope

80 sequential tool calls per trajectory
Schema validation at every turn (strict JSON / structured output)
Cumulative context growth (10K → 50K tokens across the trajectory)
Interruption / recovery tests (mid-trajectory cancel + resume)
Malformed XML/JSON recovery (does the agent self-correct ?)
Repo-edit persistence (do the edits made at turn N still hold at turn 60 ?)

This is on the asiai roadmap (a long-horizon endurance mode, after burst-mode). If built, it would be the first public benchmark in this specific niche.

Methodology

Hardware: MacBook Pro M5 Max 128 GB unified memory, macOS 26.4.1.
Workload: orchestrator class — system prompt ~7 KB, user prompt ~150-200 tokens, 60-80 calls per turn.
Phases measured (single-call, agentic-mode v1.6.0):
cold: first call after fresh start
warm: same exact prompt as cold (warm cache)
prefix-test-1/2/3: identical system, user changing — measures cross-USER cache reuse
cold-prefix: identical system, after restart — measures persistent cache
Verdict prefix cache reuse: YES if median(prefix-test) / cold < 0.2, else NO.
Anti-bias measures: SOLO mode (no cohabiting engines), thermal idle baseline, mmap warm-up phase.
Quality gates (auto-tracked by asiai bench):
early_stop: at least 2 runs with <0.5× median completion
memory_pressure: swap delta >500 MB OR swapouts delta >1000
duplicate_processes: multiple engine processes detected during the bench

The full protocol is the asiai bench --agentic-mode / --burst-mode instrumentation (power/thermal, engine footprint, KV occupancy, prefix-cache phases) — see the asiai CLI docs.

Open questions

MTP on vLLM-MLX/Rapid-MLX — answered (partly). vLLM-MLX added MTP in prerelease 0.4.0rc1 (2026-05-21); the theoretical combo "MLX + MTP-equipped Qwopus 35B-A3B + cross-USER snapshot" could win on both decode and TTFT once the Rapid-MLX fork tracks 0.4.x. Track when Rapid-MLX picks up the MTP path.
MTP on the MLX runtime — current state. Released mlx-lm does not run the MTP head as native speculative decoding (sanitize() drops the MTP weights during conversion; native support is in the unmerged PR ml-explore/mlx-lm#990). LM Studio's mlx-engine wraps mlx-lm, so it inherits this — the +13.5% decode gain in Section 1 row 5 comes from LM Studio's llama.cpp-derived backend (the file is GGUF), not from mlx-engine speculative decoding.
Burst behavior on Rapid-MLX/vllm-mlx at 60-80 calls scale: smoke test confirms single-slot FIFO at burst=5. Full panel pending (Section 2). The relevant upstream issue is whether vllm-mlx plans continuous-batching / multi-slot scheduling for hybrid arch models.
llama_memory_can_shift=false on Qwen 3.6 hybrid — still broken upstream. #18497 is closed (documents full re-processing); #22384 is an issue (closed-as-completed), not a merged fix; the actual fix PR #23121 was closed unmerged (patches live only on forks). The "just enable preserve_thinking" workaround is refuted by open issue #22615 (0.67× speedup = cache stays inert). The hybrid DeltaNet layers don't expose a shiftable cache state by construction.
Qwopus 3.6 quality independent reproduction: needs third-party BFCL / SWE-bench reruns. Author-published numbers should not drive production decisions until cross-verified.
vllm-mlx vs Rapid-MLX lineage — answered. Rapid-MLX is a community hard fork of waybarrios/vllm-mlx, not a thin wrapper: it vendors the engine in-tree (package still named vllm_mlx), does not pip-depend on the upstream package, and has diverged substantially (Rapid-MLX 0.6.74 vs upstream 0.3.0). The shared vllm_mlx package name and ~/.cache/vllm-mlx/ dir are a frequent source of attribution confusion (see Section 3, caveat 2).

This panel is a living document. Contributions, corrections, and additional bench cells welcome via github.com/druide67/asiai.