Open source · Apache 2.0 · Zero dependencies

The speedtest for local LLMs on Apple Silicon.Give your AI agents
eyes on inference

Benchmark Ollama, LM Studio, mlx-lm, llama.cpp and five more engines head-to-head — tok/s, TTFT, power and thermal, from one command.asiai's REST API lets your AI agents monitor, diagnose, and optimize local LLM infrastructure autonomously.

$pip install asiai

$asiai bench# 15s → your first comparison

Get started Leaderboard GitHub Agent Guide API Reference Give your AI this URL

Python 3.11+ · stdlib only · 9 engines auto-detected · agent-ready API

asiai bench — zsh THROUGHPUT

$ asiai bench -m mtplx-qwen36-27b-optimized-speed

✔ mlx-lm 0.31.3_2 detected · warmup · n=5 per prompt

reasoning 52.4

code 49.7

tool_call 47.7

long_gen 44.5

→ median 48.2 tok/s ±1 CI95 · Apple M5 Max

✓ card saved → ~/.local/share/asiai/cards/

GET /api/status ≤ 500ms

{
  "chip": "Apple M4 Pro",
  "ram_gb": 64.0,
  "memory_pressure": "normal",
  "gpu_utilization_percent": 45.2,
  "engines": {
    "ollama": { "running": true, "models_loaded": 2 },
    "lmstudio": { "running": true, "models_loaded": 1 }
  }
}

GET /api/snapshot Full state

{
  "system": {
    "chip": "Apple M4 Pro",
    "gpu_cores": 20,
    "gpu_utilization_percent": 45.2,
    "thermal_state": "nominal"
  },
  "engines": [{
    "name": "ollama",
    "models": [{ "name": "qwen3.5:latest", "size_params": "35B" }]
  }]
}

The Local LLM Problem

Sound familiar?

Fragmented

Ollama, LM Studio, mlx-lm — each with its own CLI, formats, and metrics. No common ground.

Blind

No real-time VRAM monitoring, no power tracking, no thermal alerts. You're flying blind.

Manual

Benchmarking means curl scripts, copy-pasting numbers, and comparing in spreadsheets.

One CLI. Four surfaces.

Everything installs with the binary — no plugins, no config files to start.

asiai Benchmark screen — one-click Quick Bench with live progress

Bench any engine

7 bench types — throughput, burst, agentic, quality, context, energy, versions. MLPerf-style: warmup, median, greedy decoding, CI95.

$ asiai bench

asiai live web dashboard — CPU load, memory, GPU power and loaded models

Live dashboard

GPU, VRAM, power and thermal in real time via passive IOReport — live gauges, sparklines, benchmark controls.

$ asiai web

Fleet cockpit

Every Mac on your network on one screen — engines, loaded models, KV cache, power and alerts, with operator controls.

$ asiai fleet

Community leaderboard

Share benchmarks anonymously and see what other Macs achieve on the same chip — in the terminal or on the web.

$ asiai leaderboard

What Will You Discover?

Real questions from r/LocalLLaMA, answered in one command.

“Which engine is fastest?”

Head-to-head comparison — the #1 question on r/LocalLLaMA.

“Monitor a multi-agent swarm”

LLMs running 24/7 for AI agents — track VRAM, thermal, and performance.

“Compare energy efficiency”

tok/s per watt between engines. Critical for 24/7 Mac Mini homelabs.

“Detect regressions after updates”

Did the Ollama or macOS update break your performance? Auto-detection via SQLite.

“Test long context support”

--context-size 64k benchmarks. Does your model survive 256k context?

“Is my Mac thermal throttling?”

Drift detection across benchmark runs. Unique to asiai.

“Reproducible benchmarks”

MLPerf/SPEC methodology. Warmup, median, greedy decoding. Share with confidence.

“Health check in one command”

asiai doctor diagnoses system, engines, and database with fix suggestions.

“Visual dashboard”

Dark/light web dashboard with live charts, SSE progress, benchmark controls.

“Compare LLMs head-to-head”

Same engine, different models. Which quantization wins?

“Prometheus + Grafana monitoring”

Expose /metrics, scrape with Prometheus, visualize in Grafana. Production-grade observability.

“Track AI agent inference”

GPU activity, TCP connections, KV cache — know when your agents are thinking, idle, or overloaded. API-ready for swarm orchestrators.

Up and Running in 60 Seconds

Three commands. That's it.

Install

$ brew install druide67/tap/asiai

# or: pip install asiai

Detect

$ asiai detect

✔ ollama (11434)

✔ lmstudio (1234)

✔ mlx-lm (8080)

→ 3 engines found

Benchmark

$ asiai bench -m qwen3.5

ENGINETOK/STTFT

lmstudio71.242ms

ollama54.861ms

mlx-lm30.138ms

Real Discoveries

Numbers from actual benchmarks on Apple Silicon.

2.3×

MLX vs llama.cpp

MLX is 2.3x faster for MoE architectures (Qwen3.5-35B-A3B) on Apple Silicon.

Flat

VRAM: 64k → 256k

VRAM stays constant from 64k to 256k context with DeltaNet — not documented anywhere else.

30 vs 71

Engine > Model

Same model, same Mac: 30 tok/s on one engine, 71 tok/s on another. The engine matters more.

Supported Engines

Auto-detected, zero configuration needed.

Ollama:11434

Native API GGUF ✓ VRAM

LM Studio:1234

OpenAI-compatible GGUF + MLX ✓ VRAM

mlx-lm:8080

OpenAI-compatible MLX

llama.cpp:8080

OpenAI-compatible GGUF

oMLX:8000

OpenAI-compatible MLX

vllm-mlx:8000

OpenAI-compatible MLX

vMLX:8000

OpenAI-compatible MLXMamba/SSM

Rapid-MLX:8000

OpenAI-compatible MLXHomebrew

Exo:52415

OpenAI-compatible MLXDistributed

Port 8000 is shared by oMLX, vllm-mlx, vMLX and Rapid-MLX — asiai disambiguates them via /v1/models.

What We Measure

8 metrics, consistent methodology, every run.

tok/s

Generation speed (tokens/sec)

TTFT

Time to first token

Power (W)

GPU power draw in watts

tok/s/W

Energy efficiency

Stability

Run-to-run variance

VRAM

GPU memory footprint

Thermal

Throttling state

Context

Long context perf scaling

Fastest in the community

Live data from real Macs, 90-day window. Anonymous by design.

Full leaderboard →

01 llama-3.2-3b-instruct-4bit mlxlm Apple M4 Max · 64 GB 228.0 88 ms

02 qwen3.6:35b-a3b-nvfp4 ollama Apple M5 Max · 128 GB 124.9 39 ms

03 gemma-4-31b-it ollama Apple M5 Max · 128 GB 115.8 64 ms

One command. One shareable card.

Every bench type renders a 1200×630 image — model, chip, measured medians, honesty gates. Made for Reddit, X and Discord.

asiai bench --card

asiai throughput card — mtplx-qwen36-27b-optimized-speed on mlx-lm at 48.2 tok/s median (±1 CI95, n=5), Apple M5 Max

$asiai agentic card — prefix-cache reuse fraction 0.80 across cold, warm and prefix phases, Qwen3.6-27B on llama.cpp$ asiai code-quality card — tool_call, recovery and thinking suites at 100%, deterministic grading, Qwen3.6-27B on llama.cpp

asiai code-quality card — tool_call, recovery and thinking suites at 100%, deterministic grading, Qwen3.6-27B on llama.cpp

The speedtest for local LLMs on Apple Silicon.Give your AI agentseyes on inference

The Local LLM Problem

Fragmented

Blind

Manual

One CLI. Four surfaces.

Bench any engine

Live dashboard

Fleet cockpit

Community leaderboard

What Will You Discover?

Up and Running in 60 Seconds

Install

Detect

Benchmark

Real Discoveries

MLX vs llama.cpp

VRAM: 64k → 256k

Engine > Model

Supported Engines

What We Measure

tok/s

TTFT

Power (W)

tok/s/W

Stability

VRAM

Thermal

Context

Fastest in the community

One command. One shareable card.

The speedtest for local LLMs on Apple Silicon.Give your AI agents
eyes on inference