Benchmark to choose. Dashboard to monitor. History to spot problems.
asiai bench
asiai web
Sound familiar?
Ollama, LM Studio, mlx-lm — each with its own CLI, formats, and metrics. No common ground.
No real-time VRAM monitoring, no power tracking, no thermal alerts. You're flying blind.
Benchmarking means curl scripts, copy-pasting numbers, and comparing in spreadsheets.
Everything you need to benchmark, monitor, and optimize local inference.
Same model on Ollama vs LM Studio vs mlx-lm. One command, real numbers. No vibes.
Measure GPU power during inference. Know your tok/s per watt — nobody else does this.
Ollama, LM Studio, mlx-lm, llama.cpp, vllm-mlx. Auto-detected, auto-configured.
stdlib Python only. No requests, no psutil, no rich. Installs in seconds.
Detects throttling during benchmarks. Alerts when your Mac overheats mid-inference.
Auto-detects performance drops after OS or engine updates. SQLite history with 90-day retention.
Full JSON API for automation. /api/snapshot, /api/status, /api/metrics — integrate with any stack.
Built-in /metrics endpoint. Plug into Grafana, Datadog, or any Prometheus-compatible tool. Zero config.
Real questions from r/LocalLLaMA, answered in one command.
Head-to-head comparison — the #1 question on r/LocalLLaMA.
LLMs running 24/7 for AI agents — track VRAM, thermal, and performance.
tok/s per watt between engines. Critical for 24/7 Mac Mini homelabs.
Did the Ollama or macOS update break your performance? Auto-detection via SQLite.
--context-size 64k benchmarks. Does your model survive 256k context?
Drift detection across benchmark runs. Unique to asiai.
MLPerf/SPEC methodology. Warmup, median, greedy decoding. Share with confidence.
asiai doctor diagnoses system, engines, and database with fix suggestions.
Dark/light web dashboard with live charts, SSE progress, benchmark controls.
Same engine, different models. Which quantization wins?
Expose /metrics, scrape with Prometheus, visualize in Grafana. Production-grade observability.
Three commands. That's it.
brew install asiai
$ asiai detect
✔ ollama (11434)
✔ lmstudio (1234)
✔ mlx-lm (8080)
→ 3 engines found
$ asiai bench -m qwen3.5
Engine tok/s TTFT
lmstudio 71.2 42ms
ollama 54.8 61ms
mlx-lm 30.1 38ms
Numbers from actual benchmarks on Apple Silicon.
MLX is 2.3x faster for MoE architectures (Qwen3.5-35B-A3B) on Apple Silicon.
VRAM stays constant from 64k to 256k context with DeltaNet — not documented anywhere else.
Same model, same Mac: 30 tok/s on one engine, 71 tok/s on another. The engine matters more.
8 metrics, consistent methodology, every run.
Generation speed (tokens/sec)
Time to first token
GPU power draw in watts
Energy efficiency
Run-to-run variance
GPU memory footprint
Throttling state
Long context perf scaling
Install in seconds. Zero dependencies.
brew tap druide67/tap
brew install asiai
pip install asiai