asiai logo

Which LLM? Which engine?
Which combo wins on your Mac?

Benchmark to choose. Dashboard to monitor. History to spot problems.

Python 3.11+ Apache 2.0 Apple Silicon
asiai CLI benchmark

asiai bench

asiai web dashboard

asiai web

The Local LLM Problem

Sound familiar?

🧩

Fragmented

Ollama, LM Studio, mlx-lm — each with its own CLI, formats, and metrics. No common ground.

🙈

Blind

No real-time VRAM monitoring, no power tracking, no thermal alerts. You're flying blind.

📋

Manual

Benchmarking means curl scripts, copy-pasting numbers, and comparing in spreadsheets.

Built for Apple Silicon Power Users

Everything you need to benchmark, monitor, and optimize local inference.

⚔️

Head-to-Head Benchmarks

Same model on Ollama vs LM Studio vs mlx-lm. One command, real numbers. No vibes.

Energy Efficiency

Measure GPU power during inference. Know your tok/s per watt — nobody else does this.

🔧

5 Engines, One CLI

Ollama, LM Studio, mlx-lm, llama.cpp, vllm-mlx. Auto-detected, auto-configured.

📦

Zero Dependencies

stdlib Python only. No requests, no psutil, no rich. Installs in seconds.

🌡️

Thermal Intelligence

Detects throttling during benchmarks. Alerts when your Mac overheats mid-inference.

📉

Regression Detection

Auto-detects performance drops after OS or engine updates. SQLite history with 90-day retention.

🌐

REST API

Full JSON API for automation. /api/snapshot, /api/status, /api/metrics — integrate with any stack.

📈

Prometheus Native

Built-in /metrics endpoint. Plug into Grafana, Datadog, or any Prometheus-compatible tool. Zero config.

What Will You Discover?

Real questions from r/LocalLLaMA, answered in one command.

🏆

"Which engine is fastest?"

Head-to-head comparison — the #1 question on r/LocalLLaMA.

🤖

"Monitor a multi-agent swarm"

LLMs running 24/7 for AI agents — track VRAM, thermal, and performance.

🔋

"Compare energy efficiency"

tok/s per watt between engines. Critical for 24/7 Mac Mini homelabs.

🚨

"Detect regressions after updates"

Did the Ollama or macOS update break your performance? Auto-detection via SQLite.

📏

"Test long context support"

--context-size 64k benchmarks. Does your model survive 256k context?

🔥

"Is my Mac thermal throttling?"

Drift detection across benchmark runs. Unique to asiai.

📊

"Reproducible benchmarks"

MLPerf/SPEC methodology. Warmup, median, greedy decoding. Share with confidence.

🩺

"Health check in one command"

asiai doctor diagnoses system, engines, and database with fix suggestions.

💻

"Visual dashboard"

Dark/light web dashboard with live charts, SSE progress, benchmark controls.

🔄

"Compare LLMs head-to-head"

Same engine, different models. Which quantization wins?

📡

"Prometheus + Grafana monitoring"

Expose /metrics, scrape with Prometheus, visualize in Grafana. Production-grade observability.

Up and Running in 60 Seconds

Three commands. That's it.

1

Install

brew install asiai
2

Detect

$ asiai detect ✔ ollama (11434) ✔ lmstudio (1234) ✔ mlx-lm (8080) → 3 engines found
3

Benchmark

$ asiai bench -m qwen3.5 Engine tok/s TTFT lmstudio 71.2 42ms ollama 54.8 61ms mlx-lm 30.1 38ms

Real Discoveries

Numbers from actual benchmarks on Apple Silicon.

2.3x

MLX vs llama.cpp

MLX is 2.3x faster for MoE architectures (Qwen3.5-35B-A3B) on Apple Silicon.

Flat

VRAM: 64k → 256k

VRAM stays constant from 64k to 256k context with DeltaNet — not documented anywhere else.

30 vs 71

Engine > Model

Same model, same Mac: 30 tok/s on one engine, 71 tok/s on another. The engine matters more.

Supported Engines

Auto-detected, zero configuration needed.

Engine Default Port API VRAM
Ollama 11434 Native
LM Studio 1234 OpenAI-compatible
mlx-lm 8080 OpenAI-compatible
llama.cpp 8080 OpenAI-compatible
vllm-mlx 8000 OpenAI-compatible

What We Measure

8 metrics, consistent methodology, every run.

🚀

tok/s

Generation speed (tokens/sec)

⏱️

TTFT

Time to first token

Power (W)

GPU power draw in watts

🔋

tok/s/W

Energy efficiency

📈

Stability

Run-to-run variance

💾

VRAM

GPU memory footprint

🌡️

Thermal

Throttling state

📏

Context

Long context perf scaling

Get Started

Install in seconds. Zero dependencies.

Homebrew
brew tap druide67/tap brew install asiai
pip
pip install asiai

Home