콘텐츠로 이동

How to Benchmark LLMs on Mac

Running a local LLM on your Mac? Here's how to measure real performance — not vibes, not "it feels fast", but actual tok/s, TTFT, power consumption, and memory usage.

Why Benchmark?

The same model runs at very different speeds depending on the inference engine. On Apple Silicon, MLX-based engines (LM Studio, mlx-lm, oMLX) can be 2x faster than llama.cpp-based engines (Ollama) for the same model. Without measuring, you're leaving performance on the table.

Quick Start (2 minutes)

1. Install asiai

pip install asiai

Or via Homebrew:

brew tap druide67/tap
brew install asiai

2. Detect your engines

asiai detect

asiai automatically finds running engines (Ollama, LM Studio, llama.cpp, mlx-lm, oMLX, vLLM-MLX, Exo) on your Mac.

3. Run a benchmark

asiai bench

That's it. asiai auto-detects the best model across your engines and runs a cross-engine comparison.

What Gets Measured

Metric What It Means
tok/s Tokens generated per second (generation only, excludes prompt processing)
TTFT Time to First Token — latency before generation starts
Power GPU + CPU watts during inference (via IOReport, no sudo needed)
tok/s/W Energy efficiency — tokens per second per watt
VRAM Memory used by the model (native API or estimated via ri_phys_footprint)
Stability Run-to-run variance: stable (<5% CV), variable (<10%), unstable (>10%)
Thermal Whether your Mac throttled during the benchmark

Example Output

Mac16,11 — Apple M4 Pro  RAM: 64.0 GB  Pressure: normal

Benchmark: qwen3-coder-30b

  Engine        tok/s   Tokens Duration     TTFT       VRAM    Thermal
  lmstudio      102.2      537    7.00s    0.29s    24.2 GB    nominal
  ollama         69.8      512   17.33s    0.18s    32.0 GB    nominal

  Winner: lmstudio (+46% tok/s)

  Power Efficiency
    lmstudio     102.2 tok/s @ 12.4W = 8.23 tok/s/W
    ollama        69.8 tok/s @ 15.4W = 4.53 tok/s/W

Example output from a real benchmark on M4 Pro 64GB. Your numbers will vary by hardware and model. See more results →

Advanced Options

Compare specific engines

asiai bench --engines ollama,lmstudio,omlx

Multiple prompts and runs

asiai bench --prompts code,reasoning,tool_call --runs 3

Large context benchmark

asiai bench --context-size 64K

Generate a shareable card

asiai bench --card --share

Creates a benchmark card image and shares results with the community leaderboard.

Apple Silicon Tips

Memory matters

On a 16GB Mac, stick to models under 14GB (loaded). MoE models (Qwen3.5-35B-A3B, 3B active) are ideal — they deliver 35B-class quality at 7B-class memory usage.

Engine choice matters more than you think

MLX engines are significantly faster than llama.cpp on Apple Silicon for most models. See our Ollama vs LM Studio comparison for real numbers.

Thermal throttling

MacBook Air (no fan) throttles after 5-10 minutes of sustained inference. Mac Mini/Studio/Pro handle sustained workloads without throttling. asiai detects and reports thermal throttling automatically.

Compare with the Community

See how your Mac stacks up against other Apple Silicon machines:

asiai compare

Or visit the online leaderboard.

Further Reading