Aller au contenu

Ollama vs LM Studio: Apple Silicon Benchmark

Which inference engine is faster on your Mac? We benchmarked Ollama (llama.cpp backend) and LM Studio (MLX backend) head-to-head on the same model and hardware.

Test Setup

Hardware Mac Mini M4 Pro, 64 GB unified memory
Model Qwen3-Coder-30B (MoE architecture, Q4_K_M / MLX 4-bit)
asiai version 1.4.0
Methodology 1 warmup + 1 measured run per engine, temperature=0, model unloaded between engines (full methodology)

Results

Metric LM Studio (MLX) Ollama (llama.cpp) Difference
Throughput 102.2 tok/s 69.8 tok/s +46%
TTFT 291 ms 175 ms Ollama faster
GPU Power 12.4 W 15.4 W -20%
Efficiency 8.2 tok/s/W 4.5 tok/s/W +82%
Process Memory 21.4 GB (RSS) 41.6 GB (RSS) -49%

About memory numbers

Ollama pre-allocates KV cache for the full context window (262K tokens), which inflates its memory footprint. LM Studio allocates KV cache on demand. The process RSS reflects total memory used by the engine process, not just model weights.

Key Findings

LM Studio wins on throughput (+46%)

MLX's native Metal optimization extracts more bandwidth from Apple Silicon's unified memory. On MoE architectures, the advantage is significant. On the larger Qwen3.5-35B-A3B variant, we measured an even wider gap: 71.2 vs 30.3 tok/s (2.3x).

Ollama wins on TTFT

Ollama's llama.cpp backend processes the initial prompt faster (175ms vs 291ms). For interactive use with short prompts, this makes Ollama feel snappier. For longer generation tasks, LM Studio's throughput advantage dominates total time.

LM Studio is more power-efficient (+82%)

At 8.2 tok/s per watt vs 4.5, LM Studio generates nearly twice as many tokens per joule. This matters for laptops on battery and for sustained workloads on always-on servers.

Memory usage: context matters

The large gap in process memory (21.4 vs 41.6 GB) is partly due to Ollama pre-allocating KV cache for its maximum context window. For a fair comparison, consider the actual context used during your workload, not the peak RSS.

When to Use Each

Use Case Recommended Why
Maximum throughput LM Studio (MLX) +46% faster generation
Interactive chat (low latency) Ollama Lower TTFT (175 vs 291 ms)
Battery life / efficiency LM Studio 82% more tok/s per watt
Docker / API compatibility Ollama Broader ecosystem, OpenAI-compat API
Memory-constrained (16GB Mac) LM Studio Lower RSS, on-demand KV cache
Multi-model serving Ollama Built-in model management, keep_alive

Other Models

The throughput gap varies by model architecture:

Model LM Studio (MLX) Ollama (llama.cpp) Gap
Qwen3-Coder-30B (MoE) 102.2 tok/s 69.8 tok/s +46%
Qwen3.5-35B-A3B (MoE) 71.2 tok/s 30.3 tok/s +135%

MoE models show the largest differences because MLX handles sparse expert routing more efficiently on Metal.

Run Your Own Benchmark

pip install asiai
asiai bench --engines ollama,lmstudio --prompts code --runs 3 --card

asiai compares engines side by side with the same model, same prompts, and same hardware. Models are automatically unloaded between engines to prevent memory contention.

View the full methodology · See the community leaderboard · How to benchmark LLMs on Mac