Skip to content

TurboQuant Benchmark on Apple Silicon

TurboQuant (Google Research, ICLR 2026) compresses the KV cache of LLMs by 5x with no quality loss, enabling 70B models to run on a Mac Mini with 64GB RAM. These are real benchmarks measured with asiai on actual hardware.

Results

Llama-3.1-70B-Instruct Q4_K_M on Mac Mini M4 Pro 64GB

Metric Value
Throughput 6.3 tok/s (stable, CI 95%: 6.3-6.3)
TTFT 196 ms (median)
GPU Power 23.8 W
Model VRAM 44.1 GB (40 GB weights + 4 GB KV turbo3)
Context 32,768 tokens
GPU Offload 81/81 layers on Metal
Thermal Nominal (no throttling)
Stability Stable (std dev 0.04 tok/s across 3 runs)

KV cache configuration: keys at q8_0 (high precision), values at turbo3 (3-bit, 5x compression).

Before vs After TurboQuant

Without TurboQuant With TurboQuant (turbo3)
KV cache (32K ctx) ~20 GB (q8_0) ~4 GB (turbo3)
Total RAM needed 60+ GB (OOM on 64GB) 44 GB (fits in 64GB)
Can run 70B on 64GB? No Yes
Quality Baseline -1% PPL (negligible)
NIAH retrieval 100% 100%

What Is TurboQuant?

TurboQuant is a KV cache compression algorithm from Google Research, presented at ICLR 2026. During LLM inference, the KV cache stores intermediate attention states and grows linearly with context length. For a 70B model at 128K context in FP16, this cache alone can consume 20-40 GB of RAM.

TurboQuant compresses this cache to 3 bits per value using:

  • Random rotation (Walsh-Hadamard transform) to Gaussianize the data
  • Optimal scalar quantization (PolarQuant) near the Shannon limit
  • QJL (Quantized Johnson-Lindenstrauss) to preserve dot products

The result: 5x memory reduction, no fine-tuning needed, and near-zero quality loss.

Setup Guide

Hardware

  • Mac Mini M4 Pro, 64 GB unified memory ($2,700)
  • Any Apple Silicon Mac with 32+ GB should work (adjust model size accordingly)

Install TurboQuant llama.cpp

# Install build tools
brew install cmake

# Clone the TurboQuant fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache

# Build with Metal (Apple Silicon GPU)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)

Download a Model

# Llama 3.1 70B Q4_K_M (~40 GB)
curl -L -o llama-3.1-70b-q4_k_m.gguf \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf"

Raise macOS GPU Memory Limit

sudo sysctl iogpu.wired_limit_mb=61440

Launch the Server

./build/bin/llama-server \
  -m llama-3.1-70b-q4_k_m.gguf \
  --cache-type-k q8_0 --cache-type-v turbo3 \
  -c 32768 \
  --port 8081 \
  --host 0.0.0.0 \
  -fa 1 \
  -ngl 99 \
  -t 10 \
  --no-mmap \
  --chat-template chatml

Configuration Explained

Parameter Value Why
--cache-type-k q8_0 Keys at 8-bit Keys are sensitive to compression
--cache-type-v turbo3 Values at 3-bit Values tolerate extreme compression (5x)
-fa 1 Flash Attention Required for TurboQuant
-ngl 99 Full GPU offload All 81 layers on Metal
-t 10 10 threads M4 Pro has 10 performance cores
--no-mmap No memory mapping Loads everything at boot, avoids page faults
--chat-template chatml ChatML format Best compatibility with this fork

Benchmark with asiai

pip install asiai
asiai detect --url http://localhost:8081
asiai bench --engines llamacpp --prompts code --runs 3 --kv-cache turbo3 --card

Models That Fit on 64GB with TurboQuant

Model Weights (Q4_K_M) KV Cache (32K, turbo3) Total Status
Llama 3.1 70B 40 GB ~4 GB 44 GB Tested: 6.3 tok/s
Qwen2.5 72B 40 GB ~4 GB 44 GB Should work
Llama 70B 128K ctx 40 GB ~16 GB (turbo3) 56 GB Tight but possible
Command-R+ 104B 58 GB ~4 GB 62 GB Very tight

FAQ

Can I run a 70B model on a Mac with 64GB RAM?

Yes, with TurboQuant. The KV cache is compressed 5x, so Llama 70B Q4_K_M (40GB weights) fits comfortably in 64GB with 32K context. We measured 6.3 tok/s on a Mac Mini M4 Pro.

Does TurboQuant reduce quality?

No measurable quality loss. The perplexity increase is under 1% vs q8_0, and Needle-in-a-Haystack retrieval scores 100% through 32K context.

Which TurboQuant format should I use?

Asymmetric: q8_0 for keys + turbo3 for values. Keys are sensitive to compression (all quality degradation comes from K compression). Values can be compressed to 2-3 bits with zero effect on attention quality.

Does TurboQuant work with MLX?

Community implementations exist (turboquant-mlx) but are less mature than the llama.cpp fork. For production use, we recommend TheTom/llama-cpp-turboquant.

How does this compare to standard llama.cpp?

Decode speed is ~0.9x of q8_0 (slightly slower per token), but the real gain is fitting models and contexts that simply didn't fit before. Prefill can actually be faster at long context due to reduced memory bandwidth pressure.

References