TurboQuant Benchmark on Apple Silicon
TurboQuant (Google Research, ICLR 2026) compresses the KV cache of LLMs by 5x with no quality loss, enabling 70B models to run on a Mac Mini with 64GB RAM. These are real benchmarks measured with asiai on actual hardware.
Results
Llama-3.1-70B-Instruct Q4_K_M on Mac Mini M4 Pro 64GB
| Metric | Value |
|---|---|
| Throughput | 6.3 tok/s (stable, CI 95%: 6.3-6.3) |
| TTFT | 196 ms (median) |
| GPU Power | 23.8 W |
| Model VRAM | 44.1 GB (40 GB weights + 4 GB KV turbo3) |
| Context | 32,768 tokens |
| GPU Offload | 81/81 layers on Metal |
| Thermal | Nominal (no throttling) |
| Stability | Stable (std dev 0.04 tok/s across 3 runs) |
KV cache configuration: keys at q8_0 (high precision), values at turbo3 (3-bit, 5x compression).
Before vs After TurboQuant
| Without TurboQuant | With TurboQuant (turbo3) | |
|---|---|---|
| KV cache (32K ctx) | ~20 GB (q8_0) | ~4 GB (turbo3) |
| Total RAM needed | 60+ GB (OOM on 64GB) | 44 GB (fits in 64GB) |
| Can run 70B on 64GB? | No | Yes |
| Quality | Baseline | -1% PPL (negligible) |
| NIAH retrieval | 100% | 100% |
What Is TurboQuant?
TurboQuant is a KV cache compression algorithm from Google Research, presented at ICLR 2026. During LLM inference, the KV cache stores intermediate attention states and grows linearly with context length. For a 70B model at 128K context in FP16, this cache alone can consume 20-40 GB of RAM.
TurboQuant compresses this cache to 3 bits per value using:
- Random rotation (Walsh-Hadamard transform) to Gaussianize the data
- Optimal scalar quantization (PolarQuant) near the Shannon limit
- QJL (Quantized Johnson-Lindenstrauss) to preserve dot products
The result: 5x memory reduction, no fine-tuning needed, and near-zero quality loss.
Setup Guide
Hardware
- Mac Mini M4 Pro, 64 GB unified memory ($2,700)
- Any Apple Silicon Mac with 32+ GB should work (adjust model size accordingly)
Install TurboQuant llama.cpp
# Install build tools
brew install cmake
# Clone the TurboQuant fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout feature/turboquant-kv-cache
# Build with Metal (Apple Silicon GPU)
cmake -B build -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu)
Download a Model
# Llama 3.1 70B Q4_K_M (~40 GB)
curl -L -o llama-3.1-70b-q4_k_m.gguf \
"https://huggingface.co/bartowski/Meta-Llama-3.1-70B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf"
Raise macOS GPU Memory Limit
sudo sysctl iogpu.wired_limit_mb=61440
Launch the Server
./build/bin/llama-server \
-m llama-3.1-70b-q4_k_m.gguf \
--cache-type-k q8_0 --cache-type-v turbo3 \
-c 32768 \
--port 8081 \
--host 0.0.0.0 \
-fa 1 \
-ngl 99 \
-t 10 \
--no-mmap \
--chat-template chatml
Configuration Explained
| Parameter | Value | Why |
|---|---|---|
--cache-type-k q8_0 |
Keys at 8-bit | Keys are sensitive to compression |
--cache-type-v turbo3 |
Values at 3-bit | Values tolerate extreme compression (5x) |
-fa 1 |
Flash Attention | Required for TurboQuant |
-ngl 99 |
Full GPU offload | All 81 layers on Metal |
-t 10 |
10 threads | M4 Pro has 10 performance cores |
--no-mmap |
No memory mapping | Loads everything at boot, avoids page faults |
--chat-template chatml |
ChatML format | Best compatibility with this fork |
Benchmark with asiai
pip install asiai
asiai detect --url http://localhost:8081
asiai bench --engines llamacpp --prompts code --runs 3 --kv-cache turbo3 --card
Models That Fit on 64GB with TurboQuant
| Model | Weights (Q4_K_M) | KV Cache (32K, turbo3) | Total | Status |
|---|---|---|---|---|
| Llama 3.1 70B | 40 GB | ~4 GB | 44 GB | Tested: 6.3 tok/s |
| Qwen2.5 72B | 40 GB | ~4 GB | 44 GB | Should work |
| Llama 70B 128K ctx | 40 GB | ~16 GB (turbo3) | 56 GB | Tight but possible |
| Command-R+ 104B | 58 GB | ~4 GB | 62 GB | Very tight |
FAQ
Can I run a 70B model on a Mac with 64GB RAM?
Yes, with TurboQuant. The KV cache is compressed 5x, so Llama 70B Q4_K_M (40GB weights) fits comfortably in 64GB with 32K context. We measured 6.3 tok/s on a Mac Mini M4 Pro.
Does TurboQuant reduce quality?
No measurable quality loss. The perplexity increase is under 1% vs q8_0, and Needle-in-a-Haystack retrieval scores 100% through 32K context.
Which TurboQuant format should I use?
Asymmetric: q8_0 for keys + turbo3 for values. Keys are sensitive to compression (all quality degradation comes from K compression). Values can be compressed to 2-3 bits with zero effect on attention quality.
Does TurboQuant work with MLX?
Community implementations exist (turboquant-mlx) but are less mature than the llama.cpp fork. For production use, we recommend TheTom/llama-cpp-turboquant.
How does this compare to standard llama.cpp?
Decode speed is ~0.9x of q8_0 (slightly slower per token), but the real gain is fitting models and contexts that simply didn't fit before. Prefill can actually be faster at long context due to reduced memory bandwidth pressure.
References
- Google Research Blog — TurboQuant
- TurboQuant Paper (ICLR 2026)
- TheTom/turboquant_plus — Extended implementation with Sparse V
- TheTom/llama-cpp-turboquant — llama.cpp fork with Metal kernels
- llama.cpp Discussion #20969 — Community thread