Skip to content

vllm-mlx

vLLM-MLX brings the vLLM serving framework to Apple Silicon via MLX, offering continuous batching and an OpenAI-compatible API on port 8000. It can achieve 400+ tok/s on optimized models, making it one of the fastest options for concurrent inference on Mac.

vllm-mlx brings continuous batching to Apple Silicon via MLX.

Setup

pip install vllm-mlx
vllm serve mlx-community/gemma-2-9b-it-4bit

Details

Property	Value
Default port	8000
API type	OpenAI-compatible
VRAM reporting	No
Model format	MLX (safetensors)
Detection	`/version` endpoint or `lsof` process detection

Notes

vllm-mlx supports continuous batching, making it suitable for concurrent request handling.
Can achieve 400+ tok/s on Apple Silicon with optimized models.
Uses the standard vLLM OpenAI-compatible API.

See also

Compare engines with asiai bench --engines vllm-mlx --- learn how