mlx-lm
mlx-lm is Apple's reference MLX inference server, running models natively on Metal GPU via port 8080. It is particularly efficient for MoE (Mixture of Experts) models on Apple Silicon, leveraging unified memory for zero-copy model loading.
mlx-lm runs models natively on Apple MLX, providing efficient unified memory utilization.
Setup
brew install mlx-lm
mlx_lm.server --model mlx-community/gemma-2-9b-it-4bit
Details
| Property | Value |
|---|---|
| Default port | 8080 |
| API type | OpenAI-compatible |
| VRAM reporting | No |
| Model format | MLX (safetensors) |
| Detection | /version endpoint or lsof process detection |
Notes
- mlx-lm shares port 8080 with llama.cpp. asiai uses API probing and process detection to distinguish between them.
- Models use the HuggingFace/MLX community format (e.g.,
mlx-community/gemma-2-9b-it-4bit). - Native MLX execution typically provides excellent performance on Apple Silicon.
See also
Compare engines with asiai bench --engines mlxlm --- learn how