Skip to content

mlx-lm

mlx-lm is Apple's reference MLX inference server, running models natively on Metal GPU via port 8080. It is particularly efficient for MoE (Mixture of Experts) models on Apple Silicon, leveraging unified memory for zero-copy model loading.

mlx-lm runs models natively on Apple MLX, providing efficient unified memory utilization.

Setup

brew install mlx-lm
mlx_lm.server --model mlx-community/gemma-2-9b-it-4bit

Details

Property Value
Default port 8080
API type OpenAI-compatible
VRAM reporting No
Model format MLX (safetensors)
Detection /version endpoint or lsof process detection

Notes

  • mlx-lm shares port 8080 with llama.cpp. asiai uses API probing and process detection to distinguish between them.
  • Models use the HuggingFace/MLX community format (e.g., mlx-community/gemma-2-9b-it-4bit).
  • Native MLX execution typically provides excellent performance on Apple Silicon.

See also

Compare engines with asiai bench --engines mlxlm --- learn how