Skip to content

oMLX

oMLX is a native macOS inference server that uses paged SSD KV caching to handle larger context windows than memory alone would allow, with continuous batching for concurrent requests on port 8000. It supports both OpenAI and Anthropic-compatible APIs on Apple Silicon.

oMLX is a native macOS LLM inference server with paged SSD KV caching and continuous batching, managed from the menu bar. Built on MLX for Apple Silicon.

Setup

brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx

Or download the .dmg from GitHub releases.

Details

Property	Value
Default port	8000
API type	OpenAI-compatible + Anthropic-compatible
VRAM reporting	No
Model format	MLX (safetensors)
Detection	`/admin/info` JSON endpoint or `/admin` HTML page
Requirements	macOS 15+, Apple Silicon (M1+), 16 GB RAM min

Notes

oMLX shares port 8000 with vllm-mlx. asiai uses /admin/info probing to distinguish between them.
SSD KV caching enables larger context windows with lower memory pressure.
Continuous batching improves throughput under concurrent requests.
Supports text LLMs, vision-language models, OCR models, embeddings, and rerankers.
The admin dashboard at /admin provides real-time server metrics.
In-app auto-update when installed via .dmg.

See also

Compare engines with asiai bench --engines omlx --- learn how