Skip to content

llama.cpp

llama.cpp is the foundational C++ inference engine for GGUF models, offering maximum low-level control over KV cache, thread count, and context size on port 8080. It powers Ollama's backend but can be run standalone for fine-grained tuning on Apple Silicon.

llama.cpp is a high-performance C++ inference engine supporting GGUF models.

Setup

brew install llama.cpp
llama-server -m model.gguf

Details

Property	Value
Default port	8080
API type	OpenAI-compatible
VRAM reporting	No
Model format	GGUF
Detection	`/health` + `/props` endpoints or `lsof` process detection

Notes

llama.cpp shares port 8080 with mlx-lm. asiai detects it via the /health and /props endpoints.
The server can be started with custom context sizes and thread counts for tuning.

See also

Compare engines with asiai bench --engines llamacpp --- learn how