
omlx
LLM inference server with continuous batching and SSD caching for Apple Silicon, managed from the macOS menu bar.
Coldcast Lens
oMLX is the best way to run LLMs on Apple Silicon right now. A menu-bar inference server with continuous batching, SSD caching that drops TTFT from 30-90 seconds to under 5, and up to 4.14x generation speedup at 8 concurrent requests. Compatible with Claude Code, OpenClaw, Cursor, and any OpenAI-compatible client.
The two-tier SSD caching is the breakthrough: hot KV cache blocks stay in RAM, cold blocks go to SSD with LRU policy, and everything persists — so previously cached contexts are always recoverable. Supports text LLMs, vision models, OCR, embeddings, and rerankers. Compared to Ollama (easier but no SSD caching), oMLX is faster for repeated contexts. Compared to llama.cpp (more portable), oMLX is Apple Silicon-optimized.
Use this when you're a Mac developer running local models as coding agent backends or for inference. Skip this on any non-Apple hardware.
The catch: macOS-only by design. The menu-bar UX is convenient but hides complexity — debugging inference issues requires digging into logs. And the MLX ecosystem is still maturing compared to CUDA/llama.cpp.
About
- Stars
- 6,906
- Forks
- 542
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.