
flash-moe
Running a big model on a small laptop
Coldcast Lens
Flash-MoE runs a 397-billion parameter model on a MacBook with 48GB RAM at 4.4+ tokens/second. No Python, no frameworks — pure C, Objective-C, and hand-tuned Metal shaders. It exploits MoE architecture by streaming only the 4 active experts (of 512) from SSD per layer, on demand.
This is a technical marvel. The custom Metal compute pipeline reads expert weights directly from NVMe via parallel pread() with GCD dispatch groups. The OS page cache handles caching naturally. It runs Qwen3.5-397B-A17B with tool calling support. Compared to llama.cpp (broader model support but slower for MoE) and Ollama (easier but can't touch this model size), Flash-MoE is the only way to run a 400B model locally.
Use this when you want frontier-class model quality on your laptop and have a Mac with 48GB+ RAM. Skip this if you need broad model compatibility — it's optimized for one architecture.
The catch: macOS-only (Metal required), supports only MoE models with the right architecture, and 4.4 tok/s means waiting 30+ seconds for a paragraph. Patience required.
About
- Owner
- Dan Woods (User)
- Stars
- 2,010
- Forks
- 172
- discussed
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.