llama.cpp

Coldcast Lens

llama.cpp is the project that proved you don't need a data center to run an LLM. Pure C/C++ inference for large language models — no Python, no PyTorch, no CUDA requirement. It runs Llama, Mistral, Phi, and dozens of other models on CPUs, Apple Silicon, and consumer GPUs. The engine behind nearly every local AI app.

If you want to run AI models locally — for privacy, cost savings, or offline use — llama.cpp is the foundation everything else is built on. Ollama wraps it in a friendly CLI. LM Studio wraps it in a GUI. vLLM is faster for GPU serving but Python-only. ExLlamaV2 squeezes more performance from NVIDIA GPUs.

Best for developers building local AI products or anyone who wants to understand how LLM inference actually works at the metal level.

The catch: it's C/C++, so building from source and debugging isn't for everyone. Model quantization tradeoffs (quality vs. speed vs. memory) require experimentation. Performance tuning is hardware-specific. And the project moves so fast that tutorials from three months ago may already be outdated.

Explore Further

GitHub Repository

Source code, issues, README

Reddit Discussions

Community opinions and use cases

Hacker News

HN threads and discussions

Dev.to Articles

Tutorials and write-ups

Tutorials & Guides

Getting started resources

Coldcast Lens

About

Explore Further

More tools in the directory

VS Code

n8n

Flutter

Get tools like this delivered weekly