The Lens

Whisper turns speech into text, and it set the bar the moment OpenAI released it. Feed it an audio file in almost any of 99 languages and you get back a transcript, optionally translated to English. The model weights and the code are MIT licensed, so you can run the whole thing on your own machine for nothing.

Running it yourself is a pip install and an ffmpeg dependency away, but the catch is hardware. The tiny model fits in about 1GB of VRAM and is fast and rough; the large model wants roughly 10GB and a real GPU to run at a sane speed. On a CPU it works, but you will wait. A newer turbo model is much faster for plain transcription, though it drops the translation trick.

For a one-off transcript, OpenAI's hosted Whisper API runs about half a cent per minute and saves you the setup. Run it locally when the audio is sensitive, when you are processing a lot of it, or when you just do not want a per-minute bill. Solo and small teams: local on a decent GPU is plenty. Higher volume: budget a GPU box and self-host.

The catch is that Whisper is a model, not an app. It does straight transcription, not speaker labels or live captioning out of the box. If you want a GUI with those niceties, look at buzz or vibe, which both wrap this exact model.

Explore Further

GitHub Repository

Source code, issues, README

Reddit Discussions

Community opinions and use cases

Hacker News

HN threads and discussions

Dev.to Articles

Tutorials and write-ups

Tutorials & Guides

Getting started resources

Official Website

Docs, blog, and more

whisper

The Lens

Free vs Self-Hosted vs Paid

What's Free

Self-Hosted

Paid Cloud Option

vs Alternatives

Similar Tools

About

Explore Further

More tools in the directory

openclaw

everything-claude-code

hermes-agent