The Lens

DaVinci-MagiHuman does it in one model. No separate video generation, no separate voice synthesis, no stitching. One 15-billion-parameter transformer takes text and a reference image and jointly produces video and audio.

The numbers are real: 5-second 1080p video in 38 seconds on a single H100. Supports Mandarin, Cantonese, English, Japanese, Korean, German, and French. Beats Ovi 1.1 (80% win rate) and LTX 2.3 (60.9% win rate) in human evaluation. The full model stack is released: base model, distilled model, super-resolution model, and inference code.

From Shanghai's GAIR Lab and Sand.ai.

The catch: you need serious hardware. An H100 for the fast inference numbers, and the 15B parameter model isn't running on a consumer GPU. No license file listed; check before commercial use. And 'joint audio-video generation' is still early. The 5-second clip limit means this is for avatars and short-form content, not video production.

Explore Further

GitHub Repository

Source code, issues, README

Reddit Discussions

Community opinions and use cases

Hacker News

HN threads and discussions

Dev.to Articles

Tutorials and write-ups

Tutorials & Guides

Getting started resources

Official Website

Docs, blog, and more

daVinci-MagiHuman

The Lens

Free vs Self-Hosted vs Paid

About

Explore Further

More tools in the directory

openclaw

everything-claude-code

hermes-agent