
Attention-Residuals
Research implementation of attention residual connections for transformer models.
Coldcast Lens
Attention Residuals is a drop-in fix for how every transformer stacks its layers. Standard residual connections accumulate all layer outputs with fixed weights, diluting each layer's contribution as depth grows. AttnRes replaces this with softmax attention over preceding layers — giving every layer selective, content-aware access to earlier representations.
Moonshot AI (the Kimi team) reports meaningful gains: MMLU 73.5 to 74.6, GPQA-Diamond 36.9 to 44.4, HumanEval 59.1 to 62.2. Block AttnRes reduces the overhead to under 4% during training and under 2% at inference. Compared to standard PreNorm (what everyone uses), this is strictly better. No direct open-source competitors exist — this is a research contribution, not a product.
Use this when you're pre-training or fine-tuning transformers and want free performance gains with minimal overhead. Skip this if you're using models, not building them.
The catch: research-stage implementation. No pretrained models with AttnRes are publicly available yet — you'd need to train from scratch or adapt existing architectures. Integration requires modifying your transformer stack.
About
- Owner
- Moonshot AI (Organization)
- Stars
- 2,714
- Forks
- 116
- trending
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.