Attention-Residuals

Coldcast Lens

Attention Residuals is a drop-in fix for how every transformer stacks its layers. Standard residual connections accumulate all layer outputs with fixed weights, diluting each layer's contribution as depth grows. AttnRes replaces this with softmax attention over preceding layers — giving every layer selective, content-aware access to earlier representations.

Moonshot AI (the Kimi team) reports meaningful gains: MMLU 73.5 to 74.6, GPQA-Diamond 36.9 to 44.4, HumanEval 59.1 to 62.2. Block AttnRes reduces the overhead to under 4% during training and under 2% at inference. Compared to standard PreNorm (what everyone uses), this is strictly better. No direct open-source competitors exist — this is a research contribution, not a product.

Use this when you're pre-training or fine-tuning transformers and want free performance gains with minimal overhead. Skip this if you're using models, not building them.

The catch: research-stage implementation. No pretrained models with AttnRes are publicly available yet — you'd need to train from scratch or adapt existing architectures. Integration requires modifying your transformer stack.

Explore Further

GitHub Repository

Source code, issues, README

Reddit Discussions

Community opinions and use cases

Hacker News

HN threads and discussions

Dev.to Articles

Tutorials and write-ups

Tutorials & Guides

Getting started resources

Coldcast Lens

About

Explore Further

More tools in the directory

ollama

langchain

goose

Get tools like this delivered weekly