7 open source tools compared. Sorted by stars — scroll down for our analysis.
| Tool | Stars | Velocity | Language | License | Score |
|---|---|---|---|---|---|
cs249r_book Machine Learning Systems | 22.9k | +121/wk | JavaScript | — | 77 |
AutoResearchClaw Fully autonomous & self-evolving research from idea to paper. Chat an Idea. Get a Paper. 🦞 | 8.7k | +2700/wk | Python | MIT License | 80 |
Auto-claude-code-research-in-sleep ARIS ⚔️ (Auto-Research-In-Sleep) — Lightweight Markdown-only skills for autonomous ML research: cross-model review loops, idea discovery, and experiment automation. No framework, no lock-in — works with Claude Code, Codex, OpenClaw, or any LLM agent. | 4.0k | +1752/wk | Python | MIT License | 72 |
pi-autoresearch Autonomous experiment loop extension for pi | 2.9k | +755/wk | TypeScript | MIT License | 72 |
Attention-Residuals Research implementation of attention residual connections for transformer models. | 2.7k | +966/wk | — | 70 | |
autoresearch Claude Autoresearch Skill — Autonomous goal-directed iteration for Claude Code. Inspired by Karpathy's autoresearch. Modify → Verify → Keep/Discard → Repeat forever. | 2.3k | +950/wk | Shell | MIT License | 74 |
autoresearch-genealogy Structured prompts, vault templates, and archive guides for AI-assisted genealogy research. Built for Claude Code. | 921 | +921/wk | MIT License | 60 |
This is Harvard's open-source textbook on Machine Learning Systems — not how to train a model, but how to engineer the entire system around it: deployment, edge inference, privacy, MLOps, and production constraints. It's the curriculum behind CS249r, now being published by MIT Press. If you're an indie hacker building ML-powered products and want to understand why your model works in a notebook but fails in production, this book fills the gap. Fast.ai teaches you to train models. Stanford CS229 teaches the math. This book teaches the engineering. There's nothing else quite like it in the open-source textbook space. The catch: It's an academic textbook, so the writing can be dense. The hands-on labs use TinyTorch and Marimo notebooks — niche tools your team might not know. Volume I and II arrive Summer 2026, so the content is still evolving. And "ML Systems" is broad — some chapters will be deeply relevant to your work and others won't apply at all. Read selectively, not cover-to-cover.
AutoResearchClaw is the "just press go" button for academic papers — and that should make you nervous and excited in equal measure. Feed it a research idea, and a 23-stage pipeline handles literature review (real papers from OpenAlex and arXiv), hypothesis generation, sandboxed experiments, statistical analysis, multi-agent peer review, and LaTeX output targeting NeurIPS/ICML. It even learns from past runs. If you're an ML researcher prototyping ideas overnight, this is your unfair advantage. The anti-fabrication guards (NaN detection, citation relevance scoring) are a step above what you'd get cobbling together AutoGPT chains. Alternatives like Karpathy's original autoresearch pattern are lighter but manual. ARIS is markdown-only and less opinionated. No commercial tool does this end-to-end yet. The catch: "fully autonomous research" is a bold claim. The papers need human review before submission — treating this as a first-draft accelerator, not a paper factory, is the right call. And the pipeline is complex enough that debugging failures takes real effort.
ARIS is the lightweight, agent-agnostic cousin of AutoResearchClaw. It orchestrates full ML research lifecycles — literature survey, idea generation, experiment automation, paper writing — using nothing but Markdown skill files. No framework, no database, no Docker. Works with Claude Code, Codex, OpenClaw, or any LLM agent. The secret sauce is adversarial collaboration: Claude Code executes fast, GPT-5.4 reviews slowly and rigorously, probing weaknesses the executor missed. This cross-model tension produces better papers than single-model loops. Compared to AutoResearchClaw (23-stage pipeline, heavier), ARIS is more flexible. Compared to autoresearch (general-purpose), ARIS is research-specific. Use this when you want autonomous ML research that runs overnight across 20+ GPU experiments. Skip this if you need a polished paper — ARIS improves drafts, it doesn't write final submissions. The catch: cross-model collaboration means paying two API providers. And "autonomous overnight research" can burn serious GPU hours and API credits if your guard rails aren't tight.
pi-autoresearch brings the Karpathy autoresearch loop to the pi agent platform. Edit, commit, benchmark, log, keep or revert, repeat — fully autonomous. Works for any measurable target: test speed, bundle size, build time, Lighthouse scores, training loss. The confidence scoring after 3+ experiments is smart — it distinguishes real gains from benchmark noise. Correctness checks via autoresearch.checks.sh prevent optimizations that break things. A built-in dashboard lets you visualize experiment history. Compared to uditgoenka's autoresearch (Claude Code-specific), this is pi-native. Compared to ARIS (research-focused), this is more general-purpose. Use this when you're on the pi platform and want overnight autonomous optimization of any measurable metric. Skip this if you're not using pi — the extension is platform-specific. The catch: autonomous agents making commits in a loop can create messy git histories. And the experiment loop assumes your benchmark is deterministic — flaky tests or variable CI environments will produce misleading results.
Attention Residuals is a drop-in fix for how every transformer stacks its layers. Standard residual connections accumulate all layer outputs with fixed weights, diluting each layer's contribution as depth grows. AttnRes replaces this with softmax attention over preceding layers — giving every layer selective, content-aware access to earlier representations. Moonshot AI (the Kimi team) reports meaningful gains: MMLU 73.5 to 74.6, GPQA-Diamond 36.9 to 44.4, HumanEval 59.1 to 62.2. Block AttnRes reduces the overhead to under 4% during training and under 2% at inference. Compared to standard PreNorm (what everyone uses), this is strictly better. No direct open-source competitors exist — this is a research contribution, not a product. Use this when you're pre-training or fine-tuning transformers and want free performance gains with minimal overhead. Skip this if you're using models, not building them. The catch: research-stage implementation. No pretrained models with AttnRes are publicly available yet — you'd need to train from scratch or adapt existing architectures. Integration requires modifying your transformer stack.
autoresearch generalizes Karpathy's autonomous ML iteration loop to any domain. The pattern is dead simple: modify, verify, keep or discard, repeat. Claude iterates autonomously with mechanical verification and automatic rollback — works on backend code, frontend UI, content, performance, anything with a measurable outcome. The genius is the constraint: one metric, one direction, fast verification, git as memory. 608 stars in 3 days because developers recognized the pattern immediately. Compared to AutoResearchClaw (heavier, academic papers only) and ARIS (cross-model collaboration), this is the lightest and most general-purpose. It's a Claude Code skill, not a framework. Use this when you want to optimize anything measurable overnight — test speed, bundle size, Lighthouse scores, training loss. Skip this if your problem can't be reduced to a single improvable metric. The catch: autonomous iteration without clear guard rails can burn through API credits fast. And "overnight optimization" sounds magical until your agent makes 50 commits that each improve the metric by 0.01%.
autoresearch-genealogy applies Karpathy's autonomous iteration pattern to family history research. Structured prompts, Obsidian vault templates, and archive guides for 12 research workflows — from OCR pipelines to oral history protocols — all built for Claude Code. This is niche but deeply thoughtful. Born from a real project that produced 105 files across 9 generations and 6 family lines. The confidence tier system and source hierarchy methodology prevent the AI from guessing when it should be verifying. Compared to generic AI genealogy tools, this has actual methodology. FamilySearch and Ancestry are commercial platforms, not research frameworks. Use this when you're doing serious genealogy research and want AI to help systematically, not just chat about ancestors. Skip this if you want a point-and-click family tree builder — this is a research methodology, not a product. The catch: 915 stars suggest early adoption. The prompts are Claude Code-specific (though adaptable), and genealogy archives vary wildly by region — the guides focus on certain sources that may not cover your family's geography.