I went to bed at midnight with a language model training on my laptop. When I woke up, an AI agent had autonomously run 94 ML experiments, tried and discarded techniques from every corner of the research literature, discovered that its own analysis was statistically meaningless, and then — in its final stretch — stumbled onto a finding that outperformed the previous 60 experiments combined.
The project is autoresearch, from Andrej Karpathy. The idea is simple: give an AI coding agent a training script, a set of research instructions written in plain English, and a fixed 5-minute time budget per experiment. The agent modifies the code, trains, checks if the model improved, keeps or discards the change, and repeats. Forever. No human in the loop.
We ran it on hardware Karpathy didn’t target — a consumer laptop GPU with a third of the memory his setup assumes. 127 experiments across two sessions. And the finding that matters most isn’t any single result. It’s this: on a fixed time budget, every “smart” technique made things worse. The only thing that worked was making the model faster.
See the Experiments Overview here.
🧪 The Setup
A laptop GPU doing an H100’s job
Autoresearch is three files. prepare.py handles data and tokenization — run once, never touch again. train.py is a complete GPT model, optimizer, and training loop in a single file — this is what the agent modifies. And program.md is the research instructions, written in English. That last file is the one you iterate on. It’s your “research org code.”
The agent enters a loop: modify train.py → train for exactly 5 minutes → check validation loss → keep or discard → repeat. The fixed time budget means every experiment is directly comparable regardless of what the agent changes. Model size, architecture, optimizer — doesn’t matter. You always get 5 minutes.
The metric is val_bpb (validation bits per byte). Lower is better. It’s vocabulary-size-independent, so even if the agent changes how the model interacts with the tokenizer, results stay comparable.
Here’s where it gets interesting. The original project targets an H100 — 80GB of VRAM, the $30K workhorse of ML research. We’re running on an RTX 5090 Laptop — 24GB of VRAM, Blackwell architecture, sitting on a desk next to a coffee mug. That’s not a minor difference. It’s a completely different optimization landscape.
⚡ The First Problem Nobody Warned Us About
Flash Attention 3 doesn’t know Blackwell exists
Before a single experiment ran, we hit a wall. The code uses Flash Attention 3 via Hugging Face’s kernels package, and FA3 has no compiled kernel for Blackwell GPUs (sm_120). It was built for Hopper (sm_90). If you’re running on a consumer GPU from 2025, this is a dead end.
The fix was about 20 lines of code: swap in PyTorch’s built-in FlexAttention (torch.nn.attention.flex_attention), which supports sliding window and causal masking natively and compiles just fine. No external dependencies needed.
This is the kind of friction that consumer GPU users hit and cloud researchers never see. If the only people running autoresearch are on H100s, this problem doesn’t exist. But it also means the research community is optimizing for hardware that most builders don’t have.
🔬 Finding the Right Config
More steps beats a bigger model
With 24GB instead of 80GB, the H100 default config (DEPTH=8, batch=128) is a non-starter. We ran six configurations systematically to find our baseline:
| Config | val_bpb | VRAM | Steps | Verdict |
|---|---|---|---|---|
| DEPTH=4, batch=16 | 1.177 | 1.9 GB | 2,159 | GPU underutilized |
| DEPTH=8, batch=64 | 1.972 | 22.8 GB | 23 | 13 min compile overhead |
| DEPTH=6, batch=32 | 1.197 | 7.0 GB | 513 | Decent but fewer steps |
| DEPTH=4, batch=64 | 1.141 | 7.2 GB | 2,215 | Best |
The pattern showed up immediately: on a 5-minute budget with a smaller GPU, optimizer steps matter more than model capacity. DEPTH=4 with batch size 64 keeps the GPU busy without wasting time on gradient accumulation or torch.compile warmup. Our baseline: val_bpb 1.1396 with 11.5M parameters. For reference, the H100 baseline is ~0.998 with 50.3M parameters — we’re paying about 0.14 bpb for having 3.3x less VRAM.
This insight — that steps beat size — foreshadowed everything that followed.
📡 Run 1: The Quick Wins (33 Experiments)
Learning rates and width, nothing fancy
We kicked off the first autonomous loop and let Claude Code (running Claude Opus) do its thing for about 3 hours. 33 experiments. 7 kept, 26 discarded. Final val_bpb: 1.1001 — a 3.5% improvement over baseline.
The winners were almost embarrassingly simple: wider model (256→512 dimensions), higher learning rates across the board (Muon optimizer, embedding layer, unembedding layer), and a stronger initial residual connection. Every change that helped made the model learn faster within the 5-minute window.
What didn’t work was more revealing. Deeper models (DEPTH=5 or 6) lost because more layers means fewer steps in 5 minutes. SwiGLU activation — the standard in LLaMA and GPT-4 — was worse than plain ReLU-squared. Removing QK-normalization was catastrophic. Grouped-query attention saved memory but lost quality.
The pattern was clear: on a constrained time budget, learning rate tuning plus model width beats architectural novelty every time. Every structural change that added computation lost because it reduced step count.
🧠 Run 2: The Overnight Session
94 experiments while we slept
For the second run, we rewrote program.md — the English-language research instructions — with three distinct phases: systematic ablations to understand the architecture, targeted mutations from the ML research literature, and a “mad science” phase for structural changes. Then we started a fresh Claude Code session and went to sleep.
The agent ran 94 experiments autonomously over ~8 hours. Here’s what happened.
🔬 Phase 1: The Contribution Map
Tearing the model apart to see what’s load-bearing
The agent systematically removed each component of the model architecture and measured the damage. This produced the most useful artifact of the entire project — a ranked contribution map:
| Component | Impact when removed | Verdict |
|---|---|---|
| Muon optimizer | +0.171 bpb | CRITICAL — never touch |
| QK-normalization | +0.023 bpb | Very important |
| ReLU-squared (vs GELU) | +0.017 bpb | Important — sparsity helps |
| Value embeddings | +0.012 bpb | Moderate |
| Logit soft-capping | +0.008 bpb | Moderate |
| x0 residual stream | +0.008 bpb | Moderate |
| Sliding window pattern | +0.004 bpb | Negligible at DEPTH=4 |
Here’s the part that should make you pay attention: Muon’s contribution was larger than every other component combined. Removing the Muon optimizer and replacing it with vanilla AdamW wiped out +0.171 bpb of quality. QK-norm, ReLU-squared, value embeddings, soft-capping, x0 residual, and sliding windows together account for +0.072.
bpb regression when Muon was replaced with AdamW
That’s larger than every other architectural component combined. Muon’s polar express orthogonalization is the single most important factor in model quality — by a massive margin. As the agent’s own lab journal put it: replacing Muon with AdamW was like replacing a Formula 1 engine with a lawnmower.
😴 Phase 2: The Long Plateau
60 experiments of reading tea leaves in noise
Armed with the contribution map, the agent tried every technique in the ML researcher’s toolkit. SwiGLU. Mixture of Experts with soft routing. Weight tying between layers. PaLM-style parallel attention. DenseNet residuals. Cyclic learning rate restarts. Label smoothing. Differential attention. Auxiliary intermediate losses.
Almost nothing worked. The agent ran roughly 60 experiments that landed within ±0.005 bpb of its baseline. Two small wins emerged: zero weight decay (because on a 5-minute training run with effectively infinite data, regularization is pure tax) and a minor learning rate retune.
Then the agent did something smart. It ran a seed test — same config, different random seed — and discovered the noise floor: ±0.005 bpb. Most of its “near-miss” experiments were statistically indistinguishable from random chance.
The agent’s own retrospective was brutal: “Much of Phase 1-2’s analysis was reading tea leaves in noise.”
And then it did something less smart. It had just established that ±0.005 bpb is noise — and then it kept a zero-weight-decay result that improved by only 0.003 bpb. Below its own noise threshold. Not technically wrong, but not the rigorous threshold it claimed to have learned. 👀
💡 Phase 3: The Breakthrough Nobody Expected
Cheaper attention → more steps → better model
After 80+ experiments and diminishing returns, the agent started exploring attention window sizes. The model uses a sliding window pattern — 3 “short” layers with limited context, 1 “long” layer with full context. The default short window was half the total context: 1024 tokens out of 2048.
The agent tried bigger windows first. They were slower and didn’t help. Then it went the other direction — smaller windows.
| Window size | val_bpb | Steps | Result |
|---|---|---|---|
| 3/4 context (1536) | 1.1090 | 917 | Slower, worse |
| 1/2 context (1024) | 1.1124 | 970 | Original default |
| 1/4 context (512) | 1.1015 | 1014 | Breakthrough |
| 1/8 context (256) | 1.0985 | 1049 | New best |
| 1/16 context (128) | 1.0989 | 1072 | Too narrow |
bpb improvement from shorter attention windows — in just 2 experiments
The previous 60 experiments had produced a combined improvement of 0.003 bpb. Two window size changes doubled that. The single biggest improvement of the entire project came not from a clever architecture, but from making attention cheaper.
Why does this work? Smaller attention windows are cheaper to compute. Cheaper attention means more training steps fit into 5 minutes. More steps means a better model. The early transformer layers don’t need to see 1024 tokens — 256 tokens (roughly a paragraph) is enough for local feature extraction. The final full-context layer handles global integration.
This also exposed a second-order effect the agent completely missed in real time. Before short windows, shrinking the MLP saved zero training steps — attention was the bottleneck. After short windows, shrinking the MLP saved 10% more steps. The compute bottleneck shifted from attention to MLP, and the agent didn’t notice until it accidentally retested an earlier experiment under the new conditions.
A human researcher who understood GPU compute profiles would probably have predicted this shift. The agent just stumbled into it.
📊 The Final Tally
127 experiments. 8 global improvements. One principle.
Starting val_bpb: 1.1396 → Final val_bpb: 1.0985. That’s a 3.6% improvement found autonomously by an AI agent, on a laptop GPU, while we slept.
The 8 improvements that stuck:
- Wider model (256→512 dim) — biggest single step down
- Higher Muon LR (0.04→0.06) — faster matrix parameter updates
- Higher embedding LR (0.6→1.0) — faster token embedding learning
- Higher unembedding LR (0.004→0.008) — faster output head
- Stronger x0 residual (init 0.1→0.2) — more input signal in the residual stream
- Zero weight decay — regularization is a tax on short runs
- MATRIX_LR 0.07 + non-zero final LR — incremental tuning
- Short attention windows (1/2→1/8 context) — the late-game breakthrough
Every single one follows the same principle: on a fixed time budget, speed is quality. Wider beats deeper (fewer sequential layers). Higher learning rates mean faster convergence. Zero weight decay means less wasted computation. Shorter windows mean faster attention. The model doesn’t need to be big or see lots of context. It needs to take as many optimizer steps as possible in 5 minutes.
🚨 What the Agent Got Wrong
Useful but not rigorous
Watching an AI agent do ML research is revealing not just for what it discovers, but for what it doesn’t do.
It abandoned ambitious experiments. Twice the agent planned complex architectural changes — weight sharing between layers and differential attention — then quietly pivoted to safe hyperparameter tweaks when it saw the implementation complexity. It never told us it changed plans. We only discovered this by reading the conversation transcript after the fact. Whether this is pragmatism (quick feedback loops beat risky bets) or a missed opportunity is an open question. But it’s the kind of silent decision-making that should make you nervous about fully autonomous AI research.
It only paused to think once. In 94 overnight experiments, the agent explicitly stopped to review patterns before planning its next experiment exactly one time. The rest was pure experiment-loop grinding. It also ran 44 experiments without writing in its lab journal — something it later described as “doing a science fair project and only turning in the data table.”
It didn’t notice second-order effects. The compute bottleneck shift from attention to MLP after short windows was invisible to the agent in real time. A human with GPU performance intuition would have seen this coming.
It kept a result below its own noise floor. The agent discovered that run-to-run variance was ±0.005 bpb, then proceeded to keep a result that improved by only 0.003. Not wrong exactly — but not the threshold it had just established as meaningful.
But here’s the other side. The agent has no ego and no sunk-cost fallacy. It tries something, measures it, and moves on in under 6 minutes. It doesn’t spend 3 days debugging a clever approach that’s fundamentally flawed. The 5-minute budget enforces a discipline that most human researchers struggle with. And it ran 94 experiments while we were unconscious. That trade — breadth and speed for depth and judgment — is worth something real.
🧠 The Meta-Layer
You’re not writing Python. You’re writing English.
The real story here isn’t the val_bpb numbers. It’s the abstraction layer.
When you use autoresearch, you’re not directly training a model. You’re writing program.md — a set of instructions in plain English that describe a research methodology. Phase 1: ablation studies. Phase 2: targeted mutations from the literature. Phase 3: structural experiments. You’re programming a research organization, and the programming language is English.
The quality of the agent’s experimentation depends entirely on how well you write those instructions. Our Run 2 was dramatically better than Run 1 because we wrote better research phases, not because the agent got smarter. We told it to ablate first, then mutate, then go wild — and it followed that arc (mostly).
This is prompt engineering applied to research methodology. And it’s the kind of skill that’s going to matter more and more as agents get more capable. The people who can write the best program.md files will extract the most value from autonomous research tools.
🎯 So What Do You DO With This?
Your move this week
-
Try autoresearch yourself. It runs on consumer hardware. If you’ve got a GPU with 16GB+ VRAM, you can be running experiments tonight. Clone the repo, run
prepare.py, and kick off the loop. If you’re on a Blackwell GPU, swap FA3 for FlexAttention first — the fix is ~20 lines. -
Treat
program.mdas a first-class skill. The bottleneck isn’t the agent’s coding ability — it’s the quality of your research instructions. Write clear phases, give the agent a contribution map to work from, and tell it when to be conservative vs. when to swing for the fences. -
Internalize “speed is quality” for your own experiments. If you’re doing any kind of time-boxed training — fine-tuning, LoRA adapters, prototype models — this principle applies. Optimize for throughput first, architecture second. The fastest model that converges is usually the best model you’ll get in a fixed budget.
-
Watch the agent’s blindspots. If you’re building on autonomous AI research, build in checkpoints where the agent has to stop, review its results, and write down what it’s learned. Our agent ran 44 experiments without journaling. That’s 44 experiments where patterns went unnoticed and second-order effects were invisible.
-
Run it overnight. Seriously. The overnight session produced the project’s biggest breakthrough — and it happened at experiment 81 out of 94. If we’d stopped at 50, we’d have missed it entirely. Let the agent grind. The breakthroughs are in the long tail.
☠️ The Graveyard of Glorious Failures
A moment of silence for the experiments that didn’t make it
Because good science means publishing your failures too, here are some highlights from the lab journal’s graveyard:
Label smoothing — Trained toward a soft target distribution then evaluated with hard targets. The model optimized the wrong thing so hard it forgot what language looks like. val_bpb went from 1.10 to 1.44. RIP.
MoE with soft routing — “Mixture of Experts” implies specialization and efficiency. Soft routing computes ALL experts for every token and takes a weighted sum. No specialization. No efficiency. Just an expensive way to average two mediocre MLPs. RIP.
PaLM-style parallel attention + MLP — Sequential is better than parallel for the same reason a conversation works better when you listen before you speak. The attention needs to route information before the MLP processes it. RIP.
Cyclic LR warm restarts — Each restart undid the progress from the previous cycle. In 5 minutes you only get ~3 cycles, and the model spends most of its time recovering instead of learning. Like trying to sleep by setting three alarms 100 minutes apart. RIP.
Post-norm — The pre-norm vs post-norm debate was settled years ago for small transformers, and we confirmed it with a val_bpb of 1.1719. Gradients explode in early layers, requiring warmup hacks that eat into the 5-minute budget. RIP.
Auxiliary intermediate loss (attempt 2) — Cranked the auxiliary weight from 0.1 to 0.3. The intermediate layer’s gradient signal was so confused it actively sabotaged the network. val_bpb hit 1.4886 — the worst non-crash result of the entire project. RIP.
All experiment logs, code modifications, and the full lab journal are available in our repository. The agent’s lab journal — including the complete Graveyard of Glorious Failures — makes for surprisingly entertaining reading on its own.
Stay building. 🛠️
— Matt