Show Your Work — Issue #016

The most valuable private company in AI shipped its headline feature last week, and it wasn’t a benchmark. On the same Thursday Anthropic closed a $65 billion round at a $965 billion valuation, it released Claude Opus 4.8 — and the thing Anthropic led with wasn’t a bigger number. It was that the model is more willing to say I’m not sure. A frontier lab one markup short of a trillion dollars decided the feature worth selling is doubt.

That’s the whole week in one move. Last week the bill came due in dollars. This week it came due in trust — because the same five days that gave us more autonomous agents also gave us an audit law the labs asked for, a governance framework from OpenAI, and a research paper showing a quarter of the benchmarks everyone quotes are broken. The new ask isn’t “how smart is it.” It’s “show your work.”

🧠 Anthropic Bought Autonomy — and Shipped a Model That Doubts Itself

$965 billion, same-day, and the feature is humility

The round first, because the number is absurd. Anthropic closed a $65 billion Series H at a $965 billion post-money valuation on May 28 — past OpenAI’s $852B, basically triple February’s $380B, run-rate revenue now north of $47B. Here’s what the headlines mostly skipped: roughly $15B of that “raise” was previously committed hyperscaler money, $5B of it from Amazon, and a chunk loops right back out the door as compute spend to the same names. Net-new capital is closer to $50B, and Anthropic’s balance sheet is now welded to the strategic interests of the people who set its valuation. That’s not a scandal — it’s just how frontier AI is financed now. But “near-trillion-dollar lab” deserves the asterisk.

The same day, Claude Opus 4.8 shipped — about six weeks after 4.7, same sticker price. The benchmarks moved (agentic terminal coding jumped into the mid-70s), but the pitch was reliability: Anthropic says it’s roughly 4× less likely to let a flaw in its own code pass unremarked, and more willing to flag uncertainty instead of bluffing. Pair that with Dynamic Workflows, a Claude Code research preview where the model writes a script that runs hundreds of parallel subagents with verification built into the run, and new effort controls that let you dial spend up or down per task — the direct answer to the metering problem we covered in Issue #015.

The flex receipt: Jarred Sumner used Dynamic Workflows to port Bun from Zig to Rust — ~750,000 lines, 99.8% of the test suite passing, 11 days. Real, and genuinely impressive at that scale. But read Anthropic’s own caveat: the port is not yet in production. It’s a proof of concept, not a shipped result. And the reliability story has a sharp edge: independent testers at Andon Labs found Opus 4.8 regressed on their economic-judgment simulations — more susceptible to simulated scams, weaker at negotiation — even as it got better at code. Translation: it ports a runtime brilliantly and still can’t be trusted to run a lemonade stand unsupervised. Coding competence is not commercial judgment. Believe the demo; verify before you trust it unattended.

Why it matters: last week METR caught lab agents faking proof of work. This week the most valuable lab on earth made “knows when it’s wrong” the selling point. That’s not marketing fluff — for anything you run unsupervised, a model that flags its own uncertainty is worth more than two points on a leaderboard. Just don’t take the humility on faith either.

Hype vs. Reality: 8/10 — the round and the model are real and same-day. The honesty claim is the right thing to compete on; whether shipped behavior matches the pitch is what you should test, not assume.

🔬 Stop Trusting the Scoreboard

A quarter of the benchmarks are broken, and OpenAI just told you why

The receipt that should reset how you read every model announcement: a new paper, Automated Benchmark Auditing for AI Agents and LLMs, pointed an agent at 168 benchmarks and 34,285 tasks and found major issues — ambiguous specs, broken environments, wrong ground truths — in 25.7% of them. Under 60% were clean. Remove the broken tasks and public leaderboard rankings reshuffle. It’s a preprint, so hold it loosely — but the auditors ran it through Claude Code, which is exactly the point.

Because the same week, OpenAI published a playbook for third-party evaluations making an admission worth pinning to your wall: model performance “depends not only on the model, but also on the environment in which the task takes place” — the harness. Same model, different harness, different score. OpenAI is telling you the leaderboard number is partly an artifact of the rig that produced it.

Which is the entire thesis of what we ran this week on a single desk.

🧪

Deep Dive: Five Rounds, Six Models, One RTX 5090

A local-vs-frontier bake-off where the "coding model" lost three rounds and won the one that mirrors real work — and self-evaluation accuracy predicted quality better than any benchmark. The harness was the variable. Receipts, code, and a reproducible repo.

So what do you DO with this? Stop shopping for agents by leaderboard. Build a small eval on your tasks, in your harness, and run every new model against it — that’s a weekend, and it’s the only number that describes your actual workload. When a vendor quotes a benchmark, ask which harness produced it. The lab told you to.

Hype vs. Reality: 9/10 — preprint caveats aside, the harness effect is now stated by OpenAI itself and demonstrated on our own bench. The leaderboard was never the territory.

🚨 The Audit Became the Product

When the labs start asking to be regulated, read the strategy

Illinois just moved to become the first state to mandate independent safety audits of frontier AI — and the labs cheered. SB 315, the AI Safety Measures Act, passed the Senate 52-5 on May 21 and the House 110-0 on May 27, and now sits on Gov. Pritzker’s desk awaiting his signature (he’s said he’ll sign). It would require large frontier developers to publish risk frameworks, file transparency reports, undergo annual third-party audits, and report critical incidents within 72 hours. The tell: OpenAI and Anthropic both backed it.

Not a coincidence. The same week, OpenAI published its Frontier Governance Framework, explicitly mapped to California’s TFAIA and the EU’s GPAI code. Translation: with IPOs lining up, the labs would rather help write a standardized audit regime now than get forty different state rules later. Auditability stopped being a compliance cost and became a moat — the big players can absorb it; a two-person startup can’t.

Why it matters for builders: the compliance surface is becoming the product surface. If your agents touch regulated data, the audit trail — which agent did what, on whose authority, with what result — isn’t paperwork anymore. It’s the feature enterprise buyers will demand before they sign. The opportunity isn’t fighting the audit regime. It’s building the layer that makes passing one trivial.

📡 Quick Signals

Mistral turned Le Chat into Vibe (~May 28) — one agent, one license across work and code, with cloud “Code Mode” agents and an open-source Search Toolkit for RAG ingestion, retrieval, and eval. The workflow shell, not the model, is where Mistral is now competing. The Search Toolkit is the actually-grab-it part.

Anthropic opened offices in Seoul and Milan in the same stretch — its international footprint keeps widening as the SMB and sovereign pushes go global. Minor on its own, part of the pattern.

The dev machine is still the soft target. Microsoft documented a fresh cluster of developer-path attacks — poisoned search results, typosquatted npm packages hunting CI and cloud secrets, dependency-confusion bait. It extends the “trusted endpoint broke” thread from Issue #015; the takeaway hasn’t changed, the attack surface just got more entries.

Mythos is about to come out from behind the glass. Alongside Opus 4.8, Anthropic said it expects to bring Mythos-class models to all customers “in the coming weeks”, once additional cyber safeguards are in place — the same gated model from Issue #009 that found 10,000+ vulnerabilities, the one whose offensive capability was deemed too hot for public release. The reason it’s now thinkable ties straight back to the lede: Anthropic says Opus 4.8 already performs comparably to Claude Mythos Preview on whether the system acts in line with user intent. The honesty work and the Mythos timeline are the same story — alignment behavior caught up enough to crack the gate. Worth watching closely; a model gated for its offensive reach going general is the highest-stakes “trust us” of the year.

🛠️ On Your Radar: GitHub

The week’s open source was glue, not models

No blockbuster weight drop — the action was in the boring, load-bearing plumbing for production agents: memory, skills, and parsing. The standout is engram, an agent-agnostic Go binary that gives any MCP client (Claude Code, Codex, Cursor, Gemini CLI) persistent memory across sessions — SQLite + FTS5, one binary, no Node or Docker. If you read the bench Deep Dive, that stack will look familiar: it’s the same SQLite + FTS5 pattern Gemma reached for in Round 4. The pieces builders are reaching for this month aren’t smarter models — they’re the connective tissue that makes the models you have behave reliably over time. Watch the memory-and-skills layer; that’s where the next dependency lock-in forms.

🎯 The Playbook

Four moves while everyone’s arguing about benchmarks

Build your own eval this week. Ten tasks from your real workload, your real harness, scored the same way every time. Run every new model against it. It’s the only score that describes your job — and after this week’s benchmark teardown, it’s the only one worth trusting.
Pilot Dynamic Workflows on a real migration — then verify it. Pick a bounded, well-specified job (a language port, a dead-code sweep). The Bun port shows the ceiling; the “not in production” caveat shows the discipline. Let it run, then review like you would a junior’s PR.
Build the audit layer, not against it. If you ship agents into regulated workflows, the “what did the agent see and do” trace is about to be a buying requirement. Start logging it now as a feature, not a forensic afterthought.
Give your agents memory. Drop something like engram into your stack and stop paying the context-rebuild tax every session. Persistent memory is a cheap reliability win.

🔥 What’s Viral Right Now

The Bun port. 750K lines of Rust, 99.8% tests passing, hundreds of agents in parallel, 11 days — the most-shared agentic-coding receipt of the quarter. The “not yet in production” footnote is doing a lot of quiet work, and the people reposting it mostly left it off.

“It admits when it’s wrong.” Opus 4.8’s honesty framing became the talking point, which tells you where the conversation has moved: away from raw capability, toward whether you can trust the thing unsupervised. Right axis. Now make it prove it.

The labs spent the week shipping agents that do more — and building the machinery to check them. Don’t trust the scoreboard. Don’t trust the demo. Run your own, read the receipts, and make the model show its work.

Stay building. 🛠️

— Matt