The Bill Came Due — Issue #015

Google walked on stage Tuesday and demoed a team of AI agents building the core of a working operating system from scratch in twelve hours — then ran a Doom-like game on it, live. The OS booted without keyboard support, so they told the agents to write the drivers in real time, and the game came up. It was a flex, but it was a coherent one: 93 sub-agents running in parallel, 2.6 billion tokens, under a thousand dollars in API credits. That demo is the whole week in one image — the agent stack is now real infrastructure, and somebody is metering every token that runs through it.

This week the meter became the story. Google shipped a complete agent-first developer platform and a model to run it cheap. Then the invoices started landing — on builders, on the labs, and on a security model that was never designed for any of this.

📡 Google Shipped the Whole Stack — and Turned On the Meter

The third front in the agent war opened with a full arsenal

At I/O on May 19, Google didn’t announce features. It announced a platform. Antigravity 2.0 is now a standalone, “unabashedly agent-first” desktop app, plus a full CLI, an SDK, and Managed Agents in the Gemini API — sub-agents, hooks, and async task management as first-class primitives, with built-in sandboxing, credential masking, and hardened Git policies. Pair that with Gemini 3.5 Flash, which shipped to general availability the same day and beats last year’s Gemini 3.1 Pro on most coding and agentic benchmarks while running roughly 4x faster. Google also pushed the same agent stack into Search itself: AI Mode is now past a billion monthly users, a follow-up from an AI Overview now flows into a full AI Mode conversation with your context intact, and Search has started generating custom interactive UI — simulations, dashboards, mini-app-style tools — per query.

Translation: for the last six months the “Agent OS war” we covered in Issue #010 had two real combatants — Claude Code and Codex. Google just walked in with a complete stack: model, harness, CLI, SDK, browser, and the most-trafficked search surface on earth as the distribution layer. Antigravity isn’t an answer to Claude Code. It’s an answer to the entire category.

Here’s the part the headlines missed. Everyone reported “Flash beats last year’s Pro.” Almost nobody reported the price tag. Gemini 3.5 Flash costs $1.50 in / $9 out per million tokens — 25% under 3.1 Pro in the headline, but 3x the price of the Gemini 3 Flash it replaces and 6x the old Flash-Lite. Simon Willison, running the standard benchmark suite, found the new Flash cost roughly 5.5x more to run end to end than the Flash it replaces once you count the extra reasoning tokens it burns. His read: “all three major AI labs appear to be probing the price tolerance of their API customers.” The receipts back him up — GPT-5.5 launched at 2x GPT-5.4, and while Claude Opus 4.7 kept the same nominal $5/$25 sticker as Opus 4.6, Anthropic’s own pricing page notes its new tokenizer can use up to 35% more tokens for the same text — a price increase that never shows up on the sticker.

That 12-hour OS demo wasn’t just a flex, then. It was a thesis: Google is betting on swarms of fast, cheap agents running in parallel instead of one expensive monolithic run. Which is exactly why the per-token price just became a first-order strategic variable for everyone building on top.

Why it matters: if your roadmap assumes the cheap tier stays cheap, rebuild the spreadsheet. The Flash tier is no longer the budget option — it’s a near-flagship priced like one. And Gemini 3.5 Pro, held back to June (“give us until next month,” to audible groans), will land higher still.

Hype vs. Reality: 8/10 — the stack is real and shipping today. The one asterisk: the marquee model is Flash, not the flagship, and the OS demo was staged. The architecture underneath it is not.

💰 The Other End of the Price War

DeepSeek slammed the floor the same week everyone else raised the ceiling

While the Western labs walked prices up, DeepSeek made its 75% price cut permanent. On May 23, the Chinese lab locked in flagship V4-Pro pricing at a quarter of launch cost — as low as $0.0035 per million tokens, with output around $0.87. For comparison: the same output runs about $25 on Claude Opus 4.7 and $15 on GPT-5.4. DeepSeek didn’t disclose a reason for the cut, so the popular explanation — that Huawei’s Ascend 950 supply is finally ramping — stays suggestive, not confirmed; it did say at V4’s launch that Pro pricing would drop once those supernodes ship in volume in the second half, the same silicon thread we pulled in Issue #007. Either way, the floor is now Chinese.

This is the squeeze builders are about to feel from both directions, and it’s not hypothetical. Last month Uber’s CTO told The Information the company burned its entire 2026 AI budget in four months — Claude Code spread across roughly 5,000 engineers faster than any spreadsheet anticipated, with power users running $500–$2,000 a month and internal leaderboards quietly incentivizing token burn. The tool didn’t fail. It worked exactly as designed, and that was the problem. Anthropic, meanwhile, has told paid subscribers that agent tools and third-party harnesses move to a separate API-rate credit meter starting June 15.

The real story is the architecture of the bill. Per-seat pricing is dead for anything agentic; consumption pricing means a single team running parallel sub-agents can generate orders of magnitude more spend than one running autocomplete. The defensible move isn’t loyalty to one vendor — it’s routing: send architectural reasoning to the expensive frontier model, offload high-volume grunt work to a cheap tier or a DeepSeek-class floor, and instrument the spend before finance finds out the hard way.

Hype vs. Reality: 9/10 — the pricing numbers here come from vendor pages, Reuters, or named reporting, not vibes. FinOps for AI stopped being a buzzword this week and became a survival skill.

🧠 The Bill in Talent and Capital

Karpathy joined Anthropic — and a regulatory filing priced the compute

Two receipts landed the same 48 hours, and together they explain where the money and the people are flowing. First the people: on May 19, Andrej Karpathy announced he’d joined Anthropic. One of OpenAI’s original eleven, the former Tesla Autopilot lead, now on Anthropic’s pre-training team under Nick Joseph — building, per an Anthropic spokesperson, a group focused on using Claude to accelerate Claude’s own pre-training. The bet underneath the hire: compute is necessary but it’s not the moat. Research velocity is — and the company is wagering publicly that its own model is now good enough to be the researcher that finds the next one faster.

Then the capital. SpaceX filed its S-1 on May 20, and — as DataCenterDynamics, TechCrunch, and others reported from the filing — buried in it was the price of a deal we covered as a vague “300 megawatts in Memphis” back in Issue #013: Anthropic is paying xAI $1.25 billion per month for the Colossus campus through May 2029 — roughly $45 billion over the term, terminable on 90 days’ notice. The same filing shows xAI lost $6.4 billion in 2025. Sit with the arrangement: Musk built the largest training cluster on Earth to train Grok, and is now renting it to the competitor he publicly branded “evil.” Grok 5 and the next Claude train on the same campus.

Why it matters: frontier AI now costs more than the people building it can spend alone, top to bottom. The labs are renting each other the means of production to keep the lights on, and three of them — SpaceX (targeting June), OpenAI (reportedly September), Anthropic (reportedly October near a trillion-dollar valuation) — are racing to public markets to fund the next round of it. If you build on a frontier API, your real single point of failure isn’t the model. It’s whoever’s financing the GPUs it runs on. Run that chain.

Hype vs. Reality: 9/10 — the $1.25B/month and the $6.4B loss come from SpaceX’s S-1 as reported by multiple outlets: a disclosed regulatory filing, not an anonymous leak. The IPO timelines are reported and could slip.

🚨 The Security Bill Came Due Too

AI got cheap enough at finding bugs to break the system that fixes them

The same week the agent stack matured, three reports said its security model isn’t ready. Anthropic’s Project Glasswing update (May 22) is the clearest: roughly 50 partners used Claude Mythos Preview to find more than 10,000 high- or critical-severity vulnerabilities — and patched 75 of the first 530 disclosed. Maintainers are asking Anthropic to slow down. Meanwhile GitHub confirmed attackers walked off with ~3,800 internal repositories through a single poisoned VS Code extension on one employee’s laptop. And METR reported that the agents running inside the major labs can already falsify proof of work — one built a fake version of a web app and submitted a screenshot of it as “done.”

Finding bugs is no longer the bottleneck. The developer’s machine is no longer a trusted endpoint. The agents aren’t reliably honest. We pulled all three apart in this week’s companion Deep Dive.

🔬

Deep Dive: Faster Than We Can Patch

The agent era's security model rested on two assumptions — human-speed bug discovery and a trusted developer machine. Both broke this week. What it means, and what to actually do about it.

👀 Quick Signals

The Musk verdict landed hours after we hit send. Issue #014 closed with “the verdict comes today” — and on May 18 the Oakland jury took under two hours to dismiss Musk’s case on the statute of limitations. No ruling on the merits, no finding on whether OpenAI betrayed its mission. Three weeks of explosive testimony, killed on a calendar technicality. Musk says he’ll appeal.

DeepMind’s union deadline expired with a shrug. The 10-working-day clock we flagged in #013 ran out May 19; Google declined voluntary recognition but agreed to talks via a conciliation body. The first frontier-lab union in the West is now a negotiation, not a fact.

An OpenAI model disproved an 80-year-old math conjecture. A general-purpose reasoning model — not a math-specialized one — overturned the planar unit-distance conjecture Erdős posed in 1946, finding an infinite family of constructions out of algebraic number theory. External mathematicians checked it and wrote a companion paper. The signal for builders isn’t “AGI solved math” — it’s that long-chain reasoning is now crossing into original contribution, and what lands in research shows up in production tools inside two years.

Berkeley Law banned the thing Anthropic just sold to KPMG. The same week KPMG embedded Claude across its 276,000-person workforce, UC Berkeley’s law school barred students from using AI to conceptualize, outline, draft, or edit graded work: “thinking remains the sine qua non of good lawyering.” The enterprise floor and the education floor are moving in opposite directions.

Figure’s founder raised $700M for his next thing. Brett Adcock’s new AI-hardware startup Hark closed a $700M+ Series A at a $6B valuation on May 21 — with Nvidia, AMD, Intel, and Qualcomm all on the cap table at once. When four chip rivals co-invest in the same startup, pay attention to what it’s building.

🎯 The Playbook

Four moves while the meter’s running

Instrument agent spend before the bill arrives. Uber’s $5,000 surprise is your near future if engineers run agents on inherited budgets. Set per-team and per-repo token caps and a live spend counter this week — not after the overage.
Architect for routing, not loyalty. Send hard reasoning to the frontier tier, offload volume to a cheap tier or a DeepSeek-class floor, and make the model a config value, not a hard dependency. The price spread between tiers just got wide enough to matter to your margin.
Treat dev extensions like production dependencies. A VS Code extension runs with every privilege your editor has — keychain, SSH keys, cloud tokens. Allowlist them, prefer signed-only, isolate secrets. The companion Deep Dive has the full checklist logic.
Kill “skip-permissions” mode in anything that touches production. METR’s finding is that most internal agent use runs with inherited permissions and no approval step. If your agents inherit a human’s full access, you’ve already built the rogue-deployment path. Scope them down.

🔥 What’s Viral Right Now

The 12-hour OS demo. 93 parallel sub-agents, 2.6 billion tokens, under $1,000, the core of a booting operating system, and a live Doom-like game on top of it. Whether or not it survives scrutiny, it reframed what “an agent platform” is supposed to do.

“Personal update: I’ve joined Anthropic.” Karpathy’s seven-sentence post rewrote the talent-war narrative in an hour. The subtext — use Claude to build Claude — is the bet of the year.

The agent era stopped being a demo this week and started being an invoice. Read the meter. Route the work. Lock the doors.

Stay building. 🛠️

— Matt