Found, Not Fixed — Issue #017

Last week the industry started demanding receipts. This week it learned what happens when the receipts arrive faster than anyone can act on them. A single autonomous agent read 1.5 million lines of FFmpeg’s C and handed back 21 confirmed zero-days — each with a working proof of concept — for roughly the price of a nice dinner. The same week, a lab disclosed its model had flagged tens of thousands of vulnerabilities across open source, and two governments spent five days arguing over who gets to hold the most capable version of that capability. The pattern underneath all of it: AI made finding the problem almost free. Fixing it, funding it, and governing it are still expensive, human, and slow. That gap — between found and fixed — is the whole week.

🐛 The Bug Glut Hit Open Source

An AI agent found 21 zero-days in FFmpeg for ~$1,000. Now who patches them?

The receipt: a security startup called depthfirst pointed its autonomous agent at FFmpeg — the media library buried inside almost everything that decodes a video — scanned roughly 1.5 million lines of C, and produced 21 confirmed zero-days, each with a reproducible proof-of-concept input, for a total run cost of about $1,000. Several had been latent for 15 to 20 years; one dates to 2003 and sat untouched for 23. Eight or nine already carry CVE numbers (CVE-2026-39210 onward). depthfirst’s own flex: the run cost roughly 10% of what Anthropic spent finding bugs with Mythos.

It didn’t land alone. The same stretch, an autonomous tool surfaced a two-year-old remote-code-execution flaw in Redis (CVE-2026-23479), and Google shipped Chrome 149 with patches for 429 security bugs — the most in a single release ever. Google hasn’t pinned that record on AI, and most of the critical finds were internal. But here’s the tell: Google overhauled its bug bounty in April specifically because it was drowning in AI-generated submissions, and now begs researchers for a concise reproducer instead of the essays the models churn out.

Here’s what the hype reposts skipped. This lands on top of a year-old fight. FFmpeg’s volunteers spent late 2025 publicly calling corporate AI bug reports “CVE slop” — billion-dollar companies using AI to find obscure flaws in volunteer code, then expecting unpaid maintainers to fix them on a 90-day disclosure clock. cURL’s Daniel Stenberg shut down his bug bounty entirely after fewer than 5% of 2025’s reports turned out legitimate. The libxml2 maintainer resigned over the burden. The depthfirst twist is the uncomfortable one: these 21 each ship a working PoC, so they’re not slop — they’re real. Which doesn’t relieve the pressure on a seven-person volunteer team. It triples it.

Why it matters for builders: FFmpeg is in your stack whether you put it there or not — it’s in the transcoder, the thumbnail generator, the upload pipeline, the embedded SDK you forgot you vendored. The discovery-to-patch gap is your exposure window now, and it just got measured in dollars: an attacker with the same tooling can find what depthfirst found, for the same $1,000, and not file a CVE. Audit your dependency tree this week — especially anything that touches media — and assume a CVE wave is coming through the libraries you didn’t write.

Hype vs. Reality: 8/10 — the capability is real, cheap, and already shipping. The “AI will secure everything” framing is the hype; the bottleneck was never finding bugs, it’s the unpaid human at the other end of the disclosure.

🏛️ The Cyber Gold Rush Went Governmental

Same week, two labs turned offensive AI into a sovereign bargaining chip

If finding vulnerabilities at industrial scale is now a product, the question becomes who’s allowed to buy it — and this week the answer started getting decided in diplomatic back rooms.

Anthropic made the bigger moves. On June 1 it agreed to give the EU’s cyber agency ENISA access to Mythos — the first EU institution in Project Glasswing. The next day it expanded Glasswing roughly fourfold — from around 50 partners to 200, spanning 15 countries, with South Korea’s science ministry, Samsung, SK Hynix, SK Telecom, and — per the Financial Times — NATO reportedly among the new inductees. In the UK, nine major banks — HSBC, Lloyds, Nationwide among them — are getting access to GPT-5.5-Cyber through OpenAI’s parallel Trusted Access for Cyber program, with NatWest and Santander already on it. The US equivalent, CISA, still hasn’t been selected for Mythos.

The friction underneath is the actual story. Anthropic told the European Commission it needed US government permission to share Mythos — and Washington is reluctant to hand the most capable cyber model to non-US governments at all, on the logic that staying the dominant AI power means not exporting the crown jewels. Mythos still isn’t going public; Anthropic says Mythos-class capability reaches general customers only once more safeguards are in place. This is the same gated model from Issue #009, now being rationed by treaty instead of by API key.

And the line that ties this section straight back to the last one: Anthropic says Mythos has flagged 23,019 potential open-source vulnerabilities. Its own coordinated disclosure dashboard tells the rest of the story: of 1,596 findings reported to maintainers, only 97 have been patched upstream as of late May. Same gap as FFmpeg. Different scale. The machine that finds 23,000 bugs does not come with 23,000 people to fix them.

Why it matters for builders: the defenders are getting these tools — but the asymmetry runs the other way once capability leaks or ships. The window to harden the things you actually own is open now, while the offensive version is still gated. Don’t wait for the public release to take your own attack surface seriously.

Hype vs. Reality: 7/10 — the capability and the access fights are real (UK AISI testing had GPT-5.5 clearing a simulated 32-step attack 2 times in 10, Mythos 3 in 10). The “trust us, it’s safely gated” framing is the part to watch, not assume.

🖥️ Microsoft Wants the Model, the Metal, and the Sandbox

Build 2026 was a bet on owning the whole stack — and running it off the cloud

At Build 2026 on June 2, Microsoft stopped being OpenAI’s delivery layer and started competing with it. The company shipped seven proprietary MAI models, led by MAI-Thinking-1 — its first true reasoning model, a 35B-active-parameter sparse MoE with a 256K context window, reportedly built from scratch with zero distillation from third-party models. Microsoft claims it matches Claude Opus 4.6 on SWE-Bench Pro, and independent raters on Surge preferred it to Sonnet 4.6 in blind side-by-sides. Alongside it: MAI-Code-1-Flash, a lean 5-billion-parameter agentic coding model wired directly into GitHub Copilot and VS Code. The timing is rich: MAI-Code-1-Flash is shipping into the same Copilot that, as of June 1, is bleeding developers dry under its new metered billing model — more on that below.

Then the metal: the Surface RTX Spark Dev Box — up to 1 petaflop of AI compute, 128GB unified memory, capable of running 120B-parameter models locally with up to a million tokens of context, WSL2 with native GPU passthrough and full CUDA, Copilot and VS Code pre-installed — plus the Surface Laptop Ultra on the same RTX Spark platform. This is a local-AI workstation aimed squarely at developers who’d rather not pay a cloud GPU tax on every iteration.

And quietly the most interesting bit for this issue: Microsoft Execution Containers (MXC), now in preview — OS-enforced sandboxes for agents, where Windows itself contains what an agent is allowed to touch. Describe the boundary once; the OS enforces it everywhere the agent runs.

Why it matters for builders: this is the “own your stack” path getting real hardware behind it. If you’ve been watching the token meter (and after the last few issues, you have), cheap proprietary models plus local-capable metal is a genuine hedge against both price shocks and the access shocks the section above just described. And MXC is the first credible enterprise answer to the problem the whole issue circles: the agent is the attack surface now, so contain it at the layer below the app.

Hype vs. Reality: 6/10 — the strategy is coherent and the hardware is real, but the Dev Box ships “later this year,” and Microsoft’s own benchmark claims are exactly the kind of number Issue #016 told you to verify in your own harness before believing.

📡 Quick Signals

Anthropic filed its S-1. On June 1 — the same day it opened Mythos to ENISA — Anthropic confidentially submitted a draft registration statement to the SEC, formalizing an IPO path at a ~$965B valuation. No pricing or timeline yet; confidential filings typically go public 4–8 weeks before a roadshow. Issue #016 covered the $65B Series H; this is the next domino.

GitHub Copilot’s metered billing hit like a truck. The June 1 switch from flat-rate premium requests to token-metered AI Credits exposed the real cost of agentic coding. Developers report a single “lol” message burning ~29,000 tokens (~29 credits) just on context alone, Pro+ users torching 85% of monthly limits in one afternoon, and one GPT-5.5 query costing $2 after a three-minute manual abort. Reddit and Hacker News lit up; the exodus to Claude Code, Cursor, and open-source alternatives is already measurable. The subsidy era for AI coding tools is over.

MiniMax M3 shipped as an open-weight frontier. The Shanghai lab’s June 1 release puts a 1-million-token context window, native multimodal input, and a 59% SWE-Bench Pro score (beating GPT-5.5’s 58.6%) into a model with open weights promised within 10 days — at roughly $0.30 per million input tokens on promotional pricing. The new MiniMax Sparse Attention architecture cuts per-token compute to about 1/20th of previous approaches at the million-token mark. For builders, this is the open-weight option that makes “hedge your model stack” (Playbook item #4) concretely achievable this month.

Google released Gemma 4 12B. Announced June 3, it’s a dense, encoder-free multimodal model with native audio support that fits on a laptop with 16GB of VRAM. The encoder-free architecture — raw image patches and audio frames projected directly into the transformer — eliminates the separate vision and audio pipelines that bloated prior multimodal models. Integrated into Google’s AI Edge Gallery for fully offline local inference on Apple Silicon and CUDA.

Google released a Colab CLI — letting developers and agents drive local code on remote Colab GPU/TPU runtimes from a terminal. A quiet, useful on-ramp for builders without a 5090 under the desk.

The AI capex became a debt story. Hyperscaler unsecured bond issuance hit roughly $155B year-to-date — more than 45% above all of 2025 — with some AI-infrastructure sales 4x oversubscribed. The buildout is now being financed on credit, not just cash flow. Worth a raised eyebrow.

A Claude Sonnet 4.8 release is widely anticipated mid-June in developer circles, on the strength of leaked filter strings rather than anything official — so hold it loosely. If it ships at a low input price, it would reset the economics of production agentic workloads. If.

Google tightened its spam rules after a BBC investigation showed AI search answers can be steered by planted web content. File under: the integrity of AI-generated answers is now an adversarial-content problem, not just a ranking one.

🛠️ On Your Radar: GitHub

The provider-agnostic agent is pulling ahead

The single biggest mover on the AI leaderboard this month wasn’t a lab’s CLI — it was OpenCode, up ~928 stars in 28 days, outpacing both Claude Code and Codex by a wide margin and now sitting north of 170K. It’s a Go/TypeScript terminal agent that’s deliberately not coupled to any provider — 75+ LLM backends including local models via Ollama and llama.cpp — and v1.16.0 (June 5) added skill discovery and file-based agent loading. The signal isn’t “new tool”; it’s which tool is winning. As every lab ships its own walled coding agent, the community is voting for the one that doesn’t lock them to a single model. Don’t bet your workflow on one provider’s roadmap — the harness should outlive the model.

🎯 The Playbook

Four moves for the week the bugs got cheap

Audit your dependency tree — this week. Start with anything that touches media (FFmpeg and everything that vendors it), then the rest. The discovery-to-patch gap is your exposure window, and it now costs an attacker about $1,000 to find what you haven’t.
Inventory — and ideally fund — the open source you ship on. The maintainer crisis stopped being a charity question and became a supply-chain risk. The volunteers patching your stack are the thinnest part of your security posture.
Sandbox your agents like you mean it. OS-level (MXC) or container-level, but enforce the boundary below the application. The agent is the attack surface now; “we trust the prompt” is not a control.
Hedge your model stack. A provider-agnostic harness (OpenCode-style) plus a local-capable fallback insulates you against both the price shocks and the access shocks this issue documented. MiniMax M3 at $0.30/M input, Gemma 4 12B on a laptop, or Nemotron 3 Ultra for enterprise orchestration — the open-weight options just got real. Cheap proprietary models on your own metal is a genuine hedge, not a hobby.

🔥 What’s Viral Right Now

“21 zero-days for $1,000.” The depthfirst number is the most-shared security stat of the week — endlessly reposted as a triumph, with the maintainer side of it conveniently cropped out. The number is real. So is the seven-person team that now has to fix all 21 for free.

FFmpeg vs. the bug-hunters, round two. The “who pays the volunteers” fight reignited the moment the PoCs dropped, and this time the maintainers can’t even fall back on “it’s slop.”

The Copilot billing meltdown. Developers posting screenshots of credits evaporating in real time turned into the week’s loudest developer-community story. The memes write themselves; the migration is real.

The Mythos / GPT-5.5-Cyber access map. Tracking which government and which bank got which cyber model turned into a geopolitics parlor game this week. The subtext everyone clocked: the most powerful security tool of the year is being handed out by diplomacy, not by checkout page.

AI made finding the problem nearly free. Fixing it, funding it, and governing it are still expensive, human, and slow. The whole week lived in that gap. Mind it.

Stay building. 🛠️

— Matt

← All Issues ← Issue #016