What This Is

A five-round bench harness designed to test where local 30B-class models actually compete with frontier cloud models, and where they don’t. Same brief per round, identical prompt, fresh per-model working directory, no iteration unless noted. Companion artifact to the Five Rounds Deep Dive.

Each round tests a different shape of coding task:

RoundTaskWhat it measures
R1Particle simulation, 60 FPS specSingle-shot artifact generation under a hard quantitative constraint
R2One round of feedback to fix R1How models respond to explicit failure feedback
R3Lisp interpreter, 58-test pytest suite as specSemantic precision when the spec is the test suite
R4notespeak Go full-stack w/ SQLite FTS5On-distribution boilerplate generation with multi-file scope
R5Agentic debug: 10 planted bugs to find and fixMulti-turn agent loop, local vs frontier head-to-head

Contestants

ModelRole
Qwen3-Coder-30B-A3B (Q5_K_M)Local, coding-specialized MoE
Qwen3.6-27B (Q5_K_M)Local, thinking, general flagship
Qwen3.5-35B-A3B (Q4_K_M)Local, thinking MoE
Gemma 4 31B IT (Q5_K_M)Local, dense (Google)
OpenAI Codex (gpt-5.3-codex)Frontier control, R5 only, via Pi + ChatGPT Plus OAuth
Claude Sonnet 4.6Frontier control, R5 only, via claude -p + Claude Pro

All six runs at $0 marginal cost (locals are free; frontier via existing subscriptions).

What’s in the Repo

  • round{1,2,3,4,5}/ - per-round brief, eval artifact (tests/integration script), and per-model output directories
  • scripts/ - runner scripts that drove each round, reusable for new models
  • assets/ - screenshots, composited grids, the bench-wide scoreboard image
  • RESULTS.md - longer-form internal writeup with details that didn’t fit in the Deep Dive

Reproducing the Bench

The harness is built so adding a new model to any round is a matter of:

  1. Add it to llama-swap’s config.yaml (or expose any OpenAI-compatible endpoint)
  2. Edit the MODELS array in the round’s runner script
  3. Create the per-model directory
  4. Run the script. Pass/fail counts and wall clock append to _logs/SCORES.txt.

Hardware target: RTX 5090 (32 GB VRAM) was the workstation. A 3090 (24 GB) works with smaller quants. The bench runs in under an hour end-to-end.

Full setup, prerequisites, and per-round walkthroughs in the repo README.