The New Guard · Labs · April 2026 · Companion to Issue #008
Google released Gemma 4 on April 2nd under an Apache 2.0 license. Forty-eight hours later, I pointed a fully automated decensoring tool at it and pressed enter. Twenty-four minutes. Two hundred optimization trials. The refusal rate dropped from 98% to 47%. The model’s actual capabilities — its ability to write code, explain history, reason through problems — remained statistically identical to the original.
But here’s the part that should change how you think about deploying open models: Gemma 4 fought back harder than any model in its class. Gemma 3 crumbled to 3% refusals under the same treatment. Gemma 4 held firm on roughly half the test set. Google’s alignment work is measurably improving. And yet, a single person on consumer hardware still halved the guardrails in the time it takes to watch an episode of television.
This experiment isn’t about how to do bad things with AI. It’s about understanding where safety actually lives — and where it doesn’t — so you can build accordingly. If you’re deploying open-weight models in production, this is your threat model. Everything in this piece is reproducible, and the companion repository includes every script, patch, and result.
🔬 What Abliteration Actually Is
The refusal direction is a line in space. Heretic erases it.
Every instruction-tuned language model ships with safety alignment — the thing that makes it say “I can’t help with that.” This alignment isn’t a separate filter sitting on top of the model. It’s baked into the weights: specific directions in the neural network’s internal geometry that encode the decision to refuse.
In 2024, researchers at Arditi et al. published a paper showing that refusal behavior in language models is mediated by a single direction in the model’s residual stream. Feed the model a harmful prompt and a harmless prompt. Record the internal activations. Compute the difference. That difference vector is the refusal direction — the geometric fingerprint of “I should say no.”
Once you know the direction, you can remove it. Orthogonalize the model’s weight matrices with respect to that vector, and the model loses its ability to distinguish “should answer” from “should refuse.” The technique is called directional ablation — or “abliteration” in the community.
Heretic is an open-source tool that automates the entire process. No dataset curation. No fine-tuning. No understanding of transformer internals required. It wraps directional ablation in an Optuna-based optimizer that runs hundreds of trials, balancing two competing objectives: minimize refusals while preserving model capabilities (measured via KL divergence from the original). One command, point it at a model, wait.
The tool has been public since 2025, has nearly 8,000 GitHub stars, and has been used to create over 1,000 abliterated model variants on Hugging Face. The research it’s built on is published, peer-reviewed, and well-understood. None of what follows is novel capability. What’s novel is applying it to a model that’s been public for 48 hours and documenting exactly what happened.
🛠️ The Setup
Bleeding-edge compatibility is never a one-liner
The hardware: NVIDIA GeForce RTX 5090 Laptop GPU with 24 GB VRAM, running Ubuntu via WSL2. Not exotic — the experiment would work on any 16+ GB NVIDIA GPU, just slower.
The target: Gemma 4 E4B-it — Google’s instruction-tuned model with 8 billion total parameters (4.5 billion effective). I chose this over the flagship 31B because it loads into ~15 GB of VRAM at bfloat16 precision, leaving headroom for Heretic’s optimization pipeline without introducing quantization artifacts. The model ships with 42 transformer layers, a 128K context window, and Apache 2.0 licensing — no gating, no sign-ups.
The first 30 minutes were just getting the stack to cooperate. Three compatibility issues, all typical of working with a model released 48 hours prior:
Problem 1: Heretic pins transformers~=4.57, but Gemma 4’s architecture only exists in transformers v5 (released the same week as the model). Force-installing v5 breaks the version constraint but the API is compatible.
Problem 2: Gemma 4’s tokenizer uses a new format for extra_special_tokens that transformers v4 can’t parse. The v5 install fixes this too.
Problem 3 — and this one was fun: Gemma 4’s vision and audio encoders use a custom layer called Gemma4ClippableLinear — a thin wrapper around nn.Linear that adds input/output clamping for numerical stability. PEFT (the library Heretic uses for LoRA-based weight modification) doesn’t recognize this layer type and crashes. The fix: a 40-line Python patch that swaps 232 instances of the custom wrapper for their inner nn.Linear modules before Heretic applies LoRA adapters. These vision/audio encoders aren’t involved in text abliteration anyway — we just need PEFT to stop complaining about them.
# The Gemma 4 compatibility patch (simplified)
from transformers.models.gemma4.modeling_gemma4 import Gemma4ClippableLinear
replacements = []
for name, module in model.named_modules():
if isinstance(module, Gemma4ClippableLinear):
replacements.append((name, module.linear))
for name, linear in replacements:
parts = name.split(".")
parent = model
for part in parts[:-1]:
parent = getattr(parent, part)
setattr(parent, parts[-1], linear)
# Patched 232 Gemma4ClippableLinear → nn.Linear
Three compatibility issues in a model that’s been public for 48 hours. This is what working at the bleeding edge looks like. Budget for integration work — “just pip install and run” is aspirational, not reality, for cutting-edge models.
⚡ Running the Optimization
200 trials in 24 minutes. The TPE learned where the refusal direction hides.
Heretic’s process works in phases:
Phase 1: Benchmarking. Heretic auto-sizes batch processing to your GPU. On the RTX 5090, it scaled from 21 tokens/second at batch size 1 up to 1,609 tokens/second at batch size 128.
Phase 2: Computing refusal directions. Heretic loads two datasets — 400 harmless prompts and 400 harmful prompts — and runs each through all 42 transformer layers, recording the residual stream activations. The mean difference between harmful and harmless activations at each layer is the refusal direction. In early layers (1-10), the two prompt types are mixed together — the model hasn’t formed an opinion yet. By layers 17-35, two distinct clusters emerge. The model knows what it’s going to refuse before it’s finished processing the prompt.
Phase 3: Optimization. Each of 200 trials selects a combination of ablation weights, layer targeting curves, and direction indices, applies the ablation via LoRA adapters, and evaluates: how many refusals? How much capability damage? The first 60 trials are exploratory (random sampling), then Optuna’s TPE sampler converges on the most promising regions of parameter space.
The optimizer is balancing two things that pull in opposite directions: minimize refusals (the point of the exercise) and minimize KL divergence (don’t lobotomize the model). Push too hard on refusal removal and the model turns to nonsense. Don’t push hard enough and you’ve wasted your time.
Total wall time: 24 minutes on an RTX 5090 Laptop.
📊 The Results
98% to 47%. The model bent, but it didn’t break.
of harmful test prompts were still refused after abliteration.
Gemma 3 crumbled to just 3% refusals under the same tool. Gemma 4 held firm on nearly half the test set. The alignment got stronger between generations — measurably.
After 200 trials:
| Metric | Original Model | Best Abliteration (Trial #156) |
|---|---|---|
| Refusals (out of 99 test prompts) | ~97 (98%) | 47 (47.5%) |
| KL divergence from original | 0.0 | 0.1029 |
| Adapter size | — | 3.4 MB |
| VRAM usage | 15.0 GB | 15.0 GB |
The distribution across 200 trials tells a story. 59 trials barely touched the alignment (98/99 refusals). 20 trials did nothing useful (99/99 refusals). But one trial — #156 — found the sweet spot: refusals cut roughly in half with a KL divergence of just 0.10. Nearly invisible capability damage.
The winning parameters targeted layers 16-17 (the zone where harmful/harmless residuals begin to diverge), with aggressive ablation weights on both attention output projections and MLP down-projections. The entire abliteration is stored as a 3.4 MB LoRA adapter — a file smaller than most PDFs.
On normal tasks — writing code, explaining cryptography, answering history questions, describing photosynthesis — both models produced nearly identical, high-quality responses. Zero observable degradation. The abliterated model’s binary search implementation used slightly different variable names but the same algorithm and structure. You can’t detect the modification by checking for degraded outputs. That’s what 0.10 KL divergence means in practice.
🎯 What Survived and What Didn’t
Abliteration didn’t flip a switch. It shifted a threshold.
This is the most revealing part of the experiment. I took categories that were “unlocked” by abliteration and categories that “held,” then probed the boundaries with prompts at different intensities — escalation ladders and softening ladders — to map exactly where the abliterated model draws its new line.
The pattern is precise:
On unlocked categories, the model answers direct requests but still refuses when prompts escalate to specific targets, named victims, or explicit operational detail. A generic request gets through. A request naming a specific person, vehicle, or institution does not.
On categories that held firm, direct requests are still refused — but educational or defensive framing breaks through at some level. “Explain the general chemistry” works. “Give me step-by-step synthesis instructions” does not. The model appears to have a separate, deeper encoding for the most dangerous specific content that survived abliteration.
Dual-use technical content — encryption weaknesses, port scanning, Tor routing, deep packet inspection — answered cleanly in both models. Abliteration didn’t make the model more permissive on content it was already willing to discuss. It specifically loosened the refusal boundary on content that was behind the safety line.
The new boundary isn't random. It has a shape.
Abliteration didn’t remove all safety alignment. It removed the broadest, most superficial layer — the one that triggers on topic keywords. What survived is a deeper encoding that appears to trigger on specificity, targeting, and operational detail. That’s a harder layer to remove because it’s closer to the model’s actual reasoning about what’s being asked.
🧠 Why Gemma 4 Is Harder to Crack
Three hypotheses, each with implications
The jump from 3% refusals (Gemma 3) to 47% refusals (Gemma 4) under the same tool is dramatic. Three possible explanations:
Distributed refusal. The refusal behavior might be encoded across more dimensions, not just a single linear direction. Removing one direction doesn’t fully disable it. This would mean Google moved from a simple “refusal vector” to a more distributed safety architecture — harder to remove with a single geometric intervention.
Deeper alignment integration. Google may have used more sophisticated alignment techniques — constitutional AI, multi-stage RLHF, or iterative red-teaming — that embed safety deeper into the weight structure. The refusal isn’t sitting on top of the model’s reasoning; it’s woven into it.
Architecture effects. Gemma 4’s Per-Layer Embeddings (PLE) provide a secondary signal path that may partially compensate for ablation applied to the primary residual stream. If safety information flows through architectural features that abliteration doesn’t target, the model retains some ability to refuse.
Why this matters: If Google is deliberately engineering models to resist abliteration — and the Gemma 3 → Gemma 4 improvement suggests they are — then the “arms race” framing isn’t quite right. It’s more like an escalating engineering discipline on both sides. Alignment researchers are building deeper safety mechanisms. Abliteration researchers are finding more dimensions to target. Neither side is standing still.
💰 What This Means for Builders
Model-level alignment is a layer, not a wall.
If you’re building products on top of open-weight models, this experiment defines your security architecture:
Any open-weight model you deploy can be modified this way. Apache 2.0 means truly open — including the freedom to modify weights. Within 48 hours of Gemma 4’s release, abliterated variants appeared on Hugging Face. This is the predictable outcome of a deliberate tradeoff. Google chose maximum accessibility. The community responded accordingly.
Your safety strategy cannot rely on model weights alone. If a 3.4 MB LoRA adapter can halve your model’s safety guardrails, then model-level alignment is necessary but fundamentally insufficient. You need runtime guardrails — output monitoring, input classification, tool permission gates, anomaly detection — that operate independently of what the model “wants” to do.
This is why the agent governance conversation matters. Microsoft’s Agent Governance Toolkit (released this same week, covered in Issue #008) addresses exactly this: runtime policy enforcement that sits between the agent and its actions, regardless of which model powers it. If your model’s safety can be modified with a single pip install, then safety has to live in the infrastructure layer — not the model layer.
The KL divergence result is the scariest part. The abliterated model performs identically to the original on normal tasks. There’s no output quality degradation you can use to detect the modification. You can’t check “is this model still safe?” by running it through a standard benchmark. The modification is invisible to capability testing. Your detection strategy needs to target safety behavior directly.
👀 The Open-Weight Debate
This isn’t a failure. It’s a tradeoff with the receipts now visible.
Google made a deliberate choice to release Gemma 4 under Apache 2.0 with zero friction — no gating, no usage restrictions, no sign-up. They also made it significantly more resistant to abliteration than its predecessor. Both of these facts can be true simultaneously.
The open-weight debate tends to collapse into binary positions. “Open models are dangerous because anyone can modify them.” “Open models are essential because transparency enables safety research.” Both are correct. Both are incomplete.
Here’s what the data from this experiment actually tells us:
For the “open is dangerous” camp: Yes, a single person on consumer hardware can meaningfully degrade a model’s safety alignment in under 30 minutes. The barrier to modification is low and getting lower as tools like Heretic improve.
For the “open is essential” camp: The only reason we know Gemma 4’s alignment is stronger than Gemma 3’s is because someone tested it. This kind of red-teaming is exactly what open models enable. You can’t improve what you can’t measure, and you can’t measure what you can’t access.
For builders: The debate is interesting but the answer is practical. If you’re using open-weight models, your architecture must assume the model’s own safety guardrails can be modified or removed. Build safety into every layer: the model, the runtime, the monitoring, the output filtering, and the human oversight. Defense in depth isn’t optional — it’s the engineering requirement that this experiment makes concrete.
🔧 Reproduction
Everything needed to reproduce this experiment is in the companion repository:
# Hardware: any NVIDIA GPU with 16+ GB VRAM
# OS: Linux (native or WSL2)
# Time: ~24 minutes on RTX 5090, longer on older GPUs
git clone https://github.com/thenewguardai/tng-heretic.git
cd tng-heretic
# Install everything
npm run setup # or: pip install commands in README
# Run the full 200-trial optimization
npm run heretic
# Save the best trial as a LoRA adapter
npm run save
# Run side-by-side comparisons
npm run compare
The repo includes the compatibility patches, npm scripts for every step, and result data. Full details in the README.
🧪 The Technical Details
For the builders who want to know how the sausage is made
Why LoRA? Heretic doesn’t modify model weights directly during optimization. It uses rank-1 LoRA adapters to represent the ablation: delta_W = -lambda * v * (v^T * W), where v is the refusal direction and lambda is the ablation weight. Fast resets between trials (zero the LoRA matrices) and tiny output files (3.4 MB, not 16 GB).
Why it works on Gemma 4. Despite the architectural innovations — Per-Layer Embeddings, shared KV cache, alternating local/global attention — Gemma 4’s text decoder uses standard dense transformer layers with o_proj (attention output projection) and down_proj (MLP down projection). These are the same two components Heretic targets in every model. Safety alignment isn’t architecture-specific. It’s a property of the training process (RLHF, DPO, constitutional AI) that manifests as a consistent linear direction across model families. That’s the fundamental insight behind abliteration — and it’s why the technique generalizes.
The winning trial’s parameters:
| Parameter | Value |
|---|---|
| direction_index | 16.47 (interpolating between layers 16-17) |
| attn.o_proj.max_weight | 1.45 |
| mlp.down_proj.max_weight | 1.46 |
| Layer coverage | 12-20 layer distance to falloff |
Both components used wide layer coverage with aggressive weights — a broad intervention across the middle layers where refusal decisions crystallize. The direction index at 16.47 means Heretic interpolated between the refusal directions at layers 16 and 17, finding a direction that exists between individual layers in the continuous latent space.
The Bottom Line
Two days after Google released Gemma 4, I ran a fully automated decensoring tool against it on a laptop. Twenty-four minutes. Two hundred optimization trials. Refusals cut from 98% to 47%, with nearly zero capability damage.
But here’s the twist nobody expected: Gemma 4 held on harder than its predecessor. Gemma 3 crumbled to 3% refusals under the same tool. Gemma 4 kept refusing roughly half the test set — and the half it kept refusing was the most dangerous content. Google’s alignment work is improving. Measurably.
The arms race between alignment and abliteration is real, and both sides are getting stronger. For builders, the takeaway is the same either way: model-level alignment is a layer of your safety strategy, not the whole thing. Build accordingly.
Built with Heretic v1.2.0 and Gemma 4 E4B-it. Full reproduction steps and companion repository at github.com/thenewguardai/tng-heretic. Residual vector plots and optimization logs available in the repo.
The New Guard covers AI infrastructure for builders who ship. Subscribe for weekly deep dives.