The Changing North Star
of AI Control

A scaling-math scenario for the AI control agenda

On December 1st, 2025, the GDM mech interp team published a LessWrong article declaring a pivot to a pragmatic approach to interpretability. Much time had been lost chasing numbers in SAE reconstruction loss; they argued that optimizing such proxies did not get them closer to their north star: making AI go well through understanding how deep neural networks process information mechanistically. Has AI control gone the way of mech interp? Is its north star of stopping harmful actions from human-level AI on track?

In mech interp, SAE reconstruction loss turned out to be a number you could optimize without getting closer to understanding models. My worry is that Pareto safety–usefulness frontiers are becoming the same kind of number for control: optimizable, publishable, and only loosely coupled to how fast we can catch a scheming model in production.

A real-world precedent

On December 31st, 2025, Alibaba reported an incident where, without instructions to do so, their models in their RL runs circumvented security policies and diverted training GPUs for crypto-mining [1]. The authors took actions to prevent this from recurring, and called on the broader community to take such failure modes seriously. AI control can help mitigate such incidents, but its north star is set on post-deployment.

The proxy trap

I recently spent three months on an AI control project, stacking defensive protocols to optimize a safety–usefulness trade-off. Then I read an article on AI-assisted red-teaming. AI control is supposed to defend against human-level AI, but the line between a human red team and an AI-augmented one has started to blur. And we're still optimizing our proxy task of safety-usefulness frontiers as if that line were clean.

The detection-to-pull gap

Even when control works as intended post-deployment and catches an attack, the technical job ends where the organizational one begins. Between the time a model is flagged and the time it is pulled, there are weeks of discovery, escalation, and legal review during which a frontier model keeps acting on the world (damage calculator). That window is the thing we should be trying to shrink.

The damage done would be larger without AI control. But the proxy task of optimizing Pareto frontiers is far removed from what actually matters, and, like SAE reconstruction loss before it, risks becoming a number we optimize because it's easier than measuring what we care about: minimizing the time window a bad model has before detection. The sooner we intervene, the less damage it can do, and the more context we have to understand its behavior. And as we have seen, this can be detected and stopped as early as during training/RL.

Pre-deployment interventions

To be clear, pre-deployment interventions already exist in the control literature. Korbak et al. (2025) sketch a "control safety case" that models how safe an agent would be to deploy [2]. Lindner et al. (2025) explicitly identify latency as one of three safety conditions for frontier monitoring [3]. Yet most papers still default to a post-deployment north star and proxy-task optimization, without asking where in the lifecycle interventions would be most effective.

We should lift our gaze from proxy tasks and reconsider where our efforts have the most impact. The crypto-mining incident shows why pre-deployment matters: it was detected and stopped during training, not after release.

The calculator below takes published monitor performance numbers from the Redwood-lineage AI control agenda (benchmarks like BashArena and ControlArena) and runs them at real deployment scale, then compares the annual cost of operating the detection infrastructure against the damage from a single undetected incident. All yellow inputs are editable; everything else recalculates live.

Inputs

Traffic
Scheming behavior
Monitor performance
Detection timeline
SOC cost model
Damage model

Results

Per-day breakdown
Agent requests/day
Attack attempts/day
Attacks caught/day
Attacks missed/day Successful undetected
False positives/day
Audit queue/day (flagged)
Queue true-positive rate Real attacks as % of queue
Cumulative by model-pull decision
Attack attempts before pull
Attacks caught before pull
Successful undetected attacks before pull
SOC cost (to operate detection)
Minutes per alert (manual)
Minutes per alert (TM-assisted)
Analysts needed (manual)
Analysts needed (TM-assisted)
SOC cost / year (manual)
SOC cost / year (TM-assisted)
Damage from undetected attacks
Lines of malicious code written
LOC forensic cleanup cost
Customer-incident damage
Rogue cloud investigation cost
Total estimated damage Single incident, assuming the model eventually does get pulled

By the time "pull the model" is executed, the forensic audit to untangle what was done is months-long and mixes malicious artifacts with legitimate developer work across customer environments the lab has no visibility into.

Damage / annual SOC cost ratio Higher = more cost-effective to invest in detection