← Writing

Linear Safety Probes Cannot Silence Features They Detect

TL;DR. A linear probe direction is asked to do three jobs: detect the feature, steer by adding it, and silence by projecting it out. Only silencing is the necessity test. On four open-weights families (Llama-3-8B, Gemma-2-9B, Mistral-7B-v0.3, OLMo-2-7B) and two features (refusal, sycophancy), I find probes that detect at AUROC ≥ 0.91 and steer cleanly yet fail to silence. The failure is recipe-specific: difference-of-means (DoM) and cross-validated logistic regression (LR-CV) have disjoint failure sets, so no fixed recipe is safe. A single cosine at calibration time predicts which recipe silences, and a calibration check built on it recovers 17/17 causal handles on a 22-cell battery and alarms on the remaining 5.

Epistemic status. Four families, two features, 35 whitening cells, a 22-cell cross-family battery plus a 5-cell held-out Qwen2.5-7B replication. Numbers hold; the generalization to novel architectures, feature types, and non-linear monitors is the open question. Full paper here.


The gap, restated

Imagine you’re standing up a safety case on top of a linear probe. It classifies activations as “about to refuse” vs “about to comply.” It passes the standard checks: AUROC 0.94 on held-out data; additive steering along the probe direction monotonically reduces refusal with a clean dose-response. You ship.

What you haven’t tested is whether the direction your monitor watches is the axis the model uses to produce the behavior. That’s a separate property, and it’s the one the deployment actually needs. The test is Arditi-style projective ablation: project the probe direction out of the residual stream at every token. If behavior drops, the direction is the causal axis. If it doesn’t, the direction was reading a correlated proxy — the probe detects, you can push against it, but the model routes around it whenever you aren’t pushing.

Anthropic’s Claude Mythos Preview system card flagged this qualitatively:

“The causal effects of individual features often changed over the course of post-training, making it difficult to attribute behavioral changes merely to increases or decreases in particular feature activations.” — Mythos Preview System Card, §4.5.3.4

My first paper measured one instance of this gap: frozen probes pass detection and steering tests but fail necessity after fine-tuning. This paper shows the gap is there before any further training — at calibration time, on the freshest possible probe fit to the current checkpoint. It’s not a fine-tuning artifact. It’s a fact about the geometry of labeled buffers.

Three jobs, one direction

Figure 1: a probe direction has three jobs; only silencing is the necessity test. (a) Detect: classify by sign of d̂·h. (b) Steer: subtract αd̂ at generation. (c) Silence on the causal axis: classes collapse, behavior vanishes. (d) Silence on a within-class confound: the causal axis is intact and behavior survives. Detect and steer cannot tell (c) from (d).

The three-jobs picture is the whole frame. Detect and steer pass on both (c) the causal axis and (d) a within-class confound. Only silencing reports the difference. A safety case built on “the probe detects and we can steer” is implicitly assuming detect–steer–silence collapse to one direction. Under identical-covariance Gaussian classes they do. In practice, they often don’t.

The hero result: the recipe that works depends on the model

I ran the three tests on two extraction recipes (DoM and LR-CV) at the per-family peak probe layer, across four open-weights families, each with four rounds of benign UltraChat SFT. Every (model, feature, recipe) panel hits AUROC ≥ 0.91 on held-out data. Every panel produces monotone dose-response under additive steering.

The silencing verdicts diverge.

Figure 2: same silencing test, four models — the recipe that works depends on the model. Green = task-canonical (silences). Pink = anti-canonical (raises behavior). Check marks denote CIs that exclude zero in the canonical direction.

The two recipes’ failure sets are disjoint. There is no fixed choice of recipe that silences across the battery. Mistral is the cleanest existence proof: detection and steering are fine; silencing reports that the feature is not linearly accessible at all. A monitor watching detection and steering reads healthy while the intervention does nothing.

The mechanism: a within-class confound

The class mean μ⁺ − μ⁻ is Bayes-optimal when the two classes have identical covariance. That assumption fails when within-class variance carries a label-correlated surface feature v — something like formality or sentence length that shifts systematically between the feature-positive and feature-negative buffers at the probe layer. In that regime:

DoM = α · (causal axis) + β · (confound)

When β dominates, projecting plain DoM out removes v and little of the axis you wanted to silence. LR-CV is less susceptible because held-out regularization penalizes directions that generalize poorly across folds — typically surface confounds. That’s the asymmetry.

The decomposition is testable. Project out the top within-class principal components first, refit DoM on the whitened activations, and three regimes should appear: recovery when β dominates, preservation when α dominates, near-noop as β → 0. They do:

Figure 4: whitening dose-response: same dial, four stories. Signed silencing Δ on refusal as a function of the number of within-class PCs projected out before refitting the class mean.

Same dial, four stories. The dial is within-class confound mass.

One cosine predicts silencing

If the decomposition is right, one calibration-time scalar should be enough to tell the regimes apart: the cosine between plain DoM and DoM refit on whitened activations.

Figure 3: one cosine at calibration time predicts silencing success. Each point is one (refit, eval) cell; x-axis is cos(plain DoM, whitened DoM with top-1 PC removed); y-axis is plain-DoM silencing effect signed to the task-canonical direction.

The cosine tracks silencing monotonically across 35 cells spanning [0.12, 1.00]:

The 0.80 threshold sits in the empirical gap [0.55, 0.92] between the confound-dominated and confound-light clusters. Any threshold in that range routes every body-battery cell identically.

The calibration check

The cosine observation converts into a one-line deployment protocol. At every checkpoint, on a labeled buffer of feature-positive and feature-negative activations at the per-family probe layer:

  1. Fit plain DoM. Fit whitened DoM (top within-class PC projected out). Compute cos.
  2. If cos ≥ 0.80, the picker selects plain DoM. Else whitened DoM.
  3. Project the picked direction out at every token. Report a 95% Newcombe CI on the behavior-rate change.
  4. If the CI excludes zero in the task-canonical sign, adopt the direction as the causal handle. Otherwise the alarm fires: no linear direction at this layer silences this feature at this checkpoint.

On the 22-cell four-family battery, the picker recovers all 17 causal handles that exist and correctly alarms on the remaining 5. Fixed-DoM alone recovers 10/22. Fixed-LR-CV alone recovers 13/22. A held-out Qwen2.5-7B replication with the threshold frozen before that family ran adds 5 confound-light cells; the picker routes all five to DoM without retuning. Marginal cost: one extra probe fit and one extra projection per checkpoint.

What this changes for probe-based safety monitors

Three concrete changes for anyone shipping a linear probe as a runtime monitor:

  1. Sweep layers per family. OLMo-2’s causal axis is at L30 on a 32-layer model. An L12–L18 convention would miss it entirely. The peak detection layer is not a constant across architectures.
  2. Compute cos(plain DoM, whitened DoM) at calibration. It’s one scalar and it tells you whether plain DoM is the causal axis or a confound before you run any silencing experiments. A low cosine is a red flag, not a tuning detail.
  3. Report a 95% CI on the silencing Δ. Treat a CI that includes zero as an alarm, not a tuning signal. Detection and steering passing is not a substitute — that’s the whole point of the Mistral result.

Where this sits in the bigger picture

The first paper showed probes can pass detection tests and fail necessity after fine-tuning. This one shows they can fail necessity from the start, on the freshest possible probe fit to the current checkpoint. The sufficiency–necessity gap is not a fine-tuning artifact; it’s a calibration-time fact about the geometry of labeled buffers. Both failures are silent — the monitor reads healthy while the intervention does nothing — and both are avoidable with a necessity check that costs one extra projection.

The deeper claim is the same across both papers: probe accuracy is not evidence of a causal handle. Deploying model-internals probes as safety interventions requires a necessity test, not just a detection score.

Open questions

Limitations

Refusal on four families + sycophancy on Llama-3. 35 whitening cells in total. Enough to establish the cosine-signature monotone across [0.12, 1.00]; not enough to predict which regime a novel family falls into from pretraining metadata alone. Projective ablation is a necessity test subject to a self-repair caveat — the operational claim is about the deployed layer. The setup is linear only; I characterize the calibration-time dissociation, not arbitrary training dynamics.

If you run deployment monitors on production models, I’d like to know whether your protocol would alarm on a probe that detects cleanly but produces no silencing effect. Comments welcome.

Paper. Code and directional artifacts.