Writing

Linear Safety Probes Cannot Silence Features They Detect

You can build a safety monitor that perfectly spots dangerous behavior — then disable it, and the model does the dangerous thing anyway.

Fine-tune a model and its safety monitor quietly stops working — even though every standard test still says it's fine.