Writing
Linear Safety Probes Cannot Silence Features They Detect
You can build a safety monitor that perfectly spots dangerous behavior — then disable it, and the model does the dangerous thing anyway.
Fine-Tuning Silently Breaks AI Safety Monitors
Fine-tune a model and its safety monitor quietly stops working — even though every standard test still says it's fine.