ShiftDirection:

Activation Steering Under Downstream Fine-Tuning

Philipp E. Glass, Allan Tucker, Yongmin Li, Alina Miron

“Does downstream fine tuning undo embedded activation steering?”

Activation steering can modify a language model’s behaviour by intervening on its internal representations along a feature direction, and linear steering methods can be embedded directly into the model’s weights. However, it is unclear whether such embedded interventions persist when the model undergoes further training. We investigate the stability of embedded steering under routine, non-adversarial fine-tuning across five instruction-tuned models (3B–14B parameters), two training paradigms (SFT and RLHF), and two steering targets: refusal suppression through activation ablation and brevity induction through activation amplification. We find that behavioural preservation varies with the optimisation pressure exerted by training content: steering persists when training data does not contradict the steered behaviour, and degrades when it does. Mechanistically, however, the steering modification itself remains nearly intact in weight space, exhibiting under 2% vector recovery across all conditions, even where behaviour substantially reverts. This dissociation suggests that fine-tuning does not reverse the weight edit, but rather develops alternate pathways that reduce its downstream effect. Embedded steering thus appears durable but not unconditionally robust, and behavioural re-validation after downstream training remains necessary.

Previous
Previous

Blandfort et al: Moral Preferences of LLMs Under Directed Contextual Influence

Next
Next

Mahajan et al: Mind The Gap