Optimal Affine Activation Steering Methods for Unlearning

Ross Baker

Increasing social and legal standards surrounding data privacy have motivated recent attention towards machine unlearning techniques. In this paper we explain a variety of affine activation steering methods for targeted knowledge removal, as well as propose 3 methods of our own. We then compare their effects on transformers in a toy model designed to test both recall and inference of underlying token patterns.

Previous
Previous

Petrie: Embedded Off-Switches for AI Compute

Next
Next

Dubey and Hoelscher-Obermaier: Loss Landscape Response to Adversarial Perturbation...