Training-time defense that redistributes LLM refusal via mean/covariance matching + KD, raising linear-ablation attack rank from K=1 to K≥16 (Llama-3.2-1B-Instruct)
pytorch llama ai-alignment adversarial-robustness mechanistic-interpretability llm-safety refusal representation-engineering leace
-
Updated
Jun 1, 2026 - Python