defenses: Add original TAR implementation#113
Conversation
0bbc17b to
a805363
Compare
Makes MT-Bench accessible by registering it in the eval registry.
e34e62b to
bd98462
Compare
fb763f0 to
6e5d7e5
Compare
ebfd6b6 to
437ada1
Compare
| @@ -198,6 +199,7 @@ class MTBenchScoreSchema(ScoreSchema): | |||
| judge_response_2: str = cast("str", pa.Field(nullable=False)) | |||
|
|
|||
|
|
|||
b3d499b to
facfab7
Compare
sdhossain
left a comment
There was a problem hiding this comment.
lgtm -- left a nit or two that can be taken or left.
| @@ -0,0 +1,295 @@ | |||
| r"""Original TAR defense (Tamirisa et al. 2024) facade. | |||
There was a problem hiding this comment.
nit: I believe convention we've been using for the name of the file that contains the main interface would be tar.py. Not sure if it's been followed for every defense, but I think has been for booster, ctrl and t-vaccine / crl.
|
|
||
| @dataclass | ||
| class TARConfig(AlignmentDefenseConfig): | ||
| """Configuration for the original TAR defense (Tamirisa et al. 2024).""" |
There was a problem hiding this comment.
wondering if an Attributes: + description would be helpful here in the docs? I guess then people maybe would be able to hover over the class and see description of the params.
This has probably not been done to a fair number of defense configs (i think it was done for the attack ones mostly), but could be done in the future as well <-- i've thought about using mkdocs in the future.
I see we already have comments for some of them.
| """Run the original TAR training as a subprocess.""" | ||
| cfg = self.defense_config | ||
|
|
||
| if cfg.output_checkpoint_path.exists(): |
There was a problem hiding this comment.
this would be pretty useful to add to the base defense class -- (not super relevant to this PR i guess).
|
Unrelated lint error, will fix it separately |
|
Actually the lint error was not unrelated. But anyway it's too late and I will fix it now in another PR |
Changes
Adds implementation of TAR. We already had T-Vaccine's TAR implementation, but it turns out there are differences that make us suspect T-Vaccine's TAR implementation is weaker. So we're renaming T-Vaccine's TAR from
TARtoTAR_TVACCINEand adding a version of TAR based on copying over the original TAR codeI'm not confident this gives much defense (see Testing section below) but it should be much more faithful to the original TAR implementation
Testing
ran
scripts/whitebox/test_tar.pyon 8 GPUs on Flamingo, reproducing the paper's experiment on refusal-trained TAR (Section 5.2, Table 2)Pre-attack (defended model quality)
Post-attack (tamper resistance)
The paper reports HarmBench ASR; we use StrongREJECT (finetuned classifier).
Direct numeric comparison is not meaningful, but relative ranking is.
The defended model shows modestly lower StrongREJECT. However, it's possible the result is noise or is a result of weaker capabilities/coherence rather than purely reduced harmfulness.
The paper's HarmBench ASR was 72.5% for the undefended model and 63.9% for the defended model, which is also quite modest. They don't report MMLU for this setting
Differences between original TAR and T-Vaccine's TAR, according to Claude:
1. Single SGD step vs. 64-step AdamW inner loop (most critical)
[2e-6, 2e-5, 4e-5]) to simulate a strong adversary. This is a full fine-tuning attack simulation.param.data -= 0.01 * stored_grads). The AdamW inner optimizer code is present but commented out (lines 885-893, 917-918 oftrainer.py).The original TAR trains the defender against a much stronger, more realistic adversary. T-Vaccine's single-step perturbation is trivially weak by comparison.
2. No max-entropy tamper-resistance loss
loss2andloss3), then combines gradients withparam.grad = grad + 2 * stored_grads_tr. There's no max-entropy objective at all.3. No diverse adversary sampling
4. No weighting schedule or gradient scaling
schedule_lambda=0.0625) that weights later inner-loop steps more heavily, plus a tamper-resistance gradient scale of 4.0.5. No representation-engineering retain loss
--retain_representations— an MSE loss between the trained model's hidden states and the base model's hidden states, preserving the model's representation structure.self.retain_model = copy.deepcopy(model)) but never used — the representation loss is commented out (lines 948-957 oftrainer.py).6. LoRA vs. full weights
7. Outer optimizer and training scale
AdamWScheduleFree, lr=2e-5, 750 outer steps each containing 64 inner steps = ~48,000 total adversary updates.adamw_torch, lr=1e-3, 20 epochs with 1 inner step each = dramatically less adversarial training.8. Lower bound check missing
tar_tamper_resistance_loss_lower_boundcheck — only applies TR gradients if the adversary was actually successful enough to warrant it.