Skip to content

defenses: Add original TAR implementation#113

Merged
tomtseng merged 27 commits into
mainfrom
tomtseng/tar-orig
Apr 18, 2026
Merged

defenses: Add original TAR implementation#113
tomtseng merged 27 commits into
mainfrom
tomtseng/tar-orig

Conversation

@tomtseng
Copy link
Copy Markdown
Collaborator

@tomtseng tomtseng commented Mar 27, 2026

Changes

Adds implementation of TAR. We already had T-Vaccine's TAR implementation, but it turns out there are differences that make us suspect T-Vaccine's TAR implementation is weaker. So we're renaming T-Vaccine's TAR from TAR to TAR_TVACCINE and adding a version of TAR based on copying over the original TAR code

I'm not confident this gives much defense (see Testing section below) but it should be much more faithful to the original TAR implementation

Testing

ran scripts/whitebox/test_tar.py on 8 GPUs on Flamingo, reproducing the paper's experiment on refusal-trained TAR (Section 5.2, Table 2)

Pre-attack (defended model quality)

Metric Paper (base) Ours (base) Paper (TAR) Ours (TAR)
MT-Bench (↑) 8.1 7.356 6.3 5.975
MMLU-Pro (↑) 0.446 0.314

Post-attack (tamper resistance)

The paper reports HarmBench ASR; we use StrongREJECT (finetuned classifier).
Direct numeric comparison is not meaningful, but relative ranking is.

Defended Undefended
Avg StrongREJECT (↓) 0.717 0.760
Avg MMLU-Pro Val (↑) 0.339 0.413

The defended model shows modestly lower StrongREJECT. However, it's possible the result is noise or is a result of weaker capabilities/coherence rather than purely reduced harmfulness.

The paper's HarmBench ASR was 72.5% for the undefended model and 63.9% for the defended model, which is also quite modest. They don't report MMLU for this setting

Differences between original TAR and T-Vaccine's TAR, according to Claude:

1. Single SGD step vs. 64-step AdamW inner loop (most critical)

  • Original TAR: Runs 64 inner loop steps with a proper AdamW optimizer (with sampled LR from [2e-6, 2e-5, 4e-5]) to simulate a strong adversary. This is a full fine-tuning attack simulation.
  • T-Vaccine: Takes 1 raw gradient step with a hardcoded step size of 0.01 (param.data -= 0.01 * stored_grads). The AdamW inner optimizer code is present but commented out (lines 885-893, 917-918 of trainer.py).

The original TAR trains the defender against a much stronger, more realistic adversary. T-Vaccine's single-step perturbation is trivially weak by comparison.

2. No max-entropy tamper-resistance loss

  • Original TAR: Uses a max-entropy loss on heldout forget data — it actively pushes the model toward maximum entropy (uniform distribution) on harmful outputs after the adversary attacks.
  • T-Vaccine: Simply uses standard cross-entropy loss on safe data twice (loss2 and loss3), then combines gradients with param.grad = grad + 2 * stored_grads_tr. There's no max-entropy objective at all.

3. No diverse adversary sampling

  • Original TAR: Samples from multiple adversary distributions (pile-bio, camel-bio, retain-forget-switch with beta-distribution switching points) and randomly varies the adversary learning rate each outer step.
  • T-Vaccine: Uses a single harmful dataset with a fixed perturbation, no distribution diversity, no LR randomization.

4. No weighting schedule or gradient scaling

  • Original TAR: Uses an exponential weighting schedule (schedule_lambda=0.0625) that weights later inner-loop steps more heavily, plus a tamper-resistance gradient scale of 4.0.
  • T-Vaccine: Uses a flat weight of 2 on the safe-data gradients with no scheduling.

5. No representation-engineering retain loss

  • Original TAR: Uses --retain_representations — an MSE loss between the trained model's hidden states and the base model's hidden states, preserving the model's representation structure.
  • T-Vaccine: The retain model is deep-copied (self.retain_model = copy.deepcopy(model)) but never used — the representation loss is commented out (lines 948-957 of trainer.py).

6. LoRA vs. full weights

  • Original TAR: Trains full model weights with FSDP across 8 GPUs.
  • T-Vaccine: Uses LoRA (rank=8, alpha=4) — a low-rank approximation that modifies far fewer parameters.

7. Outer optimizer and training scale

  • Original TAR: AdamWScheduleFree, lr=2e-5, 750 outer steps each containing 64 inner steps = ~48,000 total adversary updates.
  • T-Vaccine: Standard adamw_torch, lr=1e-3, 20 epochs with 1 inner step each = dramatically less adversarial training.

8. Lower bound check missing

  • Original TAR: Has a tar_tamper_resistance_loss_lower_bound check — only applies TR gradients if the adversary was actually successful enough to warrant it.
  • T-Vaccine: Always applies the gradient combination regardless.

@tomtseng tomtseng force-pushed the tomtseng/tar-orig branch from 0bbc17b to a805363 Compare March 27, 2026 01:03
@tomtseng tomtseng force-pushed the tomtseng/tar-orig branch from e34e62b to bd98462 Compare April 11, 2026 00:00
@tomtseng tomtseng force-pushed the tomtseng/tar-orig branch from fb763f0 to 6e5d7e5 Compare April 12, 2026 22:22
@tomtseng tomtseng force-pushed the tomtseng/tar-orig branch from ebfd6b6 to 437ada1 Compare April 13, 2026 01:04
@@ -198,6 +199,7 @@ class MTBenchScoreSchema(ScoreSchema):
judge_response_2: str = cast("str", pa.Field(nullable=False))


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomtseng -- in #115 I alter MT-Bench to use vLLM which speeds it up a fair bit.

@tomtseng tomtseng force-pushed the tomtseng/tar-orig branch from b3d499b to facfab7 Compare April 13, 2026 23:29
@tomtseng tomtseng requested a review from sdhossain April 14, 2026 01:28
Copy link
Copy Markdown
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm -- left a nit or two that can be taken or left.

@@ -0,0 +1,295 @@
r"""Original TAR defense (Tamirisa et al. 2024) facade.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe convention we've been using for the name of the file that contains the main interface would be tar.py. Not sure if it's been followed for every defense, but I think has been for booster, ctrl and t-vaccine / crl.


@dataclass
class TARConfig(AlignmentDefenseConfig):
"""Configuration for the original TAR defense (Tamirisa et al. 2024)."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if an Attributes: + description would be helpful here in the docs? I guess then people maybe would be able to hover over the class and see description of the params.

This has probably not been done to a fair number of defense configs (i think it was done for the attack ones mostly), but could be done in the future as well <-- i've thought about using mkdocs in the future.

I see we already have comments for some of them.

"""Run the original TAR training as a subprocess."""
cfg = self.defense_config

if cfg.output_checkpoint_path.exists():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be pretty useful to add to the base defense class -- (not super relevant to this PR i guess).

@tomtseng
Copy link
Copy Markdown
Collaborator Author

Unrelated lint error, will fix it separately

@tomtseng tomtseng merged commit b7916cd into main Apr 18, 2026
1 of 2 checks passed
@tomtseng tomtseng deleted the tomtseng/tar-orig branch April 18, 2026 04:54
@tomtseng
Copy link
Copy Markdown
Collaborator Author

Actually the lint error was not unrelated. But anyway it's too late and I will fix it now in another PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants