defenses: Add original TAR implementation by tomtseng · Pull Request #113 · criticalml-uw/TamperBench

tomtseng · 2026-03-27T00:49:47Z

Changes

Adds implementation of TAR. We already had T-Vaccine's TAR implementation, but it turns out there are differences that make us suspect T-Vaccine's TAR implementation is weaker. So we're renaming T-Vaccine's TAR from TAR to TAR_TVACCINE and adding a version of TAR based on copying over the original TAR code

I'm not confident this gives much defense (see Testing section below) but it should be much more faithful to the original TAR implementation

Testing

ran scripts/whitebox/test_tar.py on 8 GPUs on Flamingo, reproducing the paper's experiment on refusal-trained TAR (Section 5.2, Table 2)

Pre-attack (defended model quality)

Metric	Paper (base)	Ours (base)	Paper (TAR)	Ours (TAR)
MT-Bench (↑)	8.1	7.356	6.3	5.975
MMLU-Pro (↑)	—	0.446	—	0.314

Post-attack (tamper resistance)

The paper reports HarmBench ASR; we use StrongREJECT (finetuned classifier).
Direct numeric comparison is not meaningful, but relative ranking is.

	Defended	Undefended
Avg StrongREJECT (↓)	0.717	0.760
Avg MMLU-Pro Val (↑)	0.339	0.413

The defended model shows modestly lower StrongREJECT. However, it's possible the result is noise or is a result of weaker capabilities/coherence rather than purely reduced harmfulness.

The paper's HarmBench ASR was 72.5% for the undefended model and 63.9% for the defended model, which is also quite modest. They don't report MMLU for this setting

Differences between original TAR and T-Vaccine's TAR, according to Claude:

1. Single SGD step vs. 64-step AdamW inner loop (most critical)

Original TAR: Runs 64 inner loop steps with a proper AdamW optimizer (with sampled LR from [2e-6, 2e-5, 4e-5]) to simulate a strong adversary. This is a full fine-tuning attack simulation.
T-Vaccine: Takes 1 raw gradient step with a hardcoded step size of 0.01 (param.data -= 0.01 * stored_grads). The AdamW inner optimizer code is present but commented out (lines 885-893, 917-918 of trainer.py).

The original TAR trains the defender against a much stronger, more realistic adversary. T-Vaccine's single-step perturbation is trivially weak by comparison.

2. No max-entropy tamper-resistance loss

Original TAR: Uses a max-entropy loss on heldout forget data — it actively pushes the model toward maximum entropy (uniform distribution) on harmful outputs after the adversary attacks.
T-Vaccine: Simply uses standard cross-entropy loss on safe data twice (loss2 and loss3), then combines gradients with param.grad = grad + 2 * stored_grads_tr. There's no max-entropy objective at all.

3. No diverse adversary sampling

Original TAR: Samples from multiple adversary distributions (pile-bio, camel-bio, retain-forget-switch with beta-distribution switching points) and randomly varies the adversary learning rate each outer step.
T-Vaccine: Uses a single harmful dataset with a fixed perturbation, no distribution diversity, no LR randomization.

4. No weighting schedule or gradient scaling

Original TAR: Uses an exponential weighting schedule (schedule_lambda=0.0625) that weights later inner-loop steps more heavily, plus a tamper-resistance gradient scale of 4.0.
T-Vaccine: Uses a flat weight of 2 on the safe-data gradients with no scheduling.

5. No representation-engineering retain loss

Original TAR: Uses --retain_representations — an MSE loss between the trained model's hidden states and the base model's hidden states, preserving the model's representation structure.
T-Vaccine: The retain model is deep-copied (self.retain_model = copy.deepcopy(model)) but never used — the representation loss is commented out (lines 948-957 of trainer.py).

6. LoRA vs. full weights

Original TAR: Trains full model weights with FSDP across 8 GPUs.
T-Vaccine: Uses LoRA (rank=8, alpha=4) — a low-rank approximation that modifies far fewer parameters.

7. Outer optimizer and training scale

Original TAR: AdamWScheduleFree, lr=2e-5, 750 outer steps each containing 64 inner steps = ~48,000 total adversary updates.
T-Vaccine: Standard adamw_torch, lr=1e-3, 20 epochs with 1 inner step each = dramatically less adversarial training.

8. Lower bound check missing

Original TAR: Has a tar_tamper_resistance_loss_lower_bound check — only applies TR gradients if the adversary was actually successful enough to warrant it.
T-Vaccine: Always applies the gradient combination regardless.

Makes MT-Bench accessible by registering it in the eval registry.

sdhossain · 2026-04-13T01:15:45Z

@@ -198,6 +199,7 @@ class MTBenchScoreSchema(ScoreSchema):
    judge_response_2: str = cast("str", pa.Field(nullable=False))




@tomtseng -- in #115 I alter MT-Bench to use vLLM which speeds it up a fair bit.

sdhossain

lgtm -- left a nit or two that can be taken or left.

sdhossain · 2026-04-18T00:49:49Z

@@ -0,0 +1,295 @@
+r"""Original TAR defense (Tamirisa et al. 2024) facade.


nit: I believe convention we've been using for the name of the file that contains the main interface would be tar.py. Not sure if it's been followed for every defense, but I think has been for booster, ctrl and t-vaccine / crl.

sdhossain · 2026-04-18T00:52:00Z

+
+@dataclass
+class TARConfig(AlignmentDefenseConfig):
+    """Configuration for the original TAR defense (Tamirisa et al. 2024)."""


wondering if an Attributes: + description would be helpful here in the docs? I guess then people maybe would be able to hover over the class and see description of the params.

This has probably not been done to a fair number of defense configs (i think it was done for the attack ones mostly), but could be done in the future as well <-- i've thought about using mkdocs in the future.

I see we already have comments for some of them.

sdhossain · 2026-04-18T00:53:55Z

+        """Run the original TAR training as a subprocess."""
+        cfg = self.defense_config
+
+        if cfg.output_checkpoint_path.exists():


this would be pretty useful to add to the base defense class -- (not super relevant to this PR i guess).

…ck to base class)

tomtseng · 2026-04-18T04:53:57Z

Unrelated lint error, will fix it separately

tomtseng · 2026-04-18T04:58:01Z

Actually the lint error was not unrelated. But anyway it's too late and I will fix it now in another PR

defenses: Add original TAR implementation

a805363

tomtseng force-pushed the tomtseng/tar-orig branch from 0bbc17b to a805363 Compare March 27, 2026 01:03

tomtseng added 12 commits March 27, 2026 17:30

TAR: Make it work for non-Llama models

b498180

mt_bench: Register the eval

90fcd59

Makes MT-Bench accessible by registering it in the eval registry.

tar: Stream TAR logging

accffe9

scripts tar: hack

479c2bd

tar: Fix test-time attack dataset

a309948

tar: Document bio vs. harmful version differences

84508e1

tar: Default retain_representations to True and document it

80e85e8

test_tar: Match inner-loop batch size of TAR paper

3d2669f

tar: Additional SFT after defense

a3bf1f5

tar: Add re-alignment step, reduce memory usage

6cafb5d

tar configs: Make grid.yaml do harmful-refusal TAR, not bioweapon TAR

962cb65

tar: Increase 4 -> 8 GPUs due to OOM

bd98462

tomtseng force-pushed the tomtseng/tar-orig branch from e34e62b to bd98462 Compare April 11, 2026 00:00

tar: Make --defended-model be post-SFT

6e5d7e5

tomtseng force-pushed the tomtseng/tar-orig branch from fb763f0 to 6e5d7e5 Compare April 12, 2026 22:22

tomtseng added 3 commits April 12, 2026 16:00

scripts tar: Reduce attack memory usage

9627b8b

Merge remote-tracking branch 'origin/main' into tomtseng/tar-orig

cc2c6af

scripts tar: Fix MT-Bench

437ada1

tomtseng force-pushed the tomtseng/tar-orig branch from ebfd6b6 to 437ada1 Compare April 13, 2026 01:04

sdhossain reviewed Apr 13, 2026

View reviewed changes

tomtseng added 2 commits April 13, 2026 16:16

scripts tar: Update results and how to run

12bf1ef

tar: Update docstrings

facfab7

tomtseng force-pushed the tomtseng/tar-orig branch from b3d499b to facfab7 Compare April 13, 2026 23:29

tomtseng requested a review from sdhossain April 14, 2026 01:28

tomtseng added 4 commits April 13, 2026 18:31

Merge remote-tracking branch 'origin/main' into tomtseng/tar-orig

4082eff

tar: Cleanup

a27491b

tar: Cleanup

30a90bd

uv.lock: Update with TAR dependencies

9b0d5a2

Merge remote-tracking branch 'origin/main' into tomtseng/tar-orig

9f9b1ee

sdhossain approved these changes Apr 18, 2026

View reviewed changes

tomtseng added 3 commits April 17, 2026 21:18

tar: Rename entrypoint to tar.py

cf4432a

tar: Address PR comments (config docstring, move model checkpoint che…

549827a

…ck to base class)

Merge remote-tracking branch 'origin/main' into tomtseng/tar-orig

b884d4c

tomtseng merged commit b7916cd into main Apr 18, 2026
1 of 2 checks passed

tomtseng deleted the tomtseng/tar-orig branch April 18, 2026 04:54

tomtseng mentioned this pull request Apr 18, 2026

rsn_tune: Fix merge conflict with TAR refactor #128

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defenses: Add original TAR implementation#113

defenses: Add original TAR implementation#113
tomtseng merged 27 commits into
mainfrom
tomtseng/tar-orig

tomtseng commented Mar 27, 2026 •

edited

Loading

Uh oh!

sdhossain Apr 13, 2026

Uh oh!

sdhossain left a comment

Uh oh!

sdhossain Apr 18, 2026

Uh oh!

sdhossain Apr 18, 2026

Uh oh!

sdhossain Apr 18, 2026

Uh oh!

tomtseng commented Apr 18, 2026

Uh oh!

Uh oh!

tomtseng commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -198,6 +199,7 @@ class MTBenchScoreSchema(ScoreSchema):
		judge_response_2: str = cast("str", pa.Field(nullable=False))

		@@ -0,0 +1,295 @@
		r"""Original TAR defense (Tamirisa et al. 2024) facade.

Conversation

tomtseng commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Pre-attack (defended model quality)

Post-attack (tamper resistance)

Differences between original TAR and T-Vaccine's TAR, according to Claude:

1. Single SGD step vs. 64-step AdamW inner loop (most critical)

2. No max-entropy tamper-resistance loss

3. No diverse adversary sampling

4. No weighting schedule or gradient scaling

5. No representation-engineering retain loss

6. LoRA vs. full weights

7. Outer optimizer and training scale

8. Lower bound check missing

Uh oh!

sdhossain Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain left a comment

Choose a reason for hiding this comment

Uh oh!

sdhossain Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

sdhossain Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

tomtseng commented Apr 18, 2026

Uh oh!

Uh oh!

tomtseng commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtseng commented Mar 27, 2026 •

edited

Loading