eval: added lm-eval by psyonp · Pull Request #36 · criticalml-uw/TamperBench

psyonp · 2025-10-14T17:11:39Z

Changes

Summarize the changes in this PR and describe the context or motivation for
them.

Add a title, prepending the tag [attack], [defense], [evaluation], or [infra] if
appropriate.

Testing

Describe how you tested the changes in this PR. E.g., added tests, or ran
command foo and checked the results looked good.

sdhossain · 2025-10-17T22:35:12Z

+            "--device", self.eval_config.device,
+            "--batch_size", str(self.eval_config.batch_size),
+        ]
+


there's probably some way in this function we could get the output json and parse the outputs to a json

I believe we just need to add a --output_path “${OUTPUT_PATH}” arg to save the output jsons to the specified path. They already handle saving and caching results.

pubmedqa_llama3_qlora_2025-10-10T15-53-15.177293.json

Example of an output json^ generated by lm_eval. We can parse this further, if required - @sdhossain

In our format we'd need a bit more information - i.e. what each datapoint was - the metric score that it has, etc.

some for datapoint 1:

we'd need the exact input to the model, the model's exact output, and the evaluation systems score.

I've attached an example csv we have for another evaluation (we store it as a .parquet - see StrongREJECT as an example)

evaluator_scores.csv

Right now - (from what I saw in the json, I mainly saw configs, and an aggregate score). This would perhaps make it difficult to do further analysis, i.e. if we wanted to check responses using an LLM-Judge, etc.

MKowal2

LGTM! Just left a couple nits and questions I had about the implementation. Also - nice find and update to the logging!

MKowal2 · 2025-10-23T18:43:09Z

+
+import subprocess
+from dataclasses import dataclass
+from typing import List, Dict, Union, TypeVar


nit: as of Python 3.9, can use dict and list in type hints rather than importing Dict and List

MKowal2 · 2025-10-23T19:01:47Z

+        eval_config = LMRunEvaluationConfig(
+            tasks=["hellaswag"],
+            pretrained_path="/model-weights/Qwen2.5-3B",
+            tokenizer_path="/model-weights/Qwen2.5-3B",


Just wondering because I am curious: when would the pretrained_path and tokenizer_path be different?

MKowal2 · 2025-10-23T19:03:00Z

+if __name__ == "__main__":
+    load_dotenv()
+
+    with tempfile.TemporaryDirectory() as tmpdirname:


Nit: Do we need the tempfile directory if we are testing print-only? Probably makes sense to keep if we incorporate the logging and test that too though, which seems like a good idea.

MKowal2 · 2025-10-23T19:12:10Z

+    Returns DataFrame response=stdout.
+    """
+
+    name = "LM_EVAL_RUNNER"


For the other evals (e.g., strong reject) we set the name, score, and optimization direction like:

from safetunebed.whitebox.utils import ( EvalName, MetricName, OptimizationDirection, dealloc_model_and_tokenizer ) ... name: EvalName = EvalName.StrongReject objective: MetricName = MetricName.STRONGREJECT_SCORE attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE defender_direction: OptimizationDirection = OptimizationDirection.MINIMIZE

Not sure if this should be here as well - in terms of naming convention consistency, but also, there are a bunch of different tasks that could have different optimization directions, although I could also imagine they all share the same direction so we could include this here as well.

tomtseng · 2025-10-29T03:33:09Z

+    max_generation_length: int = 0    # required by base
+
+
+class LMRunEvaluation(WhiteBoxEvaluation[S]):


I think you don't need to define S = TypeVar(...) and you could just do LMRunEvaluation(WhiteBoxEvaluation[LMRunEvaluationConfig]). Although it looks like we do this TypeVar thing throughout the code so maybe cleaning this up should be a separate PR

tomtseng · 2025-10-29T03:34:51Z

+            print(f"\n=== Running task: {task} ===")
+            logs.append(self._run_single_task(task))
+
+        return pl.DataFrame(logs) #Inference schema doesn't really make sense until now for me.


remove comment? or elaborate on comment about what's confuising?

sdhossain · 2026-04-18T01:05:25Z

@tomtseng @psyonp -- can we close this PR, or are we looking to still work on it. My understanding is that it, in its current state it doesn't fully integrate with the structure of tamperbench (as different attacks have different output formats)

I think we can add tjhis as an open issue / convert it to a draft?

psyonp added 5 commits October 14, 2025 13:03

Create lm_eval.py

fcb9268

Create __init__.py

12d3b23

Create test_lm_eval.py

4f005d6

Update __init__.py

7b63107

Update lm_eval.py

a4e0ffb

psyonp added the evaluation Adds or modifies evaluation label Oct 14, 2025

sdhossain reviewed Oct 17, 2025

View reviewed changes

Add logging

e911971

MKowal2 self-requested a review October 23, 2025 18:41

MKowal2 reviewed Oct 27, 2025

View reviewed changes

tomtseng reviewed Oct 29, 2025

View reviewed changes

tomtseng mentioned this pull request Jan 16, 2026

infra: arxiv version #53

Merged

		max_generation_length: int = 0 # required by base


		class LMRunEvaluation(WhiteBoxEvaluation[S]):

Conversation

psyonp commented Oct 14, 2025

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MKowal2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sdhossain Oct 27, 2025 •

edited

Loading