eval: added lm-eval#36
Conversation
| "--device", self.eval_config.device, | ||
| "--batch_size", str(self.eval_config.batch_size), | ||
| ] | ||
|
|
There was a problem hiding this comment.
there's probably some way in this function we could get the output json and parse the outputs to a json
There was a problem hiding this comment.
pubmedqa_llama3_qlora_2025-10-10T15-53-15.177293.json
Example of an output json^ generated by lm_eval. We can parse this further, if required - @sdhossain
There was a problem hiding this comment.
In our format we'd need a bit more information - i.e. what each datapoint was - the metric score that it has, etc.
some for datapoint 1:
we'd need the exact input to the model, the model's exact output, and the evaluation systems score.
I've attached an example csv we have for another evaluation (we store it as a .parquet - see StrongREJECT as an example)
Right now - (from what I saw in the json, I mainly saw configs, and an aggregate score). This would perhaps make it difficult to do further analysis, i.e. if we wanted to check responses using an LLM-Judge, etc.
MKowal2
left a comment
There was a problem hiding this comment.
LGTM! Just left a couple nits and questions I had about the implementation. Also - nice find and update to the logging!
|
|
||
| import subprocess | ||
| from dataclasses import dataclass | ||
| from typing import List, Dict, Union, TypeVar |
There was a problem hiding this comment.
nit: as of Python 3.9, can use dict and list in type hints rather than importing Dict and List
| eval_config = LMRunEvaluationConfig( | ||
| tasks=["hellaswag"], | ||
| pretrained_path="/model-weights/Qwen2.5-3B", | ||
| tokenizer_path="/model-weights/Qwen2.5-3B", |
There was a problem hiding this comment.
Just wondering because I am curious: when would the pretrained_path and tokenizer_path be different?
| if __name__ == "__main__": | ||
| load_dotenv() | ||
|
|
||
| with tempfile.TemporaryDirectory() as tmpdirname: |
There was a problem hiding this comment.
Nit: Do we need the tempfile directory if we are testing print-only? Probably makes sense to keep if we incorporate the logging and test that too though, which seems like a good idea.
| Returns DataFrame response=stdout. | ||
| """ | ||
|
|
||
| name = "LM_EVAL_RUNNER" |
There was a problem hiding this comment.
For the other evals (e.g., strong reject) we set the name, score, and optimization direction like:
from safetunebed.whitebox.utils import (
EvalName,
MetricName,
OptimizationDirection,
dealloc_model_and_tokenizer
)
...
name: EvalName = EvalName.StrongReject
objective: MetricName = MetricName.STRONGREJECT_SCORE
attacker_direction: OptimizationDirection = OptimizationDirection.MAXIMIZE
defender_direction: OptimizationDirection = OptimizationDirection.MINIMIZE
Not sure if this should be here as well - in terms of naming convention consistency, but also, there are a bunch of different tasks that could have different optimization directions, although I could also imagine they all share the same direction so we could include this here as well.
| max_generation_length: int = 0 # required by base | ||
|
|
||
|
|
||
| class LMRunEvaluation(WhiteBoxEvaluation[S]): |
There was a problem hiding this comment.
I think you don't need to define S = TypeVar(...) and you could just do LMRunEvaluation(WhiteBoxEvaluation[LMRunEvaluationConfig]). Although it looks like we do this TypeVar thing throughout the code so maybe cleaning this up should be a separate PR
| print(f"\n=== Running task: {task} ===") | ||
| logs.append(self._run_single_task(task)) | ||
|
|
||
| return pl.DataFrame(logs) #Inference schema doesn't really make sense until now for me. |
There was a problem hiding this comment.
remove comment? or elaborate on comment about what's confuising?
|
@tomtseng @psyonp -- can we close this PR, or are we looking to still work on it. My understanding is that it, in its current state it doesn't fully integrate with the structure of tamperbench (as different attacks have different output formats) I think we can add tjhis as an open issue / convert it to a draft? |

Changes
Summarize the changes in this PR and describe the context or motivation for
them.
Add a title, prepending the tag [attack], [defense], [evaluation], or [infra] if
appropriate.
Testing
Describe how you tested the changes in this PR. E.g., added tests, or ran
command
fooand checked the results looked good.