[eval] Add script to benchmark latency#1699
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new benchmark script, latency_bench.py, to evaluate multiple models and report average latency and success rates. It also updates main.py to support evaluating multiple models via a comma-separated list. The review feedback suggests adding unit tests for the new logic, handling potential AttributeErrors when parsing log timestamps, filtering out stale logs from previous runs, and robustly parsing the comma-separated model list to avoid empty strings.
gspencergoog
left a comment
There was a problem hiding this comment.
Would it make sense to leave the argument and just have it default to the location that's currently hard coded? I can see wanting to test things and save logs somewhere else temporarily.
Output:
Currnetly just runs gemini models. We can add more in a follow up.