Skip to content

[eval] Add script to benchmark latency#1699

Open
jiahaog wants to merge 4 commits into
a2ui-project:mainfrom
jiahaog:feature/benchmark-models
Open

[eval] Add script to benchmark latency#1699
jiahaog wants to merge 4 commits into
a2ui-project:mainfrom
jiahaog:feature/benchmark-models

Conversation

@jiahaog

@jiahaog jiahaog commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Output:

========================================
Latency Benchmark Results
========================================
Task: a2ui_v0_9_eval | Model: google/gemini-3-flash-preview
  Total Duration (s) : 436.00
  Total Samples      : 47
  Avg Latency/Sample : 9.28 s
  Success Rate       : 91.5%
----------------------------------------
Task: a2ui_v0_9_eval | Model: google/gemini-3.1-flash-lite
  Total Duration (s) : 457.00
  Total Samples      : 47
  Avg Latency/Sample : 9.72 s
  Success Rate       : 93.6%
----------------------------------------
Task: a2ui_v0_9_eval | Model: google/gemini-3.1-pro-preview
  Total Duration (s) : 984.00
  Total Samples      : 47
  Avg Latency/Sample : 20.94 s
  Success Rate       : 91.5%
----------------------------------------

Currnetly just runs gemini models. We can add more in a follow up.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmark script, latency_bench.py, to evaluate multiple models and report average latency and success rates. It also updates main.py to support evaluating multiple models via a comma-separated list. The review feedback suggests adding unit tests for the new logic, handling potential AttributeErrors when parsing log timestamps, filtering out stale logs from previous runs, and robustly parsing the comma-separated model list to avoid empty strings.

Comment thread eval/bin/latency_bench.py
Comment thread eval/bin/latency_bench.py Outdated
Comment thread eval/bin/latency_bench.py
Comment thread eval/bin/latency_bench.py
Comment thread eval/main.py Outdated
@jiahaog jiahaog changed the title Benchmark Models [eval] Add script to benchmark models Jun 18, 2026
@jiahaog jiahaog requested a review from gspencergoog June 18, 2026 09:00
@jiahaog jiahaog changed the title [eval] Add script to benchmark models [eval] Add script to benchmark latency Jun 18, 2026

@gspencergoog gspencergoog left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to leave the argument and just have it default to the location that's currently hard coded? I can see wanting to test things and save logs somewhere else temporarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants