[eval] Add script to benchmark latency by jiahaog · Pull Request #1699 · a2ui-project/a2ui

jiahaog · 2026-06-18T08:36:05Z

Output:

========================================
Latency Benchmark Results
========================================
Task: a2ui_v0_9_eval | Model: google/gemini-3-flash-preview
  Total Duration (s) : 436.00
  Total Samples      : 47
  Avg Latency/Sample : 9.28 s
  Success Rate       : 91.5%
----------------------------------------
Task: a2ui_v0_9_eval | Model: google/gemini-3.1-flash-lite
  Total Duration (s) : 457.00
  Total Samples      : 47
  Avg Latency/Sample : 9.72 s
  Success Rate       : 93.6%
----------------------------------------
Task: a2ui_v0_9_eval | Model: google/gemini-3.1-pro-preview
  Total Duration (s) : 984.00
  Total Samples      : 47
  Avg Latency/Sample : 20.94 s
  Success Rate       : 91.5%
----------------------------------------

Currnetly just runs gemini models. We can add more in a follow up.

gemini-code-assist

Code Review

This pull request introduces a new benchmark script, latency_bench.py, to evaluate multiple models and report average latency and success rates. It also updates main.py to support evaluating multiple models via a comma-separated list. The review feedback suggests adding unit tests for the new logic, handling potential AttributeErrors when parsing log timestamps, filtering out stale logs from previous runs, and robustly parsing the comma-separated model list to avoid empty strings.

gspencergoog

Would it make sense to leave the argument and just have it default to the location that's currently hard coded? I can see wanting to test things and save logs somewhere else temporarily.

Add latency_bench script to evaluate multiple models

433750f

github-project-automation Bot added this to A2UI Jun 18, 2026

github-project-automation Bot moved this to Todo in A2UI Jun 18, 2026

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread eval/bin/latency_bench.py

Comment thread eval/bin/latency_bench.py Outdated

Comment thread eval/bin/latency_bench.py

Comment thread eval/bin/latency_bench.py

Comment thread eval/main.py Outdated

Refactor eval/main.py to use action=append for model argument

b9068ab

jiahaog changed the title ~~Benchmark Models~~ [eval] Add script to benchmark models Jun 18, 2026

Filter out stale logs and fix model arg passing in latency_bench

111f5b3

jiahaog requested a review from gspencergoog June 18, 2026 09:00

jiahaog changed the title ~~[eval] Add script to benchmark models~~ [eval] Add script to benchmark latency Jun 18, 2026

Hardcode log dir to logs/latency_bench and remove --log-dir flag

13954a3

gspencergoog approved these changes Jun 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eval] Add script to benchmark latency#1699

[eval] Add script to benchmark latency#1699
jiahaog wants to merge 4 commits into
a2ui-project:mainfrom
jiahaog:feature/benchmark-models

jiahaog commented Jun 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gspencergoog left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jiahaog commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gspencergoog left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiahaog commented Jun 18, 2026 •

edited

Loading