Skip to content

feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740

Open
gspencergoog wants to merge 4 commits into
a2ui-project:mainfrom
gspencergoog:express-pr3-evaluation
Open

feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740
gspencergoog wants to merge 4 commits into
a2ui-project:mainfrom
gspencergoog:express-pr3-evaluation

Conversation

@gspencergoog

Copy link
Copy Markdown
Collaborator

Summary

This PR integrates the A2UI Express compiler, decompiler, and prompt strategies directly into the Inspect-ai evaluation suite. It introduces the express evaluation strategy, adds a new A2UI v1.0 prompt evaluation dataset, and updates the tasks, solvers, scorers, and CI runners to support schema-validated A2UI v1.0 and A2UI Express evaluations.

This is a companion PR to the recently merged A2UI Express compiler implementation (PR #1726), enabling automated evaluation of LLM capability to generate A2UI Express DSL layouts and compiling them back to validated standard JSON.

Changes

  • Evaluation Strategies (eval/a2ui_eval/strategies/):
    • Implemented the express strategy in a new module express.py. This strategy:
      • Generates system prompt instructions dynamically using ExpressPromptGenerator based on the active catalog schema.
      • Invokes the model to generate layout designs in A2UI Express DSL.
      • Implements compile_express_dsl to extract the generated Express DSL, compile it into standard A2UI v1.0 JSON, and perform schema validation using A2uiSchemaManager.
    • Updated init.py to register the new express strategy.
  • Tasks & Dataset (eval/):
    • Added v1_0_prompts.yaml, a new evaluation dataset of prompts and target layouts designed for A2UI v1.0 and Express DSL testing.
    • Renamed the core evaluation task from a2ui_v0_9_eval to a2ui_v0_9_1_eval in tasks.py.
    • Upgraded tasks.py to dynamically load the appropriate dataset and catalog schema (v0.9 vs v1.0/Express) depending on the selected strategy.
    • Updated the default grading model in tasks.py to google/gemini-3.5-flash.
  • Scorers & Dataset Loader (eval/a2ui_eval/):
    • Upgraded a2ui_scorer in scorers.py to support parameterizable protocol versions (0.9 or 1.0), ensuring strict validation checks are performed against the correct schema.
    • Updated dataset.py to allow passing a default_catalog_path.
  • CI Runners & Reporting:
    • Updated run_ci_evals.py and report_evals.py to run and report on the new express strategy alongside existing strategies.
    • Configured main.py to expose the express strategy option.
  • Testing:
    • Updated test_strategies.py to verify the correctness of the new express solver pipeline and compilation chain.
    • Updated test_run_ci_evals.py to test the integration in the CI execution pipeline.

Impact & Risks

  • No Production Risk: All changes are isolated inside the eval/ directory and do not affect any SDK runtime paths.
  • Gated Execution: The Express solver activates A2UI_EXPRESS_ENABLED=true internally to compile outputs, which is perfectly safe and self-contained.

Testing

  • Local unit tests in the eval/ suite can be executed using pytest:
    uv run pytest eval/tests/
    Both test_strategies.py and test_run_ci_evals.py have been verified.

gemini-code-assist[bot]

This comment was marked as resolved.

@gspencergoog gspencergoog force-pushed the express-pr3-evaluation branch from f30376b to 52643f2 Compare June 23, 2026 23:18
gemini-code-assist[bot]

This comment was marked as resolved.

Comment thread eval/tasks.py
Comment on lines +90 to +91
active_version = "0.9"
default_catalog_path = "specification/v0_9/catalogs/basic/catalog.json"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the function name has changed, these paths should probably be updated to 0.9.1 too? Or perhaps we should defer all of this for a future pr

Comment thread eval/tasks.py
Comment on lines +93 to +96
if strategy == "express":
active_dataset_path = DATASET_PATH_V10
active_version = "1.0"
default_catalog_path = "specification/v1_0/catalogs/basic/catalog.json"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right, we shouldn't be having a special case to run with 1.0 in a a2ui_v0_9_1_eval task?

Comment thread eval/main.py

tasks = []
for strat in selected_strategies:
if strat not in ["direct", "subagent_tool", "express"]:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check against the constant in eval/a2ui_eval/strategies/init.py

Comment thread eval/main.py
parser.add_argument("--log-dir", type=str, default="logs", help="Directory to save logs")
parser.add_argument("--sample-shuffle", type=int, default=None, help="Seed for shuffling samples")
parser.add_argument("--thinking-budget", type=int, default=None, help="Thinking budget for reasoning models")
parser.add_argument("--temperature", type=float, default=None, help="Generation temperature")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you needed to configure this

Comment thread eval/a2ui_eval/scorers.py
Comment on lines +49 to +50
if not active_version:
active_version = "1.0" if "v1_0" in catalog_path or "v1.0" in catalog_path else "0.9"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making version a required param so everything is controled by the task

@@ -0,0 +1,915 @@
U2FsdGVkX1+9nGzseXHfkWEb7BVYv9LqHITLuSayZ9Nm8x/zAKOw5c57gGH2iqD1

@jiahaog jiahaog Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check this file out yet but any way we can reuse the existing dataset for both 0.9 and 1.0?

STRATEGIES = {
"direct": direct_solver,
"subagent_tool": subagent_tool_solver,
"express": express_solver,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future idea:

It'd be great if this could end up something like:

{
    "direct": create_direct_solver(NoopInferenceFormatAdapter()),
    "direct_express": create_direct_solver(ExpressInferenceFormatAdapter()),
    "subagent_tool": create_subagent_tool_solver(NoopInferenceFormatAdapter()),
}

Where NoopInferenceFormatAdapter, ExpressInferenceFormatAdapter conform to the InferenceFormatAdapter interface that we own. So that we can easily mix and match orchestration strategies and inference formats, and if someone implements a new inference format, they can just plug it in here with one line and see how it performs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants