feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite by gspencergoog · Pull Request #1740 · a2ui-project/a2ui

gspencergoog · 2026-06-23T23:03:29Z

Summary

This PR integrates the A2UI Express compiler, decompiler, and prompt strategies directly into the Inspect-ai evaluation suite. It introduces the express evaluation strategy, adds a new A2UI v1.0 prompt evaluation dataset, and updates the tasks, solvers, scorers, and CI runners to support schema-validated A2UI v1.0 and A2UI Express evaluations.

This is a companion PR to the recently merged A2UI Express compiler implementation (PR #1726), enabling automated evaluation of LLM capability to generate A2UI Express DSL layouts and compiling them back to validated standard JSON.

Changes

Evaluation Strategies (eval/a2ui_eval/strategies/):
- Implemented the express strategy in a new module express.py. This strategy:
  - Generates system prompt instructions dynamically using ExpressPromptGenerator based on the active catalog schema.
  - Invokes the model to generate layout designs in A2UI Express DSL.
  - Implements compile_express_dsl to extract the generated Express DSL, compile it into standard A2UI v1.0 JSON, and perform schema validation using A2uiSchemaManager.
- Updated init.py to register the new express strategy.
Tasks & Dataset (eval/):
- Added v1_0_prompts.yaml, a new evaluation dataset of prompts and target layouts designed for A2UI v1.0 and Express DSL testing.
- Renamed the core evaluation task from a2ui_v0_9_eval to a2ui_v0_9_1_eval in tasks.py.
- Upgraded tasks.py to dynamically load the appropriate dataset and catalog schema (v0.9 vs v1.0/Express) depending on the selected strategy.
- Updated the default grading model in tasks.py to google/gemini-3.5-flash.
Scorers & Dataset Loader (eval/a2ui_eval/):
- Upgraded a2ui_scorer in scorers.py to support parameterizable protocol versions (0.9 or 1.0), ensuring strict validation checks are performed against the correct schema.
- Updated dataset.py to allow passing a default_catalog_path.
CI Runners & Reporting:
- Updated run_ci_evals.py and report_evals.py to run and report on the new express strategy alongside existing strategies.
- Configured main.py to expose the express strategy option.
Testing:
- Updated test_strategies.py to verify the correctness of the new express solver pipeline and compilation chain.
- Updated test_run_ci_evals.py to test the integration in the CI execution pipeline.

Impact & Risks

No Production Risk: All changes are isolated inside the eval/ directory and do not affect any SDK runtime paths.
Gated Execution: The Express solver activates A2UI_EXPRESS_ENABLED=true internally to compile outputs, which is perfectly safe and self-contained.

Testing

Local unit tests in the eval/ suite can be executed using pytest:
```
uv run pytest eval/tests/
```
Both test_strategies.py and test_run_ci_evals.py have been verified.

…tion suite

…ents and duration reporting

jiahaog · 2026-06-24T00:54:15Z

+    active_version = "0.9"
+    default_catalog_path = "specification/v0_9/catalogs/basic/catalog.json"


Since the function name has changed, these paths should probably be updated to 0.9.1 too? Or perhaps we should defer all of this for a future pr

jiahaog · 2026-06-24T00:55:36Z

+    if strategy == "express":
+        active_dataset_path = DATASET_PATH_V10
+        active_version = "1.0"
+        default_catalog_path = "specification/v1_0/catalogs/basic/catalog.json"


This doesn't look right, we shouldn't be having a special case to run with 1.0 in a a2ui_v0_9_1_eval task?

jiahaog · 2026-06-24T01:16:10Z

+
+    tasks = []
+    for strat in selected_strategies:
+        if strat not in ["direct", "subagent_tool", "express"]:


This should check against the constant in eval/a2ui_eval/strategies/init.py

jiahaog · 2026-06-24T01:17:15Z

    parser.add_argument("--log-dir", type=str, default="logs", help="Directory to save logs")
    parser.add_argument("--sample-shuffle", type=int, default=None, help="Seed for shuffling samples")
+    parser.add_argument("--thinking-budget", type=int, default=None, help="Thinking budget for reasoning models")
+    parser.add_argument("--temperature", type=float, default=None, help="Generation temperature")


Curious if you needed to configure this

jiahaog · 2026-06-24T01:19:41Z

+        if not active_version:
+            active_version = "1.0" if "v1_0" in catalog_path or "v1.0" in catalog_path else "0.9"


Consider making version a required param so everything is controled by the task

jiahaog · 2026-06-24T01:30:10Z

@@ -0,0 +1,915 @@
+U2FsdGVkX1+9nGzseXHfkWEb7BVYv9LqHITLuSayZ9Nm8x/zAKOw5c57gGH2iqD1


I didn't check this file out yet but any way we can reuse the existing dataset for both 0.9 and 1.0?

jacobsimionato · 2026-06-24T03:08:57Z

 STRATEGIES = {
    "direct": direct_solver,
    "subagent_tool": subagent_tool_solver,
+    "express": express_solver,


Future idea:

It'd be great if this could end up something like:

{ "direct": create_direct_solver(NoopInferenceFormatAdapter()), "direct_express": create_direct_solver(ExpressInferenceFormatAdapter()), "subagent_tool": create_subagent_tool_solver(NoopInferenceFormatAdapter()), }

Where NoopInferenceFormatAdapter, ExpressInferenceFormatAdapter conform to the InferenceFormatAdapter interface that we own. So that we can easily mix and match orchestration strategies and inference formats, and if someone implements a new inference format, they can just plug it in here with one line and see how it performs.

github-project-automation Bot added this to A2UI Jun 23, 2026

github-project-automation Bot moved this to Todo in A2UI Jun 23, 2026

gspencergoog mentioned this pull request Jun 23, 2026

feat(eval): Integrate A2UI Express into evaluation suite gspencergoog/A2UI#4

Closed

gspencergoog requested review from jacobsimionato and jiahaog June 23, 2026 23:04

This comment was marked as resolved.

Sign in to view

feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evalua…

52643f2

…tion suite

gspencergoog force-pushed the express-pr3-evaluation branch from f30376b to 52643f2 Compare June 23, 2026 23:18

refactor(eval): use state.input_text from Inspect AI TaskState

492e126

This comment was marked as resolved.

Sign in to view

gspencergoog and others added 2 commits June 23, 2026 16:34

fix(eval): address code review feedback on evaluation strategy, argum…

d731a2e

…ents and duration reporting

Merge branch 'main' into express-pr3-evaluation

189e767

jiahaog reviewed Jun 24, 2026

View reviewed changes

jacobsimionato approved these changes Jun 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740

feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740
gspencergoog wants to merge 4 commits into
a2ui-project:mainfrom
gspencergoog:express-pr3-evaluation

gspencergoog commented Jun 23, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jiahaog Jun 24, 2026

Uh oh!

jiahaog Jun 24, 2026

Uh oh!

jiahaog Jun 24, 2026

Uh oh!

jiahaog Jun 24, 2026

Uh oh!

jiahaog Jun 24, 2026

Uh oh!

jiahaog Jun 24, 2026 •

edited

Loading

Uh oh!

jacobsimionato Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		active_version = "0.9"
		default_catalog_path = "specification/v0_9/catalogs/basic/catalog.json"

		if not active_version:
		active_version = "1.0" if "v1_0" in catalog_path or "v1.0" in catalog_path else "0.9"

		@@ -0,0 +1,915 @@
		U2FsdGVkX1+9nGzseXHfkWEb7BVYv9LqHITLuSayZ9Nm8x/zAKOw5c57gGH2iqD1

Conversation

gspencergoog commented Jun 23, 2026

Summary

Changes

Impact & Risks

Testing

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

jiahaog Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jiahaog Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jiahaog Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jiahaog Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jiahaog Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jiahaog Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacobsimionato Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jiahaog Jun 24, 2026 •

edited

Loading