feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740
feat(express): Integrate A2UI Express and v1.0 into Inspect-ai evaluation suite#1740gspencergoog wants to merge 4 commits into
Conversation
f30376b to
52643f2
Compare
| active_version = "0.9" | ||
| default_catalog_path = "specification/v0_9/catalogs/basic/catalog.json" |
There was a problem hiding this comment.
Since the function name has changed, these paths should probably be updated to 0.9.1 too? Or perhaps we should defer all of this for a future pr
| if strategy == "express": | ||
| active_dataset_path = DATASET_PATH_V10 | ||
| active_version = "1.0" | ||
| default_catalog_path = "specification/v1_0/catalogs/basic/catalog.json" |
There was a problem hiding this comment.
This doesn't look right, we shouldn't be having a special case to run with 1.0 in a a2ui_v0_9_1_eval task?
|
|
||
| tasks = [] | ||
| for strat in selected_strategies: | ||
| if strat not in ["direct", "subagent_tool", "express"]: |
There was a problem hiding this comment.
This should check against the constant in eval/a2ui_eval/strategies/init.py
| parser.add_argument("--log-dir", type=str, default="logs", help="Directory to save logs") | ||
| parser.add_argument("--sample-shuffle", type=int, default=None, help="Seed for shuffling samples") | ||
| parser.add_argument("--thinking-budget", type=int, default=None, help="Thinking budget for reasoning models") | ||
| parser.add_argument("--temperature", type=float, default=None, help="Generation temperature") |
There was a problem hiding this comment.
Curious if you needed to configure this
| if not active_version: | ||
| active_version = "1.0" if "v1_0" in catalog_path or "v1.0" in catalog_path else "0.9" |
There was a problem hiding this comment.
Consider making version a required param so everything is controled by the task
| @@ -0,0 +1,915 @@ | |||
| U2FsdGVkX1+9nGzseXHfkWEb7BVYv9LqHITLuSayZ9Nm8x/zAKOw5c57gGH2iqD1 | |||
There was a problem hiding this comment.
I didn't check this file out yet but any way we can reuse the existing dataset for both 0.9 and 1.0?
| STRATEGIES = { | ||
| "direct": direct_solver, | ||
| "subagent_tool": subagent_tool_solver, | ||
| "express": express_solver, |
There was a problem hiding this comment.
Future idea:
It'd be great if this could end up something like:
{
"direct": create_direct_solver(NoopInferenceFormatAdapter()),
"direct_express": create_direct_solver(ExpressInferenceFormatAdapter()),
"subagent_tool": create_subagent_tool_solver(NoopInferenceFormatAdapter()),
}Where NoopInferenceFormatAdapter, ExpressInferenceFormatAdapter conform to the InferenceFormatAdapter interface that we own. So that we can easily mix and match orchestration strategies and inference formats, and if someone implements a new inference format, they can just plug it in here with one line and see how it performs.
Summary
This PR integrates the A2UI Express compiler, decompiler, and prompt strategies directly into the Inspect-ai evaluation suite. It introduces the
expressevaluation strategy, adds a new A2UI v1.0 prompt evaluation dataset, and updates the tasks, solvers, scorers, and CI runners to support schema-validated A2UI v1.0 and A2UI Express evaluations.This is a companion PR to the recently merged A2UI Express compiler implementation (PR #1726), enabling automated evaluation of LLM capability to generate A2UI Express DSL layouts and compiling them back to validated standard JSON.
Changes
eval/a2ui_eval/strategies/):expressstrategy in a new module express.py. This strategy:ExpressPromptGeneratorbased on the active catalog schema.compile_express_dslto extract the generated Express DSL, compile it into standard A2UI v1.0 JSON, and perform schema validation usingA2uiSchemaManager.expressstrategy.eval/):a2ui_v0_9_evaltoa2ui_v0_9_1_evalin tasks.py.google/gemini-3.5-flash.eval/a2ui_eval/):a2ui_scorerin scorers.py to support parameterizable protocol versions (0.9or1.0), ensuring strict validation checks are performed against the correct schema.default_catalog_path.expressstrategy alongside existing strategies.expresssolver pipeline and compilation chain.Impact & Risks
eval/directory and do not affect any SDK runtime paths.A2UI_EXPRESS_ENABLED=trueinternally to compile outputs, which is perfectly safe and self-contained.Testing
eval/suite can be executed usingpytest: