Skip to content

Add LLMAgent to EnvGraphSpec generation#718

Open
qianl-nv wants to merge 36 commits into
mainfrom
qianl/dev/agentic_llm
Open

Add LLMAgent to EnvGraphSpec generation#718
qianl-nv wants to merge 36 commits into
mainfrom
qianl/dev/agentic_llm

Conversation

@qianl-nv
Copy link
Copy Markdown
Collaborator

@qianl-nv qianl-nv commented May 26, 2026

Summary

First part of the LLM env gen pipeline: LLMAgent going from prompt to LLMEnvSchema (JSON)

Detailed description

  • Reason: LLM infers task, robot & object names from prompt
    Changed: Add new LLMAgent class & output shema (defines subset of supported task/constraints ect).
  • Impact: Supports the first step in the user journey for agentic env gen

To run: python -m isaaclab_arena.llm_env_gen.try_schema

qianl-nv added 7 commits May 26, 2026 18:39
Resolver matches object/robot proposal from LLM (based on name/tag) to
the exact entries in the arena regestry.
It outputs an ArenaEnvGraphSpec with the nodes, tasks and initial
state scene graph filled in based on LLM output + resolver match.
Copy link
Copy Markdown
Contributor

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks for adding the LLM-based environment generation pipeline! The modular design with clear separation between schema definition, LLM interaction, and deterministic resolution is well thought out. The trace-based debugging will be valuable for understanding resolution decisions.

Suggestions

1. Replace assertions with explicit exceptions for runtime validation

File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 57-58)

The API key validation uses assert, which can be stripped with Python's -O flag:

assert self.api_key, "API key required: set NV_API_KEY or pass api_key."

Consider using an explicit exception:

if not self.api_key:
    raise ValueError("API key required: set NV_API_KEY or pass api_key.")

2. Improve JSON extraction error handling

File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 148-161)

The _extract_json method uses contextlib.suppress(json.JSONDecodeError) before attempting manual parsing. If both paths fail, it raises AssertionError. Consider:

  1. Raising a more descriptive custom exception (e.g., LLMResponseParseError)
  2. Including the raw response in error messages for debugging
class LLMResponseParseError(ValueError):
    """Raised when the LLM response cannot be parsed as JSON."""

@staticmethod
def _extract_json(content: str) -> dict:
    content = content.strip()
    # ... extraction logic ...
    raise LLMResponseParseError(f"Could not extract JSON from LLM response: {content[:200]}...")

3. Consider adding retry logic for transient API failures

File: isaaclab_arena/llm_env_gen/llm_agent.py

LLM API calls can fail transiently due to rate limits or network issues. Consider adding retry logic with exponential backoff, or at minimum documenting that callers should implement retries:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def generate_spec(self, ...):
    ...

4. Add test coverage for LLMAgent

File: isaaclab_arena/tests/test_resolver.py

The resolver tests are thorough with the FakeAssetRegistry pattern. Consider adding similar tests for LLMAgent:

  • Test _extract_json with various edge cases (markdown fences, nested JSON, malformed responses)
  • Test schema validation rejections
  • Mock the OpenAI client to test the overall flow without making real API calls

5. Document the silent failure behavior in resolver

File: isaaclab_arena/llm_env_gen/resolver.py

The design choice to never raise on LLM mistakes (recording trace events instead) is documented in the class docstring, which is good. Consider adding a helper method or property to check if resolution was "clean":

@property
def has_resolution_errors(self) -> bool:
    """Returns True if any trace events indicate failed resolution."""
    error_stages = {"item.miss", "name.miss", "task.unknown_subject", "task.unknown_target"}
    return any(e.stage in error_stages for e in self.trace)

Minor Notes

  • The try_schema.py script is useful for manual testing but consider adding it to .gitignore or documenting it as a development-only tool
  • Good use of __post_init__ in ArenaEnvGraphSpec to centralize invariant checking

Overall this is a solid foundation for the LLM environment generation feature.


Update (commit 3336f26): The PR has been substantially refactored since my initial review.

Status of Previous Suggestions

  1. Replace assertions with explicit exceptions — Fixed. The new EnvGenAgent uses explicit raise ValueError instead of assert for the API key check.

  2. Improve JSON extraction error handling — Fixed. The new AgentResponseParseError(ValueError) custom exception is exactly what I suggested, including raw response preview (capped at 500 chars for readability).

  3. ℹ️ Retry logic — Not implemented, but this was a nice-to-have suggestion. Callers can wrap as needed.

  4. Add test coverage for LLMAgent/EnvGenAgent — Fixed. The new test_env_gen_agent.py (399 lines) comprehensively covers:

    • __init__ argument/env-var precedence and missing-key guard
    • _extract_json edge cases (fenced, prose, malformed, unbalanced braces, truncated error messages)
    • generate_spec / ping with mocked OpenAI client
    • System prompt coverage for RelationKind/TaskKind literals
  5. Resolver silent failure documentation — N/A. The entire resolver module was removed in this refactor. The PR now focuses on the agent layer; resolution is presumably handled elsewhere.

New Changes Observed

  • Module renamed: llm_env_genagentic_env_gen
  • Schema renamed: SceneSpecEnvIntentSpec (with new reasoning field for chain-of-thought)
  • Added CI job test_agent_remote_e2e with proper secret validation
  • Added pydantic>=2.0 as hard dependency in setup.py
  • Removed to_dict() / to_yaml() from ArenaEnvGraphSpec — validation now runs only in from_dict()
  • Added at_pose to RelationKind literal

No new issues found in the incremental changes. The refactor improves the codebase significantly.


Update (commit fb42596): Major refactor to OpenAI-compatible structured outputs.

Changes Observed

  1. Structured outputs migration: The agent now uses response_format={"type": "json_schema", ...} with strict mode, eliminating the need for prose-parsing fallback. This is a significant reliability improvement — the wire guarantees a valid JSON envelope.

  2. New structured_output_utils.py module: Well-designed extraction of reusable utilities:

    • build_strict_schema() / apply_strict_constraints() — munges pydantic schemas for OpenAI/Bedrock strict mode (adds additionalProperties: false, populates required, strips default keys)
    • extract_response_text() — handles NVIDIA DeepSeek's reasoning_content channel quirk
    • check_structured_output_support() — deployment validator probe with rich diagnostics
  3. Removed code: _extract_json() and AgentResponseParseError removed — no longer needed since structured outputs guarantee the JSON envelope. Clean deletion.

  4. Constructor-time fail-fast: Two-stage validation on EnvGenAgent.__init__:

    • First: cheap ping() to verify connectivity/auth
    • Then: full check_structured_output_support() probe with the real schema

    This surfaces capability failures at construction time rather than mid-pipeline.

  5. Test coverage: Comprehensive — test_structured_output_utils.py (421 lines) covers:

    • Schema munging for strict mode (``, nested objects, idempotency)
    • Channel extraction fallback (contentreasoning_contentempty)
    • Probe failure modes (4xx API error, empty envelope, parse error, validation error)
    • Live endpoint test gated by @pytest.mark.agent_remote_e2e

Notes

  • The json.loads(text, strict=False) call in generate_spec is a good workaround for DeepSeek-v4-flash's unescaped control characters — documented via inline comment.
  • Good defensive caching of self._spec_schema to avoid re-walking the schema on every call.
  • The system prompt was correctly trimmed to remove the now-redundant "emit ONLY JSON" instruction.

No new issues found. This is a clean, well-tested migration to structured outputs.


Update (commit f92fa73): Two defensive improvements.

  1. Allow empty tasks list in EnvIntentSpec: Removed the @model_validator that enforced non-empty tasks. Empty tasks now legitimately maps to NoTask for static sandbox environments. The field description clearly documents this behavior.

  2. Handle empty choices array in structured output validation: Added defensive check in check_structured_output_support() for when providers (Azure content-filter, Bedrock guardrails) return HTTP 200 with an empty choices list — preventing IndexError and surfacing it cleanly as a parse_error with response_route="empty".

Both changes include test updates. No issues found.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR introduces the first stage of the agentic environment-generation pipeline: an EnvGenAgent class that calls an OpenAI-compatible structured-outputs endpoint (NVIDIA hosted by default) and parses the response into a validated EnvIntentSpec Pydantic model.

  • EnvGenAgent — constructor runs two fail-fast wire checks (cheap ping then a structured-output probe with the full EnvIntentSpec schema) so misconfigured models are rejected before generate_spec is ever called; generate_spec uses response_format=json_schema exclusively, eliminating the prose-extraction fallback from the predecessor module.
  • structured_output_utils — schema munging, the NVIDIA DeepSeek reasoning_content channel fallback, and a deployment-validator probe that correctly guards against empty choices lists — though the sibling ping() function has the same gap and is not yet guarded.
  • CI & deps — a new agent_remote_e2e job gates live tests behind a secret-presence check; openai and pydantic>=2.0 are added to RUNTIME_DEPS.

Confidence Score: 4/5

Safe to merge with one small fix: ping() needs the same empty-choices guard that was already applied to check_structured_output_support in this PR.

ping() accesses resp.choices[0] unconditionally. The same empty-choices path that check_structured_output_support was explicitly hardened against in this PR can also hit ping, causing an IndexError to propagate through EnvGenAgent.__init__() as an opaque crash rather than a diagnosable wire failure. The rest of the pipeline is well-reasoned and well-tested.

isaaclab_arena/environments/agentic_env_gen/structured_output_utils.py — the ping() function at line 154.

Important Files Changed

Filename Overview
isaaclab_arena/environments/agentic_env_gen/structured_output_utils.py New utility module for structured-output probing; check_structured_output_support correctly guards against empty choices, but ping() does not and will raise IndexError on the same empty-choices API paths.
isaaclab_arena/environments/agentic_env_gen/env_intent_spec.py New Pydantic schema for LLM output; Relation.params: dict (bare untyped dict) generates an object schema without additionalProperties: false, which may fail strict OpenAI-compatible validators.
isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py New EnvGenAgent with constructor-time ping + structured-output validation and generate_spec using response_format=json_schema; clean refactor from prose-parsing predecessor.
isaaclab_arena/environments/agentic_env_gen/try_schema.py Demo/debug script; background override now correctly rewrites both rel.subject and rel.target, fixing the previously-noted unary-relation drop.
isaaclab_arena/tests/test_env_gen_agent.py Comprehensive unit tests for EnvGenAgent covering constructor probes, request shape, channel fallback, error propagation, and an opt-in live E2E test.
isaaclab_arena/tests/test_structured_output_utils.py Thorough tests for schema munging, channel routing, and all failure modes of check_structured_output_support; the empty-choices test confirms the fix but the parallel gap in ping is untested.
.github/workflows/ci.yml Adds agent_remote_e2e CI job with explicit secret-presence guard; also excludes the new marker from the default PhysX test run.
setup.py Adds openai and pydantic>=2.0 to RUNTIME_DEPS, addressing the previously-noted missing pydantic dependency.

Sequence Diagram

sequenceDiagram
    participant U as Caller
    participant A as EnvGenAgent init
    participant P as ping
    participant C as check_support
    participant G as generate_spec
    participant LLM as Inference Endpoint

    U->>A: EnvGenAgent(model, base_url)
    A->>P: ping(client, model)
    P->>LLM: completions.create max_tokens 8
    LLM-->>P: response text
    P-->>A: text
    A->>C: check_structured_output_support(client, model, EnvIntentSpec)
    C->>LLM: completions.create response_format json_schema
    LLM-->>C: EnvIntentSpec JSON sample
    C-->>A: StructuredOutputSupport supported True
    A-->>U: EnvGenAgent instance

    U->>G: generate_spec(prompt, catalog_text)
    G->>LLM: completions.create response_format json_schema
    LLM-->>G: JSON envelope
    G->>G: extract_response_text
    G->>G: json loads strict False
    G->>G: EnvIntentSpec model_validate
    G-->>U: EnvIntentSpec plus raw text
Loading

Reviews (9): Last reviewed commit: "Allow empty tasks list in EnvIntentSpec" | Re-trigger Greptile

Comment thread setup.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/llm_env_gen/schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/resolver.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
registry = AssetRegistry()
backgrounds: list[str] = []
objects: list[dict] = []
embodiments: list[str] = []
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about optional ones, like lightings, hdr image?

Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py Outdated
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_agent.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
Comment thread isaaclab_arena/llm_env_gen/llm_schema.py Outdated
@xyao-nv
Copy link
Copy Markdown
Collaborator

xyao-nv commented May 27, 2026

Much thorough than the last one. Most comments are to make it more robust and less prone to hallucinations. Feel free to leave some as TODOs.

And one major ask is to remove LLM, and call it Agent. Classes could be called as EnvGenAgent, EnvIntentSpec

qianl-nv added 8 commits May 28, 2026 11:41
xyao-nv flagged that the env-gen prompt and LLMEnvSpec docstrings mixed
"atomic action" and "atomic task" interchangeably when referring to the
entries of the ``tasks`` field. Unify on "atomic task(s)" everywhere
since the field is ``tasks`` and the class is ``Task`` — saying
"action" introduces a third noun for the same concept. Also aligns with
the existing ``isaaclab_arena/tasks/sequential_task_base.py`` wording.

Addresses thread #3313780564 on PR #718.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Two PR threads converged on the same 3-line comment, so addressing
them together:

* "Uniform spawn scale" misled by implying a 1.0 baseline — some
  object_library assets already carry intrinsic scaling, so the
  override is not strictly uniform. Drop the qualifier.
* "auto-fit ... against the tabletop bbox" leaks implementation
  details of the current placement proposer; @zhx06 is expanding
  the proposer beyond bbox-based fitting, so the comment would go
  stale. Describe the effect (auto-fit) without the mechanism.

Addresses threads #3313785736 and #3313791294 on PR #718.

Signed-off-by: Qian Lin <qianl@nvidia.com>
The Resolver was moved out of this MR into a follow-up branch, so the
three docstrings / comments in llm_schema.py that name-checked it
either point at a class that is not in this PR (the Resolver class
itself, Resolver._build_spatial_constraint) or duplicate contract text
already covered by the surrounding class docstring.

Removed:

* Module docstring — drop "by the Resolver" from the second-step
  description; the architectural intent (deterministic second pass)
  reads fine without the class reference.
* Relation class docstring — drop the trailing sentence cross-ref to
  Resolver._build_spatial_constraint; the binary-vs-unary contract is
  fully stated in the preceding sentences.
* Relation.target inline comment — drop entirely; it restated the
  binary-vs-unary contract that the class docstring already documents
  authoritatively.

Addresses thread #3313795576 on PR #718.

Signed-off-by: Qian Lin <qianl@nvidia.com>
xyao-nv asked why the system prompt restated information that the
schema could express structurally, and suggested moving rules into
``Field(description=...)`` so they live in one place instead of being
duplicated across prompt prose and pydantic types.

Rebalance:

* Per-field semantics (what each field means, defaults, optionality)
  now live as ``Field(description=...)`` strings on every member of
  Item / Relation / Task / LLMEnvSpec. pydantic embeds these in
  ``model_json_schema()`` automatically, so the existing SCHEMA block
  at the end of the prompt now carries the per-field guidance the
  RULES section used to spell out.
* Enum members no longer need explicit ``∈ {kinds}`` lines — they are
  already enforced by the ``Literal[...]`` types and show up as
  ``enum:`` arrays inside the schema JSON.
* The prompt's RULES section shrinks to GUIDANCE: only cross-cutting
  invariants (articulated objects need spatial ``on(...)`` anchors;
  distractors need ``on(distractor, background)``), few-shot task
  examples, and the output-format directive remain — these either
  span multiple fields or change global LLM behaviour and don't fit a
  single-field description.
* New anti-hallucination guidance (also from xyao-nv's review): the
  GUIDANCE block now tells the LLM to output null for unspecified
  optional fields instead of guessing, and the same nudge is echoed
  in the optional-field descriptions on ``instance_name``, ``scale``,
  and ``Task.target`` so the LLM sees it both globally and per-field.

LLMAgent no longer imports RelationKind / TaskKind / get_args; the
schema dump is the only mechanism by which enum members reach the
prompt now.

Net line count grows (+59) because Field descriptions are verbose, but
the single source of truth for any given rule is now the schema —
xyao-nv's explicit ask. Token cost in the actual API request stays
roughly flat: what left the prompt mostly reappeared inside the JSON
schema dump.

Addresses thread #3313636935 on PR #718.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Add a required ``reasoning: str`` field as the FIRST entry of
LLMEnvSpec. Instruction-tuned models respect schema field order, so
listing it first forces the LLM to write a step-by-step analysis
before committing to any structured field (the "think then commit"
pattern, which measurably improves structured-output quality).

As a bonus, the reasoning trace makes downstream failures
debuggable: when the resolver drops an item or picks the wrong
asset, the trace shows which step the model got wrong instead of
leaving the malformed spec as the only signal.

The Field description prescribes a 4-step analysis (task /
foreground / background / distractors) and asks the model not to
duplicate the analysis into ``task_description``. Verified
end-to-end against the live endpoint: the model produces a
self-contained 200-500 char reasoning block that walks all four
steps, and ``task_description`` stays a one-sentence summary.

Surface the field on its own line in ``try_schema.py`` so it's
easy to spot when iterating on prompts.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Pure relocation requested in PR review: the package conceptually
sits alongside the other env-construction code in
isaaclab_arena/environments/, and the directory name shifts from
llm_env_gen to agentic_env_gen to align with the broader
"agent/agentic" terminology direction (the LLM is one possible
backend; a VLM-driven agent could follow later).

Files moved:
* isaaclab_arena/llm_env_gen/{__init__,llm_agent,llm_schema,try_schema}.py
  -> isaaclab_arena/environments/agentic_env_gen/...

Updated absolute import paths in try_schema.py (CLI module path in
docstring + 2 imports) and test_llm_agent.py (docstring + 2
imports), plus comment-only path references in setup.py and
.github/workflows/ci.yml.

Strictly a directory move; class names (LLMAgent, LLMEnvSpec,
LLMResponseParseError), file names within the package, the
llm_remote_e2e pytest marker, and the test file name are all
unchanged. Those renames follow in subsequent commits to keep each
diff reviewable.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Adopt the name xyao-nv proposed in PR review: "Intent means it's a
blueprint, comparing with raw response from agent. LLM is too
restricted, AgenticEnvSpec sounds too broad."

The class is the structured "env intent" the agent commits to
after parsing a natural-language prompt — calling it an *intent*
spec (rather than an *LLM* spec) keeps the name accurate when the
backend is later swapped for a VLM or a hand-written rule-based
agent.

Renames:
* class LLMEnvSpec -> EnvIntentSpec
* file llm_schema.py -> env_intent_spec.py
* test_embeds_llm_env_spec_schema -> test_embeds_env_intent_spec_schema

Updates 24 references across llm_agent.py, try_schema.py,
test_llm_agent.py, setup.py, and the renamed schema file itself
(class definition, validator return annotation, docstrings, prompt
templates, import statements, console labels, comment paths).

Refreshes the historical comment that explained why the field
initial_scene_graph was kept after the previous SceneSpec ->
LLMEnvSpec rename; now reads SceneSpec -> LLMEnvSpec ->
EnvIntentSpec so the rationale survives this rename too.

LLMAgent, LLMResponseParseError, llm_agent.py, the llm_remote_e2e
pytest marker, and test_llm_agent.py file name are deliberately
NOT touched here — they follow in F8c (agent rename) and F8d
(marker / test file / prose sweep) to keep each diff reviewable.

Signed-off-by: Qian Lin <qianl@nvidia.com>
…ckage

Continues the agentic terminology shift requested in PR review:
"Let's just use agent/agentic instead of LLM all over the code."
With this commit, the env-gen package surface no longer mentions
"LLM" in any name — the backend stays LLM-driven today, but the
contract is generic enough that a VLM- or rule-based agent could
slot in without touching consumers.

Class / file / marker / test-file renames:
* class LLMAgent              -> EnvGenAgent
* class LLMResponseParseError -> AgentResponseParseError
* file  llm_agent.py          -> env_gen_agent.py
* file  test_llm_agent.py     -> test_env_gen_agent.py
* mark  llm_remote_e2e        -> agent_remote_e2e
* job   test_llm_remote_e2e   -> test_agent_remote_e2e

AgentResponseParseError docstring rewritten to explicitly contrast
with json.JSONDecodeError (addresses xyao's review question
"AgenticResponseParseJSON error? Shall it be non overlapping with
JSON's errors?"). Now spells out that:
  * JSONDecodeError = we found JSON but it's malformed
  * AgentResponseParseError = we couldn't even find a JSON payload
…so callers can attribute failures correctly (retry vs. fix
schema) without string-matching error messages.

Prose sweep across the package (~60 sites): LLM -> agent in
docstrings, comments, console banners, error messages, and CI step
names. Three references in EnvGenAgent that describe the OpenAI
chat-completions wire (constructor docstring, generate_spec
docstring, temperature/max_tokens args) became "model" rather than
"agent" because that's the layer the OpenAI client targets — the
agent is the wrapper class.

pytest.ini marker description tightened: "live OpenAI-compatible
LLM endpoint" -> "live OpenAI-compatible chat-completions
endpoint" (model-neutral, matches the wire contract).

Signed-off-by: Qian Lin <qianl@nvidia.com>
Comment thread isaaclab_arena/environments/agentic_env_gen/env_gen_agent.py Outdated
Switch EnvGenAgent from prose-extraction to OpenAI-compatible
structured outputs (response_format=json_schema). The wire now
guarantees a JSON envelope matching EnvIntentSpec, so the agent
drops:

* _extract_json (markdown-fence stripping + brace-counting heuristic)
* AgentResponseParseError (the "no JSON found" / "unbalanced braces"
  exception class is unreachable under structured outputs)
* the "Emit ONLY the JSON object" instruction in the prompt
* the prompt's embedded JSON schema dump (the wire carries it)

Addresses xyao's review question "Can we leverage the structured
outputs API?".

Two endpoint quirks surfaced during validation against the requested
models (nvidia/deepseek-ai/deepseek-v4-flash and
aws/anthropic/bedrock-claude-opus-4-6) and are handled cleanly:

* NVIDIA DeepSeek-v4-flash routes structured outputs into
  message.reasoning_content instead of message.content. The new
  extract_response_text() reads either channel transparently.
* AWS Bedrock requires additionalProperties:false on every object
  node, which pydantic's default schema doesn't emit. The new
  build_strict_schema() deep-walks the schema and applies the
  strict-mode constraints (additionalProperties:false, every
  property required, drop default keys).

DeepSeek-v4-flash also emits literal control chars (e.g. \t) inside
JSON strings despite the structured-outputs contract. json.loads
with strict=False accepts them; pydantic's stricter
model_validate_json would reject. Documented inline.

A new structured_output_utils module owns all wire-level
diagnostics (pydantic-model-agnostic, BaseModel-typed):

* ping(client, model) — cheap liveness probe (no response_format)
* check_structured_output_support(client, model, spec_class) —
  full capability probe; returns a StructuredOutputSupport
  dataclass that separates api_error (4xx at the wire) from
  parse_error (JSON / schema validation), so deployment
  validators can attribute failures correctly without
  string-matching error messages.
* build_strict_schema / apply_strict_constraints — schema munging
* extract_response_text — channel fallback

EnvGenAgent.__init__ now runs both probes in order:

  1. ping (cheap, fails fast on dead wire / bad key)
  2. check_structured_output_support (heavier; carries the
     EnvIntentSpec schema)

A model that can't honour response_format raises RuntimeError at
construction with the diagnostic payload in the message so the
operator can attribute the cause (api_error vs parse_error vs
route='empty'). No silently-broken agent instances.

Tests:
* 28 new unit tests in test_structured_output_utils.py covering
  schema munging (incl. an EnvIntentSpec-walk that confirms every
  object node is strict-mode-compatible), channel fallback, ping,
  and check_structured_output_support failure modes.
* 16 unit tests in test_env_gen_agent.py rewritten for the new
  __init__ + generate_spec pipeline. Locks in: call order
  (ping then probe), early exit when ping fails (probe never
  fires), construction failure when probe returns unsupported
  (RuntimeError carries the diagnostic fields).
* 2 live tests guarded by the agent_remote_e2e marker:
  test_default_model_supports_structured_output pins the
  capability contract against the default model;
  test_generate_spec_against_live_endpoint exercises the full
  pipeline.

Verified end-to-end against the live nvidia/deepseek-ai/deepseek-v4-flash
endpoint (parses the EnvIntentSpec out of reasoning_content) and the
schema munging was independently confirmed against
aws/anthropic/bedrock-claude-opus-4-6 (got past Bedrock's
additionalProperties validator; geo-restricted from our test
environment, but the wire path is sound).

Signed-off-by: Qian Lin <qianl@nvidia.com>
Comment thread isaaclab_arena/environments/agentic_env_gen/structured_output_utils.py Outdated
qianl-nv added 2 commits May 28, 2026 17:56
Some providers (Azure content-filter trips, Bedrock guardrail
rejections) succeed at the HTTP level but return an empty choices
list. The naive ``resp.choices[0]`` access raised ``IndexError`` and
broke the function's contract of always returning a
``StructuredOutputSupport`` value.

Surface the condition as a ``parse_error`` with
``response_route="empty"`` so callers route it identically to the
existing "empty envelope" case, but with a distinct message
("no choices") so operators can tell the two scenarios apart in logs.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Remove the ``_tasks_must_be_non_empty`` model validator so the agent
can legitimately emit an empty ``tasks`` list when the user prompt
describes a task-less scene (static playground / sandbox env). The
arena layer treats empty tasks as a ``NoTask`` null object, so
rejecting the empty list at the schema layer was a redundant guardrail
that forced the agent to invent placeholder tasks for task-less
prompts.

Expand the ``tasks`` field description to teach the agent the new
contract — empty list is valid, prefer it over a placeholder task —
so behaviour doesn't regress now that the validator no longer enforces
non-emptiness.

Signed-off-by: Qian Lin <qianl@nvidia.com>
Comment on lines +148 to +154
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Respond with exactly: OK"}],
temperature=0,
max_tokens=8,
)
return resp.choices[0].message.content or ""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ping() accesses resp.choices[0] unconditionally. Some API error paths (e.g. Azure content-filter trips, Bedrock guardrail rejections, and certain rate-limit responses) return HTTP 200 with an empty choices list, making this line raise IndexError — not an openai-client exception. That breaks the documented contract ("Raises: Any exception raised by the underlying openai client") and surfaces as an opaque crash from EnvGenAgent.__init__() rather than a diagnosable ping failure. check_structured_output_support in the same file was explicitly fixed in this PR to guard against this case; ping needs the same treatment.

Suggested change
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Respond with exactly: OK"}],
temperature=0,
max_tokens=8,
)
return resp.choices[0].message.content or ""
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Respond with exactly: OK"}],
temperature=0,
max_tokens=8,
)
choices = getattr(resp, "choices", None) or []
if not choices:
return ""
return choices[0].message.content or ""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants