Add LLMAgent to EnvGraphSpec generation#718
Conversation
Resolver matches object/robot proposal from LLM (based on name/tag) to the exact entries in the arena regestry. It outputs an ArenaEnvGraphSpec with the nodes, tasks and initial state scene graph filled in based on LLM output + resolver match.
… always called when loading from yaml/dict
There was a problem hiding this comment.
Review Summary
Thanks for adding the LLM-based environment generation pipeline! The modular design with clear separation between schema definition, LLM interaction, and deterministic resolution is well thought out. The trace-based debugging will be valuable for understanding resolution decisions.
Suggestions
1. Replace assertions with explicit exceptions for runtime validation
File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 57-58)
The API key validation uses assert, which can be stripped with Python's -O flag:
assert self.api_key, "API key required: set NV_API_KEY or pass api_key."Consider using an explicit exception:
if not self.api_key:
raise ValueError("API key required: set NV_API_KEY or pass api_key.")2. Improve JSON extraction error handling
File: isaaclab_arena/llm_env_gen/llm_agent.py (lines 148-161)
The _extract_json method uses contextlib.suppress(json.JSONDecodeError) before attempting manual parsing. If both paths fail, it raises AssertionError. Consider:
- Raising a more descriptive custom exception (e.g.,
LLMResponseParseError) - Including the raw response in error messages for debugging
class LLMResponseParseError(ValueError):
"""Raised when the LLM response cannot be parsed as JSON."""
@staticmethod
def _extract_json(content: str) -> dict:
content = content.strip()
# ... extraction logic ...
raise LLMResponseParseError(f"Could not extract JSON from LLM response: {content[:200]}...")3. Consider adding retry logic for transient API failures
File: isaaclab_arena/llm_env_gen/llm_agent.py
LLM API calls can fail transiently due to rate limits or network issues. Consider adding retry logic with exponential backoff, or at minimum documenting that callers should implement retries:
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def generate_spec(self, ...):
...4. Add test coverage for LLMAgent
File: isaaclab_arena/tests/test_resolver.py
The resolver tests are thorough with the FakeAssetRegistry pattern. Consider adding similar tests for LLMAgent:
- Test
_extract_jsonwith various edge cases (markdown fences, nested JSON, malformed responses) - Test schema validation rejections
- Mock the OpenAI client to test the overall flow without making real API calls
5. Document the silent failure behavior in resolver
File: isaaclab_arena/llm_env_gen/resolver.py
The design choice to never raise on LLM mistakes (recording trace events instead) is documented in the class docstring, which is good. Consider adding a helper method or property to check if resolution was "clean":
@property
def has_resolution_errors(self) -> bool:
"""Returns True if any trace events indicate failed resolution."""
error_stages = {"item.miss", "name.miss", "task.unknown_subject", "task.unknown_target"}
return any(e.stage in error_stages for e in self.trace)Minor Notes
- The
try_schema.pyscript is useful for manual testing but consider adding it to.gitignoreor documenting it as a development-only tool - Good use of
__post_init__inArenaEnvGraphSpecto centralize invariant checking
Overall this is a solid foundation for the LLM environment generation feature.
Update (commit 3336f26): The PR has been substantially refactored since my initial review.
Status of Previous Suggestions
-
✅ Replace assertions with explicit exceptions — Fixed. The new
EnvGenAgentuses explicitraise ValueErrorinstead ofassertfor the API key check. -
✅ Improve JSON extraction error handling — Fixed. The new
AgentResponseParseError(ValueError)custom exception is exactly what I suggested, including raw response preview (capped at 500 chars for readability). -
ℹ️ Retry logic — Not implemented, but this was a nice-to-have suggestion. Callers can wrap as needed.
-
✅ Add test coverage for LLMAgent/EnvGenAgent — Fixed. The new
test_env_gen_agent.py(399 lines) comprehensively covers:__init__argument/env-var precedence and missing-key guard_extract_jsonedge cases (fenced, prose, malformed, unbalanced braces, truncated error messages)generate_spec/pingwith mocked OpenAI client- System prompt coverage for RelationKind/TaskKind literals
-
⚪ Resolver silent failure documentation — N/A. The entire resolver module was removed in this refactor. The PR now focuses on the agent layer; resolution is presumably handled elsewhere.
New Changes Observed
- Module renamed:
llm_env_gen→agentic_env_gen - Schema renamed:
SceneSpec→EnvIntentSpec(with newreasoningfield for chain-of-thought) - Added CI job
test_agent_remote_e2ewith proper secret validation - Added
pydantic>=2.0as hard dependency in setup.py - Removed
to_dict()/to_yaml()fromArenaEnvGraphSpec— validation now runs only infrom_dict() - Added
at_posetoRelationKindliteral
No new issues found in the incremental changes. The refactor improves the codebase significantly.
Update (commit fb42596): Major refactor to OpenAI-compatible structured outputs.
Changes Observed
-
Structured outputs migration: The agent now uses
response_format={"type": "json_schema", ...}with strict mode, eliminating the need for prose-parsing fallback. This is a significant reliability improvement — the wire guarantees a valid JSON envelope. -
New
structured_output_utils.pymodule: Well-designed extraction of reusable utilities:build_strict_schema()/apply_strict_constraints()— munges pydantic schemas for OpenAI/Bedrock strict mode (addsadditionalProperties: false, populatesrequired, stripsdefaultkeys)extract_response_text()— handles NVIDIA DeepSeek'sreasoning_contentchannel quirkcheck_structured_output_support()— deployment validator probe with rich diagnostics
-
Removed code:
_extract_json()andAgentResponseParseErrorremoved — no longer needed since structured outputs guarantee the JSON envelope. Clean deletion. -
Constructor-time fail-fast: Two-stage validation on
EnvGenAgent.__init__:- First: cheap
ping()to verify connectivity/auth - Then: full
check_structured_output_support()probe with the real schema
This surfaces capability failures at construction time rather than mid-pipeline.
- First: cheap
-
Test coverage: Comprehensive —
test_structured_output_utils.py(421 lines) covers:- Schema munging for strict mode (``, nested objects, idempotency)
- Channel extraction fallback (
content→reasoning_content→empty) - Probe failure modes (4xx API error, empty envelope, parse error, validation error)
- Live endpoint test gated by
@pytest.mark.agent_remote_e2e
Notes
- The
json.loads(text, strict=False)call ingenerate_specis a good workaround for DeepSeek-v4-flash's unescaped control characters — documented via inline comment. - Good defensive caching of
self._spec_schemato avoid re-walking the schema on every call. - The system prompt was correctly trimmed to remove the now-redundant "emit ONLY JSON" instruction.
No new issues found. This is a clean, well-tested migration to structured outputs.
Update (commit f92fa73): Two defensive improvements.
-
Allow empty tasks list in
EnvIntentSpec: Removed the@model_validatorthat enforced non-empty tasks. Empty tasks now legitimately maps toNoTaskfor static sandbox environments. The field description clearly documents this behavior. -
Handle empty
choicesarray in structured output validation: Added defensive check incheck_structured_output_support()for when providers (Azure content-filter, Bedrock guardrails) return HTTP 200 with an emptychoiceslist — preventingIndexErrorand surfacing it cleanly as aparse_errorwithresponse_route="empty".
Both changes include test updates. No issues found.
Greptile SummaryThis PR introduces the first stage of the agentic environment-generation pipeline: an
Confidence Score: 4/5Safe to merge with one small fix:
isaaclab_arena/environments/agentic_env_gen/structured_output_utils.py — the Important Files Changed
Sequence DiagramsequenceDiagram
participant U as Caller
participant A as EnvGenAgent init
participant P as ping
participant C as check_support
participant G as generate_spec
participant LLM as Inference Endpoint
U->>A: EnvGenAgent(model, base_url)
A->>P: ping(client, model)
P->>LLM: completions.create max_tokens 8
LLM-->>P: response text
P-->>A: text
A->>C: check_structured_output_support(client, model, EnvIntentSpec)
C->>LLM: completions.create response_format json_schema
LLM-->>C: EnvIntentSpec JSON sample
C-->>A: StructuredOutputSupport supported True
A-->>U: EnvGenAgent instance
U->>G: generate_spec(prompt, catalog_text)
G->>LLM: completions.create response_format json_schema
LLM-->>G: JSON envelope
G->>G: extract_response_text
G->>G: json loads strict False
G->>G: EnvIntentSpec model_validate
G-->>U: EnvIntentSpec plus raw text
Reviews (9): Last reviewed commit: "Allow empty tasks list in EnvIntentSpec" | Re-trigger Greptile |
| registry = AssetRegistry() | ||
| backgrounds: list[str] = [] | ||
| objects: list[dict] = [] | ||
| embodiments: list[str] = [] |
There was a problem hiding this comment.
How about optional ones, like lightings, hdr image?
|
Much thorough than the last one. Most comments are to make it more robust and less prone to hallucinations. Feel free to leave some as TODOs. And one major ask is to remove LLM, and call it Agent. Classes could be called as |
xyao-nv flagged that the env-gen prompt and LLMEnvSpec docstrings mixed "atomic action" and "atomic task" interchangeably when referring to the entries of the ``tasks`` field. Unify on "atomic task(s)" everywhere since the field is ``tasks`` and the class is ``Task`` — saying "action" introduces a third noun for the same concept. Also aligns with the existing ``isaaclab_arena/tasks/sequential_task_base.py`` wording. Addresses thread #3313780564 on PR #718. Signed-off-by: Qian Lin <qianl@nvidia.com>
Two PR threads converged on the same 3-line comment, so addressing them together: * "Uniform spawn scale" misled by implying a 1.0 baseline — some object_library assets already carry intrinsic scaling, so the override is not strictly uniform. Drop the qualifier. * "auto-fit ... against the tabletop bbox" leaks implementation details of the current placement proposer; @zhx06 is expanding the proposer beyond bbox-based fitting, so the comment would go stale. Describe the effect (auto-fit) without the mechanism. Addresses threads #3313785736 and #3313791294 on PR #718. Signed-off-by: Qian Lin <qianl@nvidia.com>
The Resolver was moved out of this MR into a follow-up branch, so the three docstrings / comments in llm_schema.py that name-checked it either point at a class that is not in this PR (the Resolver class itself, Resolver._build_spatial_constraint) or duplicate contract text already covered by the surrounding class docstring. Removed: * Module docstring — drop "by the Resolver" from the second-step description; the architectural intent (deterministic second pass) reads fine without the class reference. * Relation class docstring — drop the trailing sentence cross-ref to Resolver._build_spatial_constraint; the binary-vs-unary contract is fully stated in the preceding sentences. * Relation.target inline comment — drop entirely; it restated the binary-vs-unary contract that the class docstring already documents authoritatively. Addresses thread #3313795576 on PR #718. Signed-off-by: Qian Lin <qianl@nvidia.com>
xyao-nv asked why the system prompt restated information that the
schema could express structurally, and suggested moving rules into
``Field(description=...)`` so they live in one place instead of being
duplicated across prompt prose and pydantic types.
Rebalance:
* Per-field semantics (what each field means, defaults, optionality)
now live as ``Field(description=...)`` strings on every member of
Item / Relation / Task / LLMEnvSpec. pydantic embeds these in
``model_json_schema()`` automatically, so the existing SCHEMA block
at the end of the prompt now carries the per-field guidance the
RULES section used to spell out.
* Enum members no longer need explicit ``∈ {kinds}`` lines — they are
already enforced by the ``Literal[...]`` types and show up as
``enum:`` arrays inside the schema JSON.
* The prompt's RULES section shrinks to GUIDANCE: only cross-cutting
invariants (articulated objects need spatial ``on(...)`` anchors;
distractors need ``on(distractor, background)``), few-shot task
examples, and the output-format directive remain — these either
span multiple fields or change global LLM behaviour and don't fit a
single-field description.
* New anti-hallucination guidance (also from xyao-nv's review): the
GUIDANCE block now tells the LLM to output null for unspecified
optional fields instead of guessing, and the same nudge is echoed
in the optional-field descriptions on ``instance_name``, ``scale``,
and ``Task.target`` so the LLM sees it both globally and per-field.
LLMAgent no longer imports RelationKind / TaskKind / get_args; the
schema dump is the only mechanism by which enum members reach the
prompt now.
Net line count grows (+59) because Field descriptions are verbose, but
the single source of truth for any given rule is now the schema —
xyao-nv's explicit ask. Token cost in the actual API request stays
roughly flat: what left the prompt mostly reappeared inside the JSON
schema dump.
Addresses thread #3313636935 on PR #718.
Signed-off-by: Qian Lin <qianl@nvidia.com>
Add a required ``reasoning: str`` field as the FIRST entry of LLMEnvSpec. Instruction-tuned models respect schema field order, so listing it first forces the LLM to write a step-by-step analysis before committing to any structured field (the "think then commit" pattern, which measurably improves structured-output quality). As a bonus, the reasoning trace makes downstream failures debuggable: when the resolver drops an item or picks the wrong asset, the trace shows which step the model got wrong instead of leaving the malformed spec as the only signal. The Field description prescribes a 4-step analysis (task / foreground / background / distractors) and asks the model not to duplicate the analysis into ``task_description``. Verified end-to-end against the live endpoint: the model produces a self-contained 200-500 char reasoning block that walks all four steps, and ``task_description`` stays a one-sentence summary. Surface the field on its own line in ``try_schema.py`` so it's easy to spot when iterating on prompts. Signed-off-by: Qian Lin <qianl@nvidia.com>
Pure relocation requested in PR review: the package conceptually
sits alongside the other env-construction code in
isaaclab_arena/environments/, and the directory name shifts from
llm_env_gen to agentic_env_gen to align with the broader
"agent/agentic" terminology direction (the LLM is one possible
backend; a VLM-driven agent could follow later).
Files moved:
* isaaclab_arena/llm_env_gen/{__init__,llm_agent,llm_schema,try_schema}.py
-> isaaclab_arena/environments/agentic_env_gen/...
Updated absolute import paths in try_schema.py (CLI module path in
docstring + 2 imports) and test_llm_agent.py (docstring + 2
imports), plus comment-only path references in setup.py and
.github/workflows/ci.yml.
Strictly a directory move; class names (LLMAgent, LLMEnvSpec,
LLMResponseParseError), file names within the package, the
llm_remote_e2e pytest marker, and the test file name are all
unchanged. Those renames follow in subsequent commits to keep each
diff reviewable.
Signed-off-by: Qian Lin <qianl@nvidia.com>
Adopt the name xyao-nv proposed in PR review: "Intent means it's a blueprint, comparing with raw response from agent. LLM is too restricted, AgenticEnvSpec sounds too broad." The class is the structured "env intent" the agent commits to after parsing a natural-language prompt — calling it an *intent* spec (rather than an *LLM* spec) keeps the name accurate when the backend is later swapped for a VLM or a hand-written rule-based agent. Renames: * class LLMEnvSpec -> EnvIntentSpec * file llm_schema.py -> env_intent_spec.py * test_embeds_llm_env_spec_schema -> test_embeds_env_intent_spec_schema Updates 24 references across llm_agent.py, try_schema.py, test_llm_agent.py, setup.py, and the renamed schema file itself (class definition, validator return annotation, docstrings, prompt templates, import statements, console labels, comment paths). Refreshes the historical comment that explained why the field initial_scene_graph was kept after the previous SceneSpec -> LLMEnvSpec rename; now reads SceneSpec -> LLMEnvSpec -> EnvIntentSpec so the rationale survives this rename too. LLMAgent, LLMResponseParseError, llm_agent.py, the llm_remote_e2e pytest marker, and test_llm_agent.py file name are deliberately NOT touched here — they follow in F8c (agent rename) and F8d (marker / test file / prose sweep) to keep each diff reviewable. Signed-off-by: Qian Lin <qianl@nvidia.com>
…ckage Continues the agentic terminology shift requested in PR review: "Let's just use agent/agentic instead of LLM all over the code." With this commit, the env-gen package surface no longer mentions "LLM" in any name — the backend stays LLM-driven today, but the contract is generic enough that a VLM- or rule-based agent could slot in without touching consumers. Class / file / marker / test-file renames: * class LLMAgent -> EnvGenAgent * class LLMResponseParseError -> AgentResponseParseError * file llm_agent.py -> env_gen_agent.py * file test_llm_agent.py -> test_env_gen_agent.py * mark llm_remote_e2e -> agent_remote_e2e * job test_llm_remote_e2e -> test_agent_remote_e2e AgentResponseParseError docstring rewritten to explicitly contrast with json.JSONDecodeError (addresses xyao's review question "AgenticResponseParseJSON error? Shall it be non overlapping with JSON's errors?"). Now spells out that: * JSONDecodeError = we found JSON but it's malformed * AgentResponseParseError = we couldn't even find a JSON payload …so callers can attribute failures correctly (retry vs. fix schema) without string-matching error messages. Prose sweep across the package (~60 sites): LLM -> agent in docstrings, comments, console banners, error messages, and CI step names. Three references in EnvGenAgent that describe the OpenAI chat-completions wire (constructor docstring, generate_spec docstring, temperature/max_tokens args) became "model" rather than "agent" because that's the layer the OpenAI client targets — the agent is the wrapper class. pytest.ini marker description tightened: "live OpenAI-compatible LLM endpoint" -> "live OpenAI-compatible chat-completions endpoint" (model-neutral, matches the wire contract). Signed-off-by: Qian Lin <qianl@nvidia.com>
Switch EnvGenAgent from prose-extraction to OpenAI-compatible
structured outputs (response_format=json_schema). The wire now
guarantees a JSON envelope matching EnvIntentSpec, so the agent
drops:
* _extract_json (markdown-fence stripping + brace-counting heuristic)
* AgentResponseParseError (the "no JSON found" / "unbalanced braces"
exception class is unreachable under structured outputs)
* the "Emit ONLY the JSON object" instruction in the prompt
* the prompt's embedded JSON schema dump (the wire carries it)
Addresses xyao's review question "Can we leverage the structured
outputs API?".
Two endpoint quirks surfaced during validation against the requested
models (nvidia/deepseek-ai/deepseek-v4-flash and
aws/anthropic/bedrock-claude-opus-4-6) and are handled cleanly:
* NVIDIA DeepSeek-v4-flash routes structured outputs into
message.reasoning_content instead of message.content. The new
extract_response_text() reads either channel transparently.
* AWS Bedrock requires additionalProperties:false on every object
node, which pydantic's default schema doesn't emit. The new
build_strict_schema() deep-walks the schema and applies the
strict-mode constraints (additionalProperties:false, every
property required, drop default keys).
DeepSeek-v4-flash also emits literal control chars (e.g. \t) inside
JSON strings despite the structured-outputs contract. json.loads
with strict=False accepts them; pydantic's stricter
model_validate_json would reject. Documented inline.
A new structured_output_utils module owns all wire-level
diagnostics (pydantic-model-agnostic, BaseModel-typed):
* ping(client, model) — cheap liveness probe (no response_format)
* check_structured_output_support(client, model, spec_class) —
full capability probe; returns a StructuredOutputSupport
dataclass that separates api_error (4xx at the wire) from
parse_error (JSON / schema validation), so deployment
validators can attribute failures correctly without
string-matching error messages.
* build_strict_schema / apply_strict_constraints — schema munging
* extract_response_text — channel fallback
EnvGenAgent.__init__ now runs both probes in order:
1. ping (cheap, fails fast on dead wire / bad key)
2. check_structured_output_support (heavier; carries the
EnvIntentSpec schema)
A model that can't honour response_format raises RuntimeError at
construction with the diagnostic payload in the message so the
operator can attribute the cause (api_error vs parse_error vs
route='empty'). No silently-broken agent instances.
Tests:
* 28 new unit tests in test_structured_output_utils.py covering
schema munging (incl. an EnvIntentSpec-walk that confirms every
object node is strict-mode-compatible), channel fallback, ping,
and check_structured_output_support failure modes.
* 16 unit tests in test_env_gen_agent.py rewritten for the new
__init__ + generate_spec pipeline. Locks in: call order
(ping then probe), early exit when ping fails (probe never
fires), construction failure when probe returns unsupported
(RuntimeError carries the diagnostic fields).
* 2 live tests guarded by the agent_remote_e2e marker:
test_default_model_supports_structured_output pins the
capability contract against the default model;
test_generate_spec_against_live_endpoint exercises the full
pipeline.
Verified end-to-end against the live nvidia/deepseek-ai/deepseek-v4-flash
endpoint (parses the EnvIntentSpec out of reasoning_content) and the
schema munging was independently confirmed against
aws/anthropic/bedrock-claude-opus-4-6 (got past Bedrock's
additionalProperties validator; geo-restricted from our test
environment, but the wire path is sound).
Signed-off-by: Qian Lin <qianl@nvidia.com>
Some providers (Azure content-filter trips, Bedrock guardrail
rejections) succeed at the HTTP level but return an empty choices
list. The naive ``resp.choices[0]`` access raised ``IndexError`` and
broke the function's contract of always returning a
``StructuredOutputSupport`` value.
Surface the condition as a ``parse_error`` with
``response_route="empty"`` so callers route it identically to the
existing "empty envelope" case, but with a distinct message
("no choices") so operators can tell the two scenarios apart in logs.
Signed-off-by: Qian Lin <qianl@nvidia.com>
Remove the ``_tasks_must_be_non_empty`` model validator so the agent can legitimately emit an empty ``tasks`` list when the user prompt describes a task-less scene (static playground / sandbox env). The arena layer treats empty tasks as a ``NoTask`` null object, so rejecting the empty list at the schema layer was a redundant guardrail that forced the agent to invent placeholder tasks for task-less prompts. Expand the ``tasks`` field description to teach the agent the new contract — empty list is valid, prefer it over a placeholder task — so behaviour doesn't regress now that the validator no longer enforces non-emptiness. Signed-off-by: Qian Lin <qianl@nvidia.com>
| resp = client.chat.completions.create( | ||
| model=model, | ||
| messages=[{"role": "user", "content": "Respond with exactly: OK"}], | ||
| temperature=0, | ||
| max_tokens=8, | ||
| ) | ||
| return resp.choices[0].message.content or "" |
There was a problem hiding this comment.
ping() accesses resp.choices[0] unconditionally. Some API error paths (e.g. Azure content-filter trips, Bedrock guardrail rejections, and certain rate-limit responses) return HTTP 200 with an empty choices list, making this line raise IndexError — not an openai-client exception. That breaks the documented contract ("Raises: Any exception raised by the underlying openai client") and surfaces as an opaque crash from EnvGenAgent.__init__() rather than a diagnosable ping failure. check_structured_output_support in the same file was explicitly fixed in this PR to guard against this case; ping needs the same treatment.
| resp = client.chat.completions.create( | |
| model=model, | |
| messages=[{"role": "user", "content": "Respond with exactly: OK"}], | |
| temperature=0, | |
| max_tokens=8, | |
| ) | |
| return resp.choices[0].message.content or "" | |
| resp = client.chat.completions.create( | |
| model=model, | |
| messages=[{"role": "user", "content": "Respond with exactly: OK"}], | |
| temperature=0, | |
| max_tokens=8, | |
| ) | |
| choices = getattr(resp, "choices", None) or [] | |
| if not choices: | |
| return "" | |
| return choices[0].message.content or "" |
Summary
First part of the LLM env gen pipeline: LLMAgent going from prompt to LLMEnvSchema (JSON)
Detailed description
Changed: Add new LLMAgent class & output shema (defines subset of supported task/constraints ect).
To run: python -m isaaclab_arena.llm_env_gen.try_schema