Skip to content

Add Lightwheel fetching asset retry handling#731

Open
xyao-nv wants to merge 2 commits into
mainfrom
xyao/fix/lw_api_timeout
Open

Add Lightwheel fetching asset retry handling#731
xyao-nv wants to merge 2 commits into
mainfrom
xyao/fix/lw_api_timeout

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented May 27, 2026

Summary

Add Lightwheel fetching asset retry handling

Detailed description

  • Lightwheel SDK asset fetches can fail on transient API read timeouts.
  • Added a shared acquire_lightwheel_asset helper with scoped timeout override and retry handling.
  • Routed Lightwheel object and background acquisitions through the shared helper.

@xyao-nv xyao-nv marked this pull request as ready for review May 27, 2026 23:09
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR introduces a shared acquire_lightwheel_asset helper that wraps Lightwheel SDK fetch calls with a scoped base_timeout override and configurable retry logic, then routes all object and background acquisitions through it to handle transient API read timeouts.

  • lightwheel_utils.py — new helper implements a try/except/finally loop that sets client.base_timeout before each attempt, catches timeout-shaped exceptions via string inspection of the exception chain, sleeps between retries, and restores the original timeout in the finally block on every code path.
  • object_library.py / background_library.py — six class-level and one instance-level acquire_by_registry / get_usd call sites are mechanically replaced with acquire_lightwheel_asset.
  • test_lightwheel.py — adds a live integration test (skipped without the SDK) and a unit test verifying that the timeout is set correctly on each attempt and restored on completion.

Confidence Score: 5/5

The retry logic and timeout restoration are correct across all code paths; the change is safe to merge.

The finally block correctly restores client.base_timeout on success, on retried timeouts, and on final propagation. The loop bounds and retry-vs-raise branching are accurate. No functional defects were found in the new code paths introduced by this PR.

No files require special attention beyond the open feedback from previous review threads on lightwheel_utils.py.

Important Files Changed

Filename Overview
isaaclab_arena/assets/lightwheel_utils.py New shared helper with scoped timeout override and retry logic; two issues flagged in a previous review thread (assert vs ValueError, cause/context chain traversal) remain unresolved; logic is otherwise correct
isaaclab_arena/assets/object_library.py Six class-level acquire calls replaced with acquire_lightwheel_asset; behavior is functionally equivalent with retry wrapping added
isaaclab_arena/assets/background_library.py Instance-level acquire call in init now routed through acquire_lightwheel_asset; straightforward mechanical change
isaaclab_arena/tests/test_lightwheel.py Tests cover success-on-first-call and retry-on-first-timeout; missing coverage for the all-retries-exhausted path

Sequence Diagram

sequenceDiagram
    participant Caller
    participant acquire_lightwheel_asset
    participant client
    participant acquire_fn

    Caller->>acquire_lightwheel_asset: "call(loader, acquire_fn, attempts=3, timeout_sec=60, ...)"
    acquire_lightwheel_asset->>client: read base_timeout to old_timeout

    loop for attempt in 1..attempts
        acquire_lightwheel_asset->>client: "set base_timeout = timeout_sec"
        acquire_lightwheel_asset->>acquire_fn: "acquire_fn(**kwargs)"
        alt success
            acquire_fn-->>acquire_lightwheel_asset: result
            acquire_lightwheel_asset->>client: "restore base_timeout = old_timeout"
            acquire_lightwheel_asset-->>Caller: return result
        else TimeoutError not last attempt
            acquire_fn-->>acquire_lightwheel_asset: raise TimeoutError
            Note over acquire_lightwheel_asset: _looks_like_timeout True
            acquire_lightwheel_asset->>client: "restore base_timeout = old_timeout finally"
            acquire_lightwheel_asset->>acquire_lightwheel_asset: sleep(delay_sec)
        else TimeoutError last attempt or non-timeout error
            acquire_fn-->>acquire_lightwheel_asset: raise Exception
            acquire_lightwheel_asset->>client: "restore base_timeout = old_timeout finally"
            acquire_lightwheel_asset-->>Caller: re-raise
        end
    end
Loading

Reviews (2): Last reviewed commit: "lint" | Re-trigger Greptile

Comment on lines +24 to +25
assert attempts > 0, "attempts must be positive"
assert delay_sec >= 0, "delay_sec must be non-negative"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Using assert for runtime parameter validation is fragile — asserts are silently stripped when Python runs with -O or -OO (common in optimized/production environments). If attempts=0 is passed, the loop body never executes and execution falls through to raise AssertionError("unreachable"), which is a confusing error. Use ValueError instead so invalid arguments are always caught.

Suggested change
assert attempts > 0, "attempts must be positive"
assert delay_sec >= 0, "delay_sec must be non-negative"
if attempts <= 0:
raise ValueError("attempts must be positive")
if delay_sec < 0:
raise ValueError("delay_sec must be non-negative")

text = f"{type(current).__name__}: {current}".lower()
if "timeout" in text or "timed out" in text:
return True
current = current.__cause__ or current.__context__
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The __cause__ or __context__ expression only follows one branch of the exception chain. When an exception has both __cause__ set (explicit raise X from Y) and a __context__ (implicit context), only __cause__ is traversed and the __context__ subtree is never visited. A timeout buried exclusively in the __context__ chain would be missed, causing the exception to propagate without retrying.

Suggested change
current = current.__cause__ or current.__context__
next_exc = current.__cause__ if current.__cause__ is not None else current.__context__
current = next_exc

Copy link
Copy Markdown
Contributor

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Add Lightwheel fetching asset retry handling

Summary: This PR adds robust retry handling for transient Lightwheel SDK API timeouts by introducing a shared acquire_lightwheel_asset helper function. The implementation is clean and well-tested.

Strengths

  • Clean abstraction: The acquire_lightwheel_asset helper encapsulates timeout configuration and retry logic in one place, making it easy to apply consistently across all Lightwheel asset acquisitions.
  • Proper timeout restoration: Uses a sentinel _MISSING to correctly handle cases where the client may not have a base_timeout attribute.
  • Exception chain walking: _looks_like_timeout correctly inspects __cause__ and __context__ to catch wrapped timeout exceptions.
  • Good test coverage: Both integration and unit tests verify the retry behavior and timeout restoration.

Suggestions

1. Consider moving finally block outside the loop (minor)

The current structure restores old_timeout after every attempt:

for attempt in range(1, attempts + 1):
    try:
        ...
    except Exception as exc:
        ...
    finally:
        if old_timeout is not _MISSING:
            client.base_timeout = old_timeout  # restored each iteration

While this works because the timeout is re-set at the start of each iteration, using a try/finally around the entire loop would be clearer:

try:
    for attempt in range(1, attempts + 1):
        try:
            if timeout_sec is not None and old_timeout is not _MISSING:
                client.base_timeout = timeout_sec
            return acquire_fn(**kwargs)
        except Exception as exc:
            if not _looks_like_timeout(exc) or attempt == attempts:
                raise
            print(f"[isaaclab-arena] {description} timed out; retrying {attempt + 1}/{attempts}...")
            time.sleep(delay_sec)
finally:
    if old_timeout is not _MISSING:
        client.base_timeout = old_timeout

2. Log message consistency (nitpick)

The log prefix [isaaclab-arena] is clear. Consider using Python logging instead of print for consistency with other Isaac Lab modules, but this is optional.

Overall

Clean implementation that properly handles a real operational concern. The code is defensive, well-structured, and includes appropriate tests.

Verdict: Looks good to merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant