`create_ray_wrapped_inference_engines` drops the engine-core child's root cause on init failure

When SkyRL's [`create_ray_wrapped_inference_engines`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py#L86-L325) builds an [`AsyncVLLMInferenceEngine`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py#L342-L364) Ray actor whose vLLM v1 engine-core child process dies during init, the driver-side `ActorDiedError` that surfaces from [`ray.get(sleep_refs)`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py#L323) bottoms out at vLLM's [`wait_for_engine_startup`](https://github.com/vllm-project/vllm/blob/v0.20.2/vllm/v1/engine/utils.py#L1130-L1182) with:

```none
File "vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

This `RuntimeError` is tough to debug. Ultimately I had to learn the underlying engine-core stderr/traceback is written only to `/tmp/ray/session_*/logs/worker-*.err`.

## Reproduction

Run the below reproducer with Python 3.12, `skyrl==0.2.0`, `ray==2.51.1`, `vllm==0.20.2` with at least one GPU. The reproducer's failure trigger is `gpu_memory_utilization=0.999`, which forces vLLM's engine-core child to raise `ValueError: Free memory on device cuda:0 ... is less than desired GPU memory utilization` inside its `request_memory()` call.

```python
import argparse
import glob
import os
import sys
import time
import traceback
from pathlib import Path

import ray
from ray.exceptions import ActorDiedError

from skyrl.backends.skyrl_train.inference_engines.ray_wrapped_inference_engine import (
    create_ray_wrapped_inference_engines,
)


def find_recent_actor_err_logs(window_s: float = 180.0) -> list[Path]:
    cutoff = time.time() - window_s
    paths = [
        Path(p)
        for p in glob.glob("/tmp/ray/session_latest/logs/worker-*.err")
        if os.path.getmtime(p) >= cutoff and os.path.getsize(p) > 0
    ]
    return sorted(paths, key=os.path.getmtime, reverse=True)


def tail(path: Path, n: int = 80) -> str:
    return "\n".join(path.read_text(errors="replace").splitlines()[-n:])


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--no-bug", action="store_true")
    args = parser.parse_args()

    gpu_memory_utilization = 0.5 if args.no_bug else 0.999
    ray.init(num_cpus=4)

    kwargs = dict(
        num_inference_engines=1,
        tensor_parallel_size=1,
        pipeline_parallel_size=1,
        data_parallel_size=1,
        model_dtype="bfloat16",
        pretrain="Qwen/Qwen3-0.6B",
        seed=42,
        vllm_v1_disable_multiproc=False,
        enable_prefix_caching=True,
        enforce_eager=True,
        gpu_memory_utilization=gpu_memory_utilization,
        inference_engine_enable_sleep=True,
        async_engine=True,
        backend="vllm",
        engine_init_kwargs={"max_model_len": 2048},
    )

    try:
        create_ray_wrapped_inference_engines(**kwargs)
    except ActorDiedError as e:
        if args.no_bug:
            print("UNEXPECTED: control path raised ActorDiedError")
            traceback.print_exc()
            return 3
        print("=" * 72)
        print("DRIVER-SIDE TRACEBACK (what SkyRL surfaces to the user):")
        print("=" * 72)
        traceback.print_exception(type(e), e, e.__traceback__)
        time.sleep(2)
        print("\n" + "=" * 72)
        print("ACTOR STDERR LOGS (where the real cause actually lives):")
        print("Glob: /tmp/ray/session_latest/logs/worker-*.err")
        print("=" * 72)
        err_logs = find_recent_actor_err_logs()
        if not err_logs:
            print("(no actor stderr logs matched in window)")
            return 3
        for p in err_logs:
            print(f"\n--- {p} ---")
            print(tail(p))
        return 2

    if args.no_bug:
        print("control path: engine init succeeded (as expected)")
        return 0
    print("UNEXPECTED: bug-trigger path completed without failure")
    return 3


if __name__ == "__main__":
    sys.exit(main())
```

This will output:

```none
========================================================================
DRIVER-SIDE TRACEBACK (what SkyRL surfaces to the user):
========================================================================
Traceback (most recent call last):
  File ".../repro.py", line 60, in main
    create_ray_wrapped_inference_engines(**kwargs)
  File ".../skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py", line 323, in create_ray_wrapped_inference_engines
    ray.get(sleep_refs)
  ...
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::AsyncVLLMInferenceEngine.__init__() ...
  File ".../skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py", line 370, in _create_engine
    engine = vllm.AsyncLLMEngine.from_engine_args(engine_args, stat_loggers=stat_loggers)
  ...
  File ".../vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
    raise RuntimeError(
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```

```
========================================================================
ACTOR STDERR LOGS (where the real cause actually lives):
Glob: /tmp/ray/session_latest/logs/worker-*.err
========================================================================
--- /tmp/ray/session_latest/logs/worker-<hash>-ffffffff-<pid>.err ---
(EngineCore pid=<child>)   File ".../vllm/v1/worker/gpu_worker.py", line 283, in init_device
(EngineCore pid=<child>)     self.requested_memory = request_memory(init_snapshot, self.cache_config)
(EngineCore pid=<child>)   File ".../vllm/v1/worker/utils.py", line 413, in request_memory
(EngineCore pid=<child>)     raise ValueError(
(EngineCore pid=<child>) ValueError: Free memory on device cuda:0 (78.67/79.18 GiB) on startup is less than desired GPU memory utilization (0.999, 79.1 GiB). ...
```

The driver trace only has `"Failed core proc(s): {}"`. The `ValueError` from the engine-core child is only in the per-actor `.err` file under `/tmp/ray/session_latest/logs/`.

## Suggested fix

At [`ray_wrapped_inference_engine.py#L323`](https://github.com/NovaSky-AI/SkyRL/blob/skyrl-v0.2.0/skyrl/backends/skyrl_train/inference_engines/ray_wrapped_inference_engine.py#L323), `ray.get(sleep_refs)` blocks on engine init and is where `ActorDiedError` first reaches the driver.

The requested fix is to attach the failed actor's stderr to the re-raised exception on `ActorDiedError`. Wrap the `ray.get(sleep_refs)` line in a try/except that, on `ActorDiedError`, reads the actor's per-process log files from Ray's session directory and re-raises with the engine-core child's stderr:

```python
import contextlib
from ray.exceptions import ActorDiedError

try:
    ray.get(sleep_refs)
except ActorDiedError as e:
    diagnostics = []
    for engine_actor in inference_engine_actors:
        with contextlib.suppress(Exception):  # Don't block original exception
            # Ray exposes the actor's log paths via the runtime context;
            # or read from RAY_TMPDIR/session_latest/logs/worker-<id>.err
            log_path = _resolve_actor_stderr_log_path(engine_actor)
            tail = pathlib.Path(log_path).read_text().splitlines()[-200:]
            diagnostics.append("\n".join(tail))
    raise RuntimeError(
        "vLLM engine actor died during init. Tail of the actor stderr log(s):\n\n"
        + "\n\n--- next actor ---\n\n".join(diagnostics)
    ) from e
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`create_ray_wrapped_inference_engines` drops the engine-core child's root cause on init failure #1673

Reproduction

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

create_ray_wrapped_inference_engines drops the engine-core child's root cause on init failure #1673

Description

Reproduction

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`create_ray_wrapped_inference_engines` drops the engine-core child's root cause on init failure #1673