[examples][infra] Anyscale 2-node Qwen3 launch flow#1689
Conversation
…cess group When a k8s pod is evicted (preemption, scale-down, node drain) the container gets SIGTERM with a 25s grace period before SIGKILL. Without a handler, in-flight NCCL collectives leak communicators and the next run may hit stale process group state. Add a SIGTERM handler inside DistributedTorchRayActor.init_worker_process_group() that: - calls torch.cuda.synchronize() to drain any in-flight CUDA work - calls torch.distributed.destroy_process_group() to release NCCL - exits cleanly with sys.exit(0) Both calls are wrapped in try/except so a partial-state worker still tears down the half that's healthy. The whole sequence is well under the 25s grace window. Each call is guarded (`torch.distributed.is_available() and is_initialized()`) so it does nothing when distributed isn't set up yet.
Skip the call entirely on CPU-only environments so we don't generate a noisy warning every time a non-CUDA worker is terminated. Only emit a warning if synchronize() actually fails on a CUDA-capable system.
Adds an Anyscale-friendly launch path for Qwen3-30B-A3B and Qwen3-235B-A22B on 2 H100 nodes. The key piece is a one-shot venv build on shared NFS that all Ray actors share via py_executable — without it every actor independently ran uv install of skyrl[megatron], causing NCCL rendezvous to time out before all 16 actors registered. Built on top of the SIGTERM teardown PR (NovaSky-AI#1688) so preemption / scale-down doesn't leak NCCL communicators.
There was a problem hiding this comment.
Code Review
This pull request introduces configuration and scripts for training and validating large-scale Qwen3 models (235B and 30B) using Megatron and Ray on Anyscale. Key additions include shared virtual environment management, model prefetching, and optimized training scripts utilizing FP8 and optimizer offloading. Feedback highlights several critical issues: the SIGTERM cleanup handler in the worker base class is bypassed by Megatron worker overrides, model-copying logic for local NVMe caching is misplaced on the head node instead of the workers, and there are potential race conditions when building the shared virtual environment or installing NVIDIA packages on shared NFS without proper locking or atomicity.
| # Clean teardown on k8s SIGTERM: drain CUDA streams + release NCCL | ||
| # communicators before the 25s grace period elapses. | ||
| rank = self._rank | ||
|
|
||
| def _sigterm_cleanup(signum, frame): | ||
| logger.warning(f"SIGTERM received in worker rank={rank}, cleaning up...") | ||
|
|
||
| if torch.cuda.is_available(): | ||
| try: | ||
| torch.cuda.synchronize() | ||
| except Exception as e: | ||
| logger.warning(f"cuda.synchronize() failed: {e}") | ||
|
|
||
| try: | ||
| if torch.distributed.is_available() and torch.distributed.is_initialized(): | ||
| torch.distributed.destroy_process_group() | ||
| except Exception as e: | ||
| logger.warning(f"destroy_process_group() failed: {e}") | ||
|
|
||
| sys.exit(0) | ||
|
|
||
| signal.signal(signal.SIGTERM, _sigterm_cleanup) |
There was a problem hiding this comment.
The SIGTERM cleanup handler is registered within init_worker_process_group. However, MegatronPolicyWorkerBase (and other Megatron-specific workers in megatron_worker.py) overrides this method without calling super().init_worker_process_group(). Consequently, this cleanup logic will not be active for Megatron training runs, which are the primary focus of this launch flow.
To ensure the handler is registered, you should either:
- Call
super().init_worker_process_group()in the Megatron worker subclasses. - Move the signal registration to a method that is guaranteed to run, such as the
__init__method ofDistributedTorchRayActor.
| if [ -d "$NFS_HF/hub/models--Qwen--Qwen3-30B-A3B-Base" ] && [ ! -d "$LOCAL_HF/hub/models--Qwen--Qwen3-30B-A3B-Base" ]; then | ||
| echo "[script] copying Qwen3-30B-A3B-Base from NFS to local" | ||
| cp -r "$NFS_HF/hub/models--Qwen--Qwen3-30B-A3B-Base" "$LOCAL_HF/hub/" | ||
| fi |
There was a problem hiding this comment.
This model-copying logic executes on the head node as part of the job's entrypoint. However, in the provided Anyscale configuration, the head node does not have GPUs and does not run training workers. The actual Ray workers on the worker nodes will not execute this shell script; they are started directly via py_executable.
As a result, the workers will continue to use the NFS path specified in the YAML's HF_HOME, and the intended optimization of using local NVMe for faster loading will not be realized on the nodes where it matters most. Consider implementing the local caching logic within the worker's initialization or using a Ray runtime environment setup hook that runs on all nodes.
| echo "[setup] building venv at $VENV ($(date -u))" | ||
| rm -rf "$VENV" |
There was a problem hiding this comment.
There is a potential race condition if multiple build jobs are triggered simultaneously or if a job is retried. The rm -rf "$VENV" followed by a non-atomic uv sync can leave the shared storage in a corrupted state for other processes.
It is safer to build the environment in a temporary directory and then use an atomic mv to place it at the final destination once the build is successful (and the stamp is created).
| if [ ! -f "$NV/nvidia/cudnn/include/cudnn.h" ]; then | ||
| echo "[bootstrap] installing nvidia headers to $NV" | tee -a "$DIAG_LOG" | ||
| /home/ray/anaconda3/bin/python3 -m pip install -q --target "$NV" \ | ||
| nvidia-cudnn-cu12 nvidia-nccl-cu12 nvidia-cublas-cu12 nvidia-cusparse-cu12 \ | ||
| nvidia-cusolver-cu12 nvidia-curand-cu12 nvidia-cufft-cu12 nvidia-cuda-runtime-cu12 \ | ||
| nvidia-cuda-nvrtc-cu12 nvidia-cuda-cupti-cu12 nvidia-nvjitlink-cu12 nvidia-nvtx-cu12 \ | ||
| nvidia-cudnn-frontend | ||
| fi |
There was a problem hiding this comment.
Installing nvidia packages directly to a shared NFS path without locking is prone to race conditions if multiple jobs are launched. A partial installation from one job could cause others to fail or use corrupted headers/libraries.
Consider using a more robust synchronization method or ensuring that this bootstrap step is performed by a single, idempotent setup job (similar to the venv build).
Why
Running SkyRL Megatron training on Anyscale (or any Ray-on-k8s cluster) with the default flow has a hidden cost:
uv run --extra megatron ….uv runand propagates the sameuv runcommand aspy_executablefor every Ray worker actor.uv installofskyrl[megatron], buildingtransformer-engine-torchfrom source (~5–10 min per actor).init_process_groupon all 16, hits the 600 s NCCL rendezvous timeout, and dies with:What this PR adds
A complete Anyscale launch flow for Qwen3-30B-A3B and Qwen3-235B-A22B on 2 H100 nodes, built around a single shared venv on NFS:
build_shared_venv.yaml— one-shot job that runsuv sync --extra megatron --no-editableinto/mnt/cluster_storage/.skyrl-venv. Idempotent; subsequent invocations no-op.anyscale_qwen3_30b_2nodes.yaml/anyscale_qwen3_235b_2nodes.yaml— training-job YAMLs that setpy_executable: /mnt/cluster_storage/.skyrl-venv/bin/pythonso every Ray actor reuses the prebuilt venv directly.run_megatron_qwen3_30b_2nodes.sh/run_megatron_qwen3_235b_2nodes.sh— the training shell scripts. Theyunset RAY_RUNTIME_ENV_HOOKbefore invoking Python so the autouv runpropagation doesn't reactivate, and forward SIGTERM to the python child so the in-process handler (from [workers] Clean teardown on SIGTERM: drain CUDA + destroy process group #1688) actually fires.clear_uv_cache.yaml,clear_venv.yaml,dump_diag.yaml,inspect_cudnn.yaml,download_model.yaml. Useful when iterating on the cluster (rebuild venv, inspect cudnn paths, prefetch the HF model into/mnt/cluster_storage/hf_cache).Effect
Validated end-to-end on Qwen3-30B-A3B (TP=4, EP=8, DP=4 — 16 actors, 2 nodes × 8 H100s):
policy_loss0.168 → 0.151 over 5 steps,reward/avg_pass_at_16trending up from 0.447 to 0.490, full eval-before-train cycle completed.Test plan
anyscale job submit -f examples/train/megatron/build_shared_venv.yaml --waitonce per cluster lifetime — confirmIMPORTS OKin the final log line.anyscale job submit -f examples/train/megatron/anyscale_qwen3_30b_2nodes.yaml— confirm the worker raylet logs show noBuilding transformer-engine-torchand the job reachesInitialized process group for RayActorGroupwithin minutes.policy_loss,reward/avg_pass_at_16,timing/step).