Fix docker scheduler state aggregation by fallintoplace · Pull Request #554 · NVIDIA-NeMo/Run

fallintoplace · 2026-06-28T13:36:52Z

What changed

fix PersistentDockerScheduler.describe() so any failed container marks the app as FAILED
only report SUCCEEDED once every container has reported and all of them succeeded
add regression coverage for mixed container states without depending on a live Docker daemon

Why

The previous aggregation logic treated the whole app as successful as soon as it saw any terminal state and any success. That misreported mixed states like [SUCCEEDED, FAILED] as SUCCEEDED, and it could also report [SUCCEEDED, RUNNING] as finished too early.

Impact

Docker-backed jobs now keep polling while some containers are still running, and they surface a failure as soon as any container fails.

Checks

uv run -- pytest test/run/torchx_backend/schedulers/test_docker.py -k aggregates_container_states
uv run --group lint -- ruff check nemo_run/run/torchx_backend/schedulers/docker.py test/run/torchx_backend/schedulers/test_docker.py

Notes

The full Docker scheduler test file still expects a live Docker daemon in this local environment, so validation here is focused on the regression coverage added in this change.

Fix docker scheduler state aggregation

0c670c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix docker scheduler state aggregation#554

Fix docker scheduler state aggregation#554
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/docker-scheduler-state-aggregation

fallintoplace commented Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fallintoplace commented Jun 28, 2026

What changed

Why

Impact

Checks

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant