Skip to content

Fix docker scheduler state aggregation#554

Open
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/docker-scheduler-state-aggregation
Open

Fix docker scheduler state aggregation#554
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/docker-scheduler-state-aggregation

Conversation

@fallintoplace

Copy link
Copy Markdown

What changed

  • fix PersistentDockerScheduler.describe() so any failed container marks the app as FAILED
  • only report SUCCEEDED once every container has reported and all of them succeeded
  • add regression coverage for mixed container states without depending on a live Docker daemon

Why

The previous aggregation logic treated the whole app as successful as soon as it saw any terminal state and any success. That misreported mixed states like [SUCCEEDED, FAILED] as SUCCEEDED, and it could also report [SUCCEEDED, RUNNING] as finished too early.

Impact

Docker-backed jobs now keep polling while some containers are still running, and they surface a failure as soon as any container fails.

Checks

  • uv run -- pytest test/run/torchx_backend/schedulers/test_docker.py -k aggregates_container_states
  • uv run --group lint -- ruff check nemo_run/run/torchx_backend/schedulers/docker.py test/run/torchx_backend/schedulers/test_docker.py

Notes

The full Docker scheduler test file still expects a live Docker daemon in this local environment, so validation here is focused on the regression coverage added in this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant