fix(gpu): select single CDI GPU defaults#1675
Open
elezar wants to merge 1 commit into
Open
Conversation
Closes #1477 Add driver-owned CDI GPU inventory selection for Docker and Podman so bare GPU requests resolve to one default device without allocation tracking. Signed-off-by: Evan Lezar <elezar@nvidia.com>
7 tasks
|
🌿 Preview your docs: https://nvidia-preview-pr-1675.docs.buildwithfern.com/openshell |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement driver-owned CDI GPU default selection for Docker and Podman. Bare
--gpurequests now resolve to one NVIDIA CDI device from driver inventory, while explicit--gpu-devicevalues pass through unchanged.Related Issue
Closes #1477
Changes
crates/openshell-core/src/gpu.rs: Added normalized CDI GPU inventory, naming-family selection, and a concurrency-safe round-robin cursor.crates/openshell-driver-docker/: Uses DockerDiscoveredDevicesfor NVIDIA CDI inventory, keepsCDISpecDirsas the CDI support gate, and peeks vs consumes defaults on validation vs create.crates/openshell-driver-podman/: Maps local/dev/nvidiaNnodes tonvidia.com/gpu=Ninventory and selects defaults through the same round-robin helper.e2e/rust/tests/gpu_device_selection.rs: Updates default GPU expectations from all-GPU to selected-single-GPU semantics.Deviations from Plan
None - implemented as planned. Actual GPU e2e execution was environment-blocked locally; the modified GPU e2e target was compiled.
Testing
mise x -- cargo test -p openshell-coremise x -- cargo test -p openshell-driver-dockermise x -- cargo test -p openshell-driver-podmanmise x -- cargo test --manifest-path e2e/rust/Cargo.toml --features e2e-gpu --test gpu_device_selection --no-runmise run pre-commitdocker.com/gpu=webgpu, nonvidia.com/gpu=...; Podman is not installed;/dev/nvidia0is absent.Tests added:
Checklist
Documentation updated:
docs/sandboxes/manage-sandboxes.mdxdocs/reference/sandbox-compute-drivers.mdxcrates/openshell-driver-docker/README.mdcrates/openshell-driver-podman/README.md