Skip to content

feat(gpu): route device selection through driver config#1716

Draft
elezar wants to merge 5 commits into
pull-request/1156from
poc/gpu-device-driver-config-count
Draft

feat(gpu): route device selection through driver config#1716
elezar wants to merge 5 commits into
pull-request/1156from
poc/gpu-device-driver-config-count

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented Jun 3, 2026

Summary

Draft POC for routing driver-specific GPU device selection through the selected runtime driver_config while keeping portable GPU presence/count in resource_requirements.gpu. This targets the updated PR #1156 shape and keeps Kubernetes out of exact device-id support for now.

Related Issue

Related to #1156 and #1589.

Driver Config Shape

Public requests use a driver-keyed envelope under template.driver_config. Exact GPU device selection stays driver-owned, while portable GPU intent stays in resource_requirements.gpu.count:

{
  "resource_requirements": {
    "gpu": {
      "count": 1
    }
  },
  "template": {
    "driver_config": {
      "docker": {
        "gpu_device_ids": ["nvidia.com/gpu=0"]
      },
      "podman": {
        "gpu_device_ids": ["nvidia.com/gpu=0"]
      },
      "vm": {
        "gpu_device_ids": ["0000:2d:00.0"]
      }
    }
  }
}

The gateway selects the active driver block and forwards only the inner object to the compute driver. For Docker, the driver receives:

{
  "gpu_device_ids": ["nvidia.com/gpu=0"]
}

The CLI path in this POC builds that common shape from --gpu-device. The same envelope could also be accepted through a command-line config input, such as a future --driver-config-json, for callers that want to provide the driver block directly.

Changes

  • Add per-driver gpu_device_ids handling for Docker, Podman, and VM driver config.
  • Allow --gpu-device to imply GPU intent and set the portable GPU count to match the requested device IDs.
  • Select the active driver config block in the gateway and pass it to the driver without interpreting nested fields.
  • Validate gpu_device_ids at the driver level: non-empty IDs require a non-zero GPU count, duplicates are rejected, and the number of unique IDs must equal resource_requirements.gpu.count.
  • Leave Kubernetes without exact device-id handling for this POC.
  • Update driver docs, sandbox docs, architecture notes, and focused tests.

Testing

  • mise run pre-commit passes
  • mise exec -- cargo test -p openshell-core -p openshell-driver-docker -p openshell-driver-podman -p openshell-driver-vm --lib gpu
  • mise run check
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

elezar added 4 commits June 3, 2026 12:57
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 3, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the poc/gpu-device-driver-config-count branch from 69be336 to e106f02 Compare June 3, 2026 18:09
@copy-pr-bot copy-pr-bot Bot force-pushed the pull-request/1156 branch 2 times, most recently from 4c18100 to cec0e21 Compare June 4, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant