Skip to content

feat(gpu): introduce GPU request spec#1156

Open
elezar wants to merge 5 commits into
mainfrom
feat/gpu-request-spec
Open

feat(gpu): introduce GPU request spec#1156
elezar wants to merge 5 commits into
mainfrom
feat/gpu-request-spec

Conversation

@elezar
Copy link
Copy Markdown
Member

@elezar elezar commented May 4, 2026

Summary

Adds compact GPU resource requirements to the public and compute-driver protos, moving the API toward the resource-requirements shape proposed in #1360.

Related Issue

Related to #1444 and #1360

Changes

  • Replaces SandboxSpec.gpu with SandboxSpec.resource_requirements.gpu.
  • Supports default GPU requests, GPU count requests, and interim driver-specific device_ids.
  • Preserves driver behavior: Kubernetes honors counts, VM accepts one GPU, Docker/Podman reject count requests while using CDI device IDs.
  • Updates CLI mapping, gateway validation/translation, tests, and docs.

Testing

  • mise run pre-commit

Checklist

  • Follows Conventional Commits
  • Docs updated

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 8b95e09 to 3d17c4e Compare May 13, 2026 16:00
@elezar elezar force-pushed the feat/gpu-request-spec branch from 3d17c4e to 930c581 Compare May 13, 2026 16:44
@github-actions
Copy link
Copy Markdown

@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 07b171d to dd18d21 Compare May 15, 2026 15:12
@elezar elezar marked this pull request as ready for review May 15, 2026 15:12
@elezar elezar requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 15, 2026 15:12
@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 9ff168a to cfceac9 Compare May 18, 2026 13:28
@elezar elezar added the test:e2e-gpu Requires GPU end-to-end coverage label May 18, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e-gpu applied for cfceac9. Open the existing run and click Re-run all jobs to execute with the label set. The E2E Gate check on this PR will flip green automatically once the run finishes.

@elezar elezar force-pushed the feat/gpu-request-spec branch 3 times, most recently from ff9e99d to 76af3a3 Compare May 20, 2026 11:57
@elezar elezar mentioned this pull request May 20, 2026
7 tasks
@elezar elezar force-pushed the feat/gpu-request-spec branch 4 times, most recently from 7f09b83 to 3c6a501 Compare June 1, 2026 21:15
Comment thread proto/openshell.proto Outdated
message GpuRequestSpec {
// Optional driver-native device identifiers. Empty means the driver chooses
// its default GPU assignment behavior.
repeated string device_id = 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PR description

Replace legacy GPU-specific request fields in the public and driver protos with ResourceRequirements carrying a GPUSpec. This also adds GPU count requests while preserving the existing default GPU request shape used by --gpu and image-name auto-detection.

Where do we support counts in this PR?

Another thought, with the device_config convention we're trying to establish, does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we support counts in this PR?

Looking locally, I think I had tested a version of this branch on a machine with GPUs and then pushed these changes without first pulling the commit including counts or forgot to push the local changes).

does it make as much sense to include device_id in this proto? device_id could get represented from the driver_config, and gpu could simply represent the number of gpus to request.

Using driver_config for GPU ids is an interesting idea and may make sense since the actual IDs available depend on the driver implementation (e.g. index vs BDF vs CDI device name).

This probably means that we're looking at the following shape:

driver_config:
  docker:
    device_ids: 
    - nvidia.com/gpu=0
  podman:
    device_ids: 
    - nvidia.com/gpu=0
  vm:
    device_ids: 
    - 0

This would select the specified devices when --gpu-count=1 (assuming that we extend the docker and podman drivers to support this). We would also remove / deprecate the --gpu-device command line flag in this case and have --gpu imply --gpu-count=1. Note that the device_ids would ONLY be processed if a count is actually specified so that we don't allow driver-config to override sandbox-specific resources.

Let's me use this as a concrete driver for an implementaion of the proposal in #1589.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1716 as a draft POC for what this would look like. I still need to make another pass or two though.

@elezar elezar force-pushed the feat/gpu-request-spec branch 2 times, most recently from 7bdaa73 to 23bdbb1 Compare June 3, 2026 11:29
@elezar
Copy link
Copy Markdown
Member Author

elezar commented Jun 3, 2026

Compatibility note: this revision intentionally keeps the proto compact while the API is still pre-alpha. It reuses the old direct GPU field position for resource_requirements and does not reserve the removed GPU-related tags/names. If reviewers prefer compatibility protection here, we can reserve the old fields in a follow-up revision.

@elezar
Copy link
Copy Markdown
Member Author

elezar commented Jun 3, 2026

@drew I readded the counts (from local history), and then went a step further and added the ResourceRequirements envelope instead of doing that as a follow up.

This is currently a separate commit, so I can roll it back if we're more comfortable with the initial change.

Comment thread crates/openshell-cli/src/run.rs Outdated
Comment thread architecture/compute-runtimes.md Outdated
Comment on lines 2614 to +2616
gpu,
gpu_device.as_deref(),
gpu_count,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow-up: Does it make sense to introduce a type for gpu and other resources going forward?

Comment on lines 1681 to 1685
gpu: bool,
gpu_device: Option<&str>,
gpu_count: Option<u32>,
cpu: Option<&str>,
memory: Option<&str>,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For follow-up: Let's add a struct that handles these resources together instead of adding new arguments.

Comment on lines +14 to +19
pub fn cdi_gpu_device_ids(gpu: Option<&DriverGpuResourceRequirement>) -> Option<Vec<String>> {
match gpu {
Some(gpu) if gpu.device_ids.is_empty() => Some(vec![CDI_GPU_DEVICE_ALL.to_string()]),
Some(gpu) => Some(gpu.device_ids.clone()),
None => None,
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the core changes of #1675 is to not default to nvidia.com/gpu=all when an empty device_id is specified, but instead select the "next" GPU id. We're currently keeping the behaviour unchanged here as part of this PR, but it may be useful to call this out in a comment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note is that #1675 builds on this PR to also add support for count-based GPU requests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. cdi_gpu_device_ids now documents that an empty explicit device list preserves the current all-GPU CDI default for this PR.

Comment on lines +1724 to +1730
fn driver_gpu_requirement(
spec: &openshell_core::proto::compute::v1::DriverSandboxSpec,
) -> Option<&DriverGpuResourceRequirement> {
spec.resource_requirements
.as_ref()
.and_then(|requirements| requirements.gpu.as_ref())
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a question: Is this a common Rust pattern to extract a reference to a specific field (or sub-field) form a type?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: This specific function does not seem docker-specific. Should it be moved to somewhere like crates/openshell-core/src/gpu.rs if it is also present in podman or vm implementations?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The shared extraction helper now lives in openshell_core::gpu, and Docker/Podman/Kubernetes/VM use the common driver_gpu_requirement path.

const GPU_RESOURCE_QUANTITY: &str = "1";
const DEFAULT_GPU_COUNT: u32 = 1;

fn driver_gpu_requirement(spec: &SandboxSpec) -> Option<&DriverGpuResourceRequirement> {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is common to the other drivers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This now uses the shared openshell_core::gpu::driver_gpu_requirement helper instead of a Kubernetes-local pattern.

Comment on lines +322 to +328
if let Some(gpu) = sandbox.spec.as_ref().and_then(driver_gpu_requirement)
&& gpu_has_explicit_device_ids(Some(gpu))
{
return Err(KubernetesDriverError::Precondition(
"kubernetes compute driver does not support explicit GPU device IDs".to_string(),
));
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we seemingly duplicating the sandbox validation here?

sandbox_template_to_k8s(
&SandboxTemplate::default(),
spec.is_some_and(|s| s.gpu),
spec.and_then(driver_gpu_requirement),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between spec.and_then(driver_gpu_requirement) and just calling driver_gpu_requirement(spec). If they're equivalent, can we settle on one of these for consistency?

static ENV_LOCK: std::sync::LazyLock<std::sync::Mutex<()>> =
std::sync::LazyLock::new(|| std::sync::Mutex::new(()));

fn gpu_request(count: Option<u32>) -> DriverGpuResourceRequirement {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have multiple mechanisms to construct a DriverGpuResourceRequirement in the kubernetes driver? This particular method seems to only be used in testing. If this is the case, is there a way to mark it as such. Furthermore, if we could "normalize" the construction, so that we always construct a DriverGpuResourceRequirement internally with the default count (i.e. the default request), then we don't need to remember to handle the special case when Count is not specified.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The construction helpers are now local to the Kubernetes test module, with default_gpu_request() for the default request and gpu_request(count) only for count-specific cases.

Comment on lines +2081 to +2095
fn gpu_has_explicit_device_ids_only_when_ids_are_present() {
assert!(!gpu_has_explicit_device_ids(None));
assert!(!gpu_has_explicit_device_ids(Some(
&DriverGpuResourceRequirement {
device_ids: vec![],
count: None,
}
)));
assert!(gpu_has_explicit_device_ids(Some(
&DriverGpuResourceRequirement {
device_ids: vec!["nvidia.com/gpu=0".to_string()],
count: None,
}
)));
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we testing this in the kubernetes driver? Explicit device IDs are not supported here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The Kubernetes test now verifies that explicit GPU device IDs are rejected, since Kubernetes does not support device-specific requests in this PR.

sandbox_template_to_k8s(
&template,
true,
Some(&gpu_request(None)),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably just be a "defaultGpuRequest" which maps better to gpu == true before these changes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The helper is now default_gpu_request(), with gpu_request(count) kept for count-specific Kubernetes tests.

Comment thread crates/openshell-driver-kubernetes/README.md Outdated
Comment on lines +821 to +826
resource_requirements: Some(DriverSandboxResourceRequirements {
gpu: Some(DriverGpuResourceRequirement {
device_ids: vec![],
count: None,
}),
}),
Copy link
Copy Markdown
Member Author

@elezar elezar Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a "defaultGPURequest"? (which may be testing-only and make this simpler to read). This may not be the case though if we're only using it once.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The Podman test now uses default_gpu_request() for the default GPU request case.

) -> Result<(), ComputeDriverError> {
let gpu_requested = sandbox.spec.as_ref().is_some_and(|s| s.gpu);
Self::validate_gpu_request(gpu_requested)
let gpu = sandbox.spec.as_ref().and_then(driver_gpu_requirement);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note to be consistent on how the DriverGpuRequirements are constructed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Podman now uses the shared driver_gpu_requirement extraction helper, and the test construction path uses the default GPU request helper where appropriate.

gpu: Option<&DriverGpuResourceRequirement>,
gpu_enabled: bool,
) -> Result<(), Status> {
if gpu.is_some() && !gpu_enabled {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it idiomatic in Rust to perform a quick return if gpu is NOT set?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is the usual guard-clause shape in Rust for optional validation: return early for None, then validate the concrete GPU request without repeatedly unwrapping or matching.

@elezar elezar force-pushed the feat/gpu-request-spec branch from 009a6ee to 4c18100 Compare June 3, 2026 21:21
elezar added 5 commits June 4, 2026 09:47
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar force-pushed the feat/gpu-request-spec branch from 4c18100 to cec0e21 Compare June 4, 2026 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e-gpu Requires GPU end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants