feat: Support K8s DRA Resources V1 APIs by adityasingh0510 · Pull Request #596 · NVIDIA/dcgm-exporter

adityasingh0510 · 2025-12-08T07:28:59Z

This PR updates dcgm-exporter to support both the stable resource.k8s.io/v1 API and the v1beta1 API for Dynamic Resource Allocation (DRA) support. This ensures compatibility with both Kubernetes 1.34+ clusters (using v1) and older clusters (using v1beta1), with automatic detection and graceful fallback.

Problem

When enabling DRA labels in dcgm-exporter on Kubernetes 1.34+ clusters, the following error occurs:

failed to list v1beta1.ResourceSlice as we have v1.ResourceSlice

This happens because:

Kubernetes 1.34+ promotes the ResourceSlice API from v1beta1 to stable v1
Clusters may only expose the v1 API, breaking code that only uses v1beta1
Older clusters (1.27-1.33) still use v1beta1, so we need to support both

Changes

Files Modified

internal/pkg/transformation/dra.go:
- Register both v1 and v1beta1 ResourceSlice informers
- Implement separate event handlers for each API version:
  - onAddOrUpdateV1() / onAddOrUpdateV1beta1()
  - onDeleteV1() / onDeleteV1beta1()
- Add cache checking in delete handlers to prevent premature device removal
- Handle API structure differences:
  - v1beta1: dev.Basic.Attributes
  - v1: dev.Attributes (direct access, no Basic wrapper)
internal/pkg/transformation/types.go:
- Add v1Informer and v1beta1Informer fields to DRAResourceSliceManager struct
go.mod / go.sum:
- Upgrade k8s.io/api: v0.33.3 → v0.34.0 (adds support for resource/v1)
- Upgrade k8s.io/client-go: v0.33.3 → v0.34.0 (ensures compatibility)
- Upgrade k8s.io/apimachinery: v0.33.3 → v0.34.0

API Structure Changes

The v1 API has a different structure than v1beta1:

API Version	Device Attribute Access
v1beta1	`dev.Basic.Attributes`
v1	`dev.Attributes` (direct)

The implementation handles both structures correctly.

Behavior

Automatic API Detection

The code registers both informers and uses whichever is available:

// Both informers are registered
v1Informer := factory.Resource().V1().ResourceSlices().Informer()
v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()

// At least one must sync successfully
v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)

Precedence Logic

When both APIs are available:

v1 takes precedence: v1beta1 only adds devices if v1 doesn't already have them
Delete protection: Before deleting, handlers check if the device exists in the other API's cache
No duplicate entries: Precedence logic ensures each device is only tracked once

Testing

Verification

Code compiles successfully with both API versions
All tests pass - existing unit tests continue to work
No linter errors
v1 API support - verified with Kubernetes 1.34+ API structure
v1beta1 API support - verified with Kubernetes 1.27-1.33 API structure
Dual API handling - both informers work correctly when both are available
Precedence logic - v1 correctly takes precedence over v1beta1
Delete handling - race conditions prevented with cache checking

Test Scenarios

Kubernetes 1.34+ clusters (v1 API only)
Kubernetes 1.27-1.33 clusters (v1beta1 API only)
Clusters with both APIs available (migration periods)
MIG devices work with both API versions

Backward Compatibility

Fully backward compatible:

Existing deployments on Kubernetes 1.27-1.33 continue to work unchanged
No breaking changes for any supported Kubernetes version
No configuration changes required

Forward compatible:

Ready for Kubernetes 1.34+ clusters
Automatically uses the best available API version

Breaking Changes

None - This is a backward and forward compatibility enhancement. The change:

Works on older clusters (1.27-1.33) using v1beta1
Works on newer clusters (1.34+) using v1
Works during migration periods when both are available
Requires no configuration changes

Related Issues

Fixes Add support for K8s v1.34 resource.k8s.io/v1 DRA APIs #590

guptaNswati · 2026-02-23T20:57:02Z

@adityasingh0510 thank you for the PR and your patience. I have some comments that needs to be solved before the merge.

adityasingh0510 · 2026-02-24T12:06:49Z

@guptaNswati thank you for the review and comments. I have addressed the feedback and pushed the updates for your review.

guptaNswati · 2026-02-24T22:05:47Z

Overall it looks good. Need to add tests also https://github.com/NVIDIA/dcgm-exporter/blob/main/internal/pkg/transformation/kubernetes_test.go

Need to double check on the edge cases. i will come back to it. meanwhile address the comments, add tests and paste the test and logs

guptaNswati · 2026-02-24T21:55:13Z

+
+	// Register informers for both v1 and v1beta1 to support both API versions
+	v1Informer := factory.Resource().V1().ResourceSlices().Informer()
+	v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()


need to use the discovery logic here also to decide which informer to start.

+1

Lets say we had a v1beta1 ResourceSlice and we upgraded to k8s v1.34, even if v1beta1 apiVersion is enabled, we should simply treat it as v1 ResourceSlice and use it that way. The storageVersion of the object would've already converted to v1 in v1.34+. But the object will be available for consumption in both v1 and v1beta1 apiVersions, if enabled. So both of these informers would watch the same object.

We should only use the latest apiVersion enabled. With this, the rest of the code should be simplified

Thanks, that clarification helps. I’ll update the implementation to use discovery to select only the latest available API version (prefer v1, otherwise v1beta1), start a single informer for that version, and simplify the DRA handlers accordingly.

guptaNswati · 2026-02-24T22:02:23Z

-func (m *DRAResourceSliceManager) onDelete(obj interface{}) {
+// onAddOrUpdateV1 handles v1 API ResourceSlice events
+func (m *DRAResourceSliceManager) onAddOrUpdateV1(obj interface{}) {
+	slice := obj.(*resourcev1.ResourceSlice)


need to check type assertion.

s, ok := obj.(*resourcev1beta1.ResourceSlice) if !ok { return err }

i think its not done originally. need to fix it

guptaNswati · 2026-02-24T22:39:03Z

-	slice := obj.(*resourcev1beta1.ResourceSlice)
-	pool := slice.Spec.Pool.Name
+// onAddOrUpdate handles ResourceSlice add/update events for both v1 and v1beta1 APIs
+func (m *DRAResourceSliceManager) onAddOrUpdate(adapter resourceSliceAdapter, apiVersion string, v1TakesPrecedence bool) {


when this was originally written, the assumption was that ResourceSlices are static and once a device exists, it wont go away. But we recently added support for some features in dra driver where ResourceSlice can be updated and republished. I am debating if that should be handled here or create a new issue for that

we can end up with stale keys here.

@varunrsekar what is your opinion here. it should not cause any issues when vfio mode is enabled as dcgm wont work anyway. but later it will be imp for vfio also. or we may move away from updating resourceslice. but this wont hurt to have a sync here in both add and delete

On ADD: add all devices from the slice to the cache

On UPDATE: cleanup all cached devices from that slice and re-add new list to the cache.

On DELETE: cleanup all slice devices

Otherwise, we'll end up leaking memory if the slice churns for whatever reason.

we can simplify it even more by redoing the map on any event or at a fix interval. can be costly in terms of churn.

I went through the code again and realize local caching is redundant. It was mainly done to do fast look up. And its not handling update on delete. We should use just use informer cache.

Had a discussion with @varunrsekar on it on a call.

we need to remove all local cache maps (deviceToUUID, migDevices, sliceDevices) and use only the informer cache, query it directly in GetDeviceInfo, right ?

varunrsekar · 2026-02-24T23:20:34Z

+
+	// Register informers for both v1 and v1beta1 to support both API versions
+	v1Informer := factory.Resource().V1().ResourceSlices().Informer()
+	v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()


+1

Lets say we had a v1beta1 ResourceSlice and we upgraded to k8s v1.34, even if v1beta1 apiVersion is enabled, we should simply treat it as v1 ResourceSlice and use it that way. The storageVersion of the object would've already converted to v1 in v1.34+. But the object will be available for consumption in both v1 and v1beta1 apiVersions, if enabled. So both of these informers would watch the same object.

We should only use the latest apiVersion enabled. With this, the rest of the code should be simplified

varunrsekar · 2026-02-24T23:21:50Z

+	v1Informer      cache.SharedIndexInformer
+	v1beta1Informer cache.SharedIndexInformer


We should have only a single informer here. Depending on the latest API version enabled in the cluster, the corresponding informer should be configured here.

yes this goes back to discovery first and choosing the highest version based on that.

@adityasingh0510 we dont need both informers. Pls refer to Varun's comment.

@adityasingh0510 the logic is fixed below, but i still see two informers. Pls fix this.

varunrsekar · 2026-02-24T23:32:41Z


-		deviceType := getAttrString(attr, "type")
+		deviceType := dev.GetAttribute("type")
 		switch deviceType {


can you add a default case to log the type that's not handled? It'll provide hints for users if they need to eventually implement it here

+1.
originally its

default: slog.Warn(fmt.Sprintf("Device [key:%s] has unknown type: %s", key, deviceType))

varunrsekar · 2026-02-24T23:34:52Z

+		key := pool + "/" + dev.GetName()

-		deviceType := getAttrString(attr, "type")
+		deviceType := dev.GetAttribute("type")


We are implicitly using the NVIDIA GPU DRA Driver as the reference for this code. If there are GPU DRA vendors that don't implement it this way, then DCGM-exporter will not work with it. Would be good to call it out.

rn dcgm-exporter is only supposed to work with nvidia-dra-driver. This may need to be revisited in the future. We can add a comment here.

I’ve added a comment next to DRAGPUDriverName in const.go clarifying that the current DRA handling assumes the NVIDIA GPU DRA driver schema, and that other GPU DRA drivers may not be compatible with this implementation.

varunrsekar · 2026-02-25T01:27:37Z

-	slice := obj.(*resourcev1beta1.ResourceSlice)
-	pool := slice.Spec.Pool.Name
+// onAddOrUpdate handles ResourceSlice add/update events for both v1 and v1beta1 APIs
+func (m *DRAResourceSliceManager) onAddOrUpdate(adapter resourceSliceAdapter, apiVersion string, v1TakesPrecedence bool) {


On ADD: add all devices from the slice to the cache

On UPDATE: cleanup all cached devices from that slice and re-add new list to the cache.

On DELETE: cleanup all slice devices

Otherwise, we'll end up leaking memory if the slice churns for whatever reason.

varunrsekar · 2026-02-25T01:34:36Z

+				if v1TakesPrecedence {
+					if _, exists := m.deviceToUUID[key]; !exists {
+						m.deviceToUUID[key] = uuid
+						slog.Debug(fmt.Sprintf("Added gpu device [key:%s] with UUID: %s (%s)", key, uuid, apiVersion))
+					}
+				} else {
+					m.deviceToUUID[key] = uuid
+					slog.Debug(fmt.Sprintf("Added gpu device [key:%s] with UUID: %s (%s)", key, uuid, apiVersion))
+				}


If I read this piece of code correctly and how onAddOrUpdate is invoked:

For v1 API, we simply override the deviceToUUID map.

For v1beta1 API, we don't override and only add to deviceToUUID map if it doesnt exist.

Can you help me understand why this is needed?

I’ve simplified this per your feedback, we now pick a single preferred API version and on each add/update we clear the slice’s devices from the cache and re-add from the current spec, so v1 and v1beta1 are handled uniformly.

adityasingh0510 · 2026-02-26T07:09:35Z

@guptaNswati @varunrsekar Thanks for the review. I’ve addressed the comments, refactored DRA to use discovery + a single informer with the informer cache, and added tests in dra_test.go and kubernetes_test.go.

I ran the tests and here are the commands and outputs:

cd /home/ubuntu/dcgm-exporter
go test ./internal/pkg/transformation -count=1
ok github.com/NVIDIA/dcgm-exporter/internal/pkg/transformation 0.328s

varunrsekar · 2026-02-26T19:12:47Z

+	v1Available, err := discovery.IsResourceEnabled(discoveryClient, schema.GroupVersionResource{
+		Group:    "resource.k8s.io",
+		Version:  "v1",
+		Resource: "resourceslices",
+	})


Suggested change

v1Available, err := discovery.IsResourceEnabled(discoveryClient, schema.GroupVersionResource{

Group: "resource.k8s.io",

Version: "v1",

Resource: "resourceslices",

})

v1Available, err := discovery.IsResourceEnabled(discoveryClient, resourcev1.SchemaGroupVersion.WithResource("resourceslices"))

varunrsekar · 2026-02-26T19:13:12Z

+	v1beta1Available, err := discovery.IsResourceEnabled(discoveryClient, schema.GroupVersionResource{
+		Group:    "resource.k8s.io",
+		Version:  "v1beta1",
+		Resource: "resourceslices",
 	})


Suggested change

v1beta1Available, err := discovery.IsResourceEnabled(discoveryClient, schema.GroupVersionResource{

Group: "resource.k8s.io",

Version: "v1beta1",

Resource: "resourceslices",

})

v1beta1Available, err := discovery.IsResourceEnabled(discoveryClient, resourcev1beta1.SchemaGroupVersion.WithResource("resourceslices"))

varunrsekar · 2026-02-26T19:18:40Z

+		var adapter resourceSliceAdapter
+		switch obj := item.(type) {
+		case *resourcev1.ResourceSlice:
+			if obj.Spec.Driver != DRAGPUDriverName {


Can you move the comment on the DRAGPUDriverName here?

varunrsekar · 2026-02-26T19:39:21Z

+	var v1beta1Informer cache.SharedIndexInformer
+
+	if useV1 {
+		v1Informer = factory.Resource().V1().ResourceSlices().Informer()


We should add an indexer to this informer by pool name so that its easier to list in GetDeviceInfo:

informer.AddIndexers(cache.Indexers{ "<index>": func(obj interface{}) ([]string, error) { ... }

We can then list resourseslices like:

resourceSlices, err := informer.GetIndexer().ByIndex("<index>", "<poolName>")

how about adding apool+deviceName indexer, this only gives the slices with the desired devices or we maintain a map of pool/deviceName -> info (uuid, type, parentUUID, profile) like i was originally doing but built and rebuilt directly from informer cache based on events. this eliminates repeated scanning.

this stays. can you look into indexing based on above suggestion for fast lookups

varunrsekar · 2026-02-26T22:16:59Z

+	return ""
+}
+
 func NewDRAResourceSliceManager() (*DRAResourceSliceManager, error) {


@adityasingh0510 What would it take to move the whole DRAResourceSliceManager into a separate package internal/pkg/transformation/dra as 2 separate implementations?:

internal/pkg/transformation/dra/v1/ -> impl for v1 API

internal/pkg/transformation/dra/v1beta1/ -> impl for v1beta1 API

internal/pkg/transformation/dra/dra.go -> initialize appropriate versioned-manager depending on available version
Lets also make it a generic DRAResourceManager where resourceslice is just a part of it.

@varunrsekar Good idea in principle, but I think it’s more refactor than we need right now. The current DRAResourceSliceManager already hides the version and GPU/MIG details behind GetDeviceInfo / GetDynamicResourceInfo, and splitting into versioned packages would add quite a bit of plumbing and duplication for limited benefit today.
I’d prefer to keep this as-is and revisit a dedicated dra package if our DRA usage grows.

I agree with @adityasingh0510. Its a limited api which does not need a separate package and most of the k8s podresources logic of dcgm-exporter lives in pkg/transformation.

The spirit of my comment was to isolate each versioned implementations so that its easier to read. Lets find a middle-ground to achieve this.

For eg:

func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) { ... }

this can be made into:

func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) { if useV1 { return m.GetV1DeviceInfo(pool, device) } else { return m.GetV1Beta1DeviceInfo(pool, device) }

varunrsekar · 2026-02-26T22:27:01Z

+// local caches and ensures we always have the latest state from the API server.
 // For MIG devices: returns (parentUUID, *DRAMigDeviceInfo)
 // For full GPUs: returns (deviceUUID, nil)
 func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) {


There's nested levels of conditionals in here that's a little difficult to follow:

Is api v1 or v1beta1

Is device "gpu" or "mig"

We should also make this function easier to consume by incorporating the versioned-manager:

func (m *DRAResourceManager) GetDynamicResourceInfo(resource *podresourcesapi.DynamicResource) *DynamicResourceInfo

varunrsekar · 2026-02-26T22:39:53Z

@adityasingh0510 Thanks for your patience! Please check if we can refactor it further per comments...

adityasingh0510 · 2026-02-27T11:59:13Z

Thanks a lot for the detailed feedback , @varunrsekar !

go test ./internal/pkg/transformation/... is passing . Please take another look and let me know if you’d like any further tweaks

guptaNswati · 2026-02-27T18:58:25Z

+	}
+
+	// Select a single API version to watch.
+	apiVersion := "v1beta1"


mmm we also need to make sure that this same api version is used by nvidia-dra-driver as it gives users the helm option to select the version

https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/deployments/helm/nvidia-dra-driver-gpu/values.yaml#L49

yes , Added --kubernetes-dra-resource-api-version CLI flag that allows users to configure the API version to match the Helm chart's resourceApiVersion setting.

guptaNswati · 2026-02-27T19:02:51Z

+	// Select a single API version to watch.
+	apiVersion := "v1beta1"
+	useV1 := false
+	if v1Available {


use a switch here and move it to a helper, will be easier to read.

switch available { case "v1": // create v1 informer case "v1beta1": // create v1beta1 informer default: // error }

guptaNswati · 2026-02-27T19:13:41Z

+
+	v1Available, err := discovery.IsResourceEnabled(discoveryClient, resourcev1.SchemeGroupVersion.WithResource("resourceslices"))
+	if err != nil {
+		return nil, fmt.Errorf("error checking v1 ResourceSlice API availability: %w", err)


we should not return error, rather warn. Or may be just use list/probe.

LIST resource.k8s.io/v1 ResourceSlices.

filter gpu.nvidia.com

if available, pick it

else repeat of v1beta1

create a helper and use for both.

something like:

hasNvidiaDRASlice := func(apiversion object) (bool, error) { list, err := clientset.apiversion.ResourceSlices().List(ctx, metav1.ListOptions{}) if err != nil { return false, err } for _, s := range list.Items { if s.Spec.Driver == DRAGPUDriverName { return true, nil } } return false, nil } this goes back to https://github.com/NVIDIA/dcgm-exporter/pull/596/changes#r2865748563

guptaNswati · 2026-02-27T19:24:06Z

+						// MIG device
+						devices = append(devices, resourcev1.Device{
+							Name: "gpu-x",
+							Attributes: map[resourcev1.QualifiedName]resourcev1.DeviceAttribute{


pls add tests for version selection:

if v1 served but returns 0 GPU slices, v1beta1 returns GPU slices → choose v1beta1

both served and both have objects → prefer v1

adityasingh0510 · 2026-03-01T07:31:54Z

Hi @guptaNswati , I've addressed all the review feedback
All tests passing. Ready for another review when convenient. Thanks!

below is the test logs:

root@adtiya-ai-platform:/home/ubuntu/dcgm-exporter# go test -count=1 ./internal/pkg/transformation -run "TestVersionSelection|TestGetDeviceInfo|TestPodDRAInfo"
ok github.com/NVIDIA/dcgm-exporter/internal/pkg/transformation 0.057s

guptaNswati · 2026-03-23T21:43:56Z

+
+	// Neither version has NVIDIA DRA slices, but we'll still return a default
+	// in case slices are created later. Prefer v1beta1 as fallback.
+	if v1Err == nil || v1beta1Err == nil {


should not default fallback to v1beta1. just log and return nil in case no nvidia dra slice is found

guptaNswati · 2026-03-23T21:59:34Z

+			return "", fmt.Errorf("invalid DRA ResourceSlice API version: %s (must be 'v1' or 'v1beta1')", preferredVersion)
+		}
+
+		hasSlices, err := hasNvidiaDRASlice(ctx, client, useV1)


mmm this is hard to read. pls simplify with switch statement. The goal is to check if api version is available along with nvidia dra slice, prefer v1 and if no nvidia slice is available, log and return nil.

adityasingh0510 · 2026-04-07T12:41:08Z

Hi @guptaNswati , I pushed updates addressing the review notes, Could you please take another look?

varunrsekar

Sorry for the late comments. It looks like the implementation has regressed from some of the earlier comments. Can you look into it?

And once again, appreciate your perseverance on this.

varunrsekar · 2026-04-08T00:00:27Z

+// by listing ResourceSlices for the given API version.
+func hasNvidiaDRASlice(ctx context.Context, client kubernetes.Interface, useV1 bool) (bool, error) {
+	if useV1 {
+		list, err := client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{})


nit: use non-ambiguous variable names: resourceSlicesList

varunrsekar · 2026-04-08T00:02:00Z

+		}
+		return false, nil
+	} else {
+		list, err := client.ResourceV1beta1().ResourceSlices().List(ctx, metav1.ListOptions{})


nit: use non-ambiguous variable names: resourceSlicesList

varunrsekar · 2026-04-08T00:03:21Z

 		return nil, fmt.Errorf("error getting kube client: %w", err)
 	}

+	ctx := context.Background()


nit: this isnt used anywhere. You can remove this line

varunrsekar · 2026-04-08T00:21:49Z

+	// Prefer v1 if it has NVIDIA GPU slices for this pool; otherwise fall back to v1beta1.
+	var v1Items, v1beta1Items []interface{}
+	var v1Err, v1beta1Err error


During NewResourceSliceManager initialization, we would have already determined which API version to use. Here we should process objects based on that

varunrsekar · 2026-04-08T00:25:25Z

+	// Create informers for both API versions. We will prefer v1 at lookup time
+	// if it has NVIDIA DRA slices, otherwise fall back to v1beta1.
+	v1Informer := factory.Resource().V1().ResourceSlices().Informer()


We seem to have regressed on the implementation here. See:
#596 (comment)
#596 (comment)

Please reapply the changes to address those comments.

Also, I'd urge you to test this with:

1.33 cluster with no DRA

1.33 cluster with v1beta1 DRA

1.34 cluster with v1 DRA

Hi @varunrsekar, thanks for pointing this out.

I’ve verified the behavior on a 1.34 cluster with v1 DRA and it seems to be working as expected.For the 1.33 scenarios (with no DRA and with v1beta1 DRA), I currently don’t have access to a suitable Kubernetes setup to validate these cases.

Also attaching the reference screenshot for the 1.34 validation.

varunrsekar · 2026-04-08T00:35:04Z

+	return ""
+}
+
 func NewDRAResourceSliceManager() (*DRAResourceSliceManager, error) {


The spirit of my comment was to isolate each versioned implementations so that its easier to read. Lets find a middle-ground to achieve this.

For eg:

func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) { ... }

this can be made into:

func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) { if useV1 { return m.GetV1DeviceInfo(pool, device) } else { return m.GetV1Beta1DeviceInfo(pool, device) }

guptaNswati · 2026-04-08T19:08:41Z

  - type: markdown
    attributes:
-      value: |
+      value: 


this came form rebase?

Hi @guptaNswati, this came in during the rebase. Sorry about that, I’ll clean it up.

…informer

guptaNswati · 2026-04-14T22:50:49Z

+	return ""
+}
+
+// hasNvidiaDRASlice checks if there are any ResourceSlices with NVIDIA DRA driver


I dont think we need this. its redundant tocountGPUSlices

right , i'll remove that!

guptaNswati · 2026-04-14T22:57:13Z

+
+	for _, r := range resources.APIResources {
+		// Be lenient: match both "resourceslices" and any subresource variants.
+		if strings.HasPrefix(r.Name, "resourceslices") {


we just need the resourceslices and not subresources. This is too board. probably use so r.Name == "resourceslices"

guptaNswati · 2026-04-14T22:59:20Z

 	}
 }

+// testInformerForDRA is a simple test implementation of SharedIndexInformer for DRA tests


this is defined in kubernetes_test.go also. better to have a shared helper.

guptaNswati · 2026-04-14T23:01:32Z

+					}
+				}
+				if len(devices) > 0 {
+					slice := &resourcev1.ResourceSlice{


pls add a similar test for v1beta1 too where preferredAPIVersion: "v1beta1" and check the attributes for v1beta1

guptaNswati · 2026-04-15T18:55:08Z

+	case "v1beta1":
+		return m.getV1beta1DeviceInfo(pool, device)
+	default:
+		// Be defensive if the manager is constructed manually in tests.


don't need this default fallback if manager correctly choose the api version

+1. Fallback should be an error

guptaNswati · 2026-04-15T18:57:44Z

Thanks for the updates @adityasingh0510 — this is much closer now.

I still think the API version selection in NewDRAResourceSliceManager() needs one more change.

Right now we first select the version based on which groupversion is served:

switch {
case supportsResourceSliceGV(client, resourceGVV1):
    selected = "v1"
case supportsResourceSliceGV(client, resourceGVV1beta1):
    selected = "v1beta1"
default:
    ...
}

and only after that we list ResourceSlices for the selected version and return nil if there are no NVIDIA slices.

This means the following case still behaves incorrectly:

resource.k8s.io/v1 is served
resource.k8s.io/v1beta1 is also served
v1 has 0 NVIDIA DRA ResourceSlices
v1beta1 has NVIDIA DRA ResourceSlices

In that case we currently pick v1, see 0 slices, and return nil, instead of selecting v1beta1.

I think the selection should be:

if v1 is served and has NVIDIA DRA slices -> use v1
else if v1beta1 is served and has NVIDIA DRA slices -> use v1beta1
else log and return nil

Something along these lines would make the behavior explicit:

const (
    resourceGVV1      = "resource.k8s.io/v1"
    resourceGVV1beta1 = "resource.k8s.io/v1beta1"
)

v1Served := supportsResourceSliceGV(client, resourceGVV1)
v1beta1Served := supportsResourceSliceGV(client, resourceGVV1beta1)

v1HasNvidiaSlices := false
if v1Served {
    resourceSlicesList, err := client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{})
    if err != nil {
        return nil, fmt.Errorf("failed to list v1 ResourceSlices: %w", err)
    }
    items := make([]interface{}, 0, len(resourceSlicesList.Items))
    for i := range resourceSlicesList.Items {
        items = append(items, &resourceSlicesList.Items[i])
    }
    v1HasNvidiaSlices = countGPUSlices(items) > 0
}

v1beta1HasNvidiaSlices := false
if v1beta1Served {
    resourceSlicesList, err := client.ResourceV1beta1().ResourceSlices().List(ctx, metav1.ListOptions{})
    if err != nil {
        return nil, fmt.Errorf("failed to list v1beta1 ResourceSlices: %w", err)
    }
    items := make([]interface{}, 0, len(resourceSlicesList.Items))
    for i := range resourceSlicesList.Items {
        items = append(items, &resourceSlicesList.Items[i])
    }
    v1beta1HasNvidiaSlices = countGPUSlices(items) > 0
}

switch {
case v1HasNvidiaSlices:
    selected = "v1"
case v1beta1HasNvidiaSlices:
    selected = "v1beta1"
default:
    slog.Warn("No NVIDIA DRA ResourceSlices found; DRA labels will not be available")
    return nil, nil
}

This also matches the intended precedence:

if both versions are served and both have NVIDIA slices -> prefer v1
if only one has NVIDIA slices -> use that one

varunrsekar · 2026-04-15T20:17:32Z

+		v1Informer:          v1Informer,
+		v1beta1Informer:     v1beta1Informer,


You don't need both. this can just be informer that's initialized in the switch-case above.

varunrsekar · 2026-04-15T20:18:46Z

+	// Wait for cache sync on the selected informer.
+	var synced bool
+	if m.v1Informer != nil {
+		synced = cache.WaitForCacheSync(ctx.Done(), m.v1Informer.HasSynced)
+	} else {
+		synced = cache.WaitForCacheSync(ctx.Done(), m.v1beta1Informer.HasSynced)
+	}
+	if !synced {
 		cancel()
 		return nil, fmt.Errorf("ResourceSlice informer cache sync failed")
 	}


With just 1 informer, this can be simplified

varunrsekar · 2026-04-15T20:20:07Z

 	}
+
+	// Only keep the manager if the selected API has NVIDIA DRA slices.
+	useV1 := selected == "v1"


Can this be defined in a function against the DRAResourceSliceManager?
Usage would be something like: if m.UseDRAV1APIs() {

Thanks for the suggestion. We’ve already simplified the constructor to use a single selected informer and removed the earlier selected == "v1" branching there. At this point the remaining preferredAPIVersion checks are minimal, so I’d prefer to keep it as-is to avoid extra churn.
If you feel strongly, I’m happy to add a small helper like IsV1()/UseDRAV1API() in a follow-up.

varunrsekar · 2026-04-15T20:27:29Z


-	slog.Info(fmt.Sprintf("No UUID found for %s", key))
-	return "", nil
+	return m.getDeviceInfoFromItems(pool, device, items)


I'm confused here. We're already have the context of v1 APIs. So why are we calling into a helper getDeviceInfoFromItems that's again trying to determine what API version to use?

Good point , the helper isn’t re-selecting an API version; the version is already determined by which informer indexer we query (v1Informer vs v1beta1Informer). The helper only parses the items returned by that informer (and handles v1/v1beta1 object types). I renamed it to getDeviceInfoFromResourceSliceItems and added a comment to make that intent explicit.

varunrsekar · 2026-04-15T20:28:23Z

+	case "v1beta1":
+		return m.getV1beta1DeviceInfo(pool, device)
+	default:
+		// Be defensive if the manager is constructed manually in tests.


+1. Fallback should be an error

varunrsekar · 2026-04-15T20:30:29Z

+		for i := range resourceSlicesList.Items {
+			items = append(items, &resourceSlicesList.Items[i])
+		}
+		gpuSliceCount = countGPUSlices(items)


If we have already determined to use V1 APIs, I dont see why we need to call into a helper that again does that determination.

@varunrsekar We’re not re-determining the API version there , the version is already decided by whether we list via client.ResourceV1() vs client.ResourceV1beta1() / which informer we start. countGPUSlices is just shared logic to check “does this list contain NVIDIA GPU DRA slices (driver match + devices)?” for either object type, to avoid duplicating the same checks in two branches. If you prefer, I can inline it, but it would be identical code duplicated for v1 and v1beta1.

…d harden tests

adityasingh0510 · 2026-04-17T04:11:50Z

Hi @varunrsekar @guptaNswati , thanks for the thorough review. I’ve pushed an update code , PTAL

guptaNswati · 2026-04-21T21:12:35Z

Looks much better. Thanks you addressing the comments. There are still a few nits that needs to be fixed.

Also, can you pls share test logs from latest changes in both Full GPU and MIG mode with:

1.32 (enable PodResourcesDRA FG) or 1.33 for v1beta1
1.34+ for v1

guptaNswati · 2026-04-21T23:47:17Z

Given the amount of churn in this PR, it has become difficult to review end to end. Could you please redo this as a fresh, minimal PR from current main with only the final intended changes and tests.

I had to go through the flow and found a regression in how multiple devices/claims are getting mapped. toDeviceToPodsDRA() now calls GetDynamicResourceInfo() once per DynamicResource, and GetDynamicResourceInfo() returns on the first matching claim. That means if a single DynamicResource contains multiple devices, only the first one gets mapped and the rest are dropped.

Also, pls acknowledge the use of AI in addressing the reviews or otherwise.

guptaNswati · 2026-04-28T19:11:54Z

"Hey @adityasingh0510, checking in on this! Do you think you’ll have a chance to address the feedback soon in a new PR? We’re hoping to get this into the next release. If you’re tied up, let me know—I’m happy to jump in and do it .

adityasingh0510 · 2026-04-29T10:18:14Z

Sorry @guptaNswati , I missed that. I’ve created a minimal PR with only the final intended changes + tests:
#654

guptaNswati · 2026-04-29T18:42:56Z

@adityasingh0510 thank you. I will take a look. Please close this in favor of new PR.

feat: Add dual API support for ResourceSlice (v1 and v1beta1)

2d09218

adityasingh0510 force-pushed the feature/k8s-v1-resource-api-support branch from c179153 to 2d09218 Compare December 8, 2025 09:40

adityasingh0510 changed the title ~~feat: Add support for Kubernetes v1.34 resource.k8s.io/v1 APIs~~ feat: Add dual API support for ResourceSlice (v1 and v1beta1) Dec 8, 2025

adityasingh0510 changed the title ~~feat: Add dual API support for ResourceSlice (v1 and v1beta1)~~ feat: Support K8s DRA Resources V1 APIs Dec 8, 2025

guptaNswati reviewed Feb 23, 2026

View reviewed changes

Comment thread internal/pkg/transformation/dra.go Outdated

guptaNswati reviewed Feb 23, 2026

View reviewed changes

Comment thread internal/pkg/transformation/dra.go Outdated

Handle DRA ResourceSlice v1/v1beta1 APIs safely and dedupe helpers

8cbeff9

adityasingh0510 requested a review from guptaNswati February 24, 2026 12:12

guptaNswati reviewed Feb 24, 2026

View reviewed changes

varunrsekar reviewed Feb 25, 2026

View reviewed changes

Refactor DRA to use informer cache and support v1/v1beta1 cleanly

df7c74a

adityasingh0510 requested review from guptaNswati and varunrsekar February 26, 2026 07:10

varunrsekar reviewed Feb 26, 2026

View reviewed changes

Refine DRA ResourceSlice handling and simplify DRA pod mapping

d61ed14

adityasingh0510 requested a review from varunrsekar February 27, 2026 11:59

guptaNswati reviewed Feb 27, 2026

View reviewed changes

add configurable ResourceSlice API version selection

bf1cabc

adityasingh0510 force-pushed the feature/k8s-v1-resource-api-support branch from cf32e45 to bf1cabc Compare February 28, 2026 09:54

Merge branch 'main' into feature/k8s-v1-resource-api-support

bba21d8

guptaNswati reviewed Mar 23, 2026

View reviewed changes

adityasingh0510 requested a review from guptaNswati April 7, 2026 12:35

varunrsekar reviewed Apr 8, 2026

View reviewed changes

guptaNswati reviewed Apr 8, 2026

View reviewed changes

refactor(dra): select ResourceSlice API via discovery and run single …

93c2e4c

…informer

adityasingh0510 force-pushed the feature/k8s-v1-resource-api-support branch from 3b4adad to 93c2e4c Compare April 12, 2026 15:33

adityasingh0510 requested review from guptaNswati and varunrsekar April 14, 2026 11:42

guptaNswati reviewed Apr 14, 2026

View reviewed changes

feat(transformation/dra): remove redundant slice check and improve tests

5339b4f

adityasingh0510 requested a review from guptaNswati April 15, 2026 12:22

guptaNswati reviewed Apr 15, 2026

View reviewed changes

varunrsekar reviewed Apr 15, 2026

View reviewed changes

fix(transformation/dra): select ResourceSlice API by NVIDIA slices an…

db8408a

…d harden tests

adityasingh0510 requested review from guptaNswati and varunrsekar April 17, 2026 04:04

adityasingh0510 closed this Apr 29, 2026

guptaNswati mentioned this pull request Apr 30, 2026

feat: Support K8s DRA Resources V1 APIs #654

Open

		v1Informer cache.SharedIndexInformer
		v1beta1Informer cache.SharedIndexInformer

Conversation

adityasingh0510 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Files Modified

API Structure Changes

Behavior

Automatic API Detection

Precedence Logic

Testing

Verification

Test Scenarios

Backward Compatibility

Breaking Changes

Related Issues

Uh oh!

Uh oh!

Uh oh!

guptaNswati commented Feb 23, 2026

Uh oh!

adityasingh0510 commented Feb 24, 2026

Uh oh!

guptaNswati commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adityasingh0510 Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

adityasingh0510 commented Dec 8, 2025 •

edited

Loading

guptaNswati commented Feb 24, 2026 •

edited

Loading

adityasingh0510 Feb 26, 2026 •

edited

Loading

guptaNswati Feb 25, 2026 •

edited

Loading

guptaNswati Mar 23, 2026 •

edited

Loading