feat: Support K8s DRA Resources V1 APIs by adityasingh0510 · Pull Request #654 · NVIDIA/dcgm-exporter

adityasingh0510 · 2026-04-29T04:33:17Z

This PR updates dcgm-exporter to support both the stable resource.k8s.io/v1 API and the v1beta1 API for Dynamic Resource Allocation (DRA) support. This ensures compatibility with both Kubernetes 1.34+ clusters (using v1) and older clusters (using v1beta1), with automatic detection and graceful fallback.

Problem

When enabling DRA labels in dcgm-exporter on Kubernetes 1.34+ clusters, the following error occurs:

failed to list v1beta1.ResourceSlice as we have v1.ResourceSlice

This happens because:

Kubernetes 1.34+ promotes the ResourceSlice API from v1beta1 to stable v1
Clusters may only expose the v1 API, breaking code that only uses v1beta1
Older clusters (1.27-1.33) still use v1beta1, so we need to support both

Changes

Files Modified

internal/pkg/transformation/dra.go:
- Register both v1 and v1beta1 ResourceSlice informers
- Implement separate event handlers for each API version:
  - onAddOrUpdateV1() / onAddOrUpdateV1beta1()
  - onDeleteV1() / onDeleteV1beta1()
- Add cache checking in delete handlers to prevent premature device removal
- Handle API structure differences:
  - v1beta1: dev.Basic.Attributes
  - v1: dev.Attributes (direct access, no Basic wrapper)
internal/pkg/transformation/types.go:
- Add v1Informer and v1beta1Informer fields to DRAResourceSliceManager struct
go.mod / go.sum:
- Upgrade k8s.io/api: v0.33.3 → v0.34.0 (adds support for resource/v1)
- Upgrade k8s.io/client-go: v0.33.3 → v0.34.0 (ensures compatibility)
- Upgrade k8s.io/apimachinery: v0.33.3 → v0.34.0

API Structure Changes

The v1 API has a different structure than v1beta1:

API Version	Device Attribute Access
v1beta1	`dev.Basic.Attributes`
v1	`dev.Attributes` (direct)

The implementation handles both structures correctly.

Behavior

Automatic API Detection

The code registers both informers and uses whichever is available:

// Both informers are registered
v1Informer := factory.Resource().V1().ResourceSlices().Informer()
v1beta1Informer := factory.Resource().V1beta1().ResourceSlices().Informer()

// At least one must sync successfully
v1Synced := cache.WaitForCacheSync(ctx.Done(), v1Informer.HasSynced)
v1beta1Synced := cache.WaitForCacheSync(ctx.Done(), v1beta1Informer.HasSynced)

Precedence Logic

When both APIs are available:

v1 takes precedence: v1beta1 only adds devices if v1 doesn't already have them
Delete protection: Before deleting, handlers check if the device exists in the other API's cache
No duplicate entries: Precedence logic ensures each device is only tracked once

Testing

Verification

Code compiles successfully with both API versions
All tests pass - existing unit tests continue to work
No linter errors
v1 API support - verified with Kubernetes 1.34+ API structure
v1beta1 API support - verified with Kubernetes 1.27-1.33 API structure
Dual API handling - both informers work correctly when both are available
Precedence logic - v1 correctly takes precedence over v1beta1
Delete handling - race conditions prevented with cache checking

Test Scenarios

Kubernetes 1.34+ clusters (v1 API only)
Kubernetes 1.27-1.33 clusters (v1beta1 API only)
Clusters with both APIs available (migration periods)
MIG devices work with both API versions

Backward Compatibility

Fully backward compatible:

Existing deployments on Kubernetes 1.27-1.33 continue to work unchanged
No breaking changes for any supported Kubernetes version
No configuration changes required

Forward compatible:

Ready for Kubernetes 1.34+ clusters
Automatically uses the best available API version

Breaking Changes

None - This is a backward and forward compatibility enhancement. The change:

Works on older clusters (1.27-1.33) using v1beta1
Works on newer clusters (1.34+) using v1
Works during migration periods when both are available
Requires no configuration changes

Related Issues

Fixes Add support for K8s v1.34 resource.k8s.io/v1 DRA APIs #590

adityasingh0510 · 2026-04-29T10:14:44Z

Hi @guptaNswati , sharing Full GPU mode test logs from the latest dcgm-exporter changes on Kubernetes v1.34+ (ResourceSlice v1) — see below.
We currently don’t have access to MIG-capable hardware in the environment (current GPU: RTX 5090, no MIG support), so I’m unable to provide MIG-mode logs right now

k logs nvidia-dcgm-exporter-7tfg5
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
time=2026-04-29T10:00:22.892Z level=INFO msg="Starting dcgm-exporter" Version=4.5.2-4.8.1
time=2026-04-29T10:00:22.901Z level=INFO msg="Attempting to initialize DCGM."
time=2026-04-29T10:00:22.957Z level=INFO msg="Initialized DCGM Fields module."
time=2026-04-29T10:00:22.957Z level=INFO msg="Attempting to initialize NVML library."
time=2026-04-29T10:00:22.957Z level=INFO msg="NVML provider successfully initialized for Kubernetes MIG support"
time=2026-04-29T10:00:22.957Z level=INFO msg="DCGM successfully initialized!"
time=2026-04-29T10:00:22.989Z level=INFO msg="Successfully queried DCGM profiling metric groups" reload_id=0 count=2 gpu_model="NVIDIA GeForce RTX 5090"
time=2026-04-29T10:00:22.989Z level=INFO msg="Building registry for current GPU topology"
time=2026-04-29T10:00:22.989Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics-included.csv'"
time=2026-04-29T10:00:22.989Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2026-04-29T10:00:23.031Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2026-04-29T10:00:23.031Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2026-04-29T10:00:23.031Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2026-04-29T10:00:23.031Z level=WARN msg="Failed to initialize NvSwitch/NvLink info" error="no switches to monitor"
time=2026-04-29T10:00:23.046Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2026-04-29T10:00:23.125Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-04-29T10:00:23.125Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-04-29T10:00:23.125Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-04-29T10:00:23.171Z level=INFO msg="Registry built successfully" collector_count=2
time=2026-04-29T10:00:23.171Z level=INFO msg="Kubernetes metrics collection enabled!"
time=2026-04-29T10:00:23.172Z level=INFO msg="Initializing Pod Informer" nodeName=stg-nc-partner1-wkld1
I0429 10:00:23.186172       1 warnings.go:110] "Warning: resource.k8s.io/v1beta1 ResourceSlice is deprecated in v1.35+, unavailable in v1.38+"
time=2026-04-29T10:00:23.286Z level=INFO msg="ResourceSlice API informer synced successfully" apiVersion=v1
time=2026-04-29T10:00:23.286Z level=INFO msg="Started DRAResourceSliceManager with auto-detected API version"
time=2026-04-29T10:00:23.287Z level=INFO msg="Profiling endpoints enabled at /debug/pprof/"
time=2026-04-29T10:00:23.287Z level=INFO msg="HTTP server started - ready to serve metrics"
time=2026-04-29T10:00:23.287Z level=INFO msg="Watching for changes in file" file=/etc/dcgm-exporter/dcp-metrics-included.csv debounce=200ms
time=2026-04-29T10:00:23.287Z level=INFO msg="Starting webserver"
time=2026-04-29T10:00:23.291Z level=INFO msg="Listening on" address=[::]:9400
time=2026-04-29T10:00:23.291Z level=INFO msg="TLS is disabled." http2=false address=[::]:9400
time=2026-04-29T10:00:23.388Z level=INFO msg="Pod informer cache synced"

guptaNswati · 2026-04-30T22:13:14Z

-	deviceToUUID  map[string]string            // pool/device -> UUID (for full GPUs)
-	migDevices    map[string]*DRAMigDeviceInfo // pool/device -> MIG info (for MIG devices)
+	factory         informers.SharedInformerFactory
+	v1Informer      cache.SharedIndexInformer


dont need both informers here. can

Suggested change

v1Informer cache.SharedIndexInformer

informer cache.SharedIndexInformer

guptaNswati · 2026-04-30T22:15:22Z

+	// - "v1beta1" if v1 does not, but v1beta1 does
+	preferredAPIVersion string
+	cancelContext       context.CancelFunc
+	mu                  sync.RWMutex


same, not needed anymore.

guptaNswati · 2026-04-30T22:18:19Z

+// For MIG devices: returns (parentUUID, *DRAMigDeviceInfo)
+// For full GPUs: returns (deviceUUID, nil)
+func (m *DRAResourceSliceManager) GetDeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) {
+	m.mu.RLock()


not need this lock

guptaNswati · 2026-04-30T22:31:46Z

+	// Search for the device in the selected slices
+	for _, item := range items {
+		var adapter resourceSliceAdapter
+		switch obj := item.(type) {


This still does a per-item type-switch on v1.ResourceSlice / v1beta1.ResourceSlice even though getV1DeviceInfo / getV1beta1DeviceInfo already know which version they're handling. @varunrsekar called this out #596 (comment) earlier.

getV1DeviceInfo and getV1beta1DeviceInfo can each inline the lookup against their own typed slice.

Then another helper that does the gpu / mig / default switch.

guptaNswati · 2026-04-30T22:35:00Z

+	 uuid, migInfo := m.GetDeviceInfo("gpu-pool", "gpu0")
+	 assert.Empty(t, uuid, "expected no UUID when preferred version is invalid")
+	 assert.Nil(t, migInfo, "expected no MIG info when preferred version is invalid")
+ }


nit: no new line

guptaNswati · 2026-04-30T22:35:17Z

+		return "", nil
+	}
+	return mappings[0].MappingKey, mappings[0].Info
+}


nit: no new line

guptaNswati · 2026-04-30T23:12:32Z

+		for i := range resourceSlicesList.Items {
+			items = append(items, &resourceSlicesList.Items[i])
+		}
+		v1beta1HasNvidiaSlices = countGPUSlices(items) > 0


Can this be simplified a bit? These two list+count blocks are almost identical. Have a helper that returns a bool on the first match. Something like:

// hasNvidiaDRASlices reports whether the cluster currently exposes any // NVIDIA GPU DRA ResourceSlices on the given API version. func hasNvidiaDRASlices(ctx context.Context, client kubernetes.Interface, apiVersion string) (bool, error) { switch apiVersion { case "v1": list, err := client.ResourceV1().ResourceSlices().List(ctx, metav1.ListOptions{}) if err != nil { return false, fmt.Errorf("listing v1 ResourceSlices: %w", err) } for i := range list.Items { s := &list.Items[i] if s.Spec.Driver == DRAGPUDriverName && len(s.Spec.Devices) > 0 { return true, nil } } return false, nil case "v1beta1": list, err := client.ResourceV1beta1().ResourceSlices().List(ctx, metav1.ListOptions{}) if err != nil { return false, fmt.Errorf("listing v1beta1 ResourceSlices: %w", err) } for i := range list.Items { s := &list.Items[i] if s.Spec.Driver == DRAGPUDriverName && len(s.Spec.Devices) > 0 { return true, nil } } return false, nil default: return false, fmt.Errorf("unsupported ResourceSlice API version: %q", apiVersion) } }

than replace the above blocks and no more countGPUSlices check needed

v1HasNvidiaSlices := false if v1Served { has, err := hasNvidiaDRASlices(ctx, client, "v1") if err != nil { return nil, err } v1HasNvidiaSlices = has } // same for v1beta1

guptaNswati · 2026-04-30T23:18:28Z

+ )
+
+ // testInformer is a simple test implementation of SharedIndexInformer
+ type testInformer struct {


why is this needed, can usetestInformerForDRA

guptaNswati · 2026-04-30T23:22:56Z

 	github.com/avast/retry-go/v4 v4.6.0
 	github.com/bits-and-blooms/bitset v1.22.0
-	github.com/fsnotify/fsnotify v1.7.0
+	github.com/containerd/cgroups/v3 v3.1.1


are these updates needed for this PR?

fsnotify and cgroups are both still required, they were already on main (watcher → fsnotify, pidmapper → cgroups). This PR doesn’t change those files. The diff is from go mod tidy after the k8s.io/* bump (line order / direct vs indirect / versions). We need the updated go.mod for a consistent module graph with the Kubernetes upgrade.

guptaNswati · 2026-04-30T23:28:23Z

+//
+// Deprecated behavior: this returns only the first mapping. Prefer
+// GetDynamicResourceMappings when a DynamicResource may contain multiple devices.
+func (m *DRAResourceSliceManager) GetDynamicResourceInfo(resource *podresourcesapi.DynamicResource) (string, *DynamicResourceInfo) {


this should be removed. but double check if called in the tests

guptaNswati · 2026-04-30T23:33:51Z

@adityasingh0510 thank you for the new PR. i have some nits that needs addressing but most of the comments from previous PR is addressed and looks good. I will find a MIG device internally for testing.

adityasingh0510 · 2026-05-03T16:30:58Z

Thanks @guptaNswati . I’ve pushed updates for the remaining nits; please let me know if anything else stands out. Appreciate you testing on a MIG setup when you have a chance.

guptaNswati · 2026-05-04T22:12:52Z


 	// Wait for cache sync on the selected informer.
-	synced := cache.WaitForCacheSync(ctx.Done(), informer.HasSynced)
+	synced := cache.WaitForCacheSync(wait.NeverStop, informer.HasSynced)


why did you change this?

guptaNswati · 2026-05-04T22:26:26Z

-	 assert.Nil(t, migInfo, "expected no MIG info for GPU device")
- }
-
- func TestGetDeviceInfo_InvalidPreferredVersion_ReturnsEmpty(t *testing.T) {


i see these tests are removed? why?

guptaNswati · 2026-05-04T23:03:14Z

+		selected = "v1beta1"
+	default:
+		slog.Warn("No NVIDIA DRA ResourceSlices found; DRA labels will not be available")
+		return nil, nil


i think there is a potential race condition here. In the intended installation order, dra-driver is deployed before dcgm-exporter and hence we expect for the available api nvidia ResourceSlice (RS) will exit. But if dcgm-exporter is up first, the api check

client.ResourceV1().ResourceSlices().List(...) succeeds with an empty list

hasNvidiaDRASlices returns (false, nil) for both versions.

it lands into the default case, returning nil, nil

PodMapper.ResourceSliceManager stays nil for the rest of the pod's lifetime untill exporter is restarted.

may be we should log and always start v1 informer. @varunsekar Thoughts?

originally, v1beta1_informer is always initiated: https://github.com/NVIDIA/dcgm-exporter/blob/main/internal/pkg/transformation/dra.go#L43

guptaNswati · 2026-05-04T23:05:51Z

+	resources, err := client.Discovery().ServerResourcesForGroupVersion(groupVersion)
+	if err != nil {
+		// Discovery returns errors when the group/version isn't served.
+		slog.Debug("Discovery failed for groupVersion", "groupVersion", groupVersion, "error", err)


this should be logged as a warning.

guptaNswati · 2026-05-04T23:33:37Z

-	key := pool + "/" + device
-	m.mu.RLock()
-	defer m.mu.RUnlock()
+func (m *DRAResourceSliceManager) getV1DeviceInfo(pool, device string) (string, *DRAMigDeviceInfo) {


can getV1DeviceInfo and getV1beta1DeviceInfo be collapsed in one function with a switch? something like

if m.informer == nil { return "", nil } items, err := m.informer.GetIndexer().ByIndex("poolName", pool) if err != nil { slog.Error("Error listing ResourceSlices by pool index", "pool", pool, "err", err) return "", nil } for _, item := range items { var adapter resourceSliceAdapter var driver string switch rs := item.(type) { case *resourcev1.ResourceSlice: driver, adapter = rs.Spec.Driver, &v1ResourceSliceAdapter{slice: rs} case *resourcev1beta1.ResourceSlice: driver, adapter = rs.Spec.Driver, &v1beta1ResourceSliceAdapter{slice: rs} default: continue } if driver != DRAGPUDriverName { continue } if mappingKey, migInfo := lookupDRADeviceInAdapter(pool, device, adapter); mappingKey != "" { return mappingKey, migInfo } }

guptaNswati · 2026-05-13T17:56:39Z

@adityasingh0510 can you pls address the open comments.

feat: Add dual API support for ResourceSlice (v1 and v1beta1)

40b10da

adityasingh0510 force-pushed the feature/k8s-v1-resource-apis-support branch from c7c2391 to 40b10da Compare April 29, 2026 05:18

adityasingh0510 mentioned this pull request Apr 29, 2026

feat: Support K8s DRA Resources V1 APIs #596

Closed

adityasingh0510 changed the title ~~feat: Add dual API support for ResourceSlice (v1 and v1beta1)~~ feat: Support K8s DRA Resources V1 APIs Apr 29, 2026

guptaNswati reviewed Apr 30, 2026

View reviewed changes

fix(dra): address review feedback for ResourceSlice manager

6b4bc4a

guptaNswati reviewed May 4, 2026

View reviewed changes

	v1Informer cache.SharedIndexInformer
	informer cache.SharedIndexInformer

Conversation

adityasingh0510 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Files Modified

API Structure Changes

Behavior

Automatic API Detection

Precedence Logic

Testing

Verification

Test Scenarios

Backward Compatibility

Breaking Changes

Related Issues

Uh oh!

adityasingh0510 commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Apr 30, 2026

Uh oh!

adityasingh0510 commented May 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adityasingh0510 commented Apr 29, 2026 •

edited

Loading