Skip to content

Conversation

@surajssd
Copy link
Member

@surajssd surajssd commented Feb 4, 2026

  • Add Test_DCGM_Exporter_Compatibility E2E test to detect dependency mismatches between dcgm-exporter and its DCGM dependencies before they are cached on VHDs
  • Update datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary from 1:4.4.2-1 to 1:4.5.1-1 to match dcgm-exporter 4.8.0 requirements
  • Update dcgm-exporter from 4.7.1 to 4.8.0 across Ubuntu 22.04, Ubuntu 24.04, and Azure Linux 3.0

Problem

The dcgm-exporter package has strict version dependencies on datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary. These packages are downloaded and cached separately during VHD build time. However, package caching succeeds even when versions are
incompatible—the mismatch only surfaces later during node provisioning when dpkg or rpm attempts to install the packages together.

This creates a problematic failure mode:

  1. VHD builds successfully with mismatched package versions cached
  2. Node provisioning fails with dependency errors like:
    dpkg: dependency problems prevent configuration of dcgm-exporter:
    dcgm-exporter depends on datacenter-gpu-manager-4-core (= 1:4.4.2-1); however:
    Version of datacenter-gpu-manager-4-core on system is 1:4.5.0-1.
  3. GPU monitoring is broken on affected nodes

Similar issue was fixed in this PR #7736.

Solution

Add a new E2E test that validates dependency compatibility before packages are cached on VHDs:

  1. Download the dcgm-exporter package from PMC (without installing)
  2. Extract the package's dependency metadata (dpkg-deb -f for Ubuntu, rpm -qpR for Azure Linux)
  3. Compare the required dependency versions against what's specified in components.json
  4. Fail the test if there's a mismatch

This shifts detection from "node provisioning time" to "PR review time", preventing broken VHDs from being built and released.

Copilot AI review requested due to automatic review settings February 4, 2026 22:08
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

PR Title Lint Failed ❌

Current Title: test: Add DCGM Exporter compatibility e2e

Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns:

Conventional Commits Format:

  • feat: add new feature - for new features
  • fix: resolve bug in component - for bug fixes
  • docs: update README - for documentation changes
  • refactor: improve code structure - for refactoring
  • test: add unit tests - for test additions
  • chore: remove dead code - for maintenance tasks
  • chore(deps): update dependencies - for updating dependencies
  • ci: update build pipeline - for CI/CD changes

Guidelines:

  • Use lowercase for the type and description
  • Keep the description concise but descriptive
  • Use imperative mood (e.g., "add" not "adds" or "added")
  • Don't end with a period

Examples:

  • feat(windows): add secure TLS bootstrapping for Windows nodes
  • fix: resolve kubelet certificate rotation issue
  • docs: update installation guide
  • Added new feature
  • Fix bug.
  • Update docs

Please update your PR title and the lint check will run again automatically.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an e2e test to verify package dependency compatibility between DCGM Exporter and datacenter-gpu-manager packages for Ubuntu 24.04 GPU nodes. The test validates that the package versions specified in components.json match the actual dependencies declared in the dcgm-exporter package from Microsoft's repository.

Changes:

  • Adds new compatibility test that downloads and inspects dcgm-exporter package metadata
  • Validates datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary version dependencies

@surajssd
Copy link
Member Author

surajssd commented Feb 4, 2026

Support for the Azure Linux is blocked because right now the dcgm-exporter package does not list the two packages as dependencies similar to Ubuntu

Ubuntu shows datacenter-gpu-manager packages as dependencies:

# curl -LO https://packages.microsoft.com/repos/microsoft-ubuntu-noble-prod/pool/main/d/dcgm-exporter/dcgm-exporter_4.7.1-ubuntu24.04u2_amd64.deb
# dpkg-deb -f dcgm-exporter_4.7.1-ubuntu24.04u2_amd64.deb Depends
libc6 (>= 2.34), datacenter-gpu-manager-4-core (= 1:4.4.2-1), datacenter-gpu-manager-4-proprietary (= 1:4.4.2-1), libcap2-bin

Azure Linux does not show anything about the datacenter-gpu-manager like Ubuntu:

# curl -LO https://packages.microsoft.com/azurelinux/3.0/prod/cloud-native/x86_64/Packages/d/dcgm-exporter-4.7.1-2.azl3.x86_64.rpm
# rpm -qpR dcgm-exporter-4.7.1-2.azl3.x86_64.rpm
warning: dcgm-exporter-4.7.1-2.azl3.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 3135ce90: NOKEY
/bin/bash
/bin/sh
/bin/sh
/bin/sh
config(dcgm-exporter) = 4.7.1-2.azl3
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3.2)(64bit)
libc.so.6(GLIBC_2.32)(64bit)
libc.so.6(GLIBC_2.34)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libresolv.so.2()(64bit)
openssl-libs
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsZstd) <= 5.4.18-1
systemd
systemd
systemd

With new version we will be able to see something like this:

# rpm -qpR /tmp/dcgm-exporter-4.8.0-1.azl3.x86_64.rpm
/bin/bash
/bin/sh
/bin/sh
/bin/sh
config(dcgm-exporter) = 4.8.0-1.azl3
datacenter-gpu-manager-4-core = 1:4.5.1-1
datacenter-gpu-manager-4-proprietary = 1:4.5.1-1
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3.2)(64bit)
libc.so.6(GLIBC_2.32)(64bit)
libc.so.6(GLIBC_2.34)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libresolv.so.2()(64bit)
openssl-libs
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsZstd) <= 5.4.18-1
systemd
systemd
systemd

@surajssd
Copy link
Member Author

surajssd commented Feb 10, 2026

So this the sample test run when I have cherry-picked the code from the PR: #7809

Ubuntu run:

➜  go test -run "^Test_DCGM_Exporter_Compatibility/Ubuntu2404$" -v -timeout 90m -parallel 100
...
    scenario_gpu_managed_experience_test.go:199: Expected versions from components.json:
    scenario_gpu_managed_experience_test.go:199:   dcgm-exporter: 4.8.0-ubuntu24.04u1
    scenario_gpu_managed_experience_test.go:199:   datacenter-gpu-manager-4-core: 1:4.4.2-1
    scenario_gpu_managed_experience_test.go:199:   datacenter-gpu-manager-4-proprietary: 1:4.4.2-1
    scenario_gpu_managed_experience_test.go:202: Downloading dcgm-exporter package from PMC...
    scenario_gpu_managed_experience_test.go:207: Extracting dependency versions from package...
    scenario_gpu_managed_experience_test.go:212: Package dependencies: libc6 (>= 2.34), datacenter-gpu-manager-4-core (= 1:4.5.1-1), datacenter-gpu-manager-4-proprietary (= 1:4.5.1-1), libcap2-bin
    scenario_gpu_managed_experience_test.go:215: Actual versions from dcgm-exporter package:
    scenario_gpu_managed_experience_test.go:215:   datacenter-gpu-manager-4-core: 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:215:   datacenter-gpu-manager-4-proprietary: 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
        	Error Trace:	/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:218
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:372
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:228
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:91
        	            				/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:191
        	            				/opt/homebrew/Cellar/go/1.25.6/libexec/src/runtime/asm_arm64.s:1268
        	Error:      	Not equal:
        	            	expected: "1:4.4.2-1"
        	            	actual  : "1:4.5.1-1"

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-1:4.4.2-1
        	            	+1:4.5.1-1
        	Test:       	Test_DCGM_Exporter_Compatibility/Ubuntu2404
        	Messages:   	datacenter-gpu-manager-4-core version mismatch: components.json has 1:4.4.2-1 but dcgm-exporter requires 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
    vmss.go:468: extracted VM logs to scenario-logs/Test_DCGM_Exporter_Compatibility/Ubuntu2404
--- FAIL: Test_DCGM_Exporter_Compatibility (0.00s)
    --- FAIL: Test_DCGM_Exporter_Compatibility/Ubuntu2404 (307.28s)
FAIL
exit status 1
FAIL	github.com/Azure/agentbaker/e2e	308.791s

Azure Linux 3.0 run:

➜  go test -run "^Test_DCGM_Exporter_Compatibility/AzureLinux3$" -v -timeout 90m -parallel 100
...
    scenario_gpu_managed_experience_test.go:199: Expected versions from components.json:
    scenario_gpu_managed_experience_test.go:199:   dcgm-exporter: 4.8.0-1.azl3
    scenario_gpu_managed_experience_test.go:199:   datacenter-gpu-manager-4-core: 1:4.4.2-1
    scenario_gpu_managed_experience_test.go:199:   datacenter-gpu-manager-4-proprietary: 1:4.4.2-1
    scenario_gpu_managed_experience_test.go:202: Downloading dcgm-exporter package from PMC...
    scenario_gpu_managed_experience_test.go:207: Extracting dependency versions from package...
    scenario_gpu_managed_experience_test.go:212: Package dependencies: datacenter-gpu-manager-4-core = 1:4.5.1-1
        datacenter-gpu-manager-4-proprietary = 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:215: Actual versions from dcgm-exporter package:
    scenario_gpu_managed_experience_test.go:215:   datacenter-gpu-manager-4-core: 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:215:   datacenter-gpu-manager-4-proprietary: 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
        	Error Trace:	/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:218
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:372
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:228
        	            				/Users/code/AgentBaker/e2e/test_helpers.go:91
        	            				/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:191
        	            				/opt/homebrew/Cellar/go/1.25.6/libexec/src/runtime/asm_arm64.s:1268
        	Error:      	Not equal:
        	            	expected: "1:4.4.2-1"
        	            	actual  : "1:4.5.1-1"

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-1:4.4.2-1
        	            	+1:4.5.1-1
        	Test:       	Test_DCGM_Exporter_Compatibility/AzureLinux3
        	Messages:   	datacenter-gpu-manager-4-core version mismatch: components.json has 1:4.4.2-1 but dcgm-exporter requires 1:4.5.1-1
    scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
    vmss.go:468: extracted VM logs to scenario-logs/Test_DCGM_Exporter_Compatibility/AzureLinux3
--- FAIL: Test_DCGM_Exporter_Compatibility (0.00s)
    --- FAIL: Test_DCGM_Exporter_Compatibility/AzureLinux3 (330.35s)
FAIL
exit status 1
FAIL	github.com/Azure/agentbaker/e2e	331.540s

@surajssd surajssd marked this pull request as ready for review February 10, 2026 00:42
Copilot AI review requested due to automatic review settings February 10, 2026 00:42
@github-actions github-actions bot added the components This pull request updates cached components on Linux or Windows VHDs label Feb 10, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

@surajssd surajssd changed the title e2e: Add Ubuntu 24.04 DCGM Exporter compatibility test test: Add Ubuntu 24.04 DCGM Exporter compatibility e2e Feb 10, 2026
@surajssd surajssd changed the title test: Add Ubuntu 24.04 DCGM Exporter compatibility e2e test: Add DCGM Exporter compatibility e2e Feb 10, 2026
Copilot AI review requested due to automatic review settings February 10, 2026 17:18
@surajssd surajssd force-pushed the suraj/compatible-dcgm branch from 40047bb to 42b3f71 Compare February 10, 2026 17:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

surajssd and others added 3 commits February 10, 2026 10:11
Add Test_DCGM_Exporter_Compatibility to verify that the dcgm-exporter
package dependencies (`datacenter-gpu-manager-4-core` and
`datacenter-gpu-manager-4-proprietary`) match the versions specified in
`components.json`.

The test downloads the dcgm-exporter package from PMC, extracts its
dependency requirements, and validates they match our pinned versions.
This prevents version mismatches that could cause GPU monitoring
failures on nodes.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Update datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary
from 1:4.4.2-1 to 1:4.5.1-1 for Ubuntu 22.04, Ubuntu 24.04, and Azure Linux 3.0.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings February 10, 2026 18:31
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants