-
Notifications
You must be signed in to change notification settings - Fork 247
test: Add DCGM Exporter compatibility e2e #7787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
PR Title Lint Failed ❌Current Title: Your PR title doesn't follow the expected format. Please update your PR title to follow one of these patterns: Conventional Commits Format:
Guidelines:
Examples:
Please update your PR title and the lint check will run again automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds an e2e test to verify package dependency compatibility between DCGM Exporter and datacenter-gpu-manager packages for Ubuntu 24.04 GPU nodes. The test validates that the package versions specified in components.json match the actual dependencies declared in the dcgm-exporter package from Microsoft's repository.
Changes:
- Adds new compatibility test that downloads and inspects dcgm-exporter package metadata
- Validates datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary version dependencies
|
Support for the Azure Linux is blocked because right now the dcgm-exporter package does not list the two packages as dependencies similar to Ubuntu Ubuntu shows datacenter-gpu-manager packages as dependencies: # curl -LO https://packages.microsoft.com/repos/microsoft-ubuntu-noble-prod/pool/main/d/dcgm-exporter/dcgm-exporter_4.7.1-ubuntu24.04u2_amd64.deb
# dpkg-deb -f dcgm-exporter_4.7.1-ubuntu24.04u2_amd64.deb Depends
libc6 (>= 2.34), datacenter-gpu-manager-4-core (= 1:4.4.2-1), datacenter-gpu-manager-4-proprietary (= 1:4.4.2-1), libcap2-binAzure Linux does not show anything about the datacenter-gpu-manager like Ubuntu: # curl -LO https://packages.microsoft.com/azurelinux/3.0/prod/cloud-native/x86_64/Packages/d/dcgm-exporter-4.7.1-2.azl3.x86_64.rpm
# rpm -qpR dcgm-exporter-4.7.1-2.azl3.x86_64.rpm
warning: dcgm-exporter-4.7.1-2.azl3.x86_64.rpm: Header V4 RSA/SHA256 Signature, key ID 3135ce90: NOKEY
/bin/bash
/bin/sh
/bin/sh
/bin/sh
config(dcgm-exporter) = 4.7.1-2.azl3
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3.2)(64bit)
libc.so.6(GLIBC_2.32)(64bit)
libc.so.6(GLIBC_2.34)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libresolv.so.2()(64bit)
openssl-libs
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsZstd) <= 5.4.18-1
systemd
systemd
systemdWith new version we will be able to see something like this: # rpm -qpR /tmp/dcgm-exporter-4.8.0-1.azl3.x86_64.rpm
/bin/bash
/bin/sh
/bin/sh
/bin/sh
config(dcgm-exporter) = 4.8.0-1.azl3
datacenter-gpu-manager-4-core = 1:4.5.1-1
datacenter-gpu-manager-4-proprietary = 1:4.5.1-1
libc.so.6()(64bit)
libc.so.6(GLIBC_2.2.5)(64bit)
libc.so.6(GLIBC_2.3.2)(64bit)
libc.so.6(GLIBC_2.32)(64bit)
libc.so.6(GLIBC_2.34)(64bit)
libc.so.6(GLIBC_2.4)(64bit)
libresolv.so.2()(64bit)
openssl-libs
rpmlib(CompressedFileNames) <= 3.0.4-1
rpmlib(FileDigests) <= 4.6.0-1
rpmlib(PayloadFilesHavePrefix) <= 4.0-1
rpmlib(PayloadIsZstd) <= 5.4.18-1
systemd
systemd
systemd |
|
So this the sample test run when I have cherry-picked the code from the PR: #7809 Ubuntu run: ➜ go test -run "^Test_DCGM_Exporter_Compatibility/Ubuntu2404$" -v -timeout 90m -parallel 100
...
scenario_gpu_managed_experience_test.go:199: Expected versions from components.json:
scenario_gpu_managed_experience_test.go:199: dcgm-exporter: 4.8.0-ubuntu24.04u1
scenario_gpu_managed_experience_test.go:199: datacenter-gpu-manager-4-core: 1:4.4.2-1
scenario_gpu_managed_experience_test.go:199: datacenter-gpu-manager-4-proprietary: 1:4.4.2-1
scenario_gpu_managed_experience_test.go:202: Downloading dcgm-exporter package from PMC...
scenario_gpu_managed_experience_test.go:207: Extracting dependency versions from package...
scenario_gpu_managed_experience_test.go:212: Package dependencies: libc6 (>= 2.34), datacenter-gpu-manager-4-core (= 1:4.5.1-1), datacenter-gpu-manager-4-proprietary (= 1:4.5.1-1), libcap2-bin
scenario_gpu_managed_experience_test.go:215: Actual versions from dcgm-exporter package:
scenario_gpu_managed_experience_test.go:215: datacenter-gpu-manager-4-core: 1:4.5.1-1
scenario_gpu_managed_experience_test.go:215: datacenter-gpu-manager-4-proprietary: 1:4.5.1-1
scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
Error Trace: /Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:218
/Users/code/AgentBaker/e2e/test_helpers.go:372
/Users/code/AgentBaker/e2e/test_helpers.go:228
/Users/code/AgentBaker/e2e/test_helpers.go:91
/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:191
/opt/homebrew/Cellar/go/1.25.6/libexec/src/runtime/asm_arm64.s:1268
Error: Not equal:
expected: "1:4.4.2-1"
actual : "1:4.5.1-1"
Diff:
--- Expected
+++ Actual
@@ -1 +1 @@
-1:4.4.2-1
+1:4.5.1-1
Test: Test_DCGM_Exporter_Compatibility/Ubuntu2404
Messages: datacenter-gpu-manager-4-core version mismatch: components.json has 1:4.4.2-1 but dcgm-exporter requires 1:4.5.1-1
scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
vmss.go:468: extracted VM logs to scenario-logs/Test_DCGM_Exporter_Compatibility/Ubuntu2404
--- FAIL: Test_DCGM_Exporter_Compatibility (0.00s)
--- FAIL: Test_DCGM_Exporter_Compatibility/Ubuntu2404 (307.28s)
FAIL
exit status 1
FAIL github.com/Azure/agentbaker/e2e 308.791sAzure Linux 3.0 run: ➜ go test -run "^Test_DCGM_Exporter_Compatibility/AzureLinux3$" -v -timeout 90m -parallel 100
...
scenario_gpu_managed_experience_test.go:199: Expected versions from components.json:
scenario_gpu_managed_experience_test.go:199: dcgm-exporter: 4.8.0-1.azl3
scenario_gpu_managed_experience_test.go:199: datacenter-gpu-manager-4-core: 1:4.4.2-1
scenario_gpu_managed_experience_test.go:199: datacenter-gpu-manager-4-proprietary: 1:4.4.2-1
scenario_gpu_managed_experience_test.go:202: Downloading dcgm-exporter package from PMC...
scenario_gpu_managed_experience_test.go:207: Extracting dependency versions from package...
scenario_gpu_managed_experience_test.go:212: Package dependencies: datacenter-gpu-manager-4-core = 1:4.5.1-1
datacenter-gpu-manager-4-proprietary = 1:4.5.1-1
scenario_gpu_managed_experience_test.go:215: Actual versions from dcgm-exporter package:
scenario_gpu_managed_experience_test.go:215: datacenter-gpu-manager-4-core: 1:4.5.1-1
scenario_gpu_managed_experience_test.go:215: datacenter-gpu-manager-4-proprietary: 1:4.5.1-1
scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
Error Trace: /Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:218
/Users/code/AgentBaker/e2e/test_helpers.go:372
/Users/code/AgentBaker/e2e/test_helpers.go:228
/Users/code/AgentBaker/e2e/test_helpers.go:91
/Users/code/AgentBaker/e2e/scenario_gpu_managed_experience_test.go:191
/opt/homebrew/Cellar/go/1.25.6/libexec/src/runtime/asm_arm64.s:1268
Error: Not equal:
expected: "1:4.4.2-1"
actual : "1:4.5.1-1"
Diff:
--- Expected
+++ Actual
@@ -1 +1 @@
-1:4.4.2-1
+1:4.5.1-1
Test: Test_DCGM_Exporter_Compatibility/AzureLinux3
Messages: datacenter-gpu-manager-4-core version mismatch: components.json has 1:4.4.2-1 but dcgm-exporter requires 1:4.5.1-1
scenario_gpu_managed_experience_test.go:218: 🔴 FAIL:
vmss.go:468: extracted VM logs to scenario-logs/Test_DCGM_Exporter_Compatibility/AzureLinux3
--- FAIL: Test_DCGM_Exporter_Compatibility (0.00s)
--- FAIL: Test_DCGM_Exporter_Compatibility/AzureLinux3 (330.35s)
FAIL
exit status 1
FAIL github.com/Azure/agentbaker/e2e 331.540s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
40047bb to
42b3f71
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
42b3f71 to
5816853
Compare
5816853 to
588727f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Add Test_DCGM_Exporter_Compatibility to verify that the dcgm-exporter package dependencies (`datacenter-gpu-manager-4-core` and `datacenter-gpu-manager-4-proprietary`) match the versions specified in `components.json`. The test downloads the dcgm-exporter package from PMC, extracts its dependency requirements, and validates they match our pinned versions. This prevents version mismatches that could cause GPU monitoring failures on nodes. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Update datacenter-gpu-manager-4-core and datacenter-gpu-manager-4-proprietary from 1:4.4.2-1 to 1:4.5.1-1 for Ubuntu 22.04, Ubuntu 24.04, and Azure Linux 3.0. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
722fa0b to
66ea522
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
Test_DCGM_Exporter_CompatibilityE2E test to detect dependency mismatches betweendcgm-exporterand its DCGM dependencies before they are cached on VHDsdatacenter-gpu-manager-4-coreanddatacenter-gpu-manager-4-proprietaryfrom1:4.4.2-1to1:4.5.1-1to matchdcgm-exporter 4.8.0requirementsdcgm-exporterfrom4.7.1to4.8.0across Ubuntu 22.04, Ubuntu 24.04, and Azure Linux 3.0Problem
The
dcgm-exporterpackage has strict version dependencies ondatacenter-gpu-manager-4-coreanddatacenter-gpu-manager-4-proprietary. These packages are downloaded and cached separately during VHD build time. However, package caching succeeds even when versions areincompatible—the mismatch only surfaces later during node provisioning when
dpkgorrpmattempts to install the packages together.This creates a problematic failure mode:
dpkg: dependency problems prevent configuration of dcgm-exporter:
dcgm-exporter depends on datacenter-gpu-manager-4-core (= 1:4.4.2-1); however:
Version of datacenter-gpu-manager-4-core on system is 1:4.5.0-1.
Similar issue was fixed in this PR #7736.
Solution
Add a new E2E test that validates dependency compatibility before packages are cached on VHDs:
dcgm-exporterpackage from PMC (without installing)dpkg-deb -ffor Ubuntu,rpm -qpRfor Azure Linux)components.jsonThis shifts detection from "node provisioning time" to "PR review time", preventing broken VHDs from being built and released.