-
Notifications
You must be signed in to change notification settings - Fork 247
feat: adding support for amdama (supernova) gpus. #7697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
pkg/agent/datamodel/helper.go
Outdated
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method duplicated 3 times. Should be no more than 2. (Ideally 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also. would you recommend breaking form the model to make lint happy?
pkg/agent/variables.go
Outdated
| "userAssignedIdentityID": config.UserAssignedIdentityClientID, | ||
| "isVHD": isVHD(profile), | ||
| "gpuNode": strconv.FormatBool(config.EnableNvidia), | ||
| "amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but there's one for Nvidia, so I figured better include it just in case.
| }) | ||
| } | ||
|
|
||
| func Test_AzureLinuxV3_MA35D(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AzureLinux is the only OS that support this VMSku correct ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With AKS, yes.
| } | ||
|
|
||
| setupAmdAma() { | ||
| if [ "$(isARM64)" -eq 1 ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should never happen.
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?
One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. If someone did try to create an AKS cluster with a different OS, it just wouldn't install the driver, but AKS cluster creation would succeed. That's fine.
No idea for OSGuard...
| sudo tdnf install -y azurelinux-repos-amd | ||
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we combine into a single dnf call ?
| # Install core package | ||
| sudo tdnf install -y libzip | ||
| sudo tdnf install -y azurelinux-repos-extended | ||
| sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.
pkg/agent/baker.go
Outdated
| } | ||
|
|
||
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because aks-rp is unaware that this is a GPU SKU so it doesn't interfere with Nvidia/AMD GPUs, and so we don't have to add a whole new mechanism just for this SKU. This was Yi's decision.
| fi | ||
|
|
||
| # Install and configure AMD AMA (Supernova) drivers if this is an AMA node | ||
| if [ "${AMDAMA_NODE}" = "true" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can query instance metadata server to get the sku and install it directly here instead of going through all the hoops if the intention is default enabled for the SKU
# Retrieves the VM SKU (size) from the cached IMDS instance metadata.
# Outputs:
# The VM size/SKU (e.g., "Standard_NM16ads_MA35D") or empty string if not found.
# Returns:
# 0 on success, non-zero on failure
get_imds_vm_sku() {
local vm_sku=""
vm_sku=$(jq -r '.compute.vmSize // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
echo "$vm_sku"
}
if [ "$(get_imds_vm_sku)" = "Standard_NM16ads_MA35D" ]; then
echo "Exact match!"
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for AMD AMA (Supernova) GPU hardware by enabling detection and configuration for the Standard_NM16ads_MA35D VM SKU. The changes introduce a new VM size detection mechanism, install necessary drivers and packages for AMD AMA GPUs, and configure the required system settings for Azure Linux environments.
Changes:
- Added VM SKU detection logic for AMD AMA-enabled instances (Standard_NM16ads_MA35D)
- Implemented driver installation and configuration for AMD AMA GPUs on Azure Linux
- Added E2E tests to validate AMD AMA GPU functionality
- Updated firewall rules to allow access to download.microsoft.com for AMD AMA package installation
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| aks-node-controller/helpers/const.go | Added constant for Standard_NM16ads_MA35D VM size |
| aks-node-controller/parser/helper.go | Added helper function to detect AMD AMA enabled SKUs |
| aks-node-controller/parser/parser.go | Added AMDAMA_NODE environment variable to CSE configuration |
| pkg/agent/baker.go | Added IsAmdAmaEnabledSKU function for template rendering |
| pkg/agent/datamodel/helper.go | Added IsAmdAmaEnabledSKU function for VM SKU detection |
| pkg/agent/variables.go | Added amdamaNode variable to CSE command variables |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Declared AMDAMA_NODE variable for shell scripts |
| parts/linux/cloud-init/artifacts/cse_config.sh | Implemented setupAmdAma function to install drivers and configure system |
| parts/linux/cloud-init/artifacts/cse_main.sh | Added call to setupAmdAma during node preparation |
| e2e/scenario_test.go | Added E2E tests for AMD AMA GPU support (standard and scriptless) |
| e2e/aks_model.go | Added firewall rule to allow download.microsoft.com access |
pkg/agent/baker.go
Outdated
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | ||
| func IsAmdAmaEnabledSKU(vmSize string) bool { | ||
| switch vmSize { | ||
| case "Standard_NM16ads_MA35D": | ||
| return true | ||
| } | ||
| return false | ||
| } | ||
|
|
Copilot
AI
Jan 30, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IsAmdAmaEnabledSKU function is duplicated in both pkg/agent/baker.go and pkg/agent/datamodel/helper.go. This creates maintenance overhead as both implementations need to be updated when new AMD AMA SKUs are added. Consider removing this function from baker.go and calling datamodel.IsAmdAmaEnabledSKU instead, following the pattern used by IsSgxEnabledSKU which only exists in datamodel/helper.go.
| // IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support. | |
| func IsAmdAmaEnabledSKU(vmSize string) bool { | |
| switch vmSize { | |
| case "Standard_NM16ads_MA35D": | |
| return true | |
| } | |
| return false | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 19 out of 70 changed files in this pull request and generated 2 comments.
| sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf" | ||
| sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages" | ||
| if [ "$(systemctl is-active kubelet)" = "active" ]; then |
Copilot
AI
Feb 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hugepages are configured by appending to /etc/sysctl.conf and writing directly to /proc/sys/vm/nr_hugepages. This is not idempotent (re-runs will keep appending) and can fail if the kernel can’t allocate the requested pages, which would currently abort provisioning under set -e. Consider writing a dedicated drop-in under /etc/sysctl.d/ and applying via sysctl with explicit error handling/logging for allocation failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 19 out of 70 changed files in this pull request and generated 4 comments.
| return | ||
| fi | ||
|
|
||
| if isMarinerOrAzureLinux "$OS"; then |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setupAmdAma is gated by isMarinerOrAzureLinux "$OS", but the function installs azurelinux-repos-* packages. This will likely fail on Mariner images (and could break provisioning if someone uses this SKU on Mariner). Consider restricting this to Azure Linux only (and optionally specific Azure Linux versions) before attempting installs.
| if isMarinerOrAzureLinux "$OS"; then | |
| if isAzureLinux "$OS"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 19 out of 70 changed files in this pull request and generated 3 comments.
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | ||
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | ||
| if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then | ||
| echo "Unable to find AMD AMA driver package for current kernel version, exiting..." | ||
| exit $ERR_AMDAMA_DRIVER_NOT_FOUND | ||
| fi | ||
| dnf install -y $AMD_AMA_DRIVER_PACKAGE |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In setupAmdAma, KERNEL_VERSION/AMD_AMA_DRIVER_PACKAGE are not local, and $AMD_AMA_DRIVER_PACKAGE is unquoted when passed to dnf install, which can lead to word-splitting/globbing issues. Also, the grep -E pattern isn’t anchored (unlike the similar CUDA logic), increasing the chance of matching the wrong repoquery output. Consider making these variables local, quoting expansions, and tightening the match (e.g., anchor to the start of the package name).
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | |
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1) | |
| if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then | |
| echo "Unable to find AMD AMA driver package for current kernel version, exiting..." | |
| exit $ERR_AMDAMA_DRIVER_NOT_FOUND | |
| fi | |
| dnf install -y $AMD_AMA_DRIVER_PACKAGE | |
| local KERNEL_VERSION | |
| KERNEL_VERSION=$(uname -r | sed 's/-/./g') | |
| local AMD_AMA_DRIVER_PACKAGE | |
| AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "^amd-ama-driver-[0-9]+.*_${KERNEL_VERSION}$" | sort -V | tail -n 1) | |
| if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then | |
| echo "Unable to find AMD AMA driver package for current kernel version, exiting..." | |
| exit $ERR_AMDAMA_DRIVER_NOT_FOUND | |
| fi | |
| dnf install -y "$AMD_AMA_DRIVER_PACKAGE" |
| Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU", | ||
| Config: Config{ |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar nit: "a AzureLinuxV3" should be "an AzureLinuxV3" (both in the non-scriptless and scriptless MA35D scenario descriptions).
| Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU", | ||
| Tags: Tags{ |
Copilot
AI
Feb 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar nit: "a AzureLinuxV3" should be "an AzureLinuxV3" in this scenario description as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 20 out of 71 changed files in this pull request and generated 2 comments.
| Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU", | ||
| Tags: Tags{ |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammar: "a AzureLinuxV3" should be "an AzureLinuxV3".
| TestTimeout time.Duration `env:"TEST_TIMEOUT" envDefault:"60m"` | ||
| TestTimeoutCluster time.Duration `env:"TEST_TIMEOUT_CLUSTER" envDefault:"60m"` |
Copilot
AI
Feb 10, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing the default E2E timeouts to 60m/60m affects the entire test suite and can mask regressions or significantly lengthen CI when tests hang. If this change is only needed for MA35D/driver bring-up, consider using per-scenario timeouts (if supported) or narrowing the scope of the timeout increase; otherwise, please document why the global increase is required.
| TestGalleryImagePrefix string `env:"TEST_GALLERY_IMAGE_PREFIX" envDefault:"abe2etest"` | ||
| TestGalleryNamePrefix string `env:"TEST_GALLERY_NAME_PREFIX" envDefault:"abe2etest"` | ||
| TestPreProvision bool `env:"TEST_PRE_PROVISION" envDefault:"false"` | ||
| TestTimeout time.Duration `env:"TEST_TIMEOUT" envDefault:"35m"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR does / why we need it:
Adding support for AMDAMA (Supernova) GPUs.