Skip to content

Conversation

@mipresmsft
Copy link

What this PR does / why we need it:

Adding support for AMDAMA (Supernova) GPUs.

@mipresmsft mipresmsft changed the title Moved changes to sync with main branch. Adding support for AMDAMA (Supernova) GPUs. Jan 21, 2026
return true
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method duplicated 3 times. Should be no more than 2. (Ideally 1)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just following the method used in each file. Presumably the ones we don't need will get removed when the move from scripted to scriptless is done?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also. would you recommend breaking form the model to make lint happy?

"userAssignedIdentityID": config.UserAssignedIdentityClientID,
"isVHD": isVHD(profile),
"gpuNode": strconv.FormatBool(config.EnableNvidia),
"amdamaNode": strconv.FormatBool(datamodel.IsAmdAmaEnabledSKU(profile.VMSize)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unused?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but there's one for Nvidia, so I figured better include it just in case.

})
}

func Test_AzureLinuxV3_MA35D(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AzureLinux is the only OS that support this VMSku correct ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With AKS, yes.

}

setupAmdAma() {
if [ "$(isARM64)" -eq 1 ]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even possible ? to have AMDAMA_MODE == true and with a arm64 arch ? What would be the consequences here if this cluase is true and we expect to have AMDAMA_MODE and we dont setup anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should never happen.

return
fi

if isMarinerOrAzureLinux "$OS"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, could we have ubuntu ? if it's the case what will happen ? should we baill out instead and cause anode bootstrap failure ? I would expect that Ubuntu/Flatcar and ARM64 will all be blocked at the RP level to provide users with early validation failure ?

One more question, does isMarinerOrAzureLinux return true for OSGuard I recall there were extrra check required to ensure it wasn't LinuxOSGuard

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. If someone did try to create an AKS cluster with a different OS, it just wouldn't install the driver, but AKS cluster creation would succeed. That's fine.

No idea for OSGuard...

sudo tdnf install -y azurelinux-repos-amd
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
sudo tdnf install -y $AMD_AMA_DRIVER_PACKAGE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine into a single dnf call ?

# Install core package
sudo tdnf install -y libzip
sudo tdnf install -y azurelinux-repos-extended
sudo RPM_FRONTEND=noninteractive tdnf install -y https://download.microsoft.com/download/16b04fa7-883e-4a94-88c2-801881a47b28/amd-ama-core_1.3.0-2503242033-amd64.rpm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very much hardcoded, if it's for testing purposes for now, I'm ok but we need to move the package to components.json and also ensure we can detect new versions automatically with renovate. Let's do that in another PR.

}

// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the first time we perform a logic in the VHD to mapping SKU to feature... normally this is done in RP.

Copy link
Author

@mipresmsft mipresmsft Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because aks-rp is unaware that this is a GPU SKU so it doesn't interfere with Nvidia/AMD GPUs, and so we don't have to add a whole new mechanism just for this SKU. This was Yi's decision.

fi

# Install and configure AMD AMA (Supernova) drivers if this is an AMA node
if [ "${AMDAMA_NODE}" = "true" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can query instance metadata server to get the sku and install it directly here instead of going through all the hoops if the intention is default enabled for the SKU

# Retrieves the VM SKU (size) from the cached IMDS instance metadata.
# Outputs:
#   The VM size/SKU (e.g., "Standard_NM16ads_MA35D") or empty string if not found.
# Returns:
#   0 on success, non-zero on failure
get_imds_vm_sku() {
    local vm_sku=""
    vm_sku=$(jq -r '.compute.vmSize // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
    echo "$vm_sku"
}

if [ "$(get_imds_vm_sku)" = "Standard_NM16ads_MA35D" ]; then
    echo "Exact match!"
fi

Copilot AI review requested due to automatic review settings January 30, 2026 21:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for AMD AMA (Supernova) GPU hardware by enabling detection and configuration for the Standard_NM16ads_MA35D VM SKU. The changes introduce a new VM size detection mechanism, install necessary drivers and packages for AMD AMA GPUs, and configure the required system settings for Azure Linux environments.

Changes:

  • Added VM SKU detection logic for AMD AMA-enabled instances (Standard_NM16ads_MA35D)
  • Implemented driver installation and configuration for AMD AMA GPUs on Azure Linux
  • Added E2E tests to validate AMD AMA GPU functionality
  • Updated firewall rules to allow access to download.microsoft.com for AMD AMA package installation

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
aks-node-controller/helpers/const.go Added constant for Standard_NM16ads_MA35D VM size
aks-node-controller/parser/helper.go Added helper function to detect AMD AMA enabled SKUs
aks-node-controller/parser/parser.go Added AMDAMA_NODE environment variable to CSE configuration
pkg/agent/baker.go Added IsAmdAmaEnabledSKU function for template rendering
pkg/agent/datamodel/helper.go Added IsAmdAmaEnabledSKU function for VM SKU detection
pkg/agent/variables.go Added amdamaNode variable to CSE command variables
parts/linux/cloud-init/artifacts/cse_cmd.sh Declared AMDAMA_NODE variable for shell scripts
parts/linux/cloud-init/artifacts/cse_config.sh Implemented setupAmdAma function to install drivers and configure system
parts/linux/cloud-init/artifacts/cse_main.sh Added call to setupAmdAma during node preparation
e2e/scenario_test.go Added E2E tests for AMD AMA GPU support (standard and scriptless)
e2e/aks_model.go Added firewall rule to allow download.microsoft.com access

Comment on lines 1275 to 1283
// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case "Standard_NM16ads_MA35D":
return true
}
return false
}

Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IsAmdAmaEnabledSKU function is duplicated in both pkg/agent/baker.go and pkg/agent/datamodel/helper.go. This creates maintenance overhead as both implementations need to be updated when new AMD AMA SKUs are added. Consider removing this function from baker.go and calling datamodel.IsAmdAmaEnabledSKU instead, following the pattern used by IsSgxEnabledSKU which only exists in datamodel/helper.go.

Suggested change
// IsAmdAmaEnabledSKU determines if an VM SKU has AMD AMA GPU HW support.
func IsAmdAmaEnabledSKU(vmSize string) bool {
switch vmSize {
case "Standard_NM16ads_MA35D":
return true
}
return false
}

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r2k1 @djsly Are these no longer needed?

Copilot AI review requested due to automatic review settings January 30, 2026 21:54
Copilot AI review requested due to automatic review settings February 7, 2026 00:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 70 changed files in this pull request and generated 2 comments.

Comment on lines +1007 to +1009
sh -c "echo 'vm.nr_hugepages=4096' >> /etc/sysctl.conf"
sh -c "echo 4096 >> /proc/sys/vm/nr_hugepages"
if [ "$(systemctl is-active kubelet)" = "active" ]; then
Copy link

Copilot AI Feb 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugepages are configured by appending to /etc/sysctl.conf and writing directly to /proc/sys/vm/nr_hugepages. This is not idempotent (re-runs will keep appending) and can fail if the kernel can’t allocate the requested pages, which would currently abort provisioning under set -e. Consider writing a dedicated drop-in under /etc/sysctl.d/ and applying via sysctl with explicit error handling/logging for allocation failures.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 9, 2026 22:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 70 changed files in this pull request and generated 4 comments.

return
fi

if isMarinerOrAzureLinux "$OS"; then
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setupAmdAma is gated by isMarinerOrAzureLinux "$OS", but the function installs azurelinux-repos-* packages. This will likely fail on Mariner images (and could break provisioning if someone uses this SKU on Mariner). Consider restricting this to Azure Linux only (and optionally specific Azure Linux versions) before attempting installs.

Suggested change
if isMarinerOrAzureLinux "$OS"; then
if isAzureLinux "$OS"; then

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 9, 2026 23:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 70 changed files in this pull request and generated 3 comments.

Comment on lines +993 to +999
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then
echo "Unable to find AMD AMA driver package for current kernel version, exiting..."
exit $ERR_AMDAMA_DRIVER_NOT_FOUND
fi
dnf install -y $AMD_AMA_DRIVER_PACKAGE
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In setupAmdAma, KERNEL_VERSION/AMD_AMA_DRIVER_PACKAGE are not local, and $AMD_AMA_DRIVER_PACKAGE is unquoted when passed to dnf install, which can lead to word-splitting/globbing issues. Also, the grep -E pattern isn’t anchored (unlike the similar CUDA logic), increasing the chance of matching the wrong repoquery output. Consider making these variables local, quoting expansions, and tightening the match (e.g., anchor to the start of the package name).

Suggested change
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "amd-ama-driver-[0-9]+.*_$KERNEL_VERSION" | sort -V | tail -n 1)
if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then
echo "Unable to find AMD AMA driver package for current kernel version, exiting..."
exit $ERR_AMDAMA_DRIVER_NOT_FOUND
fi
dnf install -y $AMD_AMA_DRIVER_PACKAGE
local KERNEL_VERSION
KERNEL_VERSION=$(uname -r | sed 's/-/./g')
local AMD_AMA_DRIVER_PACKAGE
AMD_AMA_DRIVER_PACKAGE=$(dnf repoquery -y --available "amd-ama-driver-1.3.0*" | grep -E "^amd-ama-driver-[0-9]+.*_${KERNEL_VERSION}$" | sort -V | tail -n 1)
if [ -z "$AMD_AMA_DRIVER_PACKAGE" ]; then
echo "Unable to find AMD AMA driver package for current kernel version, exiting..."
exit $ERR_AMDAMA_DRIVER_NOT_FOUND
fi
dnf install -y "$AMD_AMA_DRIVER_PACKAGE"

Copilot uses AI. Check for mistakes.
Comment on lines +1350 to +1351
Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU",
Config: Config{
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar nit: "a AzureLinuxV3" should be "an AzureLinuxV3" (both in the non-scriptless and scriptless MA35D scenario descriptions).

Copilot uses AI. Check for mistakes.
Comment on lines +1375 to +1376
Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU",
Tags: Tags{
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar nit: "a AzureLinuxV3" should be "an AzureLinuxV3" in this scenario description as well.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings February 10, 2026 02:02
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 71 changed files in this pull request and generated 2 comments.

Comment on lines +1375 to +1376
Description: "Tests that a node using a AzureLinuxV3 can support MA35D SKU",
Tags: Tags{
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: "a AzureLinuxV3" should be "an AzureLinuxV3".

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +80
TestTimeout time.Duration `env:"TEST_TIMEOUT" envDefault:"60m"`
TestTimeoutCluster time.Duration `env:"TEST_TIMEOUT_CLUSTER" envDefault:"60m"`
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing the default E2E timeouts to 60m/60m affects the entire test suite and can mask regressions or significantly lengthen CI when tests hang. If this change is only needed for MA35D/driver bring-up, consider using per-scenario timeouts (if supported) or narrowing the scope of the timeout increase; otherwise, please document why the global increase is required.

Copilot uses AI. Check for mistakes.
TestGalleryImagePrefix string `env:"TEST_GALLERY_IMAGE_PREFIX" envDefault:"abe2etest"`
TestGalleryNamePrefix string `env:"TEST_GALLERY_NAME_PREFIX" envDefault:"abe2etest"`
TestPreProvision bool `env:"TEST_PRE_PROVISION" envDefault:"false"`
TestTimeout time.Duration `env:"TEST_TIMEOUT" envDefault:"35m"`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r2k1 @djsly Do we want to keep these?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants