Skip to content

Conversation

@awesomenix
Copy link
Contributor

  • Extend skipTestIfSKUNotAvailableErr to also skip tests on OperationNotAllowed quota exceeded errors (409 with 'exceeding approved' + 'quota' in message)
  • Enable SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE in GPU E2E pipeline
  • Make go test timeout configurable via E2E_GO_TEST_TIMEOUT env var (default 90m)
  • Set GPU pipeline timeout to 75m so tests terminate cleanly with log output before the 90m ADO task timeout kills the process

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Copilot AI review requested due to automatic review settings February 9, 2026 22:41
@awesomenix awesomenix changed the title Skip GPU E2E tests on SKU quota exceeded and add configurable test timeout fix: skip gpu e2e tests on sku quota exceeded and add configurable test timeout Feb 9, 2026
E2E_SUBSCRIPTION_ID="${E2E_SUBSCRIPTION_ID:-}"
TAGS_TO_SKIP="${TAGS_TO_SKIP:-}"
TAGS_TO_RUN="${TAGS_TO_RUN:-}"
E2E_GO_TEST_TIMEOUT="${E2E_GO_TEST_TIMEOUT:-90m}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually quite a lot. Maybe we should lower it to 60m?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need different timeout for GPU tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am just keeping the default for now :) we should just have it for 20m IMO but considering we can anytime create a new cluster we must have kept a high timeout

// sometimes the SKU quota is exceeded and we can't do anything. Skip the test in this case.
if respErr.ErrorCode == "OperationNotAllowed" &&
strings.Contains(respErr.Error(), "exceeding approved") &&
strings.Contains(respErr.Error(), "quota") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strings.Contains a bit magic. Is there a sub code we can use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESPONSE 409: 409 Conflict
        	            	ERROR CODE: OperationNotAllowed
        	            	--------------------------------------------------------------------------------
        	            	{
        	            	  "error": {
        	            	    "code": "OperationNotAllowed",
        	            	    "message": "Operation could not be completed as it results in exceeding approved StandardNCADSA100v4Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: WestUS3, Current Limit: 50, Current Usage: 48, Additional Required: 24, (Minimum) New Limit Required: 72. Setup Alerts when Quota reaches threshold. Learn more at https://aka.ms/quotamonitoringalerting . Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%228ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22WestUS3%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22StandardNCADSA100v4Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:72,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22StandardNCADSA100v4Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at [https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests"](https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests%22)
        	            	  }
        	            	}
        	            	```

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves E2E reliability for GPU scenarios by (1) skipping known-capacity-related Azure errors when configured and (2) letting the E2E go test timeout be tuned per pipeline to avoid abrupt Azure DevOps task termination.

Changes:

  • Extend skipTestIfSKUNotAvailableErr to optionally skip tests on Azure quota-exceeded errors (OperationNotAllowed 409 with specific message content).
  • Add E2E_GO_TEST_TIMEOUT to .pipelines/scripts/e2e_run.sh (default 90m) and use it for the go test -timeout value.
  • Enable SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE and set E2E_GO_TEST_TIMEOUT=75m in the GPU E2E pipeline.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
e2e/vmss.go Adds skip logic for quota-exceeded capacity errors when the skip flag is enabled.
.pipelines/scripts/e2e_run.sh Makes go test -timeout configurable via E2E_GO_TEST_TIMEOUT (defaults to 90m).
.pipelines/e2e-gpu.yaml Turns on SKU-capacity skipping for GPU runs and sets a shorter test timeout (75m).

…meout

- Extend skipTestIfSKUNotAvailableErr to also skip tests on OperationNotAllowed
  quota exceeded errors (409 with 'exceeding approved' + 'quota' in message)
- Enable SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE in GPU E2E pipeline
- Make go test timeout configurable via E2E_GO_TEST_TIMEOUT env var (default 90m)
- Set GPU pipeline timeout to 60m so tests terminate cleanly with log output
  before the 90m ADO task timeout kills the process
@awesomenix awesomenix merged commit 718a790 into main Feb 9, 2026
20 of 28 checks passed
@awesomenix awesomenix deleted the nishp/fix.e2e branch February 9, 2026 23:50
TAGS_TO_RUN: "gpu=true"
TAGS_TO_SKIP: "os=windows"
SKIP_E2E_TESTS: false
SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably skip for GPU e2e only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is GPU E2E :)

}

func skipTestIfSKUNotAvailableErr(t testing.TB, err error) {
// sometimes the SKU is not available and we can't do anything. Skip the test in this case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we print a warning in ADO on quota issues when we skip ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants