-
Notifications
You must be signed in to change notification settings - Fork 247
fix: skip gpu e2e tests on sku quota exceeded and add configurable test timeout #7836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| E2E_SUBSCRIPTION_ID="${E2E_SUBSCRIPTION_ID:-}" | ||
| TAGS_TO_SKIP="${TAGS_TO_SKIP:-}" | ||
| TAGS_TO_RUN="${TAGS_TO_RUN:-}" | ||
| E2E_GO_TEST_TIMEOUT="${E2E_GO_TEST_TIMEOUT:-90m}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually quite a lot. Maybe we should lower it to 60m?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need different timeout for GPU tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am just keeping the default for now :) we should just have it for 20m IMO but considering we can anytime create a new cluster we must have kept a high timeout
| // sometimes the SKU quota is exceeded and we can't do anything. Skip the test in this case. | ||
| if respErr.ErrorCode == "OperationNotAllowed" && | ||
| strings.Contains(respErr.Error(), "exceeding approved") && | ||
| strings.Contains(respErr.Error(), "quota") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strings.Contains a bit magic. Is there a sub code we can use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RESPONSE 409: 409 Conflict
ERROR CODE: OperationNotAllowed
--------------------------------------------------------------------------------
{
"error": {
"code": "OperationNotAllowed",
"message": "Operation could not be completed as it results in exceeding approved StandardNCADSA100v4Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: WestUS3, Current Limit: 50, Current Usage: 48, Additional Required: 24, (Minimum) New Limit Required: 72. Setup Alerts when Quota reaches threshold. Learn more at https://aka.ms/quotamonitoringalerting . Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/%7B%22subscriptionId%22:%228ecadfc9-d1a3-4ea4-b844-0d9f87e4d7c8%22,%22command%22:%22openQuotaApprovalBlade%22,%22quotas%22:[%7B%22location%22:%22WestUS3%22,%22providerId%22:%22Microsoft.Compute%22,%22resourceName%22:%22StandardNCADSA100v4Family%22,%22quotaRequest%22:%7B%22properties%22:%7B%22limit%22:72,%22unit%22:%22Count%22,%22name%22:%7B%22value%22:%22StandardNCADSA100v4Family%22%7D%7D%7D%7D]%7D by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at [https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests"](https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests%22)
}
}
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR improves E2E reliability for GPU scenarios by (1) skipping known-capacity-related Azure errors when configured and (2) letting the E2E go test timeout be tuned per pipeline to avoid abrupt Azure DevOps task termination.
Changes:
- Extend
skipTestIfSKUNotAvailableErrto optionally skip tests on Azure quota-exceeded errors (OperationNotAllowed409 with specific message content). - Add
E2E_GO_TEST_TIMEOUTto.pipelines/scripts/e2e_run.sh(default90m) and use it for thego test -timeoutvalue. - Enable
SKIP_TESTS_WITH_SKU_CAPACITY_ISSUEand setE2E_GO_TEST_TIMEOUT=75min the GPU E2E pipeline.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
e2e/vmss.go |
Adds skip logic for quota-exceeded capacity errors when the skip flag is enabled. |
.pipelines/scripts/e2e_run.sh |
Makes go test -timeout configurable via E2E_GO_TEST_TIMEOUT (defaults to 90m). |
.pipelines/e2e-gpu.yaml |
Turns on SKU-capacity skipping for GPU runs and sets a shorter test timeout (75m). |
…meout - Extend skipTestIfSKUNotAvailableErr to also skip tests on OperationNotAllowed quota exceeded errors (409 with 'exceeding approved' + 'quota' in message) - Enable SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE in GPU E2E pipeline - Make go test timeout configurable via E2E_GO_TEST_TIMEOUT env var (default 90m) - Set GPU pipeline timeout to 60m so tests terminate cleanly with log output before the 90m ADO task timeout kills the process
01a8fc9 to
040c6df
Compare
| TAGS_TO_RUN: "gpu=true" | ||
| TAGS_TO_SKIP: "os=windows" | ||
| SKIP_E2E_TESTS: false | ||
| SKIP_TESTS_WITH_SKU_CAPACITY_ISSUE: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably skip for GPU e2e only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is GPU E2E :)
| } | ||
|
|
||
| func skipTestIfSKUNotAvailableErr(t testing.TB, err error) { | ||
| // sometimes the SKU is not available and we can't do anything. Skip the test in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we print a warning in ADO on quota issues when we skip ?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #