chore: Backport v20260512#1307
Merged
Merged
Conversation
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
* fix: deterministic envelope Work names to prevent duplicates Envelope Work names were generated with a fresh uuid.NewUUID() on every Create, so two back-to-back reconciles of the same binding with a stale informer-cache List could each build a Work with different names and both succeed. The `len > 1` branch then returned UnexpectedBehaviorError without any recovery, leaving the binding permanently stuck in WorkSynchronized=False/SyncWorkFailed and the placement never converging on that cluster. buildNewWorkForEnvelopeCR now derives the name suffix from a truncated SHA-256 hash of (envelope type, namespace, name). Two concurrent Creates for the same (binding, envelope) produce the same name and collide at the API server; the existing AlreadyExists handling in upsertWork requeues into the len == 1 update branch. Auto-deletion on detection of multiple Works is deliberately avoided: the member agent may have already applied both Works, so deleting one would trigger resource fluctuation on the member side. When the `len > 1` state is still observed (possible only in environments that were stuck before this change), the controller now emits a Warning Event on the binding (reason=DuplicateEnvelopeWorks) and a clear log message instructing the operator to delete all but the oldest Work in the affected namespace. Fixes Azure#630 Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> * Addressing feedback Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> --------- Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…LI improvements (Azure#575) Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…re#670) HasBindingFailed previously returned true for any condition with Status=False, regardless of the Reason. Several in-progress reasons on the per-cluster binding conditions (Overridden, WorkSynchronized, Applied, Available) are expressed as Status=False rather than Status=Unknown — most notably WorkNotSynchronizedYetReason and NotAvailableYetReason. The over-broad classification caused upstream controllers (rollout, updaterun) to bail on bindings that were still making forward progress. Switch HasBindingFailed to consult an explicit allowlist of non-terminal reasons (nonTerminalBindingFailureReasons) drawn from the condition package: OverriddenPendingReason, WorkNotSynchronizedYetReason, ApplyPendingReason, NotAvailableYetReason. Anything outside that set is treated as terminal — the safer default, since wrongly classifying an unknown reason as "transient" would stall rollouts forever. Per code review feedback (the original draft used a `strings.Contains` on the literal "failed", which is fragile against typos and convention drift), the allowlist uses typed constants. New "Pending" / "NotYet" reasons added to pkg/utils/condition must also be added to the allowlist; this is documented inline. Tests: - TestHasBindingFailed_NonTerminalReasons covers each known transient reason returning false, each known terminal reason returning true, an unknown-reason fall-through to terminal, and a mixed-condition case where a transient reason on one condition does not mask a terminal failure on another. - Existing TestHasBindingFailed (which uses camelCase reason strings not in the allowlist) still passes — those legacy strings are correctly classified as terminal. - TestCheckClusterUpdateResult in the updaterun package was encoding the previous bug: it asserted that WorkNotSynchronizedYet should produce an error from checkClusterUpdateResult. Updated the case to match corrected semantics (in-progress, no error) and added a new case for the terminal SyncWorkFailedReason path. This aligns with the existing TODO at execution.go:626 that called for distinguishing recoverable from non-recoverable failures. Fixes Azure#648 Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> Co-authored-by: Claude Code <noreply@anthropic.com>
…e#667) * refactor: extract IsLatest/LookupLatest policy snapshot helpers Move the two duplicated "latest policy snapshot" lookups into pkg/utils/controller/policy_snapshot_resolver.go so both the scheduler and the policy-snapshot watcher share the implementation. Adds two helpers: - IsLatestPolicySnapshot(snapshot) (bool, error): inspects the IsLatestSnapshot label. Returns (false, ErrUnexpectedBehavior-wrapped) for missing or non-bool values. The watcher's Reconcile now uses this in place of inline label parsing. - LookupLatestPolicySnapshot(ctx, client.Reader, placement) (PolicySnapshotObj, error): wraps the existing FetchLatestPolicySnapshot and enforces the "exactly one latest snapshot" invariant. Returns a plain error for the (transient) zero case and ErrUnexpectedBehavior for the >1 case. The scheduler's lookupLatestPolicySnapshot now delegates to this and keeps only its own log + retry handling. The watcher's buildCustomPredicate continues to do its own ParseBool because it compares old-vs-new label transitions, which doesn't fit the helper's signature. Adds table-driven tests for both helpers covering: label-true/false, missing label, nil labels map, malformed value, exactly-one snapshot, zero snapshots, multiple snapshots, ignoring snapshots from other placements, and ignoring non-latest snapshots. Drops two now-unused imports (strconv, labels) from scheduler.go. Fixes #643 Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> * chore: address review feedback Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> * test: align integration substring with new error wording Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> * test: cover LookupLatestPolicySnapshot list-error branch Address codecov/patch shortfall on PR Azure#667. The List-failure path in controller.LookupLatestPolicySnapshot and the corresponding wrapper in scheduler.lookupLatestPolicySnapshot were not exercised by any existing unit test. Add an interceptor-based test in each package that injects a non-StatusError on List, asserts (nil, error) and that the error wraps ErrUnexpectedBehavior (NewAPIServerError promotes non-Status errors via isUnexpectedCacheError). Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> --------- Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> Co-authored-by: Claude Code <noreply@anthropic.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.2 to 4.35.3. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@95e58e9...e46ed2c) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.35.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Simon Waight <simon.waight@gmail.com>
Update community meetings. Signed-off-by: Simon Waight <simon.waight@microsoft.com>
Contributor
|
LGTM ;) Thanks! |
* fix: specify which object failed in override error message Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> * fix: harden formatOverrideTarget fallback and align IsClusterMatched log Address review feedback on PR Azure#686: - formatOverrideTarget: fall back to "Unknown" when the unstructured target has an empty Kind, so a malformed snapshot doesn't render as `"" "name"` in user-facing error messages. - applyOverrideRules: drop the NewUnexpectedBehaviorError log wrapper on the IsClusterMatched failure path. The outer applyOverrides wrap already tags this as ErrUserError, so classifying the log line as "unexpected" contradicted the user-error classification. - applyOverrides: switch the outer wrap from the inline fmt.Errorf("%w: ...", controller.ErrUserError, ...) form to controller.NewUserError(fmt.Errorf(...)) to match the form used by every other call site in the codebase. Behaviorally identical (same Error() string, same errors.Is, same trim output at workgenerator/controller.go:188-194). - Add TestFormatOverrideTarget covering namespaced, cluster-scoped, and the new empty-Kind fallback (both scopes). - Trim the verbose doc and test comments introduced by this PR. Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com> --------- Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Contributor
|
Note: the UTs are failing because of the a hardcoded error string about |
… be placed Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of your changes
Fixes #
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer