Skip to content

chore: Backport v20260512#1307

Merged
michaelawyu merged 21 commits into
Azure:mainfrom
britaniar:backport-v20260512
May 14, 2026
Merged

chore: Backport v20260512#1307
michaelawyu merged 21 commits into
Azure:mainfrom
britaniar:backport-v20260512

Conversation

@britaniar
Copy link
Copy Markdown
Contributor

Description of your changes

Fixes #

I have:

  • [ x] Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Special notes for your reviewer

jwtty and others added 16 commits April 28, 2026 15:17
Signed-off-by: Akash Kumar <meakash7902@gmail.com>
* fix: deterministic envelope Work names to prevent duplicates

Envelope Work names were generated with a fresh uuid.NewUUID() on every
Create, so two back-to-back reconciles of the same binding with a stale
informer-cache List could each build a Work with different names and
both succeed. The `len > 1` branch then returned UnexpectedBehaviorError
without any recovery, leaving the binding permanently stuck in
WorkSynchronized=False/SyncWorkFailed and the placement never
converging on that cluster.

buildNewWorkForEnvelopeCR now derives the name suffix from a truncated
SHA-256 hash of (envelope type, namespace, name). Two concurrent Creates
for the same (binding, envelope) produce the same name and collide at
the API server; the existing AlreadyExists handling in upsertWork
requeues into the len == 1 update branch.

Auto-deletion on detection of multiple Works is deliberately avoided:
the member agent may have already applied both Works, so deleting one
would trigger resource fluctuation on the member side. When the `len > 1`
state is still observed (possible only in environments that were stuck
before this change), the controller now emits a Warning Event on the
binding (reason=DuplicateEnvelopeWorks) and a clear log message
instructing the operator to delete all but the oldest Work in the
affected namespace.

Fixes Azure#630

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

* Addressing feedback

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

---------

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…LI improvements (Azure#575)

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
…re#670)

HasBindingFailed previously returned true for any condition with
Status=False, regardless of the Reason. Several in-progress reasons on
the per-cluster binding conditions (Overridden, WorkSynchronized,
Applied, Available) are expressed as Status=False rather than
Status=Unknown — most notably WorkNotSynchronizedYetReason and
NotAvailableYetReason. The over-broad classification caused upstream
controllers (rollout, updaterun) to bail on bindings that were still
making forward progress.

Switch HasBindingFailed to consult an explicit allowlist of
non-terminal reasons (nonTerminalBindingFailureReasons) drawn from the
condition package: OverriddenPendingReason, WorkNotSynchronizedYetReason,
ApplyPendingReason, NotAvailableYetReason. Anything outside that set is
treated as terminal — the safer default, since wrongly classifying an
unknown reason as "transient" would stall rollouts forever.

Per code review feedback (the original draft used a `strings.Contains`
on the literal "failed", which is fragile against typos and convention
drift), the allowlist uses typed constants. New "Pending" / "NotYet"
reasons added to pkg/utils/condition must also be added to the
allowlist; this is documented inline.

Tests:

- TestHasBindingFailed_NonTerminalReasons covers each known transient
  reason returning false, each known terminal reason returning true,
  an unknown-reason fall-through to terminal, and a mixed-condition
  case where a transient reason on one condition does not mask a
  terminal failure on another.
- Existing TestHasBindingFailed (which uses camelCase reason strings
  not in the allowlist) still passes — those legacy strings are
  correctly classified as terminal.
- TestCheckClusterUpdateResult in the updaterun package was encoding
  the previous bug: it asserted that WorkNotSynchronizedYet should
  produce an error from checkClusterUpdateResult. Updated the case to
  match corrected semantics (in-progress, no error) and added a new
  case for the terminal SyncWorkFailedReason path. This aligns with
  the existing TODO at execution.go:626 that called for distinguishing
  recoverable from non-recoverable failures.

Fixes Azure#648

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
…e#667)

* refactor: extract IsLatest/LookupLatest policy snapshot helpers

Move the two duplicated "latest policy snapshot" lookups into
pkg/utils/controller/policy_snapshot_resolver.go so both the scheduler
and the policy-snapshot watcher share the implementation.

Adds two helpers:

- IsLatestPolicySnapshot(snapshot) (bool, error): inspects the
  IsLatestSnapshot label. Returns (false, ErrUnexpectedBehavior-wrapped)
  for missing or non-bool values. The watcher's Reconcile now uses this
  in place of inline label parsing.

- LookupLatestPolicySnapshot(ctx, client.Reader, placement)
  (PolicySnapshotObj, error): wraps the existing FetchLatestPolicySnapshot
  and enforces the "exactly one latest snapshot" invariant. Returns a
  plain error for the (transient) zero case and ErrUnexpectedBehavior for
  the >1 case. The scheduler's lookupLatestPolicySnapshot now delegates
  to this and keeps only its own log + retry handling.

The watcher's buildCustomPredicate continues to do its own ParseBool
because it compares old-vs-new label transitions, which doesn't fit the
helper's signature.

Adds table-driven tests for both helpers covering: label-true/false,
missing label, nil labels map, malformed value, exactly-one snapshot,
zero snapshots, multiple snapshots, ignoring snapshots from other
placements, and ignoring non-latest snapshots. Drops two now-unused
imports (strconv, labels) from scheduler.go.

Fixes #643

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

* chore: address review feedback

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

* test: align integration substring with new error wording

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

* test: cover LookupLatestPolicySnapshot list-error branch

Address codecov/patch shortfall on PR Azure#667. The List-failure path in
controller.LookupLatestPolicySnapshot and the corresponding wrapper
in scheduler.lookupLatestPolicySnapshot were not exercised by any
existing unit test. Add an interceptor-based test in each package
that injects a non-StatusError on List, asserts (nil, error) and
that the error wraps ErrUnexpectedBehavior (NewAPIServerError
promotes non-Status errors via isUnexpectedCacheError).

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

---------

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
Co-authored-by: Claude Code <noreply@anthropic.com>
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.35.2 to 4.35.3.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@95e58e9...e46ed2c)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.35.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Simon Waight <simon.waight@gmail.com>
Update community meetings.

Signed-off-by: Simon Waight <simon.waight@microsoft.com>
ytimocin
ytimocin previously approved these changes May 12, 2026
Copy link
Copy Markdown
Contributor

@ytimocin ytimocin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My changes look good.

@michaelawyu
Copy link
Copy Markdown
Contributor

LGTM ;) Thanks!

ytimocin and others added 3 commits May 12, 2026 20:03
* fix: specify which object failed in override error message

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

* fix: harden formatOverrideTarget fallback and align IsClusterMatched log

Address review feedback on PR Azure#686:

- formatOverrideTarget: fall back to "Unknown" when the unstructured
  target has an empty Kind, so a malformed snapshot doesn't render as
  `"" "name"` in user-facing error messages.
- applyOverrideRules: drop the NewUnexpectedBehaviorError log wrapper on
  the IsClusterMatched failure path. The outer applyOverrides wrap
  already tags this as ErrUserError, so classifying the log line as
  "unexpected" contradicted the user-error classification.
- applyOverrides: switch the outer wrap from the inline
  fmt.Errorf("%w: ...", controller.ErrUserError, ...) form to
  controller.NewUserError(fmt.Errorf(...)) to match the form used by
  every other call site in the codebase. Behaviorally identical
  (same Error() string, same errors.Is, same trim output at
  workgenerator/controller.go:188-194).
- Add TestFormatOverrideTarget covering namespaced, cluster-scoped, and
  the new empty-Kind fallback (both scopes).
- Trim the verbose doc and test comments introduced by this PR.

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>

---------

Signed-off-by: Yetkin Timocin <ytimocin@microsoft.com>
@michaelawyu
Copy link
Copy Markdown
Contributor

Note: the UTs are failing because of the a hardcoded error string about github.com/kubefleet-dev/kubefleet in the new error handling utility UTs. I will prepare a PR to address this in the upstream repo; meanwhile this should not block the current release.

britaniar added 2 commits May 13, 2026 13:41
… be placed

Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Signed-off-by: Britania Rodriguez Reyes <britaniar@microsoft.com>
Copy link
Copy Markdown
Contributor

@michaelawyu michaelawyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@michaelawyu michaelawyu merged commit 578625a into Azure:main May 14, 2026
19 of 20 checks passed
@britaniar britaniar deleted the backport-v20260512 branch May 14, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants