Skip to content

Retry 422 errors on ResourceClaim status patch#154

Merged
jkupferer merged 3 commits into
mainfrom
fix/retry-422-resourceclaim-status-patch
May 18, 2026
Merged

Retry 422 errors on ResourceClaim status patch#154
jkupferer merged 3 commits into
mainfrom
fix/retry-422-resourceclaim-status-patch

Conversation

@marcosmamorim

Copy link
Copy Markdown
Contributor

Summary

  • Fix silent 422 drop in update_status_from_handle that caused stale ResourceClaim status for multi-subject claims
  • Add retry-with-refetch loop at both call sites (update_status and manage_resource in resourcehandle.py)
  • Follows existing retry pattern from resourcepoolscaling.py (bounded at 10 retries, refetch() without sleep)

Root Cause

In production (separated operator mode), each watch event creates a new ResourceClaim instance with its own lock. Concurrent events for the same RC build JSON Patches against independent state snapshots. The first patch succeeds; the second gets 422 because paths changed. The 422 was caught and silently discarded — no retry, no re-raise.

Evidence: 20 occurrences of "Failed to apply patch" in 42h on babylon-west, all on multi-subject RCs (summit labs).

When multiple AnarchySubjects in a ResourceClaim change state
concurrently, the JSON Patch to update RC status can fail with 422
because concurrent callers hold independent RC instances with stale
state. Previously, 422 was silently dropped with only a warning log,
leaving the RC status out of sync until the next periodic reconciliation.

- Remove 422 suppression in update_status_from_handle (resourceclaim.py)
- Add retry-with-refetch at both call sites in resourcehandle.py
  (update_status and manage_resource)
- Follow existing codebase pattern from resourcepoolscaling.py:
  while True, attempt > 10 bail-out, refetch() without sleep
- Add unit tests for retry behavior (success, transient 422, exhausted
  retries, 404, unexpected exceptions)

Fixes: 20 occurrences of "Failed to apply patch" observed in 42h on
babylon-west production cluster across handler and watch pods.
@marcosmamorim marcosmamorim requested a review from jkupferer May 9, 2026 16:02
Remove unused asyncio and patch imports flagged by linter.

@jkupferer jkupferer left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea here, but let's put the retry loop in the update_status_from_handle method rather than in both places where it is called? Perhaps rename the existing method to make it private (ex: __update_status_from_handle) and then make a new update_status_from_handle that implements the retry mechanism you added in the other locations?

@jkupferer jkupferer merged commit ee4a3db into main May 18, 2026
2 of 3 checks passed
@jkupferer jkupferer deleted the fix/retry-422-resourceclaim-status-patch branch May 18, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants