Retry 422 errors on ResourceClaim status patch#154
Merged
Conversation
When multiple AnarchySubjects in a ResourceClaim change state concurrently, the JSON Patch to update RC status can fail with 422 because concurrent callers hold independent RC instances with stale state. Previously, 422 was silently dropped with only a warning log, leaving the RC status out of sync until the next periodic reconciliation. - Remove 422 suppression in update_status_from_handle (resourceclaim.py) - Add retry-with-refetch at both call sites in resourcehandle.py (update_status and manage_resource) - Follow existing codebase pattern from resourcepoolscaling.py: while True, attempt > 10 bail-out, refetch() without sleep - Add unit tests for retry behavior (success, transient 422, exhausted retries, 404, unexpected exceptions) Fixes: 20 occurrences of "Failed to apply patch" observed in 42h on babylon-west production cluster across handler and watch pods.
Remove unused asyncio and patch imports flagged by linter.
jkupferer
requested changes
May 18, 2026
jkupferer
left a comment
Collaborator
There was a problem hiding this comment.
I like the idea here, but let's put the retry loop in the update_status_from_handle method rather than in both places where it is called? Perhaps rename the existing method to make it private (ex: __update_status_from_handle) and then make a new update_status_from_handle that implements the retry mechanism you added in the other locations?
jkupferer
approved these changes
May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
update_status_from_handlethat caused stale ResourceClaim status for multi-subject claimsupdate_statusandmanage_resourceinresourcehandle.py)resourcepoolscaling.py(bounded at 10 retries,refetch()without sleep)Root Cause
In production (separated operator mode), each watch event creates a new
ResourceClaiminstance with its own lock. Concurrent events for the same RC build JSON Patches against independent state snapshots. The first patch succeeds; the second gets 422 because paths changed. The 422 was caught and silently discarded — no retry, no re-raise.Evidence: 20 occurrences of
"Failed to apply patch"in 42h on babylon-west, all on multi-subject RCs (summit labs).