Retry 422 errors on ResourceClaim status patch by marcosmamorim · Pull Request #154 · rhpds/poolboy

marcosmamorim · 2026-05-09T16:01:38Z

Summary

Fix silent 422 drop in update_status_from_handle that caused stale ResourceClaim status for multi-subject claims
Add retry-with-refetch loop at both call sites (update_status and manage_resource in resourcehandle.py)
Follows existing retry pattern from resourcepoolscaling.py (bounded at 10 retries, refetch() without sleep)

Root Cause

In production (separated operator mode), each watch event creates a new ResourceClaim instance with its own lock. Concurrent events for the same RC build JSON Patches against independent state snapshots. The first patch succeeds; the second gets 422 because paths changed. The 422 was caught and silently discarded — no retry, no re-raise.

Evidence: 20 occurrences of "Failed to apply patch" in 42h on babylon-west, all on multi-subject RCs (summit labs).

When multiple AnarchySubjects in a ResourceClaim change state concurrently, the JSON Patch to update RC status can fail with 422 because concurrent callers hold independent RC instances with stale state. Previously, 422 was silently dropped with only a warning log, leaving the RC status out of sync until the next periodic reconciliation. - Remove 422 suppression in update_status_from_handle (resourceclaim.py) - Add retry-with-refetch at both call sites in resourcehandle.py (update_status and manage_resource) - Follow existing codebase pattern from resourcepoolscaling.py: while True, attempt > 10 bail-out, refetch() without sleep - Add unit tests for retry behavior (success, transient 422, exhausted retries, 404, unexpected exceptions) Fixes: 20 occurrences of "Failed to apply patch" observed in 42h on babylon-west production cluster across handler and watch pods.

Remove unused asyncio and patch imports flagged by linter.

jkupferer

I like the idea here, but let's put the retry loop in the update_status_from_handle method rather than in both places where it is called? Perhaps rename the existing method to make it private (ex: __update_status_from_handle) and then make a new update_status_from_handle that implements the retry mechanism you added in the other locations?

marcosmamorim requested a review from jkupferer May 9, 2026 16:02

Remove unused imports from retry test

ec6575d

Remove unused asyncio and patch imports flagged by linter.

jkupferer requested changes May 18, 2026

View reviewed changes

Move retry logic into update_status_from_handle wrapper

18d63cf

jkupferer approved these changes May 18, 2026

View reviewed changes

jkupferer merged commit ee4a3db into main May 18, 2026
2 of 3 checks passed

jkupferer deleted the fix/retry-422-resourceclaim-status-patch branch May 18, 2026 22:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry 422 errors on ResourceClaim status patch#154

Retry 422 errors on ResourceClaim status patch#154
jkupferer merged 3 commits into
mainfrom
fix/retry-422-resourceclaim-status-patch

marcosmamorim commented May 9, 2026

Uh oh!

jkupferer left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

marcosmamorim commented May 9, 2026

Summary

Root Cause

Uh oh!

jkupferer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants