fix: avoid panic when LoadAssignment has fewer endpoints than active BackendRefs#2159
fix: avoid panic when LoadAssignment has fewer endpoints than active BackendRefs#2159mtparet wants to merge 3 commits into
Conversation
…BackendRefs maybeModifyCluster indexed cluster.LoadAssignment.Endpoints[lbEndpointIndex] inside the BackendRefs loop without bounds-checking. When envoy-gateway emits a LoadAssignment with fewer endpoint sets than the rule's active (non-zero-weight) backendRefs - a transient state during translation churn - the controller panicked with "index out of range" and crashed. Add a length guard that breaks out of the loop and logs a structured warning so the underlying translation mismatch remains observable. Cover the case with a unit test that constructs a 2-backendRef rule against a 1-entry LoadAssignment and asserts no panic. Signed-off-by: Matthieu Paret <mp@blackfuel.ai>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2159 +/- ##
==========================================
+ Coverage 84.33% 84.34% +0.01%
==========================================
Files 134 134
Lines 19352 19361 +9
==========================================
+ Hits 16320 16331 +11
+ Misses 2033 2032 -1
+ Partials 999 998 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Same test passed locally, could be a flaky test error ? |
|
/retest |
| // than the rule has active (non-zero-weight) BackendRefs. This is a | ||
| // transient translation mismatch; skip the remaining backendRefs rather |
There was a problem hiding this comment.
Could you elaborate a bit more on what the "transient translation mismatch" is, and under what conditions it can appear?
There was a problem hiding this comment.
One of our custom controller was patching BackendSecurityPolicy / Backend every ~30 s forcing EG to re-translate continuously. One intermediate LoadAssignment had 1 endpoint set for a 2-backendRef rule → index out of range [1] with length 1 and causing a panic. I suppose it can happens in any system quickly automatically modifying gateway resources and causing this race condition.
There was a problem hiding this comment.
Can you explain this and elaborate on the comment in the code itself, so that users and maintainers who get to this code in the future fully understand the conditions under which this can happen?
Description
maybeModifyClusterininternal/extensionserver/post_translate_modify.goindexedcluster.LoadAssignment.Endpoints[lbEndpointIndex]inside theBackendRefsloop without bounds-checking. When envoy-gateway emits aLoadAssignmentwith fewer endpoint sets than the rule's active (non-zero-weight) backendRefs - a transient state during translation churn - the controller panicked withindex out of rangeand crashed, eventually triggering CrashLoopBackOff on the affected replica.This change adds a length guard that breaks out of the loop and logs a structured warning (cluster name, route namespace/name, endpoints length, lb endpoint index, backend ref index) so the underlying translation mismatch remains observable. A unit test constructs a 2-backendRef rule against a 1-entry
LoadAssignmentand assertsmaybeModifyClusterreturns without panicking; the test fails onmainwithout the fix.Related Issues/PRs (if applicable)
None upstream.
Special notes for reviewers (if applicable)
The guard uses
break(rather thancontinue) becauselbEndpointIndexis monotonically increasing across iterations; once it exceeds the available endpoint sets, no later backendRef in the same rule can be served either.This PR was drafted with the help of an AI assistant; the change has been reviewed and is owned by the submitter.