Skip to content

fix: avoid panic when LoadAssignment has fewer endpoints than active BackendRefs#2159

Open
mtparet wants to merge 3 commits into
envoyproxy:mainfrom
blackfuel-ai:mp/maybe-modify-cluster-bounds-guard
Open

fix: avoid panic when LoadAssignment has fewer endpoints than active BackendRefs#2159
mtparet wants to merge 3 commits into
envoyproxy:mainfrom
blackfuel-ai:mp/maybe-modify-cluster-bounds-guard

Conversation

@mtparet
Copy link
Copy Markdown
Contributor

@mtparet mtparet commented May 25, 2026

Description

maybeModifyCluster in internal/extensionserver/post_translate_modify.go indexed cluster.LoadAssignment.Endpoints[lbEndpointIndex] inside the BackendRefs loop without bounds-checking. When envoy-gateway emits a LoadAssignment with fewer endpoint sets than the rule's active (non-zero-weight) backendRefs - a transient state during translation churn - the controller panicked with index out of range and crashed, eventually triggering CrashLoopBackOff on the affected replica.

This change adds a length guard that breaks out of the loop and logs a structured warning (cluster name, route namespace/name, endpoints length, lb endpoint index, backend ref index) so the underlying translation mismatch remains observable. A unit test constructs a 2-backendRef rule against a 1-entry LoadAssignment and asserts maybeModifyCluster returns without panicking; the test fails on main without the fix.

Related Issues/PRs (if applicable)

None upstream.

Special notes for reviewers (if applicable)

The guard uses break (rather than continue) because lbEndpointIndex is monotonically increasing across iterations; once it exceeds the available endpoint sets, no later backendRef in the same rule can be served either.

This PR was drafted with the help of an AI assistant; the change has been reviewed and is owned by the submitter.

…BackendRefs

maybeModifyCluster indexed cluster.LoadAssignment.Endpoints[lbEndpointIndex]
inside the BackendRefs loop without bounds-checking. When envoy-gateway emits
a LoadAssignment with fewer endpoint sets than the rule's active
(non-zero-weight) backendRefs - a transient state during translation churn -
the controller panicked with "index out of range" and crashed.

Add a length guard that breaks out of the loop and logs a structured warning
so the underlying translation mismatch remains observable. Cover the case
with a unit test that constructs a 2-backendRef rule against a 1-entry
LoadAssignment and asserts no panic.

Signed-off-by: Matthieu Paret <mp@blackfuel.ai>
@mtparet mtparet requested a review from a team as a code owner May 25, 2026 08:54
@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 25, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.34%. Comparing base (3509761) to head (d9cd6ec).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2159      +/-   ##
==========================================
+ Coverage   84.33%   84.34%   +0.01%     
==========================================
  Files         134      134              
  Lines       19352    19361       +9     
==========================================
+ Hits        16320    16331      +11     
+ Misses       2033     2032       -1     
+ Partials      999      998       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mtparet
Copy link
Copy Markdown
Contributor Author

mtparet commented May 26, 2026

Same test passed locally, could be a flaky test error ?

@nacx
Copy link
Copy Markdown
Member

nacx commented May 26, 2026

/retest

Comment on lines +229 to +230
// than the rule has active (non-zero-weight) BackendRefs. This is a
// transient translation mismatch; skip the remaining backendRefs rather
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit more on what the "transient translation mismatch" is, and under what conditions it can appear?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of our custom controller was patching BackendSecurityPolicy / Backend every ~30 s forcing EG to re-translate continuously. One intermediate LoadAssignment had 1 endpoint set for a 2-backendRef rule → index out of range [1] with length 1 and causing a panic. I suppose it can happens in any system quickly automatically modifying gateway resources and causing this race condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this and elaborate on the comment in the code itself, so that users and maintainers who get to this code in the future fully understand the conditions under which this can happen?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

friendly ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants