Skip to content

extproc: skip CONTINUE_AND_REPLACE when no body change is needed#2170

Open
ChrisJBurns wants to merge 2 commits into
envoyproxy:mainfrom
ChrisJBurns:upstream-fix/issue-extproc-body-replay
Open

extproc: skip CONTINUE_AND_REPLACE when no body change is needed#2170
ChrisJBurns wants to merge 2 commits into
envoyproxy:mainfrom
ChrisJBurns:upstream-fix/issue-extproc-body-replay

Conversation

@ChrisJBurns
Copy link
Copy Markdown

@ChrisJBurns ChrisJBurns commented May 28, 2026

Description

Scope: this PR fixes the narrow case where the upstream cluster ext_proc
filter issued CONTINUE_AND_REPLACE even though nothing in the request
pipeline needed to mutate the body
— no translation, no model override, no
retry, no backend HTTPBodyMutation. In that path, the upstream filter
replayed u.parent.originalRequestBodyRaw (snapshotted at the router phase),
which clobbered any body mutation applied by an earlier ext_proc filter in the
listener chain. After this PR, that path emits CONTINUE and leaves the body
alone.

The broader question of "preserve an intermediate filter's body mutation
across translation / retries / HTTPBodyMutation" is not addressed by
this PR. When wantBodyReplace == true, the replacement body is still derived
from the router-phase snapshot, so any intermediate filter's mutation is lost
on those branches. That's an architectural property of the snapshot-at-router-
phase design — the upstream filter has no view of Envoy's current body, by
design, to avoid re-buffering through ext_proc at the upstream phase. Fixing
it requires a separate design (either re-buffering at upstream, or
propagating a diff via dynamic metadata) and should be its own PR.

Root cause of the narrow bug. The existing safety net in
applyBodyMutation looks like it covers the no-config case:

if bodyMutator == nil {
    return bodyMutation
}

but it does not fire in practice. bodymutator.NewBodyMutator(nil, ...)
returns a non-nil struct even when its *filterapi.HTTPBodyMutation argument
is nil. So the early-return is dead code; the function falls through, calls
Mutate(originalRequestBodyRaw) on a no-config mutator (which returns its
input unchanged), wraps that in a BodyMutation, and the upstream filter
sends CONTINUE_AND_REPLACE with the original body — even though nothing
asked for a body replacement.

Fix:

  • Add BodyMutator.HasMutations() predicate. NewBodyMutator's contract is
    unchanged (it still returns non-nil); callers can ask whether the mutator
    actually has anything to do.

  • In upstreamProcessor.ProcessRequestHeaders, compute

    mutatorHasMutations := u.bodyMutator != nil && u.bodyMutator.HasMutations()
    wantBodyReplace := bodyMutation != nil || forceBodyMutation || mutatorHasMutations
    

    before calling applyBodyMutation. When false, skip the body-mutation work
    and emit CONTINUE instead of CONTINUE_AND_REPLACE, preserving translator
    header mutations (e.g. path rewrite), header-mutator output, and auth-handler
    header additions. Skip the content-length dynamic-metadata restamp on the
    CONTINUE branch since the body length did not change.

All existing CONTINUE_AND_REPLACE callers stay on the existing path (these
are the branches where this PR explicitly does not change behavior, including
the loss-of-earlier-mutation property documented above):

  • Translator emits a body (AWS Bedrock/SigV4, OpenAI to Anthropic, OpenAI to
    Vertex, etc.) — bodyMutation != nil.
  • Backend HTTPBodyMutation configured — HasMutations() == true.
  • Retry — forceBodyMutation == true via u.onRetry().
  • Streaming with IncludeUsage=falseforceBodyMutation == true.
  • Model override — translator emits newBody via sjson.SetBytesOptions, so
    bodyMutation != nil.

Tests:

  • Added Test_chatCompletionProcessorUpstreamFilter_ProcessRequestHeaders_BodyReplaceContract
    with two subtests:
    • no translator body, no mutator, no force -> CONTINUE exercises the
      previously untested path. Asserts Status == CONTINUE, no BodyMutation,
      preserved translator and auth-handler header mutations, and
      DynamicMetadata == nil (no content-length restamp).
    • translator body present -> CONTINUE_AND_REPLACE pins the existing
      behavior so a future refactor cannot quietly flip the contract back.

The coverage gap that let this ship: every existing t.Run("ok", ...) subtest
under Test_chatCompletionProcessorUpstreamFilter_ProcessRequestHeaders set
retBodyMutation on the mock translator, so the "translator returns nil body"
path was never exercised.

Related Issues/PRs (if applicable)

None.

Special notes for reviewers (if applicable)

  • The fix is request-side only. Response-side code is untouched.
  • NewBodyMutator's contract is intentionally not changed (it still returns
    non-nil even when given a nil *filterapi.HTTPBodyMutation). A predicate is
    added instead, to keep the blast radius local to ProcessRequestHeaders. If
    reviewers prefer changing the constructor to return nil in the no-config
    case, that is a larger refactor and a separate PR.
  • The bug is in the generic upstreamProcessor[...], so the fix applies to
    every endpoint spec (chat completions, messages, embeddings, completions,
    GCP, AWS Bedrock, etc.). The new tests exercise one endpoint spec for the
    same reason the existing ProcessRequestHeaders table-test does — locking
    the generic method's contract.

The upstream cluster ext_proc filter's `upstreamProcessor.ProcessRequestHeaders`
unconditionally returns `CONTINUE_AND_REPLACE` with a `BodyMutation`, even when
nothing in the request pipeline actually needs to change the body. The
replacement body it sends is the original request bytes captured by the
downstream router phase.

That clobbers the request body when an earlier ext_proc filter in the listener
chain has already returned `CONTINUE_AND_REPLACE` with a mutated body: Envoy
applies that filter's mutation, then the upstream filter replaces the body
again with the original captured bytes — silently undoing the earlier
mutation.

The existing safety net `if bodyMutator == nil` in `applyBodyMutation` does
not fire in practice: `bodymutator.NewBodyMutator(nil, ...)` returns a
non-nil struct even when its `*filterapi.HTTPBodyMutation` argument is nil,
so the early-return is dead code. The function falls through, calls
`Mutate(originalRequestBodyRaw)` on a no-config mutator (which returns its
input unchanged), wraps that in a `BodyMutation`, and the upstream filter
sends `CONTINUE_AND_REPLACE` with the original body.

Fix:

* Add `BodyMutator.HasMutations()` predicate. `NewBodyMutator`'s contract is
  unchanged (still returns non-nil); callers can ask whether the mutator
  actually has anything to do.
* In `upstreamProcessor.ProcessRequestHeaders`, compute
  `wantBodyReplace := bodyMutation != nil || forceBodyMutation
                       || (u.bodyMutator != nil && u.bodyMutator.HasMutations())`
  before calling `applyBodyMutation`. When false, skip the body-mutation work
  and emit `CONTINUE` instead of `CONTINUE_AND_REPLACE`, preserving translator
  header mutations, header-mutator output, and auth-handler header additions.
  Skip the content-length dynamic-metadata restamp on the `CONTINUE` branch
  since the body length did not change.

All existing CONTINUE_AND_REPLACE callers stay on the existing path: translator
emits a body (Bedrock/SigV4, OpenAI to Anthropic, OpenAI to Vertex, etc.) sets
bodyMutation != nil; backend HTTPBodyMutation sets HasMutations() == true;
retries and streaming-with-IncludeUsage=false set forceBodyMutation == true;
model override emits newBody via sjson.SetBytesOptions.

Tests:

* Added Test_chatCompletionProcessorUpstreamFilter_ProcessRequestHeaders_BodyReplaceContract
  with two subtests:
  * "no translator body, no mutator, no force -> CONTINUE" exercises the
    previously untested path. Asserts Status == CONTINUE, no BodyMutation,
    preserved translator and auth-handler header mutations, and no
    content-length dynamic metadata.
  * "translator body present -> CONTINUE_AND_REPLACE" pins existing behavior
    so a future refactor cannot quietly flip the contract back.

The coverage gap that let this ship: every existing t.Run("ok", ...) subtest
under Test_chatCompletionProcessorUpstreamFilter_ProcessRequestHeaders set
retBodyMutation on the mock translator, so the translator-returns-nil-body
path was never exercised.

Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
@ChrisJBurns ChrisJBurns requested a review from a team as a code owner May 28, 2026 13:57
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 28, 2026
@PatilHrushikesh
Copy link
Copy Markdown
Contributor

if wantBodyReplace is true, it will still overwrite earlier mutation right?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.44%. Comparing base (edef4c5) to head (50edc79).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2170      +/-   ##
==========================================
+ Coverage   84.42%   84.44%   +0.02%     
==========================================
  Files         134      134              
  Lines       19162    19179      +17     
==========================================
+ Hits        16177    16196      +19     
+ Misses       1998     1997       -1     
+ Partials      987      986       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ChrisJBurns
Copy link
Copy Markdown
Author

ChrisJBurns commented Jun 2, 2026

if wantBodyReplace is true, it will still overwrite earlier mutation right?

Correct - when wantBodyReplace == true, the replacement body is still derived from u.parent.originalRequestBodyRaw (snapshotted at the router phase), so any mutation an earlier ext_proc filter applied between the router phase and the upstream phase is lost. That's an architectural limitation of the snapshot-at-router-phase design and not in scope for this PR. The narrower thing this PR fixes is the case where the upstream filter was issuing CONTINUE_AND_REPLACE even though nothing needed to mutate the body (no translation, no override, no retry, no HTTPBodyMutation) - that path used to always replay the original and clobber earlier filters; now it emits CONTINUE and leaves the body alone. I'll tighten the PR description to make that scope explicit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants