fix(llm): snapshot vision encoder output before caching#1229
Merged
Conversation
The per-image embedding cache stored the EValue returned by vision_encoder.execute(), whose tensor aliases the method's reusable output buffer. The next execute() overwrites that buffer, so in any conversation with more than one image every cached entry silently became the most recently encoded image — the model would describe the first picture as the second one on re-prefilled turns. The audio path already snapshots its encoder output for exactly this reason; do the same for vision: copy the output bytes into the cache entry and serve cache hits from a tensor over the owned bytes. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
msluszniak
approved these changes
Jun 11, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
In any multimodal conversation with more than one image, the model starts describing earlier images as the most recently sent one on later turns.
VisionEncoder::encodecaches theEValuereturned byvision_encoder.execute()per image path. That tensor aliases the method's reusable output buffer, so the nextexecute()(the second image, or any later encode) overwrites the bytes behind every cached entry. On re-prefilled turns the prefiller then splices the latest image's embeddings into every image slot. The audio path already snapshots its encoder output for exactly this reason (see theAudioSlotcomment inmultimodal_prefiller.cpp); vision never got the same treatment.The fix copies the encoder output into bytes owned by the cache entry immediately after
execute()and serves cache hits from a tensor wrapping those owned bytes (unordered_mapnodes are pointer-stable, so the blob stays valid).The bug is backend-independent (the cache sits above the delegate), so XNNPACK/Vulkan multimodal models are affected the same way.
Introduces a breaking change?
Type of change
Tested on
Testing instructions
Before this fix, step 4 describes image B's content (both image slots receive B's embeddings on the re-prefilled turn). After the fix, the model correctly recalls image A.
Screenshots
N/A
Related issues
N/A
Checklist
Additional notes