Skip to content

fix(llm): snapshot vision encoder output before caching#1229

Merged
msluszniak merged 1 commit into
mainfrom
@nk/fix-vision-encoder-stale-cache
Jun 11, 2026
Merged

fix(llm): snapshot vision encoder output before caching#1229
msluszniak merged 1 commit into
mainfrom
@nk/fix-vision-encoder-stale-cache

Conversation

@NorbertKlockiewicz

@NorbertKlockiewicz NorbertKlockiewicz commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

In any multimodal conversation with more than one image, the model starts describing earlier images as the most recently sent one on later turns.

VisionEncoder::encode caches the EValue returned by vision_encoder.execute() per image path. That tensor aliases the method's reusable output buffer, so the next execute() (the second image, or any later encode) overwrites the bytes behind every cached entry. On re-prefilled turns the prefiller then splices the latest image's embeddings into every image slot. The audio path already snapshots its encoder output for exactly this reason (see the AudioSlot comment in multimodal_prefiller.cpp); vision never got the same treatment.

The fix copies the encoder output into bytes owned by the cache entry immediately after execute() and serves cache hits from a tensor wrapping those owned bytes (unordered_map nodes are pointer-stable, so the blob stays valid).

The bug is backend-independent (the cache sits above the delegate), so XNNPACK/Vulkan multimodal models are affected the same way.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  1. Run the example LLM app with a multimodal model (e.g. Gemma 4 E2B multimodal) on the Multimodal LLM screen.
  2. Send image A with "What's in this picture?" — answer is correct.
  3. Send image B (different content) with the same question — answer is correct.
  4. Ask "What was in the FIRST picture I sent?".

Before this fix, step 4 describes image B's content (both image slots receive B's embeddings on the re-prefilled turn). After the fix, the model correctly recalls image A.

Screenshots

N/A

Related issues

N/A

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

The per-image embedding cache stored the EValue returned by
vision_encoder.execute(), whose tensor aliases the method's reusable
output buffer. The next execute() overwrites that buffer, so in any
conversation with more than one image every cached entry silently
became the most recently encoded image — the model would describe the
first picture as the second one on re-prefilled turns. The audio path
already snapshots its encoder output for exactly this reason; do the
same for vision: copy the output bytes into the cache entry and serve
cache hits from a tensor over the owned bytes.

Authored with Claude.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@msluszniak msluszniak self-requested a review June 11, 2026 16:55
@msluszniak msluszniak added the bug fix PRs that are fixing bugs label Jun 11, 2026
@msluszniak msluszniak merged commit 04063c0 into main Jun 11, 2026
4 of 5 checks passed
@msluszniak msluszniak deleted the @nk/fix-vision-encoder-stale-cache branch June 11, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug fix PRs that are fixing bugs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants