fix(llm): snapshot vision encoder output before caching by NorbertKlockiewicz · Pull Request #1229 · software-mansion/react-native-executorch

NorbertKlockiewicz · 2026-06-11T16:48:19Z

Description

In any multimodal conversation with more than one image, the model starts describing earlier images as the most recently sent one on later turns.

VisionEncoder::encode caches the EValue returned by vision_encoder.execute() per image path. That tensor aliases the method's reusable output buffer, so the next execute() (the second image, or any later encode) overwrites the bytes behind every cached entry. On re-prefilled turns the prefiller then splices the latest image's embeddings into every image slot. The audio path already snapshots its encoder output for exactly this reason (see the AudioSlot comment in multimodal_prefiller.cpp); vision never got the same treatment.

The fix copies the encoder output into bytes owned by the cache entry immediately after execute() and serves cache hits from a tensor wrapping those owned bytes (unordered_map nodes are pointer-stable, so the blob stays valid).

The bug is backend-independent (the cache sits above the delegate), so XNNPACK/Vulkan multimodal models are affected the same way.

Introduces a breaking change?

Yes
No

Type of change

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Documentation update (improves or adds clarity to existing documentation)
Other (chores, tests, code style improvements etc.)

Tested on

iOS
Android

Testing instructions

Run the example LLM app with a multimodal model (e.g. Gemma 4 E2B multimodal) on the Multimodal LLM screen.
Send image A with "What's in this picture?" — answer is correct.
Send image B (different content) with the same question — answer is correct.
Ask "What was in the FIRST picture I sent?".

Before this fix, step 4 describes image B's content (both image slots receive B's embeddings on the re-prefilled turn). After the fix, the model correctly recalls image A.

Screenshots

N/A

Related issues

N/A

Checklist

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have updated the documentation accordingly
My changes generate no new warnings

Additional notes

The per-image embedding cache stored the EValue returned by vision_encoder.execute(), whose tensor aliases the method's reusable output buffer. The next execute() overwrites that buffer, so in any conversation with more than one image every cached entry silently became the most recently encoded image — the model would describe the first picture as the second one on re-prefilled turns. The audio path already snapshots its encoder output for exactly this reason; do the same for vision: copy the output bytes into the cache entry and serve cache hits from a tensor over the owned bytes. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

msluszniak self-requested a review June 11, 2026 16:55

msluszniak assigned NorbertKlockiewicz Jun 11, 2026

msluszniak added the bug fix PRs that are fixing bugs label Jun 11, 2026

msluszniak approved these changes Jun 11, 2026

View reviewed changes

msluszniak merged commit 04063c0 into main Jun 11, 2026
4 of 5 checks passed

msluszniak deleted the @nk/fix-vision-encoder-stale-cache branch June 11, 2026 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llm): snapshot vision encoder output before caching#1229

fix(llm): snapshot vision encoder output before caching#1229
msluszniak merged 1 commit into
mainfrom
@nk/fix-vision-encoder-stale-cache

NorbertKlockiewicz commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NorbertKlockiewicz commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Introduces a breaking change?

Type of change

Tested on

Testing instructions

Screenshots

Related issues

Checklist

Additional notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NorbertKlockiewicz commented Jun 11, 2026 •

edited

Loading