feat: ExecuTorch 1.3 with MLX (iOS) and Vulkan (Android) backends + Gemma 4 E2B#1223
Merged
Conversation
Adds 'mlx' to the Backend union and backend-resolution order, and routes MLX models through a chunked prefill path. The MLX backend's forward is exported with a sliding-window cap on the sequence dimension and a one-shot prefill spikes Metal memory, so prefill is done in steps of the forward's declared max input length, read from the method metadata (input_tensor_meta sizes) rather than a fixed constant. Non-MLX backends pass a chunk size of 0 and keep the original one-shot path unchanged. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rebuilt the iOS xcframework and static libs and the Android libexecutorch.so (arm64-v8a, x86_64) from the labs ExecuTorch fork: ExecuTorch 1.3 with the MLX backend enabled for iOS and the Vulkan backend enabled for Android. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registers gemma4_e2b under models.llm with mlx/xnnpack/vulkan variants served from the react-native-executorch-gemma-4 HF repo. Platform defaults select MLX (Apple GPU) on iOS and XNNPACK on Android, where MLX is unavailable. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…budget Registers gemma4_e2b with mlx/xnnpack/vulkan variants; Android now defaults to Vulkan (verified coherent end-to-end). For dynamic-shape PTEs the text runner now derives the generation budget from get_max_context_len rather than get_max_seq_len (the per-call decoder chunk size), which previously resolved max_new_tokens to ~0 and ended generation immediately after prefill. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rebuilt libexecutorch.so (arm64-v8a, x86_64) from the labs ExecuTorch 1.3 fork with the Gemma4 Vulkan support: aten.rms_norm lowering and the SDPA shaders, ported onto 1.3's tile-load helper API with the DHSB Q/K/V layout the Gemma4 export uses. Verified coherent generation on device. Authored with Claude. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
81c6113 to
2cbd555
Compare
msluszniak
reviewed
Jun 11, 2026
… Multimodal category Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mkopcins
approved these changes
Jun 11, 2026
msluszniak
approved these changes
Jun 11, 2026
mkopcins
pushed a commit
that referenced
this pull request
Jun 11, 2026
…emma 4 E2B (#1223) Bumps ExecuTorch to 1.3 and adds two GPU backends with Gemma 4 E2B support: - **MLX (iOS / Apple GPU)** — new backend, with metadata-driven chunked prefill. The MLX `forward` is exported with a sliding-window cap on the sequence dimension and a one-shot prefill spikes Metal memory, so MLX models are prefilled in steps of the forward's declared max input length (read from the method metadata). Non-MLX backends keep the original one-shot path. - **Vulkan (Android GPU)** — Gemma 4 E2B now runs on Vulkan. The prebuilt `libexecutorch.so` (arm64-v8a, x86_64) is rebuilt from the labs 1.3 fork with the Gemma4 Vulkan support: the `aten.rms_norm` lowering and the Gemma SDPA shaders, ported onto 1.3's tile-load helper API with the DHSB Q/K/V layout the Gemma4 export uses. `models.llm.gemma4_e2b` is registered with `mlx` / `xnnpack` / `vulkan` variants and defaults to **MLX on iOS** and **Vulkan on Android**. - [ ] Yes - [x] No - [x] Bug fix (change which fixes an issue) - [x] New feature (change which adds functionality) - [ ] Documentation update (improves or adds clarity to existing documentation) - [x] Other (chores, tests, code style improvements etc.) - [x] iOS - [x] Android 1. Build and run the LLM example app (`apps/llm`) on a physical device (Vulkan/MLX need a real GPU — not the simulator/emulator). 2. In the model picker, select **Gemma 4 E2B**. 3. Send a prompt and confirm coherent generation: - iOS → runs on the MLX backend. - Android → runs on the Vulkan backend. 4. Confirm generation does not stop immediately after prefill and produces multiple tokens. <!-- Add screenshots here, if applicable --> <!-- Link related issues here using #issue-number --> - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [ ] I have updated the documentation accordingly - [x] My changes generate no new warnings The vulkan gemma won't work until @mkopcins PR is merged. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Bumps ExecuTorch to 1.3 and adds two GPU backends with Gemma 4 E2B support:
forwardis exported with a sliding-window cap on the sequence dimension and a one-shot prefill spikes Metal memory, so MLX models are prefilled in steps of the forward's declared max input length (read from the method metadata). Non-MLX backends keep the original one-shot path.libexecutorch.so(arm64-v8a, x86_64) is rebuilt from the labs 1.3 fork with the Gemma4 Vulkan support: theaten.rms_normlowering and the Gemma SDPA shaders, ported onto 1.3's tile-load helper API with the DHSB Q/K/V layout the Gemma4 export uses.models.llm.gemma4_e2bis registered withmlx/xnnpack/vulkanvariants and defaults to MLX on iOS and Vulkan on Android.Introduces a breaking change?
Type of change
Tested on
Testing instructions
apps/llm) on a physical device (Vulkan/MLX need a real GPU — not the simulator/emulator).Screenshots
Related issues
Checklist
Additional notes
The vulkan gemma won't work until @mkopcins PR is merged.