Skip to content

feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984

Draft
liangliangchang wants to merge 8 commits into
gfx11from
npu-vision-support
Draft

feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984
liangliangchang wants to merge 8 commits into
gfx11from
npu-vision-support

Conversation

@liangliangchang

@liangliangchang liangliangchang commented Jun 1, 2026

Copy link
Copy Markdown

Introduce pluggable NPU vision support without scheduler or engine pipelining changes. Vision encoding runs synchronously on the NPU when VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling untouched for easier upstream review.

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

liangliangchang and others added 8 commits June 1, 2026 15:41
Introduce pluggable NPU vision support without scheduler or engine
pipelining changes. Vision encoding runs synchronously on the NPU when
VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling
untouched for easier upstream review.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use VLLM_VISION_NPU_CACHE as the sole enable switch instead of
VLLM_VISION_NPU_BACKEND. Move Qwen2.5-VL CPU preprocessing into
vision_npu/models/ and drop the unoptimized numpy path. Support-only:
sync FlexMLRTVisionBackend with no async pipelining env or wrapper.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move NPU-specific vision logic into Qwen2_5_VisionTransformerNPU and
restore Qwen2_5_VisionTransformer as PyTorch-only. Add
build_qwen2_5_vision_transformer() factory and replace npu_backend
conditionals with isinstance checks.

Co-authored-by: Cursor <cursoragent@cursor.com>
Validated on hardware that NPU embeddings match the GPU token-count
formula after forward() padding. Remove NPU-only single-image split
path and add split_qwen2_5_vision_embedding_sizes() for both towers.

Co-authored-by: Cursor <cursoragent@cursor.com>
NPU and GPU share spatial_merge_size and the same grid_thw token formula,
so drop split_embedding_sizes helpers and use the original inline split.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the duplicate embed_input_ids override with the one-liner used by
other VL models. Revert unrelated qwen2.py **kwargs change.

Co-authored-by: Cursor <cursoragent@cursor.com>
…rt_npu

The extension runs FlexMLRT on the NPU; CPU preprocessing is separate
in Python. Rebuild the bridge after pulling this change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Memory-map the compiled RAI file and pass fbs_buffer to FlexMLRT, matching
test_generic -r. VLLM_VISION_NPU_CACHE must point to a .rai file; ONNX for
CPU preprocess is resolved from the same bundle directory.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant