feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL by liangliangchang · Pull Request #984 · ROCm/vllm

liangliangchang · 2026-06-01T21:45:38Z

Introduce pluggable NPU vision support without scheduler or engine pipelining changes. Vision encoding runs synchronously on the NPU when VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling untouched for easier upstream review.

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Introduce pluggable NPU vision support without scheduler or engine pipelining changes. Vision encoding runs synchronously on the NPU when VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling untouched for easier upstream review. Co-authored-by: Cursor <cursoragent@cursor.com>

Use VLLM_VISION_NPU_CACHE as the sole enable switch instead of VLLM_VISION_NPU_BACKEND. Move Qwen2.5-VL CPU preprocessing into vision_npu/models/ and drop the unoptimized numpy path. Support-only: sync FlexMLRTVisionBackend with no async pipelining env or wrapper. Co-authored-by: Cursor <cursoragent@cursor.com>

Move NPU-specific vision logic into Qwen2_5_VisionTransformerNPU and restore Qwen2_5_VisionTransformer as PyTorch-only. Add build_qwen2_5_vision_transformer() factory and replace npu_backend conditionals with isinstance checks. Co-authored-by: Cursor <cursoragent@cursor.com>

Validated on hardware that NPU embeddings match the GPU token-count formula after forward() padding. Remove NPU-only single-image split path and add split_qwen2_5_vision_embedding_sizes() for both towers. Co-authored-by: Cursor <cursoragent@cursor.com>

NPU and GPU share spatial_merge_size and the same grid_thw token formula, so drop split_embedding_sizes helpers and use the original inline split. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the duplicate embed_input_ids override with the one-liner used by other VL models. Revert unrelated qwen2.py **kwargs change. Co-authored-by: Cursor <cursoragent@cursor.com>

…rt_npu The extension runs FlexMLRT on the NPU; CPU preprocessing is separate in Python. Rebuild the bridge after pulling this change. Co-authored-by: Cursor <cursoragent@cursor.com>

Memory-map the compiled RAI file and pass fbs_buffer to FlexMLRT, matching test_generic -r. VLLM_VISION_NPU_CACHE must point to a .rai file; ONNX for CPU preprocess is resolved from the same bundle directory. Co-authored-by: Cursor <cursoragent@cursor.com>

liangliangchang and others added 8 commits June 1, 2026 15:41

refactor: restore upstream image embedding split in _process_image_input

aae1236

NPU and GPU share spatial_merge_size and the same grid_thw token formula, so drop split_embedding_sizes helpers and use the original inline split. Co-authored-by: Cursor <cursoragent@cursor.com>

refactor: use standard SupportsMultiModal.embed_input_ids for Qwen2.5-VL

0f4f86c

Replace the duplicate embed_input_ids override with the one-liner used by other VL models. Revert unrelated qwen2.py **kwargs change. Co-authored-by: Cursor <cursoragent@cursor.com>

refactor: rename pybind module _vision_flexmlrt_cpu to _vision_flexml…

c6db17b

…rt_npu The extension runs FlexMLRT on the NPU; CPU preprocessing is separate in Python. Rebuild the bridge after pulling this change. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984

feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984
liangliangchang wants to merge 8 commits into
gfx11from
npu-vision-support

liangliangchang commented Jun 1, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liangliangchang commented Jun 1, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

liangliangchang commented Jun 1, 2026 •

edited by github-actions Bot

Loading