feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984
Draft
liangliangchang wants to merge 8 commits into
Draft
feat: Add FlexMLRT NPU vision backend for Qwen2.5-VL#984liangliangchang wants to merge 8 commits into
liangliangchang wants to merge 8 commits into
Conversation
Introduce pluggable NPU vision support without scheduler or engine pipelining changes. Vision encoding runs synchronously on the NPU when VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling untouched for easier upstream review. Co-authored-by: Cursor <cursoragent@cursor.com>
Use VLLM_VISION_NPU_CACHE as the sole enable switch instead of VLLM_VISION_NPU_BACKEND. Move Qwen2.5-VL CPU preprocessing into vision_npu/models/ and drop the unoptimized numpy path. Support-only: sync FlexMLRTVisionBackend with no async pipelining env or wrapper. Co-authored-by: Cursor <cursoragent@cursor.com>
Move NPU-specific vision logic into Qwen2_5_VisionTransformerNPU and restore Qwen2_5_VisionTransformer as PyTorch-only. Add build_qwen2_5_vision_transformer() factory and replace npu_backend conditionals with isinstance checks. Co-authored-by: Cursor <cursoragent@cursor.com>
Validated on hardware that NPU embeddings match the GPU token-count formula after forward() padding. Remove NPU-only single-image split path and add split_qwen2_5_vision_embedding_sizes() for both towers. Co-authored-by: Cursor <cursoragent@cursor.com>
NPU and GPU share spatial_merge_size and the same grid_thw token formula, so drop split_embedding_sizes helpers and use the original inline split. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the duplicate embed_input_ids override with the one-liner used by other VL models. Revert unrelated qwen2.py **kwargs change. Co-authored-by: Cursor <cursoragent@cursor.com>
…rt_npu The extension runs FlexMLRT on the NPU; CPU preprocessing is separate in Python. Rebuild the bridge after pulling this change. Co-authored-by: Cursor <cursoragent@cursor.com>
Memory-map the compiled RAI file and pass fbs_buffer to FlexMLRT, matching test_generic -r. VLLM_VISION_NPU_CACHE must point to a .rai file; ONNX for CPU preprocess is resolved from the same bundle directory. Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce pluggable NPU vision support without scheduler or engine pipelining changes. Vision encoding runs synchronously on the NPU when VLLM_VISION_NPU_BACKEND=flexmlrt is set, keeping core v1 scheduling untouched for easier upstream review.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.