Summary
winml analyze should detect the Gemm→Reshape→Transpose pattern that signals a CNN-ViT hybrid unfold block and emit a warning when highdimRTR_lowdimRTR optimization is requested, because this flag causes measurable regression on such architectures.
Background: What is highdimRTR and why does it regress?
highdimRTR_lowdimRTR is an ORT optimization pass that simplifies Reshape→Transpose→Reshape (RTR) chains from high-dimensional to lower-dimensional layout — beneficial for pure-ViT Attention patterns.
However, on CNN-ViT hybrid architectures (e.g., MobileViT), it backfires:
| Model |
Architecture |
highdimRTR effect |
| DINOv2-small |
Pure ViT |
+38% speedup |
| MobileViT-small |
CNN-ViT hybrid |
-19% regression (QNN NPU) / -6.9% (QNN GPU) |
Root cause
MobileViT's CNN encoder uses an unfold operation (sliding window patch extraction) implemented in ONNX as:
Conv → Gemm → Reshape → Transpose → ...
The Gemm→Reshape→Transpose sequence is the architectural fingerprint of this unfold block. When highdimRTR runs on these RTR chains, it inserts ~36 spurious Reshape nodes after the Gemm layers instead of simplifying them, creating unnecessary layout round-trips that increase memory traffic on both the HTP (NPU) and DX12 compute pipeline (GPU).
Verified results from QNN catalog sweep (research/autoconfig — 3×500 iters, Phase C confirmed):
- MobileViT QNN NPU:
h9 highdimRTR → median 31.8ms vs baseline 26.6ms = -19.5% regression, DISCARD
- MobileViT QNN GPU:
h9 highdimRTR → -6.9% regression (cross-EP, same mechanism)
- DINOv2 QNN NPU:
h9 highdimRTR → +38.1% speedup (pure-ViT, no Gemm-unfold blocks)
What the static analyzer should do
Detection
Add a graph-level pattern check in analyze_insight.py (or equivalent in the static analyzer pipeline) to count Gemm→Reshape→Transpose chains:
gemm_unfold_count = 0
for node in graph.node:
if node.op_type == "Reshape":
pred = producer.get(node.input[0])
if pred and pred.op_type in ("Gemm", "MatMul"):
# Check if this Reshape feeds a Transpose
consumer = _single_consumer(node)
if consumer and consumer.op_type == "Transpose":
gemm_unfold_count += 1
Warning / skip hint
When gemm_unfold_count > 0 and the model also has RTR chains (highdimRTR candidate):
- Surface as a
FusionCandidate with risk="HIGH" and tag highdimRTR_risky
- In the autoconfig sweep: add to
skip_set so the sweep skips h_highdimRTR for this model
- In
winml analyze output: print a warning, e.g.:
⚠ Detected 12 Gemm→Reshape→Transpose unfold blocks (CNN-ViT hybrid pattern).
highdimRTR_lowdimRTR optimization inserts spurious Reshape nodes after
Gemm layers on this architecture → confirmed -19% regression on QNN NPU.
Recommendation: skip this optimization flag for this model.
Discriminator logic (architecture classification)
This ties into the broader need for architecture-aware optimization gating:
| Graph signature |
Architecture class |
highdimRTR recommendation |
| Dense Transpose (≥49 nodes), no Gemm-unfold |
Pure ViT (DINOv2, ViT-B) |
✅ Candidate (+38%) |
| Gemm→Reshape→Transpose blocks present |
CNN-ViT hybrid (MobileViT) |
❌ SKIP (confirmed -19% NPU) |
| Sparse Transpose, Gemm-dominated |
Pure CNN (ResNet) |
Neutral (test, low priority) |
Acceptance Criteria
Related
Summary
winml analyzeshould detect theGemm→Reshape→Transposepattern that signals a CNN-ViT hybrid unfold block and emit a warning whenhighdimRTR_lowdimRTRoptimization is requested, because this flag causes measurable regression on such architectures.Background: What is highdimRTR and why does it regress?
highdimRTR_lowdimRTRis an ORT optimization pass that simplifiesReshape→Transpose→Reshape(RTR) chains from high-dimensional to lower-dimensional layout — beneficial for pure-ViT Attention patterns.However, on CNN-ViT hybrid architectures (e.g., MobileViT), it backfires:
Root cause
MobileViT's CNN encoder uses an unfold operation (sliding window patch extraction) implemented in ONNX as:
The
Gemm→Reshape→Transposesequence is the architectural fingerprint of this unfold block. WhenhighdimRTRruns on these RTR chains, it inserts ~36 spurious Reshape nodes after the Gemm layers instead of simplifying them, creating unnecessary layout round-trips that increase memory traffic on both the HTP (NPU) and DX12 compute pipeline (GPU).Verified results from QNN catalog sweep (research/autoconfig — 3×500 iters, Phase C confirmed):
h9 highdimRTR→ median 31.8ms vs baseline 26.6ms = -19.5% regression, DISCARDh9 highdimRTR→ -6.9% regression (cross-EP, same mechanism)h9 highdimRTR→ +38.1% speedup (pure-ViT, no Gemm-unfold blocks)What the static analyzer should do
Detection
Add a graph-level pattern check in
analyze_insight.py(or equivalent in the static analyzer pipeline) to countGemm→Reshape→Transposechains:Warning / skip hint
When
gemm_unfold_count > 0and the model also has RTR chains (highdimRTR candidate):FusionCandidatewithrisk="HIGH"and taghighdimRTR_riskyskip_setso the sweep skipsh_highdimRTRfor this modelwinml analyzeoutput: print a warning, e.g.:Discriminator logic (architecture classification)
This ties into the broader need for architecture-aware optimization gating:
Acceptance Criteria
analyze_insight.py(or static analyzer) countsGemm→(Reshape→)Transposeunfold blocksFusionCandidatewith taghighdimRTR_riskyadded to resultscatalog_qnn_sweep.py(and equivalent sweep scripts) consume this hint → skiph_highdimRTRfor affected modelswinml analyzeCLI output includes a human-readable warning for this patternhighdimRTR_risky; DINOv2-small is NOTRelated
research/autoconfig/ep_knowledge/qnn_npu.json→npu-010,research/autoconfig/ep_knowledge/qnn_gpu.json→gpu-008research/autoconfig/catalog-qnn-sweep/apple--mobilevit-small/results.json(h9)