Add Qwen3.6-27B hybrid DeltaNet/GQA contrib model and vLLM serving path#164
Add Qwen3.6-27B hybrid DeltaNet/GQA contrib model and vLLM serving path#164m-deepankar-singh wants to merge 3 commits into
Conversation
|
Working with our team to evaluate. |
Qwen3.6: Fused DeltaNet Direct-Solve Follow-UpSummaryThis branch is a reviewer-friendly presentation of the fused DeltaNet direct-solve result for Qwen3.6. It is intentionally based on PR 164, The important result is that the fused DeltaNet CTE path is now coherent with realistic Qwen gate values when the Neumann power-doubling solve is replaced by a direct triangular RHS solve. Branch LineageThe actual development history was: PR 164 is the original Qwen3.6 vLLM APC baseline for this branch, and PR 164 This clean branch extracts the direct-solve fused DeltaNet work and its result artifacts onto PR 164 for review. It does not include the entire Why This ExistsThe original fused DeltaNet path used a Neumann power-doubling solve for the recurrence. That approach is mathematically convenient, but it is fragile for realistic Qwen gate scales because it repeatedly forms full matrix powers and can amplify numerical error. In practice, the fused path could produce unstable or incoherent tokens. The chunked DeltaNet path already used a more stable direct triangular solve. This branch ports that idea to the fused path: compute the causal recurrence through a direct triangular RHS solve instead of Neumann power-doubling. The goal is not to claim the fused path is now the final production baseline by itself. The goal is to make the fused-kernel stability fix reviewable and to preserve the validation evidence that it produces coherent output inside the full experimental runtime lineage. Major Changes From PR 164 To The Tested BranchThe full tested branch differs from PR 164 by roughly 105 source/result files. The main work streams from Hybrid APC Runtime
vLLM / NxDI Scheduler Bridge
Qwen Model Execution
NxDI Prefix-Cache Plumbing
DeltaNet NKI Kernels
FP8 / Artifact Compile Path
Serving And API Compatibility
Validation And Results
What This Clean Branch AddsThis branch extracts only the fused DeltaNet follow-up commits:
Concretely, it changes:
Implementation DetailsThe fused kernel previously used Neumann power-doubling to solve the recurrent DeltaNet update. The direct-solve version computes the lower-triangular causal recurrence explicitly in the RHS solve path. This avoids the repeated full-matrix power operations that were unstable under realistic gate values. The validator was also made standalone enough to load the fused NKI kernel directly. This matters because review and debug runs should not depend on package import side effects. The CPU regression test was updated to cover realistic decay/gate scales and to catch the class of instability that showed up in the fused branch. Validation ResultsValidation artifacts are stored under: Local Checkspython3 -m py_compile \
contrib/models/Qwen3.6-27B/src/nki_kernels/nki_deltanet_fused.py \
contrib/models/Qwen3.6-27B/scripts/validate_deltanet_fused_nki.py \
contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py
python3 -m pytest contrib/models/Qwen3.6-27B/test/unit/test_deltanet_decay.py -qResult: Artifact Under TestRuntime:
CoherenceFile: Result:
DecodeFile: Result:
Cold / Warm PrefillFile:
MemoryFile: Result:
Memory caveat: this is a Neuron high-water peak from the Hybrid APC artifact, Known LimitationsThe compiled artifact used for this validation has That is an artifact bucket coverage limitation, not a direct-solve correctness failure. A long-context follow-up compile should include prefix buckets beyond 16K, ideally through the target 64K/128K range. The artifact results were produced from What Is Intentionally Not IncludedThis clean branch does not include the full 80+ commit |
882e9da to
f5b2eaf
Compare
Clean-room cut of codex/nki-deltanet-multihead-cte (43ca740) onto upstream main: contrib model + NKI kernels, hybrid APC/GDN cache support, core substrate, unit tests, validation scripts, and the verified nativechunk baseline evidence. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
f5b2eaf to
5619ee0
Compare
|
Closing in favor of #173 — a fresh PR from a clean branch rebased on current main, with the same contribution and committed validation evidence. The smaller, single-purpose diff should be much easier to review. |
Summary
This adds a Qwen3.6-27B contrib implementation for NxDI. Qwen3.6-27B is not a standard transformer-only decoder; it uses a hybrid
[3 DeltaNet + 1 GQA] x 16architecture with 48 recurrent DeltaNet/GDN layers and 16 GQA layers. Supporting it requires model code, DeltaNet NKI kernels, hybrid recurrent-state cache handling, vLLM/OpenAI serving glue, and validation coverage.The main additions are:
contrib/models/Qwen3.6-27B/.Why this is larger than a normal contrib model
Stock NxDI has the standard serving substrate for transformer models, but Qwen3.6-27B needs additional hybrid recurrent-state semantics. Attention KV prefix reuse alone is insufficient. A safe reusable prefix is the intersection of attention KV cache hits, GDN recurrent checkpoint hits, and GDN convolution checkpoint hits. The PR therefore includes the minimal scheduler/model/cache bridge needed for coherent long-context Hybrid APC.
Validation
Recorded on
trn2.3xlarge, TP=4, LNC=2. Evidence committed undervalidation_outputs/qwen36_nativechunk_baseline_20260609T000000Z/.pass=true, thinking enabled.usage.prompt_tokens, 235.98 s TTFT, 1,029.2 tok/s usage-accounted,pass=true, thinking enabled.enable_thinking=False.Known limitations and follow-ups
max_num_seqs=1is not included in this baseline.Tests
test/integration/test_model.py) for model load/generation/coherence ontrn2.3xlarge.🤖 Generated with Claude Code