fix(capabilities): accept MLA architectures when full geometry is exposed#137
Open
mvkorobkov wants to merge 1 commit into
Open
fix(capabilities): accept MLA architectures when full geometry is exposed#137mvkorobkov wants to merge 1 commit into
mvkorobkov wants to merge 1 commit into
Conversation
…osed PR chrishayuk#96 wired MLA absorption (`mla_absorb::absorb`) into the f32 weight writer: when an architecture reports `uses_mla() == true` AND exposes all three of `mla_qk_nope_head_dim` / `mla_qk_rope_head_dim` / `mla_v_head_dim`, the writer fuses the four low-rank tensors (q_a / q_b / kv_a / kv_b) into standard dense Q/K/V/O at write time. The on-disk manifest after that is a standard Q/K/V/O vindex — exactly what `ensure_standard_attention_supported` is gating against. But the gate still hard-rejected every `uses_mla()` arch, so the absorption path was unreachable from `larql extract --level inference/attention/all` for any DeepSeek-V2/V3/Kimi-K2 model. The CLI failed before the writer could even try: $ larql extract --level inference DS-V2-Lite-Chat.Q4_K.gguf Error: unsupported architecture 'deepseek' for extract pipeline: multi-head latent attention (MLA) is not implemented This commit narrows the gate to keep rejecting *only* MLA archs whose geometry fields are missing (where absorption can't safely guess a qk_head_dim split). Complete-geometry MLA archs pass through, and `write_f32` runs the absorption path that PR chrishayuk#96 already shipped. End-to-end verification: built locally on top of chrishayuk#135 (which surfaces the MLA fields from GGUF metadata), extracted DeepSeek-V2-Lite-Chat Q4_K (10.4 GB, 27 layers, kv_lora=512, qk_nope=128, qk_rope=64, v_head=128) → 1.13 GB inference-level vindex with `attn_weights.bin` sized at 216 MB (post-absorption standard QKVO). Tests: - `mla_with_full_geometry_is_accepted_so_absorption_can_run` — proves the lower-level `ensure_standard_attention_supported` accepts the complete-geometry case. - `extract_level_inference_accepts_mla_with_full_geometry` — drives the CLI-facing gate at Inference AND All levels. - All existing rejection tests still pass: incomplete-geometry MLA (the existing `mla_arch()` fixture has none of qk_nope/qk_rope/v_head) remains rejected at every level above Browse. 9/9 pass. Combined with chrishayuk#133 / chrishayuk#135 / chrishayuk#136 this completes `larql extract` for DeepSeek-V2 family GGUFs (and clears the last gate for Kimi K2 once chrishayuk#135 + chrishayuk#136 merge).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #96 added MLA absorption (
mla_absorb::absorb) to the f32 weight writer, but the entry-point gate (ensure_standard_attention_supported) was never updated to let MLA architectures through. Result: the absorption path is unreachable fromlarql extract --level inference/attention/all— the CLI refuses MLA up front.Reproduces with any DeepSeek-V2/V3/Kimi-K2 GGUF:
```
$ larql extract --level inference DeepSeek-V2-Lite-Chat.Q4_K.gguf
Error: unsupported architecture 'deepseek' for extract pipeline:
multi-head latent attention (MLA) is not implemented
```
Fix
Narrow the gate to keep rejecting only MLA archs whose pre-absorption geometry fields are missing (where there is no defensible split for
qk_head_dim). Complete-geometry MLA archs pass through, andwrite_f32runs the absorption code PR #96 already shipped.```rust
// after this PR:
if arch.uses_mla() {
let has_geom = arch.mla_qk_nope_head_dim().is_some()
&& arch.mla_qk_rope_head_dim().is_some()
&& arch.mla_v_head_dim().is_some();
if !has_geom { return Err(...); }
}
```
End-to-end verification
Built locally on top of #135 (which surfaces the MLA fields from GGUF metadata), extracted DeepSeek-V2-Lite-Chat Q4_K (10.4 GB single-file GGUF, 27 layers,
kv_lora=512,qk_nope=128,qk_rope=64,v_head=128) → 1.13 GB inference-level vindex.index.jsonconfirmsextract_level: \"inference\",family: \"deepseek\",model_type: \"deepseek_v2\".attn_weights.binis 216 MB (post-absorption standard QKVO).Tests
mla_with_full_geometry_is_accepted_so_absorption_can_run—ensure_standard_attention_supportedaccepts complete-geometry MLAextract_level_inference_accepts_mla_with_full_geometry— drives the CLI-facing gate at both Inference and All levelsmla_arch()fixture has none ofqk_nope/qk_rope/v_head) remains rejected. 9/9 pass.Stacking
This is the third in a small stack (#135 → #136 → this) that together restore the full DeepSeek-V2/V3/Kimi-K2 extraction pipeline. Each PR is logically independent — any merge order works — but the chain only completes end-to-end when all three land.