Skip to content

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67#135

Open
mvkorobkov wants to merge 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/deepseek-mla-gguf-kimi-k2
Open

feat(gguf): surface MLA metadata for DeepSeek-V2/V3 + Kimi K2 — closes #67#135
mvkorobkov wants to merge 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/deepseek-mla-gguf-kimi-k2

Conversation

@mvkorobkov
Copy link
Copy Markdown

Summary

Closes #67. With the MLA absorption work in #96 already merged, the last missing piece for Kimi K2 (and any DeepSeek-V2/V3) extraction from GGUF was reading the MLA geometry off the GGUF metadata. `to_config_json` dropped every `attention.q_lora_rank` / `attention.kv_lora_rank` / `attention.key_length[_mla]` / `attention.value_length[_mla]` / `rope.dimension_count` key it saw, so `ModelConfig.qk_nope_head_dim` etc. came back as `None` and `uses_mla()` stayed `false` for GGUF-sourced models.

Surprise from looking at real Kimi K2.6 files

Inspecting Kimi-K2.6 UD-Q8_K_XL (unsloth's 554 GB 14-shard split) with `gguf-dump` showed that the "Q8_K_XL" naming is misleading — the tensor type histogram across multiple shards is only BF16 + F32 + Q4_0, every one of which is already covered by larql's existing ggml dequant. No Q8_K (type 15) dequant work was actually needed for this family. The blocker was purely the missing config plumbing.

(For posterity: shard 2 = 54 BF16 / 33 F32 / 13 Q4_0; shards 3, 7, 10 = same pattern with 45/30/15; shard 14 trails with 2/2/2 — all supported.)

What this PR changes

In `crates/larql-models/src/loading/gguf.rs::to_config_json`:

GGUF key HF field surfaced Notes
`{arch}.attention.q_lora_rank` `q_lora_rank` Kimi K2.6: 1536
`{arch}.attention.kv_lora_rank` `kv_lora_rank` Kimi K2.6: 512
`{arch}.attention.key_length_mla` (or `.key_length`) `qk_nope_head_dim` = key_length − rope.dim Kimi K2.6: 192 − 64 = 128
`{arch}.attention.value_length_mla` (or `.value_length`) `v_head_dim` Kimi K2.6: 128
`{arch}.rope.dimension_count` `qk_rope_head_dim` Kimi K2.6: 64

For per-head dims the loader prefers the `_mla` variants when present — those carry the pre-absorption (DeepSeek-V3-standard) split that `mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms (192/128 in `_mla`, 576/512 absorbed); we want the 192/128.

Verification

`cargo test -p larql-models` — 281/281 pass.

Three new tests:

  1. `test_kimi_k2_gguf_to_config_json_extracts_mla_fields` — synthesises Kimi K2.6-shaped metadata (q_lora=1536, kv_lora=512, key_length=576+key_length_mla=192, value_length=512+value_length_mla=128, rope.dim=64); checks all MLA fields end up in the HF config, then drives `detect_from_json` and asserts `uses_mla() == true` with the pre-absorption dims.
  2. `test_gguf_mla_falls_back_to_non_mla_key_length_when_mla_keys_absent` — older DS-V2 GGUFs that ship only `key_length`/`value_length` still produce the correct split.
  3. `test_gguf_mla_fields_absent_for_non_mla_architectures` — llama / qwen / mistral etc. don't emit MLA keys; loader leaves every optional MLA field unset so streaming path keeps its existing behaviour (no regression).

What this unlocks

Combined with the three already-merged PRs (#96 MLA absorption + #103 Q3_K/Q5_K dequant + #133 GGUF-input fix), this PR completes the chain: `larql extract --level inference` works end-to-end on Kimi K2 family GGUFs. Same path works for any DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.

I plan to extract Kimi K2.6 UD-Q8_K_XL once this lands — happy to share the resulting vindex `index.json` for sanity-check.

 chrishayuk#67

llama.cpp emits DeepSeek-V2/V3 (and Kimi K2) MLA geometry in the GGUF
metadata under {arch}.attention.* and {arch}.rope.dimension_count.
`to_config_json` was dropping every one of these fields, so the parsed
ModelConfig had MLA disabled and PR chrishayuk#96's absorption never fired for
GGUF-sourced inputs.

This surfaces the relevant fields into the HF-shaped config the parser
consumes:

- `attention.q_lora_rank`       → `q_lora_rank`
- `attention.kv_lora_rank`      → `kv_lora_rank`
- `attention.key_length[_mla]`  → `qk_nope_head_dim` (= key_length − rope.dim)
- `attention.value_length[_mla]`→ `v_head_dim`
- `rope.dimension_count`        → `qk_rope_head_dim`

For per-head dims the loader prefers the `_mla` variants when present —
those carry the pre-absorption (DS-V3-standard) split that
`mla_absorb::absorb` operates on. Kimi K2.6's GGUF exposes both forms
(192/128 for `_mla`, 576/512 absorbed); we want 192/128.

Verified against Kimi K2.6 UD-Q8_K_XL GGUF metadata (the unsloth name
is misleading — actual tensor types are BF16 + F32 + Q4_0, all already
supported by larql's existing dequant). Three new tests cover:

1. Kimi K2.6-shaped metadata → full MLA fields populated, MLA detected
2. Non-`_mla` variant fallback (DS-V2 with key_length only)
3. Non-MLA architectures (llama) keep their fields absent

281/281 larql-models tests pass. Combined with PR chrishayuk#96 + chrishayuk#103 + chrishayuk#133,
this unlocks inference-level extraction of Kimi K2 family and any
other DeepSeek-V2/V3 GGUF that exposes the standard MLA metadata.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kimi K2 support

1 participant