Skip to content

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136

Open
mvkorobkov wants to merge 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/multishard-gguf
Open

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136
mvkorobkov wants to merge 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/multishard-gguf

Conversation

@mvkorobkov
Copy link
Copy Markdown

Summary

Adds first-class support for multi-shard GGUF models (<prefix>-<NNNNN>-of-<NNNNN>.gguf) so larql extract can load the increasingly-common big-model splits produced by llama-gguf-split. Before this PR, GgufFile::open read exactly one file regardless of split.count, so any model that ships as multiple shards — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), DeepSeek-V3 quants on Bartowski, etc. — could not be loaded at all.

Verified locally that this unblocks the chain: combined with #96 / #103 / #133 / #135, larql extract --level inference now ingests Kimi K2.6 UD-Q8_K_XL (554 GB across 14 shards) and DeepSeek-V4-Flash (127 GB across 3 shards).

What changed

  1. ShardInfo + shards: Vec<ShardInfo> on GgufFile — every GgufFile carries the full list of shards (length 1 for single-file). shards[0].path == self.path and shards[0].data_offset == self.data_offset always hold, so existing single-file callers don't need to change.
  2. GgufFile::open detects multi-shard via:
    • split.count > 1 in the metadata (canonical signal from llama-gguf-split), OR
    • the filename matching <prefix>-<NNNNN>-of-<NNNNN>.gguf (fallback for splitters that omit the metadata).
  3. Sibling discovery rebuilds filenames at the same width the user's shard uses (00001-of-00014, 001-of-003, etc.), errors cleanly if any expected sibling is missing on disk.
  4. GgufTensorInfo gains shard_idx internal field; tensor infos from each shard are appended with the right index. Cross-checks against split.tensors.count when the splitter emits it.
  5. load_tensors_filtered mmaps lazily per shard — shards whose tensors are all skipped by skip_key are never opened (matters for walk-only loads on big MoE models).

The single-file path is untouched at the byte level — open paths through open_single (the previous open body), and load paths through the same dequantization loop, just with a per-shard mmap.

Tests

8 new + all existing pass (35 GGUF / 286 larql-models):

  • parse_shard_filename_canonical_layout — 5-digit Kimi-K2.6-UD-Q8_K_XL-00003-of-00014.gguf
  • parse_shard_filename_rejects_single_file — plain .gguf returns None
  • parse_shard_filename_rejects_unmatched_widths00003-of-0014 rejected (different widths)
  • parse_shard_filename_supports_3digit_split001-of-003 works
  • discover_shard_siblings_finds_all_in_order — caller can point at any shard; returns shard 1 → N in order
  • discover_shard_siblings_errors_when_one_missing — clean error, not panic, when a sibling file is gone
  • open_multi_shard_combines_tensors_from_all_shards — builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor is read from its own shard's data section
  • open_rejects_multi_shard_when_a_shard_file_is_missing — caller passes the lone shard 1; loader sees split.count=2, refuses to silently truncate

Verification with real big models

Tested on bearden+ai-main against the three multi-shard models we currently care about:

  • DeepSeek-V2-Lite-Chat.Q4_K (10.4 GB single-file): baseline — confirms refactor didn't regress single-file loading. (Also exercises the in-memory GGUF→ModelWeights path landed in fix(extract): accept GGUF input (file or directory) — closes #131 #133.)
  • DeepSeek-V4-Flash Q3_K_M (127 GB, 3 shards) — loads, tensor infos combine across all three shards.
  • Kimi K2.6 UD-Q8_K_XL (554 GB, 14 shards) — same.

(I'll attach index.json artifacts once the inference-level vindex extractions finish — they're real model-scale extractions, currently running.)

Stacking note

The DeepSeek-V2/V3/Kimi MLA pipeline only really completes when this lands alongside #135 (which surfaces the MLA geometry fields from GGUF metadata). #135 unblocks the config plumbing; this PR unblocks the tensor loading. Either order of merge works — the PRs touch different parts of gguf.rs and don't conflict.

llama.cpp's gguf-split produces multi-file GGUFs (canonical naming:
`<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full
metadata header but only owns its own slice of tensors. The current
`GgufFile::open` reads one file, so multi-shard models — Kimi K2.6
(14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large
modern LLM — could not be loaded for vindex extraction.

This change:

1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>`
   field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`.
2. `GgufFile::open` detects multi-shard via the explicit `split.count`
   metadata key, falling back to the filename pattern when the splitter
   omits the metadata.
3. Discovers all sibling shards in the same directory by reconstructing
   filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003`
   both supported).
4. Appends each sibling's `tensor_infos` to the combined list, tagging
   them with the right `shard_idx`. Cross-checks the total against
   `split.tensors.count` when present.
5. `load_tensors_filtered` mmaps each shard lazily on first use and
   reads each tensor from `shards[info.shard_idx].path` at the right
   per-shard `data_offset`. Shards whose tensors are all skipped by
   `skip_key` are never opened.

Backward-compatible: existing `GgufFile::open` callers and the
single-file test fixtures keep working with `shards = vec![…one…]`.

Tests (8 new + all existing pass):

- parse_shard_filename: canonical layout, plain `.gguf` rejection,
  mismatched widths rejection, 3-digit split width support
- discover_shard_siblings: complete set discovery from any-position
  shard, error when sibling missing
- open_multi_shard_combines_tensors_from_all_shards: builds two real
  2-shard GGUFs with disjoint tensor sets, opens via either shard,
  verifies each tensor reads from its own shard's data section
- open_rejects_multi_shard_when_a_shard_file_is_missing
- existing 27 tests stay green; 286/286 larql-models tests pass

Combined with chrishayuk#96 (MLA absorption), chrishayuk#103 (Q3_K/Q5_K dequant), chrishayuk#133
(GGUF extract input), and chrishayuk#135 (DeepSeek-V2/V3 MLA metadata reading),
this completes the chain — `larql extract --level inference` works
end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard
GGUFs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant