feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits by mvkorobkov · Pull Request #136 · chrishayuk/larql

mvkorobkov · 2026-05-24T11:37:34Z

Summary

Adds first-class support for multi-shard GGUF models (<prefix>-<NNNNN>-of-<NNNNN>.gguf) so larql extract can load the increasingly-common big-model splits produced by llama-gguf-split. Before this PR, GgufFile::open read exactly one file regardless of split.count, so any model that ships as multiple shards — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), DeepSeek-V3 quants on Bartowski, etc. — could not be loaded at all.

Verified locally that this unblocks the chain: combined with #96 / #103 / #133 / #135, larql extract --level inference now ingests Kimi K2.6 UD-Q8_K_XL (554 GB across 14 shards) and DeepSeek-V4-Flash (127 GB across 3 shards).

What changed

ShardInfo + shards: Vec<ShardInfo> on GgufFile — every GgufFile carries the full list of shards (length 1 for single-file). shards[0].path == self.path and shards[0].data_offset == self.data_offset always hold, so existing single-file callers don't need to change.
GgufFile::open detects multi-shard via:
- split.count > 1 in the metadata (canonical signal from llama-gguf-split), OR
- the filename matching <prefix>-<NNNNN>-of-<NNNNN>.gguf (fallback for splitters that omit the metadata).
Sibling discovery rebuilds filenames at the same width the user's shard uses (00001-of-00014, 001-of-003, etc.), errors cleanly if any expected sibling is missing on disk.
GgufTensorInfo gains shard_idx internal field; tensor infos from each shard are appended with the right index. Cross-checks against split.tensors.count when the splitter emits it.
load_tensors_filtered mmaps lazily per shard — shards whose tensors are all skipped by skip_key are never opened (matters for walk-only loads on big MoE models).

The single-file path is untouched at the byte level — open paths through open_single (the previous open body), and load paths through the same dequantization loop, just with a per-shard mmap.

Tests

8 new + all existing pass (35 GGUF / 286 larql-models):

parse_shard_filename_canonical_layout — 5-digit Kimi-K2.6-UD-Q8_K_XL-00003-of-00014.gguf
parse_shard_filename_rejects_single_file — plain .gguf returns None
parse_shard_filename_rejects_unmatched_widths — 00003-of-0014 rejected (different widths)
parse_shard_filename_supports_3digit_split — 001-of-003 works
discover_shard_siblings_finds_all_in_order — caller can point at any shard; returns shard 1 → N in order
discover_shard_siblings_errors_when_one_missing — clean error, not panic, when a sibling file is gone
open_multi_shard_combines_tensors_from_all_shards — builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor is read from its own shard's data section
open_rejects_multi_shard_when_a_shard_file_is_missing — caller passes the lone shard 1; loader sees split.count=2, refuses to silently truncate

Verification with real big models

Tested on bearden+ai-main against the three multi-shard models we currently care about:

DeepSeek-V2-Lite-Chat.Q4_K (10.4 GB single-file): baseline — confirms refactor didn't regress single-file loading. (Also exercises the in-memory GGUF→ModelWeights path landed in fix(extract): accept GGUF input (file or directory) — closes #131 #133.)
DeepSeek-V4-Flash Q3_K_M (127 GB, 3 shards) — loads, tensor infos combine across all three shards.
Kimi K2.6 UD-Q8_K_XL (554 GB, 14 shards) — same.

(I'll attach index.json artifacts once the inference-level vindex extractions finish — they're real model-scale extractions, currently running.)

Stacking note

The DeepSeek-V2/V3/Kimi MLA pipeline only really completes when this lands alongside #135 (which surfaces the MLA geometry fields from GGUF metadata). #135 unblocks the config plumbing; this PR unblocks the tensor loading. Either order of merge works — the PRs touch different parts of gguf.rs and don't conflict.

llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with chrishayuk#96 (MLA absorption), chrishayuk#103 (Q3_K/Q5_K dequant), chrishayuk#133 (GGUF extract input), and chrishayuk#135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.

This was referenced May 24, 2026

fix(capabilities): accept MLA architectures when full geometry is exposed #137

Open

fix(gguf): fall back to expert_feed_forward_length for MoE-only configs #138

Open

feat(streaming): GGUF support in extract pipeline (browse-level) #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136

feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136
mvkorobkov wants to merge 1 commit into
chrishayuk:mainfrom
mvkorobkov:feat/multishard-gguf

mvkorobkov commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mvkorobkov commented May 24, 2026

Summary

What changed

Tests

Verification with real big models

Stacking note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant