feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136
Open
mvkorobkov wants to merge 1 commit into
Open
feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits#136mvkorobkov wants to merge 1 commit into
mvkorobkov wants to merge 1 commit into
Conversation
llama.cpp's gguf-split produces multi-file GGUFs (canonical naming: `<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full metadata header but only owns its own slice of tensors. The current `GgufFile::open` reads one file, so multi-shard models — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large modern LLM — could not be loaded for vindex extraction. This change: 1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>` field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`. 2. `GgufFile::open` detects multi-shard via the explicit `split.count` metadata key, falling back to the filename pattern when the splitter omits the metadata. 3. Discovers all sibling shards in the same directory by reconstructing filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003` both supported). 4. Appends each sibling's `tensor_infos` to the combined list, tagging them with the right `shard_idx`. Cross-checks the total against `split.tensors.count` when present. 5. `load_tensors_filtered` mmaps each shard lazily on first use and reads each tensor from `shards[info.shard_idx].path` at the right per-shard `data_offset`. Shards whose tensors are all skipped by `skip_key` are never opened. Backward-compatible: existing `GgufFile::open` callers and the single-file test fixtures keep working with `shards = vec![…one…]`. Tests (8 new + all existing pass): - parse_shard_filename: canonical layout, plain `.gguf` rejection, mismatched widths rejection, 3-digit split width support - discover_shard_siblings: complete set discovery from any-position shard, error when sibling missing - open_multi_shard_combines_tensors_from_all_shards: builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor reads from its own shard's data section - open_rejects_multi_shard_when_a_shard_file_is_missing - existing 27 tests stay green; 286/286 larql-models tests pass Combined with chrishayuk#96 (MLA absorption), chrishayuk#103 (Q3_K/Q5_K dequant), chrishayuk#133 (GGUF extract input), and chrishayuk#135 (DeepSeek-V2/V3 MLA metadata reading), this completes the chain — `larql extract --level inference` works end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard GGUFs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds first-class support for multi-shard GGUF models (
<prefix>-<NNNNN>-of-<NNNNN>.gguf) solarql extractcan load the increasingly-common big-model splits produced byllama-gguf-split. Before this PR,GgufFile::openread exactly one file regardless ofsplit.count, so any model that ships as multiple shards — Kimi K2.6 (14 shards), DeepSeek-V4-Flash (3 shards), DeepSeek-V3 quants on Bartowski, etc. — could not be loaded at all.Verified locally that this unblocks the chain: combined with #96 / #103 / #133 / #135,
larql extract --level inferencenow ingests Kimi K2.6 UD-Q8_K_XL (554 GB across 14 shards) and DeepSeek-V4-Flash (127 GB across 3 shards).What changed
ShardInfo+shards: Vec<ShardInfo>onGgufFile— everyGgufFilecarries the full list of shards (length 1 for single-file).shards[0].path == self.pathandshards[0].data_offset == self.data_offsetalways hold, so existing single-file callers don't need to change.GgufFile::opendetects multi-shard via:split.count > 1in the metadata (canonical signal fromllama-gguf-split), OR<prefix>-<NNNNN>-of-<NNNNN>.gguf(fallback for splitters that omit the metadata).00001-of-00014,001-of-003, etc.), errors cleanly if any expected sibling is missing on disk.GgufTensorInfogainsshard_idxinternal field; tensor infos from each shard are appended with the right index. Cross-checks againstsplit.tensors.countwhen the splitter emits it.load_tensors_filteredmmaps lazily per shard — shards whose tensors are all skipped byskip_keyare never opened (matters for walk-only loads on big MoE models).The single-file path is untouched at the byte level — open paths through
open_single(the previousopenbody), and load paths through the same dequantization loop, just with a per-shard mmap.Tests
8 new + all existing pass (35 GGUF / 286 larql-models):
parse_shard_filename_canonical_layout— 5-digitKimi-K2.6-UD-Q8_K_XL-00003-of-00014.ggufparse_shard_filename_rejects_single_file— plain.ggufreturns Noneparse_shard_filename_rejects_unmatched_widths—00003-of-0014rejected (different widths)parse_shard_filename_supports_3digit_split—001-of-003worksdiscover_shard_siblings_finds_all_in_order— caller can point at any shard; returns shard 1 → N in orderdiscover_shard_siblings_errors_when_one_missing— clean error, not panic, when a sibling file is goneopen_multi_shard_combines_tensors_from_all_shards— builds two real 2-shard GGUFs with disjoint tensor sets, opens via either shard, verifies each tensor is read from its own shard's data sectionopen_rejects_multi_shard_when_a_shard_file_is_missing— caller passes the lone shard 1; loader seessplit.count=2, refuses to silently truncateVerification with real big models
Tested on bearden+ai-main against the three multi-shard models we currently care about:
(I'll attach
index.jsonartifacts once the inference-level vindex extractions finish — they're real model-scale extractions, currently running.)Stacking note
The DeepSeek-V2/V3/Kimi MLA pipeline only really completes when this lands alongside #135 (which surfaces the MLA geometry fields from GGUF metadata). #135 unblocks the config plumbing; this PR unblocks the tensor loading. Either order of merge works — the PRs touch different parts of
gguf.rsand don't conflict.