Skip to content

feat(streaming): GGUF support in extract pipeline (browse-level)#139

Open
mvkorobkov wants to merge 5 commits into
chrishayuk:mainfrom
mvkorobkov:feat/streaming-gguf
Open

feat(streaming): GGUF support in extract pipeline (browse-level)#139
mvkorobkov wants to merge 5 commits into
chrishayuk:mainfrom
mvkorobkov:feat/streaming-gguf

Conversation

@mvkorobkov
Copy link
Copy Markdown

Summary

Adds GGUF to the streaming-extract pipeline alongside safetensors. Today GGUF input is routed through the in-memory load_model_dir_validated path which dequantises every tensor to f32 in RAM — fine for small models but architecturally unworkable for ≥70B GGUFs (Kimi K2.6 at 554 GB, DS-V4-Flash at 127 GB).

Design — TensorSource enum

enum TensorSource {
    Safetensors { shards, index },   // existing
    Gguf(GgufTensorSource),          // new
}

StreamingContext::new detects the input format (a .gguf file directly, or a directory whose first / largest .gguf shard is used as the entry point) and constructs the appropriate variant. Each stage now calls self.tensor_source.get_tensor_f32(key) for the canonical 2D dequant path. The MXFP4 raw-pair access (DeepSeek-V4 packed gate_up_proj_blocks / down_proj_blocks) stays safetensors-only — GGUF has no equivalent packed format.

GGUF specifics

  • Multi-shard splits are handled via GgufFile::open (feat(gguf): multi-shard reader for *-NNNNN-of-NNNNN.gguf splits #136). Each shard is mmap'd eagerly — virtual address space only, the OS pages in only the tensors we touch.
  • Per-tensor read does data_offset + offset into the right shard, slices tensor_data_size bytes, and dequantises via larql_models::quant::ggml::dequantize (Q4_K / Q5_K / Q6_K / Q8_0 / BF16 / F32 etc — all already supported).
  • Dim ordering follows load_gguf's reshape-to-(dims[1], dims[0]) convention. Canonical FFN orientation (the in-memory loader's orient_in_place) is applied here too, driven by (hidden_size, intermediate_size) from the detected architecture — without it tensor.shape()[0] would be hidden instead of intermediate for some quants and downstream matmul would produce NaN.

CLI routing

Input Level / quant Path
safetensors any streaming
GGUF --level browse (and no quant) streaming (new)
GGUF attention / inference / all in-memory (unchanged)
GGUF any level with --quant q4k in-memory (unchanged)

Inference / Q4K levels for GGUF still need the StreamingWeights writer subsystem ported to read tensors via ggml::dequantize per tensor — tracked as a follow-on PR. The stage gate returns a clear VindexError::Parse(…) if the user requests an unsupported level/quant combo for GGUF input.

Validation

  • DS-R1-0528-Qwen3-8B-Q3_K_L (10 GB, dense, mixed Q3_K / Q4_K / Q6_K) → 3.4 GB gate_vectors.bin + 1.2 GB embeddings.bin written cleanly through the streaming path. Resident memory stays at the per-layer-tensor scale, not the full-model scale.
  • 1074 existing vindex unit tests pass.

Known gap (pre-existing)

The streaming pipeline's MoE branch looks up per-expert 2D keys (mlp.experts.K.gate_proj.weight) which GGUF stores as 3D-packed tensors (blk.L.ffn_gate_exps.weight, [hidden, intermediate, n_experts]). Both streaming and in-memory load_gguf currently skip these 3D tensors. Unpacking them lives in the same follow-on PR as inference-level GGUF.

Test plan

  • cargo check --workspace clean
  • cargo test -p larql-vindex --lib — 1074 / 1074 pass
  • End-to-end on dense GGUF (Qwen3-8B Q3_K_L) — gate_vectors.bin + embeddings.bin populated correctly
  • End-to-end on multi-shard GGUF (Kimi K2.6 at browse level) — pending, this PR unblocks it
  • Inference / Q4K for GGUF — follow-on PR

Mykhailo Korobkov added 5 commits May 24, 2026 14:24
llama.cpp's gguf-split produces multi-file GGUFs (canonical naming:
`<prefix>-<NNNNN>-of-<NNNNN>.gguf`). Each shard carries the full
metadata header but only owns its own slice of tensors. The current
`GgufFile::open` reads one file, so multi-shard models — Kimi K2.6
(14 shards), DeepSeek-V4-Flash (3 shards), and increasingly any large
modern LLM — could not be loaded for vindex extraction.

This change:

1. Adds `ShardInfo` (path + data_offset) and a `shards: Vec<ShardInfo>`
   field on `GgufFile`. Single-file GGUFs get a `shards.len() == 1`.
2. `GgufFile::open` detects multi-shard via the explicit `split.count`
   metadata key, falling back to the filename pattern when the splitter
   omits the metadata.
3. Discovers all sibling shards in the same directory by reconstructing
   filenames at the prefix's chosen width (`00001-of-00014` vs `001-of-003`
   both supported).
4. Appends each sibling's `tensor_infos` to the combined list, tagging
   them with the right `shard_idx`. Cross-checks the total against
   `split.tensors.count` when present.
5. `load_tensors_filtered` mmaps each shard lazily on first use and
   reads each tensor from `shards[info.shard_idx].path` at the right
   per-shard `data_offset`. Shards whose tensors are all skipped by
   `skip_key` are never opened.

Backward-compatible: existing `GgufFile::open` callers and the
single-file test fixtures keep working with `shards = vec![…one…]`.

Tests (8 new + all existing pass):

- parse_shard_filename: canonical layout, plain `.gguf` rejection,
  mismatched widths rejection, 3-digit split width support
- discover_shard_siblings: complete set discovery from any-position
  shard, error when sibling missing
- open_multi_shard_combines_tensors_from_all_shards: builds two real
  2-shard GGUFs with disjoint tensor sets, opens via either shard,
  verifies each tensor reads from its own shard's data section
- open_rejects_multi_shard_when_a_shard_file_is_missing
- existing 27 tests stay green; 286/286 larql-models tests pass

Combined with chrishayuk#96 (MLA absorption), chrishayuk#103 (Q3_K/Q5_K dequant), chrishayuk#133
(GGUF extract input), and chrishayuk#135 (DeepSeek-V2/V3 MLA metadata reading),
this completes the chain — `larql extract --level inference` works
end-to-end on Kimi K2.6 UD-Q8_K_XL and DeepSeek-V4-Flash multi-shard
GGUFs.
The streaming extract pipeline in `larql-vindex` needs per-tensor
metadata access to look up the right shard / byte range / quant type
for each tensor on demand (without bulk-loading a 500 GB+ MoE model
into RAM). All the building blocks were already on `GgufFile.shards`
and the free helpers `normalize_gguf_key` / `dequantize` /
`tensor_data_size`; this commit only adds the read-only accessors on
`GgufTensorInfo` so a consumer can:

    for info in &gguf.tensor_infos {
        let hf_key = normalize_gguf_key(info.name());
        if !want(&hf_key) { continue; }
        let shard = &gguf.shards[info.shard_idx()];
        // mmap shard, slice [shard.data_offset + info.offset()
        //                    .. + tensor_data_size(info.tensor_type(), n_elements)?],
        // dequantize to f32, reshape to (dims[1], dims[0]).
    }

No behaviour change — purely additive accessors. Used in the
follow-up streaming-GGUF work that lets `build_vindex_streaming`
ingest GGUF inputs alongside safetensors.
Adds GGUF to the streaming-extract pipeline alongside safetensors. Until
now, GGUF input was routed through the in-memory `load_model_dir_validated`
path which dequantises every tensor to f32 in RAM — fine for small models
but architecturally unworkable for ≥70B GGUFs (Kimi K2.6 at 554 GB,
DS-V4-Flash at 127 GB).

Design — TensorSource enum:

  enum TensorSource {
      Safetensors { shards, index },   // existing
      Gguf(GgufTensorSource),          // new
  }

`StreamingContext::new` detects the input format (`.gguf` file directly,
or a directory whose first/largest `.gguf` shard is used as the entry
point) and constructs the appropriate variant. Each stage now calls
`self.tensor_source.get_tensor_f32(key)` for the canonical 2D dequant
path. The MXFP4 raw-pair access (DeepSeek-V4 packed gate_up_proj_blocks /
down_proj_blocks) stays safetensors-only — GGUF has no equivalent
packed format.

GGUF specifics:

- Multi-shard splits are handled via `GgufFile::open` (added previously);
  each shard is mmap'd eagerly (virtual address space only — the OS
  pages in only what we touch).
- Per-tensor read does `data_offset + offset` into the right shard,
  slices `tensor_data_size` bytes, and dequantises via
  `larql_models::quant::ggml::dequantize` (Q4_K / Q5_K / Q6_K / Q8_0 /
  BF16 / F32 etc — all already supported by the workspace).
- The dim ordering convention matches `load_gguf`'s reshape to
  `(dims[1], dims[0])`. Canonical FFN orientation (the in-memory
  loader's `orient_in_place`) is applied here too, driven by
  `(hidden_size, intermediate_size)` from the detected architecture —
  without it `tensor.shape()[0]` would be `hidden` instead of
  `intermediate` for some quants and downstream matmul would produce
  NaN.

CLI routing:

  - safetensors (any level)            → streaming
  - GGUF + browse + quant=none         → streaming (new)
  - GGUF + attention/inference/all     → in-memory (unchanged)
  - GGUF + any level with --quant q4k  → in-memory (unchanged)

Inference / Q4K levels for GGUF still need the `StreamingWeights` writer
subsystem (Q4_K + f32 attn/FFN writers) ported to read tensors via
`ggml::dequantize` per tensor — that's the follow-on PR. The stage gate
returns a clear `VindexError::Parse(...)` if the user requests an
unsupported level/quant combo for GGUF input.

Validation:

- DS-R1-0528-Qwen3-8B-Q3_K_L (10 GB, dense, mixed Q3_K/Q4_K/Q6_K) →
  3.4 GB gate_vectors.bin + 1.2 GB embeddings.bin written cleanly
  through the streaming path on ai-main.
- 1074 vindex unit tests pass.

Known gap (pre-existing, not introduced here): the streaming pipeline's
MoE branch looks up per-expert 2D keys (`mlp.experts.K.gate_proj.weight`)
which GGUF stores as 3D-packed tensors (`blk.L.ffn_gate_exps.weight`,
`[hidden, intermediate, n_experts]`). Both the streaming pipeline and
the in-memory `load_gguf` currently skip these 3D tensors. Unpacking
them lives in the same follow-on PR as inference-level GGUF.
Previously the streaming `down_meta` stage accumulated every layer's
feature meta in memory and called `write_binary` exactly once at the
end of the projection loop. For a dense ≥30B model, that loop is
single-threaded matmul that can run for an hour — kill mid-projection
and every completed layer's work was lost.

Fix: snapshot `all_down_meta` to `down_meta.bin` after each layer
finishes. `write_binary` already uses a tempfile + atomic rename, so
the on-disk file is never in a half-written state — readers always see
either the previous snapshot or the new one.

The loop is restructured from `iter_mut().enumerate()` to index-based
iteration so the per-iteration mutable borrow on `all_down_meta[layer]`
drops before the immutable borrow `write_binary` needs.

Cost: ~1.5 MB extra write per layer (well under the per-layer matmul
time). Benefit: a killed run preserves every completed layer of
projection — a 40-min interruption no longer loses 40 min of work.

The final write after the loop is kept for the resumed-from-checkpoint
branch (where the loop runs zero iterations).
DeepSeek-V4 family emits only `{arch}.expert_feed_forward_length` —
never the global `{arch}.feed_forward_length` — because no dense FFN
layer exists above the per-expert size. The current loader reads only
the global key, so `intermediate_size` came back as `0` and config
validation rejected:

  Error: failed to load GGUF model: config validation failed:
  [ConfigValidationError { field: "intermediate_size",
   message: "must be greater than 0" }]

This is the same fix as upstream PR chrishayuk#138, applied directly to this
branch so DS-V4-Flash can flow through the streaming-GGUF path. (chrishayuk#138
will land independently; this commit is no-op once it merges.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant