feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6
Open
carlosfundora wants to merge 1 commit into
Open
feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6carlosfundora wants to merge 1 commit into
carlosfundora wants to merge 1 commit into
Conversation
Port PrismML's custom Q1_0 (32-element blocks) and Q1_0_G128 (128-element blocks) 1-bit ternary quantization types into the turboquant fork. This enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) which use these types. Changes: - ggml.h: Add GGML_TYPE_Q1_0=42, GGML_TYPE_Q1_0_G128=43, bump COUNT to 44 Add GGML_FTYPE_MOSTLY_Q1_0=27, GGML_FTYPE_MOSTLY_Q1_0_G128=28 - ggml-common.h: Add block_q1_0 (6 bytes) and block_q1_0_g128 (18 bytes) struct definitions with static_asserts - ggml-quants.h/c: Add quantize_row, dequantize_row, and quantize functions for both Q1_0 and Q1_0_G128 - ggml.c: Add type_traits entries, ftype switch cases, quantize dispatch - ggml-cpu/ggml-cpu.c: Register CPU type traits with vec_dot support - ggml-cpu/quants.h/c: Add vec_dot implementations (Q1_0 x Q8_0 and Q1_0_G128 x Q8_0) and quantize_row wrappers - ggml-cpu/ops.cpp: Add Q1_0/Q1_0_G128 to all quantized-type switch statements (get_rows, set_rows, cpy, etc.) - gguf.cpp: Add PrismML compatibility remap — when general.file_type indicates a PrismML model (ftype 40 or 41), remap tensor type IDs 40->42 (Q1_0) and 41->43 (Q1_0_G128) to avoid collision with NVFP4 (40) and TQ3_0 (41) - gguf-py/constants.py: Add Python enum entries and block size mappings Type ID conflict resolution: PrismML uses type 40=Q1_0, 41=Q1_0_g128 Turboquant uses type 40=NVFP4, 41=TQ3_0 This port assigns Q1_0=42, Q1_0_G128=43 and remaps at GGUF load time Algorithm: Ternary quantization where each value is encoded as a single sign bit (1=+scale, 0=-scale), with scale = mean(abs(block_values)). Tested: Bonsai-4B and Bonsai-8B GGUFs load successfully and produce coherent output on CPU inference (0.7 t/s for 4B, 0.3 t/s for 8B). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports PrismML's custom Q1_0 and Q1_0_G128 1-bit ternary quantization types into llama-turboquant. This enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) that use these types.
Problem
PrismML's fork assigns type IDs 40 (Q1_0) and 41 (Q1_0_G128), which collide with NVFP4 (40) and TQ3_0 (41) in this fork. When turboquant tries to read a PrismML GGUF, it misinterprets tensor sizes, causing offset mismatches and load failures.
Solution
general.file_typeindicates a PrismML model (ftype 40 or 41), remap tensor type IDs 40→42 and 41→43Algorithm
Ternary quantization: each value encoded as 1 sign bit (
bit=1 → +scale, bit=0 → −scale), wherescale = mean(abs(block_values)).Files Changed (11 files, +282 lines)
ggml/include/ggml.hggml/src/ggml-common.hggml/src/ggml-quants.h/cggml/src/ggml.cggml/src/ggml-cpu/ggml-cpu.cggml/src/ggml-cpu/quants.h/cggml/src/ggml-cpu/ops.cppggml/src/gguf.cppgguf-py/gguf/constants.pyTesting
Known Limitations
ggml-cuda/dequantize kernelsRisks
general.file_typeKV metadata; hand-crafted GGUFs without this field won't trigger remap