feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support by carlosfundora · Pull Request #6 · unixsysdev/llama-turboquant

carlosfundora · 2026-04-01T16:36:48Z

Summary

Ports PrismML's custom Q1_0 and Q1_0_G128 1-bit ternary quantization types into llama-turboquant. This enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) that use these types.

Problem

PrismML's fork assigns type IDs 40 (Q1_0) and 41 (Q1_0_G128), which collide with NVFP4 (40) and TQ3_0 (41) in this fork. When turboquant tries to read a PrismML GGUF, it misinterprets tensor sizes, causing offset mismatches and load failures.

Solution

Assign new non-conflicting type IDs: Q1_0 = 42, Q1_0_G128 = 43
Add GGUF load-time remapping: when general.file_type indicates a PrismML model (ftype 40 or 41), remap tensor type IDs 40→42 and 41→43
Full CPU inference path: block structs, quantize/dequantize, vec_dot (Q1_0 × Q8_0), get_rows, set_rows, cpy ops
Python GGUF constants updated with enum entries and block size mappings

Algorithm

Ternary quantization: each value encoded as 1 sign bit (bit=1 → +scale, bit=0 → −scale), where scale = mean(abs(block_values)).

Q1_0: 32-element blocks, 6 bytes each (fp16 scale + 4 bytes sign bits)
Q1_0_G128: 128-element blocks, 18 bytes each (fp16 scale + 16 bytes sign bits)

Files Changed (11 files, +282 lines)

File	Change
`ggml/include/ggml.h`	New type enums + ftype entries
`ggml/src/ggml-common.h`	Block struct definitions
`ggml/src/ggml-quants.h/c`	Quantize/dequantize implementations
`ggml/src/ggml.c`	type_traits, ftype mapping, quantize dispatch
`ggml/src/ggml-cpu/ggml-cpu.c`	CPU type traits with vec_dot registration
`ggml/src/ggml-cpu/quants.h/c`	vec_dot + quantize_row wrappers
`ggml/src/ggml-cpu/ops.cpp`	Switch cases for all quantized-type ops
`ggml/src/gguf.cpp`	PrismML type ID remap shim
`gguf-py/gguf/constants.py`	Python enum + block size mappings

Testing

Model	Result
Bonsai-4B (Q1_0_G128, 546 MB)	✅ Loads, coherent output, 0.7 t/s CPU
Bonsai-8B (Q1_0_G128, 1.1 GB)	✅ Loads, coherent output, 0.3 t/s CPU

Known Limitations

CPU-only inference (no HIP/CUDA kernels yet for Q1_0)
No SIMD-optimized vec_dot (generic C implementation)
GPU offload will require adding Q1_0 to ggml-cuda/ dequantize kernels

Risks

Type ID remap relies on general.file_type KV metadata; hand-crafted GGUFs without this field won't trigger remap
Performance is baseline (scalar loops); SIMD optimization is future work

Port PrismML's custom Q1_0 (32-element blocks) and Q1_0_G128 (128-element blocks) 1-bit ternary quantization types into the turboquant fork. This enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) which use these types. Changes: - ggml.h: Add GGML_TYPE_Q1_0=42, GGML_TYPE_Q1_0_G128=43, bump COUNT to 44 Add GGML_FTYPE_MOSTLY_Q1_0=27, GGML_FTYPE_MOSTLY_Q1_0_G128=28 - ggml-common.h: Add block_q1_0 (6 bytes) and block_q1_0_g128 (18 bytes) struct definitions with static_asserts - ggml-quants.h/c: Add quantize_row, dequantize_row, and quantize functions for both Q1_0 and Q1_0_G128 - ggml.c: Add type_traits entries, ftype switch cases, quantize dispatch - ggml-cpu/ggml-cpu.c: Register CPU type traits with vec_dot support - ggml-cpu/quants.h/c: Add vec_dot implementations (Q1_0 x Q8_0 and Q1_0_G128 x Q8_0) and quantize_row wrappers - ggml-cpu/ops.cpp: Add Q1_0/Q1_0_G128 to all quantized-type switch statements (get_rows, set_rows, cpy, etc.) - gguf.cpp: Add PrismML compatibility remap — when general.file_type indicates a PrismML model (ftype 40 or 41), remap tensor type IDs 40->42 (Q1_0) and 41->43 (Q1_0_G128) to avoid collision with NVFP4 (40) and TQ3_0 (41) - gguf-py/constants.py: Add Python enum entries and block size mappings Type ID conflict resolution: PrismML uses type 40=Q1_0, 41=Q1_0_g128 Turboquant uses type 40=NVFP4, 41=TQ3_0 This port assigns Q1_0=42, Q1_0_G128=43 and remaps at GGUF load time Algorithm: Ternary quantization where each value is encoded as a single sign bit (1=+scale, 0=-scale), with scale = mean(abs(block_values)). Tested: Bonsai-4B and Bonsai-8B GGUFs load successfully and produce coherent output on CPU inference (0.7 t/s for 4B, 0.3 t/s for 8B). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added ggml python labels Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6

feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6
carlosfundora wants to merge 1 commit into
unixsysdev:mainfrom
carlosfundora:feat/q1_0-1bit-quantization

carlosfundora commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlosfundora commented Apr 1, 2026

Summary

Problem

Solution

Algorithm

Files Changed (11 files, +282 lines)

Testing

Known Limitations

Risks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant