Skip to content

feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6

Open
carlosfundora wants to merge 1 commit into
unixsysdev:mainfrom
carlosfundora:feat/q1_0-1bit-quantization
Open

feat: add PrismML Q1_0/Q1_0_G128 1-bit ternary quantization support#6
carlosfundora wants to merge 1 commit into
unixsysdev:mainfrom
carlosfundora:feat/q1_0-1bit-quantization

Conversation

@carlosfundora
Copy link
Copy Markdown

Summary

Ports PrismML's custom Q1_0 and Q1_0_G128 1-bit ternary quantization types into llama-turboquant. This enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) that use these types.

Problem

PrismML's fork assigns type IDs 40 (Q1_0) and 41 (Q1_0_G128), which collide with NVFP4 (40) and TQ3_0 (41) in this fork. When turboquant tries to read a PrismML GGUF, it misinterprets tensor sizes, causing offset mismatches and load failures.

Solution

  • Assign new non-conflicting type IDs: Q1_0 = 42, Q1_0_G128 = 43
  • Add GGUF load-time remapping: when general.file_type indicates a PrismML model (ftype 40 or 41), remap tensor type IDs 40→42 and 41→43
  • Full CPU inference path: block structs, quantize/dequantize, vec_dot (Q1_0 × Q8_0), get_rows, set_rows, cpy ops
  • Python GGUF constants updated with enum entries and block size mappings

Algorithm

Ternary quantization: each value encoded as 1 sign bit (bit=1 → +scale, bit=0 → −scale), where scale = mean(abs(block_values)).

  • Q1_0: 32-element blocks, 6 bytes each (fp16 scale + 4 bytes sign bits)
  • Q1_0_G128: 128-element blocks, 18 bytes each (fp16 scale + 16 bytes sign bits)

Files Changed (11 files, +282 lines)

File Change
ggml/include/ggml.h New type enums + ftype entries
ggml/src/ggml-common.h Block struct definitions
ggml/src/ggml-quants.h/c Quantize/dequantize implementations
ggml/src/ggml.c type_traits, ftype mapping, quantize dispatch
ggml/src/ggml-cpu/ggml-cpu.c CPU type traits with vec_dot registration
ggml/src/ggml-cpu/quants.h/c vec_dot + quantize_row wrappers
ggml/src/ggml-cpu/ops.cpp Switch cases for all quantized-type ops
ggml/src/gguf.cpp PrismML type ID remap shim
gguf-py/gguf/constants.py Python enum + block size mappings

Testing

Model Result
Bonsai-4B (Q1_0_G128, 546 MB) ✅ Loads, coherent output, 0.7 t/s CPU
Bonsai-8B (Q1_0_G128, 1.1 GB) ✅ Loads, coherent output, 0.3 t/s CPU

Known Limitations

  • CPU-only inference (no HIP/CUDA kernels yet for Q1_0)
  • No SIMD-optimized vec_dot (generic C implementation)
  • GPU offload will require adding Q1_0 to ggml-cuda/ dequantize kernels

Risks

  • Type ID remap relies on general.file_type KV metadata; hand-crafted GGUFs without this field won't trigger remap
  • Performance is baseline (scalar loops); SIMD optimization is future work

Port PrismML's custom Q1_0 (32-element blocks) and Q1_0_G128 (128-element
blocks) 1-bit ternary quantization types into the turboquant fork. This
enables loading and running Bonsai 1-bit models (1.7B, 4B, 8B) which use
these types.

Changes:
- ggml.h: Add GGML_TYPE_Q1_0=42, GGML_TYPE_Q1_0_G128=43, bump COUNT to 44
  Add GGML_FTYPE_MOSTLY_Q1_0=27, GGML_FTYPE_MOSTLY_Q1_0_G128=28
- ggml-common.h: Add block_q1_0 (6 bytes) and block_q1_0_g128 (18 bytes)
  struct definitions with static_asserts
- ggml-quants.h/c: Add quantize_row, dequantize_row, and quantize
  functions for both Q1_0 and Q1_0_G128
- ggml.c: Add type_traits entries, ftype switch cases, quantize dispatch
- ggml-cpu/ggml-cpu.c: Register CPU type traits with vec_dot support
- ggml-cpu/quants.h/c: Add vec_dot implementations (Q1_0 x Q8_0 and
  Q1_0_G128 x Q8_0) and quantize_row wrappers
- ggml-cpu/ops.cpp: Add Q1_0/Q1_0_G128 to all quantized-type switch
  statements (get_rows, set_rows, cpy, etc.)
- gguf.cpp: Add PrismML compatibility remap — when general.file_type
  indicates a PrismML model (ftype 40 or 41), remap tensor type IDs
  40->42 (Q1_0) and 41->43 (Q1_0_G128) to avoid collision with
  NVFP4 (40) and TQ3_0 (41)
- gguf-py/constants.py: Add Python enum entries and block size mappings

Type ID conflict resolution:
  PrismML uses type 40=Q1_0, 41=Q1_0_g128
  Turboquant uses type 40=NVFP4, 41=TQ3_0
  This port assigns Q1_0=42, Q1_0_G128=43 and remaps at GGUF load time

Algorithm: Ternary quantization where each value is encoded as a single
sign bit (1=+scale, 0=-scale), with scale = mean(abs(block_values)).

Tested: Bonsai-4B and Bonsai-8B GGUFs load successfully and produce
coherent output on CPU inference (0.7 t/s for 4B, 0.3 t/s for 8B).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant