ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization by metalchef1 · Pull Request #7 · unixsysdev/llama-turboquant

metalchef1 · 2026-05-07T10:50:34Z

Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends, enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0.

Vulkan backend:

types.glsl: add block_tq3_0_packed16 struct and A_TYPE_PACKED16 macro
dequant_funcs.glsl: add dequantize/dequantize4/get_dm for TQ3_0 using 32-point Walsh-Hadamard Transform with Max-Lloyd centroids
copy_to_quant.comp: TQ3_0 quantization path (F32 -> TQ3_0)
dequant_tq3_0.comp: standalone dequantize shader (TQ3_0 -> F32)
flash_attn_base.glsl: TQ3_0 dequantize4 for flash attention K/V path
vulkan-shaders-gen.cpp: compile TQ3_0 shaders (cpy, dequant, mul_mat_vec, get_rows, flash_attn scalar path)
ggml-vulkan.cpp: pipeline creation for cpy/dequant/mul_mat_vec/flash_attn, supports_op entries for CPY/DUP/CONT/MUL_MAT/FLASH_ATTN_EXT

SYCL backend:

dequantize.hpp: TQ3_0 WHT dequantize kernel
dmmv.cpp: dot-matrix-vector kernel for TQ3_0 x F32
convert.cpp: F32 -> TQ3_0 and TQ3_0 -> F32 copy kernels
ggml-sycl.cpp: register TQ3_0 pipelines

Shared:

ggml-common.h: add QR_TQ3_0 = 1 macro
llama-context.cpp: enable flash_attn for TQ3_0 KV cache (was force-disabled)

Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS:
ctx=32768, K+V q8_0: 340 MiB
ctx=32768, K+V tq3_0: 140 MiB (58.8% reduction, correct outputs verified)

TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends, enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0. Vulkan backend: - types.glsl: add block_tq3_0_packed16 struct and A_TYPE_PACKED16 macro - dequant_funcs.glsl: add dequantize/dequantize4/get_dm for TQ3_0 using 32-point Walsh-Hadamard Transform with Max-Lloyd centroids - copy_to_quant.comp: TQ3_0 quantization path (F32 -> TQ3_0) - dequant_tq3_0.comp: standalone dequantize shader (TQ3_0 -> F32) - flash_attn_base.glsl: TQ3_0 dequantize4 for flash attention K/V path - vulkan-shaders-gen.cpp: compile TQ3_0 shaders (cpy, dequant, mul_mat_vec, get_rows, flash_attn scalar path) - ggml-vulkan.cpp: pipeline creation for cpy/dequant/mul_mat_vec/flash_attn, supports_op entries for CPY/DUP/CONT/MUL_MAT/FLASH_ATTN_EXT SYCL backend: - dequantize.hpp: TQ3_0 WHT dequantize kernel - dmmv.cpp: dot-matrix-vector kernel for TQ3_0 x F32 - convert.cpp: F32 -> TQ3_0 and TQ3_0 -> F32 copy kernels - ggml-sycl.cpp: register TQ3_0 pipelines Shared: - ggml-common.h: add QR_TQ3_0 = 1 macro - llama-context.cpp: enable flash_attn for TQ3_0 KV cache (was force-disabled) Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS: ctx=32768, K+V q8_0: 340 MiB ctx=32768, K+V tq3_0: 140 MiB (58.8% reduction, correct outputs verified) TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.

github-actions Bot added ggml SYCL Vulkan labels May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7

ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7
metalchef1 wants to merge 1 commit into
unixsysdev:mainfrom
metalchef1:main

metalchef1 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

metalchef1 commented May 7, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant