Skip to content

ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7

Open
metalchef1 wants to merge 1 commit into
unixsysdev:mainfrom
metalchef1:main
Open

ggml: add Vulkan and SYCL backends for TQ3_0 KV cache quantization#7
metalchef1 wants to merge 1 commit into
unixsysdev:mainfrom
metalchef1:main

Conversation

@metalchef1
Copy link
Copy Markdown

Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends, enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0.

Vulkan backend:

  • types.glsl: add block_tq3_0_packed16 struct and A_TYPE_PACKED16 macro
  • dequant_funcs.glsl: add dequantize/dequantize4/get_dm for TQ3_0 using 32-point Walsh-Hadamard Transform with Max-Lloyd centroids
  • copy_to_quant.comp: TQ3_0 quantization path (F32 -> TQ3_0)
  • dequant_tq3_0.comp: standalone dequantize shader (TQ3_0 -> F32)
  • flash_attn_base.glsl: TQ3_0 dequantize4 for flash attention K/V path
  • vulkan-shaders-gen.cpp: compile TQ3_0 shaders (cpy, dequant, mul_mat_vec, get_rows, flash_attn scalar path)
  • ggml-vulkan.cpp: pipeline creation for cpy/dequant/mul_mat_vec/flash_attn, supports_op entries for CPY/DUP/CONT/MUL_MAT/FLASH_ATTN_EXT

SYCL backend:

  • dequantize.hpp: TQ3_0 WHT dequantize kernel
  • dmmv.cpp: dot-matrix-vector kernel for TQ3_0 x F32
  • convert.cpp: F32 -> TQ3_0 and TQ3_0 -> F32 copy kernels
  • ggml-sycl.cpp: register TQ3_0 pipelines

Shared:

  • ggml-common.h: add QR_TQ3_0 = 1 macro
  • llama-context.cpp: enable flash_attn for TQ3_0 KV cache (was force-disabled)

Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS:
ctx=32768, K+V q8_0: 340 MiB
ctx=32768, K+V tq3_0: 140 MiB (58.8% reduction, correct outputs verified)

TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.

Overview

Additional information

Requirements

Add hardware-accelerated TQ3_0 support for Vulkan and SYCL backends,
enabling TQ3_0 KV cache on Intel Arc GPU with 58% VRAM savings vs q8_0.

Vulkan backend:
- types.glsl: add block_tq3_0_packed16 struct and A_TYPE_PACKED16 macro
- dequant_funcs.glsl: add dequantize/dequantize4/get_dm for TQ3_0 using
  32-point Walsh-Hadamard Transform with Max-Lloyd centroids
- copy_to_quant.comp: TQ3_0 quantization path (F32 -> TQ3_0)
- dequant_tq3_0.comp: standalone dequantize shader (TQ3_0 -> F32)
- flash_attn_base.glsl: TQ3_0 dequantize4 for flash attention K/V path
- vulkan-shaders-gen.cpp: compile TQ3_0 shaders (cpy, dequant, mul_mat_vec,
  get_rows, flash_attn scalar path)
- ggml-vulkan.cpp: pipeline creation for cpy/dequant/mul_mat_vec/flash_attn,
  supports_op entries for CPY/DUP/CONT/MUL_MAT/FLASH_ATTN_EXT

SYCL backend:
- dequantize.hpp: TQ3_0 WHT dequantize kernel
- dmmv.cpp: dot-matrix-vector kernel for TQ3_0 x F32
- convert.cpp: F32 -> TQ3_0 and TQ3_0 -> F32 copy kernels
- ggml-sycl.cpp: register TQ3_0 pipelines

Shared:
- ggml-common.h: add QR_TQ3_0 = 1 macro
- llama-context.cpp: enable flash_attn for TQ3_0 KV cache (was force-disabled)

Tested on Intel Arc Pro B60 (24GB) with Qwen3.6-35B-A3B-IQ4_XS:
  ctx=32768, K+V q8_0: 340 MiB
  ctx=32768, K+V tq3_0: 140 MiB  (58.8% reduction, correct outputs verified)

TQ3_0 uses flash attention (FLASH_ATTN_EXT) since the K cache is accessed
non-contiguously. Requires --cache-type-v tq3_0 alongside --cache-type-k tq3_0.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant